Unit 3 Characteristics of Tests: Test Validity and Reliability

Meaning, nature and principles of validity Categories of validity evidence Factors affecting validity Meaning of reliability concept Definition of Terms: obtained scores, error scores, true scores, standard error of measurement Methods of estimating reliability Factors affecting reliability


3.1 Explain the concept of validity and how it applies to all educational

assessment results

3.2 State and explain the four principles of validation

3.3 List, explain and differentiate between the categories of validity evidence.

3.4 Explain the factors that influence the validity of test results.

3.5 Explain the concept of reliability and how it relates to inconsistency in

students’ assessment results.

3.6 Explain the relationship between reliability, validity and the quality of

educational decisions.

3.7 Explain the term “reliability co-efficient”

3.8 Explain the meaning of observed/obtained, true and error scores

3.9 Explain the meaning of standard error of measurement, its relationship to

reliability and how it is used to interpret scores.

3.10 State and explain method of estimating reliability of test results:

3.11 Identify and explain the factors influencing reliability.





Test Validity


Definition:  Nitko (1996, p. 36) defined validity as the “soundness of the interpretations and use of students’ assessment results”.  Validity emphasizes the interpretations and use of the results and not the test instrument.  Evidence needs to be provided that the interpretations and use are appropriate.


Nature of validity:

In using the term validity in relation to testing and assessment, five points have to be borne in mind.

  1. Validity refers to the appropriateness of the interpretations of the results of an assessment procedure for a group of individuals. It does not refer to the procedure of instrument itself.
  2. Validity is a matter of degree. Results have different degrees of validity for different purposes and for different situations.  Assessment results may have high, moderate or low validity.
  3. Validity is always specific to some particular use or interpretation. No assessment is valid for all purposes.
  4. Validity is a unitary concept that is based on various kinds of evidence.
  5. Validity involves an overall evaluative judgment. Several types of validity evidence should be studied and combined.


Principles for validation

          There are four principles that help a test user/giver to decide the degree to which his/her assessment results are valid.

  1. The interpretations (meanings) given to students’ assessment results are valid only to the degree that evidence can be produced to support their appropriateness.
  2. The uses made of assessment results are valid only to the degree that evidence can be produced to support their appropriateness and correctness.
  3. The interpretations and uses of assessment results are valid only when the educational and social values implied by them are appropriate.
  4. The interpretations and uses made of assessment results are valid only when the consequences of these interpretations and uses are consistent with appropriate values.


Categories of validity evidence


There are 3 major categories of validity evidence.


  1. Content-related evidence

This type of evidence refers to the content representativeness and relevance of tasks or items on an instrument.  Judgments of content representativeness focus on whether the assessment tasks are a representative sample from a larger domain of performance.  Judgments of content relevance focus on whether the assessment tasks are included in the test user’s domain definition when standardized tests are used.

Content-related evidence answers questions like:

  1. How well do the assessment tasks represent the domain of important content?
  2. How well do the assessment tasks represent the curriculum as defined?
  • How well do the assessment tasks reflect current thinking about what should be taught and assessed?
  1. Are the assessment tasks worthy of being learned?


To obtain answers for the questions, a description of the curriculum and content to be learned (or learned) is obtained.  Each assessment task is checked to see if it matches important content and learning outcomes.  Each assessment task is rated for its relevance, importance, accuracy and meaningfulness.


One way to ascertain content-related validity is to inspect the table of specification which is a two-way chart showing the content coverage and the instructional objectives to be measured.


  1. Criterion-related evidence

This type of evidence pertains to the empirical technique of studying the relationship between the test scores or some other measures (predictors) and some independent external measures (criteria) such as intelligence test scores and university grade point average.  Criterion-related evidence answers the question, How well the results of an assessment can be used to infer or predict an individual’s standing on one or more outcomes other than the assessment procedure itself.  The outcome is called the criterion.


There are two types of criterion-related evidence.  These are concurrent validity and predictive validity.

Concurrent validity evidence refers to the extent to which individuals’ current status on a criterion can be estimated from their current performance on an assessment instrument.  For concurrent validity, data are collected at approximately the same time and the purpose is to substitute the assessment result for the score of a related variable.  e.g. a test of swimming ability vrs swimming itself to be scored.

Predictive validity evidence refers to extent to which individuals’ future performance on a criterion can be predicted from their prior performance on an assessment instrument.  For predictive validity, data are collected at different times.  Scores on the predictor variable are collected prior to the scores on the criterion variable.  The purpose is to predict the future performance of a criterion variable.  e.g. Using WASSCE results to predict the first year GPA in the University of Cape Coast.


Criterion-related validation is done by computing the coefficient of correlation between the assessment result and the criterion.  The correlation coefficient is a statistical index that quantifies the degree of relationship between the scores from one assessment and the scores from another.  This coefficient is often called the validity coefficient and takes values from

–1.0 to +1.0.


Expectancy tables can also be used for validation.  An expectancy table is a two-way table that allows one to say how likely it is for a person with a specific assessment result to attain each criterion score level.



An example of an expectancy table



test score

Percent of pupils receiving each grade
F D C B A Totals
90-99 20 60 20 100
80-89 8 33 42 17 100
70-79 20 33 40 7 100
60-69 22 44 28 6 100
50-59 6 28 44 22 100
40-49 7 40 33 20 100
30-39 17 42 33 8 100
20-29 25 50 25 100
10-19 100 100


Determine the degree of success by using a grade e.g. C or better.  A question will be, What is the probability that a person with a score of 65 will succeed in this course (i.e. obtaining grade C or better)?  The score of 65 lies in the 60-69 class and for this class, 78% (44+28 +6) are successful, so the person has a 78% chance of success.


  1. Construct-related evidence: This type of evidence refers to how well the assessment results can be interpreted as reflecting an individuals’ status regarding an educational or psychological trait, attribute or mental process.  Examples of constructs are mathematical reasoning, reading comprehension, creativity, honesty and sociability.


Methods of construct validation

  1. Define the domain or tasks to be measured. Specifications must be very well defined so that the meaning of the construct is clear.  Expert judgment is then used to judge the extent to which the assessment provides a relevant and representative measure of the task domain.
  2. Analyze mental process required by the assessment tasks. Examine the assessment tasks or administer the tasks to individual students and have them “think aloud” as they perform the tasks.
  • Compare the scores of known groups. Groups that are expected to differ on the construct, e.g. by age, or training may both be given the same assessment tasks.
  1. Correlate the assessment scores with other measures. Similar constructs are expected to produce high correlation. E.g. two assessments on creativity are expected to produce high correlation.






Factors affecting validity

  1. Unclear directions. Validity is reduced if students do not clearly understand how to respond to the items and how to record the responses or the amount of time available.
  2. Too difficult reading vocabulary and sentence structure tends to reduce validity. The assessment may be measuring reading comprehension which is not to be measured.
  3. Ambiguous statements in assessment tasks and items. This confuses students and makes way for different interpretations thus reducing validity.
  4. Inadequate time limits. This does not provide students with enough time to respond and thus may perform below their level of achievement.  This reduces validity.
  5. Inappropriate level of difficulty of the test items. Items that are too easy or too difficult does not provide high validity.
  6. Poorly constructed test items. These items may provide unintentional clues which may cause students to perform above their actual level of achievement.  This lowers validity.
  7. Test items being inappropriate for the outcomes being measured lowers validity.
  8. Test being too short. If a test is too short, it does not provide a representative sample of the performance being interested in and this lowers validity.
  9. Improper arrangement of items. Placing difficult items in the beginning of the test may put some students off and cause them to become unstable thereby performing below their level of performance thus reducing validity.
  10. Identifiable pattern of answers. Placing the answers to tests like multiple-choice and true/false types enables students to guess the correct answers more easily and this lowers validity.
  11. When students cheat by copying answers or helping their friends with answers to test items, validity is reduced.
  12. Unreliable scoring. Scoring of test items especially essay tests may lower reliability if they are not scored reliably.
  13. Student emotional disturbances. These interfere with their performance thus reducing validity.
  14. Fear of the assessment situation. Students can be frightened by the assessment situation and are unable to perform normally. This reduces their actual level of performance and consequently, lowers validity.


Test Reliability



Reliability is the degree of consistency of assessment results.  It is the degree to which assessment results are the same when (1) the same tasks are completed on two different occasions (2) different but equivalent tasks are completed on the same or different occasions, and (3) two or more raters mark performance on the same tasks.


Points to note when applying the concept of reliability to testing and assessment.

  1. Reliability refers to the results obtained with an assessment instrument and not to the instrument itself.
  2. An estimate of reliability refers to a particular type of consistency.
  • Reliability is a necessary condition but not a sufficient condition for validity.
  1. Reliability is primarily statistical. It is determined by the reliability coefficient, which is defined as a correlation coefficient that indicates the degree of relationship between two sets of scores intended to be measures of the same characteristic.  It ranges from 0.0 – 1.0


Definition of terms:

Obtained (Observed) score:  Actual scores obtained in a test or assessment.

Error score:  The amount of error in an obtained score.

True score:  The difference between the obtained and the error scores.  It is the portion of the observed score that is not affected by random error.  An estimate of the true score of a student is the mean score obtained after repeated assessments under the same conditions.

X = T + E

Reliability can be defined theoretically as the ratio of the true score variance to the observed score variance.  i.e. rxx =

Standard error of measurement:  It is a measure of the variation within individuals on a test.  It is an estimate of the standard deviation of the errors of measurement.  It is obtained by the formula: Se = Sx or SEM = SDx√, where Sx or SDx is the standard deviation of the obtained scores.  For example, given that, rxx = 0.8, Sx = 4.0,

SEM = 4= 4= 4 x 0.447 = 1.788


Interpreting standard errors of measurement

  1. It estimates the amount that a student is likely to deviate from her/his true score. e.g. SEM=4 indicates that a student’s obtained scores lies 4 points above or below the true score. An obtained score of 75 means the true score is either 71 or 79.  The true score therefore lies between 71 and 79.  71-79 therefore provides a confidence band for interpreting an obtained score.  A small standard error of measurement indicates high reliability providing greater confidence that the obtained score is near the true score.
  2. In interpreting the scores of two students, if the ends of the bands do overlap as in Example 1, then there is no real difference between the two scores. However, if the two bands do not overlap as in Example 2, there is a real difference between the scores.

40                                                 40



35            35                               34                                                 No overlap



30           30                                30




Example 1.  SEM = 5                                    Example 2.  SEM = 2

Suppose Grace had 34 and                             Suppose George had 38 and Aku 32

Fiifi 32 in a quiz.                                           in a quiz.

There is overlapping.                                    There is no overlapping.



Reliability coefficient:  A correlation coefficient that indicates the degree of relationship between two sets of scores intended to be measures of the same characteristic (e.g. correlation between scores assigned by two different raters or scores obtained from administration of two forms of a test)


Methods of estimating reliability

  1. Test-retest method. This is a measure of the stability of scores over a period of time.  The same test is given to a group of students twice within an interval ranging from several minutes to years.  The scores on the two administrations are correlated and the result is the estimate of the reliability of the test.  The interval should be reasonable, not be too short nor too long.
  2. Equivalent forms method. Two test forms, which are alternate or parallel with the same content and level of difficulty for each item, are administered to the same group of students.  The forms may be given on the same or nearly the same occasion or a time interval will elapse before the second form is given.  The scores on the two administrations are correlated and the result is the estimate of the reliability of the test.
  3. Split-half method. This is a measure of internal consistency.  A single test is given to the students.  The test is then divided into two halves for scoring.  The two scores for each student are correlated to obtain the estimate of reliability.  The test can be split into two halves in several ways.  These include using (i) odd-even numbered items, and (ii) first half-second half.  The Spearman-Brown prophecy formula is often used to obtain the reliability coefficient.  This is given by:




Suppose correlation between half test scores was 0.75.



  1. Kuder-Richardson method. This is also a measure of internal consistency.  A single administration of the test is used.  Kuder-Richardson Formulas 20 and 21 (KR20 & KR21) are used mostly for dichotomously scored items (i.e. right or wrong).  KR20 can be generalized to more than one-correct response items (e.g. attitude scales ranging from 5 to 1 on a 5-point scale).  Such estimates are called Coefficient Alpha.
  2. Inter-rater reliability. Two raters each score a students paper.  The two scores for all the students are correlated.  This estimate of reliability is called scorer reliability or inter-rater reliability.  It is an index of the extent to which the raters were consistent in rating the same students.







Factors influencing reliability

  1. Test length. Longer tests give more reliable scores.  A test consisting of 40 items will give a more reliable score than a test consisting of 25 items.  Wherever practicable, give use more items.
  2. Group variability. The more heterogeneous the group, the higher the reliability.  The narrower the range of a group’s ability, the lower the reliability.  Differentiate among students.  Use items that differentiate the best students from the less able students.
  3. Difficulty of items. Too difficult or too easy items produce little variation in the test scores.  This in turn lowers reliability.  The difficulty of the assessment tasks should be matched to the ability level of the students.
  4. Scoring objectivity. Subjectively scored items result in lower variability.  More objectively scored assessment results are more reliable.  For subjectively-scored items, multiple markers are preferred.
  5. Tests, where most students do not complete the items due to inadequate allocation of time, result in lower reliability.  Sufficient time should be provided to students to respond to the items.
  6. Sole marking. Using multiple markers improves the reliability of the assessment results.  A single person grading may lead to low reliability especially of essay tests, term papers, and performances.  Averaging the results of several markers increases reliability.
  7. Testing conditions. Where test administrators do not adhere strictly to uniform test regulations and practices, students’ scores may not represent their actual level of performances and this tends to reduce reliability.  In cases of the test-retest method of estimating reliability, this issue is of a great concern.





In this unit, you have been exposed to the concept of validity and reliability of tests. Any good test must achieve these two characteristics.  A test is said to be valid if it measures what it is supposed to measure. A test is reliable if it measures what it is supposed to measure consistently.

SEE ALL Add a note
Add your Comment

Welcome To.


The official komenco LMS where you learn at the comfort of your home.
Learn more

Subscribe From

Orbit I.T Training and Services Ltd © 2019. All rights reserved.
Skip to toolbar