Reliability and Validity of the ITERS-3™

Infant/Toddler Environment Rating Scale®, Third Edition (ITERS-3)

The ITERS-3™ is a revision of the widely used and documented Infant/Toddler Environment Rating Scale, Revised and Updated (ITERS-R) (2006), one in a family of instruments designed to assess the overall quality of early childhood programs. Since the concurrent and predictive validity of the ITERS-R™ and ECERS-3™ are well established and the current revision maintains the basic properties of the original instrument, the focus of the first field study of ITERS-3 was primarily on the degree to which the 3rd Edition maintains the ability of trained observers to use the scale reliably and that the basic psychometric properties are similar to those of earlier versions.  Additional studies will be needed to document the continued relationship with other measures of quality, as well as to document its ability to predict child outcomes. As further studies are conducted, they will be posted on the ERSI website (www.ersi.info). 

After extensive revision, the authors conducted small pilot trials of the ITERS-3 in the late spring and summer of 2016 and a larger field test of the scale that fall. The study sample consisted of 53 classrooms in the following states: Georgia (15), Pennsylvania (16), Washington (18), and North Carolina (4). Classrooms were recruited with a goal of having approximately 1/3 of the total be low-quality programs, 1/3 be mid-level-quality, and 1/3 be high-quality, based on available data from state licensing and Quality Rating and Improvement System information. In the end, the sample was somewhat skewed, with field test data showing relatively few high-scoring classrooms and more in the moderate- to low-scoring range, but adequate distribution was attained to allow for examination of use of the scale across a range of program quality.

 

Study Findings

Indicator Reliability. Indicator reliability is the proportion or percentage of scores that exactly match for each indicator by two assessors independently completing ITERS-3 during the same time sample. Across the 33 Items in the ITERS-3, there were a total of 476 indicators in the pilot test version of the instrument. Assessors were instructed to score all indicators for each classroom. The average reliability across all of the indicators and assessor pairs was 86.9%. The indicators were scored either Yes or No, with several indicators allowed to be assigned NA (not applicable) in specified circumstances. In addition, six of the Items could be scored NA. In such cases, NA was assigned as the score for all indicators in the Item(s). A few indicators scored below 75%; after the field test the authors examined those indicators and either eliminated them or made adjustments to improve their reliability. The final version of the scale includes 457 indicators across the 33 Items.

Total Scale and Item Reliability. Because of the nature of the scoring system, it is theoretically possible to have high indicator agreement but low agreement at the Item level. Two measures of Item agreement were calculated. First, we calculated the agreement between pairs of observers within 1 point on the 7-point scale. For all 33 Items, exact agreement occurred in 60.6% of the cases and agreement within 1 point was obtained 86.1% of the time. For individual items, agreement within one point ranged from a low of 69.8% for Item 4, Display for children, to 94.4% each for Item 5, Meals/Snacks and Item 20, Nature/Science.

A second, more conservative measure of reliability is Cohen’s kappa, which takes into account the magnitude of difference between scores. For measures with an ordinal scale, a weighted version of the kappa is most appropriate and was used for the pilot test data. The mean weighted kappa for the 33 Items was .600. Kappa ranged from a low of .376 for Item 3, Room arrangement to a high of .682 for Item 13, Staff use of books with children. Item 3 was the only Item with a score below .40. Six additional Items had a weighted kappa below .500. All of these Items received minor edits to help improve reliability. The edits made to the indicators discussed above should result in somewhat higher kappas for the low-scoring Items without changing their basic content. These changes are included in the current version of the scale. Even using the more conservative measure of reliability, the overall results indicate an acceptable level of reliability for the instrument as a whole.

Intraclass Correlation. A third way of looking at reliability is intraclass correlation, which examines the level of agreement between independent observers. It accounts for both the correlation between two observers and differences in absolute magnitude of the two assessors’ ratings. We assessed the absolute agreement intraclass correlation coefficient in a two-way mixed model, average estimates, where 0 represents no correlation between assessments and 1 represents perfect correlation. At the Item level the mean coefficient was .83, ranging from .637 for Item 3, Room arrangement, to .917 for Item 33, Group play activities. Coefficients for the subscales and for the total score are shown in the table below. Generally, correlations of .85 or higher are considered acceptable. Only the Space and Furnishings subscale fell below this level.

Intraclass Correlation: Subscales and Full Scale

Subscales

N

N paired

ICC

Subscale 1: Space and Furnishings

106

53

0.764

Subscale 2: Personal Care Routines

106

53

0.857

Subscale 3: Language/Books

106

53

0.940

Subscale 4: Activities

106

53

0.895

Subscale 5: Interaction

106

53

0.917

Subscale 6: Program Structure

106

53

0.870

Mean of Subscales 1–6

   

0.874

Full Scale (based on Observation score)

106

53

0.915

 

Internal Consistency.  Finally, we examined the scale for internal consistency. This is a measure of the degree to which the full scale and the subscales appear to be measuring common concepts.  Cronbach’s alpha estimates of .6 and higher are generally considered acceptable levels of internal consistency.  Overall, the scale has a high level of internal consistency, with a Cronbach’s alpha of .914.  This estimate allows a high degree of confidence that a unified concept, which we call quality of the environment, is being measured.  We also examined the degree to which the subscales also show consistency—that is, are they measuring the subcomponents of quality consistently?  The table below shows the Cronbach alpha estimates for each subscale and the full scale based on the 106 field-test observations:

Internal Consistency

Subscale

Cronbach’s Alpha

Subscale 1: Space and Furnishings

0.761

Subscale 2: Personal Care Routines

0.855

Subscale 3: Language/Books

0.94

Subscale 4: Activities

0.893

Subscale 5: Interaction

0.915

Subscale 6: Program Structure

0.868

Mean of Subscales 1–6

0.872

Full Scale (Items 1-33)

0.914