B. Development of the ITERS-R

The Infant/Toddler Environment Rating Scale-Revised Edition (ITERS-R) is a thorough revision of the original Infant/Toddler Environment Rating Scale (ITERS, 1990). It is one of a series of four scales that share the same format and scoring system but vary considerably in requirements, because each scale assesses a different age group and/or type of child development setting. The ITERS-R retains the original broad definition of environment including organization of space, interaction, activities, schedule, and provisions for parents and staff. The 39 items are organized into seven subscales: Space and Furnishings, Personal Care Routines, Listening and Talking, Activities, Interaction, Program Structure, and Parents and Staff. This scale is designed to assess programs for children from birth to 30 months of age, the age group that is most vulnerable physically, mentally, and emotionally. Therefore, the ITERS-R contains items to assess provision in the environment for the protection of children's health and safety, appropriate stimulation through language and activities, and warm, supportive interaction.

Admittedly, it is very challenging to meet the needs of infants and toddlers in a group care setting because each of these very young children requires a great deal of personal attention in order to thrive. The economic pressure of raising a family continues to make the use of out-of-home group care for infants and toddlers the norm rather than the exception. Therefore, as a society, we are increasingly aware that we must face the challenge of providing child care settings for very young children that promote optimal development. It has long been the personal challenge of professional early childhood educators to provide the nurturance and stimulation that very young children need on a daily basis. A comprehensive, reliable, and valid instrument that assesses process quality and quantifies what is observed to be happening in a classroom, can play an important role in improving the quality of infant/toddler care.

In order to define and measure quality, the ITERS-R draws from three main sources: research evidence from a number of relevant fields (health, development, and education), professional views of best practice, and the practical constraints of real life in a child care setting. The requirements of the ITERS-R are based on what these sources judge to be important conditions for positive outcomes in children both while they are in the program and long afterward. The guiding principle here, as in all of our environment rating scales, has been to focus on what we know to be good for children.

Process of Revision

The process of revision drew on four main sources of information: (1) research on development in the early years and findings related to the impact of child care environments on children's health and development; (2) a content comparison of the original ITERS with other assessment instruments designed for a similar age group, and additional documents describing aspects of program quality; (3) feedback from ITERS users, solicited through a questionnaire that was circulated and also put on our website, as well as from a focus group of professionals familiar with the ITERS; and (4) intensive use for more than two years by two of the ITERS co-authors and over 25 ITERS trained assessors for The North Carolina Rated License Project.

The data from studies of program quality gave us information about the range of scores on various items, the relative difficulty of items, and their validity. The content comparison helped us to identify items to consider for addition or deletion. By far the most helpful guidance for the revision was the feedback from direct use in the field. Colleagues from the US, Canada, and Europe who had used the ITERS in research, monitoring, and program improvement gave us valuable suggestions based on their experience with the scale. The focus group discussed in particular what was needed to make the revised ITERS more sensitive to issues of inclusion and diversity.

Changes in the ITERS-R

While retaining the basic similarities in format and content that provide continuity between the ITERS and ITERS-R, the following changes were made:

  1. The indicators under each level of quality in an item were numbered so that they could be given a score of "Yes", "No", or "Not Applicable" (NA) on the scoresheet. This makes it possible to be more exact in reflecting observed strengths and weaknesses in an item.
  2. Negative indicators on the minimal level were removed from one item and are now found only in the 1 (inadequate) level. In levels 3 (minimal), 5 (good), and 7 (excellent) only indicators of positive attributes are listed. This eliminates the one exception to the scoring rule in the original ITERS.
  3. The Notes for Clarification have been expanded to give additional information to improve accuracy in scoring and to explain the intent of specific items and indicators.
  4. Indicators and examples were added throughout the scale to make the items more inclusive and culturally sensitive. This follows the advice given to us by scales users to include indicators and examples in the scale instead of adding a subscale.
  5. New items were added to several subscales including the following: Listening and Talking: Item 12. Helping children understand language, and Item 13. Helping children use language; Activities: Item 22. Nature/science, and Item 23. Use of TV, video and/or computer; Program Structure: Item 30. Free play, and Item 31. Group play activities; Parents and Staff: Item 37. Staff continuity, and Item 38. Supervision and evaluation of staff.
  6. Some items in the Space and Furnishings subscale were combined to remove redundancies, and two items were dropped in Personal Care Routines: Item 12. Health policy, and Item 14. Safety policy. Research showed that these items were routinely rated with high scores because they were based on regulation but the corresponding items assessing practice were rated much lower. It is practice that the ITERS-R should concentrate on since the aim is to assess process quality.
  7. The scaling of some of the items in the subscale Personal Care Routines was made more gradual to better reflect varying levels of health practices in real life situations, including Item 6. Greeting/departing, Item 7. Meals/snacks, Item 9. Diapering/toileting, Item 10. Health practices, and Item 11. Safety practices. 8. Each item is printed on a separate page, followed by the Notes for Clarification.
  8. Sample questions are included for indicators that are difficult to observe.
Reliability and Validity

As noted earlier in this introduction, the ITERS-R is a revision of the widely used and documented ITERS, that is one in a family of instruments designed to assess the overall quality of early childhood programs. Together, with the original instrument, the Early Childhood Environment Rating Scale (ECERS), and the more recent revision of that scale, the ECERS-R, these scales have been used in major research projects in the United States as well as in a number of other countries. This extensive research has documented both the ability of the scales to be used reliably and the validity of the scales in terms of their relation to other measures of quality and their tie to child development outcomes for children in classrooms with varying environmental ratings.

In particular, both the ECERS and ITERS scores are predicted by structural measures of quality such as child-staff ratios, group size, and staff education levels (Cryer, Tietze, Burchinal, Leal, & Palacios, 1999; Phillipsen, Burchinal, Howes, & Cryer, 1998). The scores are also related to other characteristics normally expected to be related to quality such as teacher salaries and total program costs (Cryer et al., 1999; Marshall, Creps, Burstein, Glantz, Robeson, & Barnett, 2001; Phillipsen et al., 1998; Whitebook, Howes, & Phillips, 1989). In turn, rating scale scores have been shown to predict children's development (Burchinal, Roberts, Nabors, & Bryant, 1996; Peisner-Feinberg et al., 1999).

Since the concurrent and predictive validity of the original ITERS is well established and the current revision maintains the basic properties of the original instrument, the studies of the ITERS-R have focused on the degree to which the revised version maintains the ability of trained observers to use the scale reliably. Additional studies will be needed to document the continued relationship with other measures of quality as well as to document its ability to predict child outcomes. A two-phase study was completed in 2001 and 2002 to establish reliability in use of the scale.

The first phase was a pilot phase. In this phase a total of 10 trained observers in groups of two or three used the first version of the revised scale in 12 observations in nine centers with infant and/or toddler groups. After these observations, modifications were made in the revised scale to adjust for issues that arose in the pilot observations.

The final phase of the field test involved a more formal study of reliability. In this phase, six trained observers conducted 45 paired observations. Each observation lasted approximately three hours, followed by a 20-30 minute teacher interview. The groups observed were selected to be representative of the range of quality in programs in North Carolina. North Carolina has a rated license system that awards points for various features related to quality. Centers are given a license with one to five stars depending on the total number of points earned. A center receiving a one-star license meets only the very basic requirements in the licensing law while a five-star center meets much higher standards. For our sample we selected 15 groups in centers with one or two stars, 15 with three stars, and 15 with four or five stars. The programs were also chosen to represent various age ranges of children served. Of the 45 groups observed, 15 were from groups with children under 12 months of age, 15 from groups with children 12-24 months old, and 15 with children 18-30 months old. The groups were in 34 different centers and seven of them included children with identified disabilities. All centers were in the central portion of North Carolina.

The field test resulted in 90 observations with two paired observations each in 45 group settings. Several measures of reliability have been calculated.

Indicator Reliability. Across all 39 items in the revised ITERS, there are a total of 467 indicators. There was agreement on 91.65% of all indicator scores given by the raters. Some researchers will omit the Parents and Staff Subscale in their work. Thus, we have calculated the indicator reliability for the child specific items in the first six subscales, Items 1-32. The observer agreement for the 378 indicators in these items was 90.27%. Only one item had indicator agreement of less than 80% (Item 11. Safety practices was 79.11%). The item with the highest level of indicator agreement was Item 35. Staff professional needs, with an agreement of 97.36%. It is apparent that a high level of observer agreement at the indicator level can be obtained using the ITERS-R.

Item Reliability. Because of the nature of the scoring system, it is theoretically possible to have high indicator agreement but low agreement at the item level. Two measures of item agreement have been calculated. First, we calculated the agreement between pairs of observers within 1 point on the seven-point scale. Across the 32 child-related items, there was agreement at this level 83% of the time. For the full 39 items, agreement within 1 point was obtained in 85% of the cases. Item agreement within one point ranged from a low of 64% for Item 4. Room arrangement, to 98% for Item 38. Evaluation of staff.

A second, somewhat more conservative measure of reliability is Cohen's Kappa. This measure takes into account the difference between scores. The mean weighted Kappa for the first 32 items was .55 and for the full 39-item scale it was .58. Weighted Kappa's ranged from a low of .14 for Item 9. Diapering/toileting, to a high of .92 for Item 34. Provisions for personal needs of staff. Only two items had weighted Kappa's below .40 (Item 9. Diapering/ toileting, and Item 11. Safety practices, with a weighted Kappa of .20). In both cases the mean item score was extremely low. A characteristic of the Kappa statistic is that for items with little variability the reliability is particularly sensitive to even minor differences between observers. The authors and observers agreed that the low scores on these items accurately reflected the situation in the groups observed and that any changes to substantially increase variability would provide an inaccurate picture of the features of quality reflected in these two items. For all items with a weighted Kappa below .50 the authors examined the items carefully and made minor changes to improve the reliability of the item without changing its basic content. These changes are included in the printed version of the scale. Even using the more conservative measure of reliability, the overall results indicate a clearly acceptable level of reliability.

Overall Agreement. For the full scale, the intraclass correlation was .92 both for the full 39 items as well as for the 32 child-related items. Intraclass correlations for the seven subscales are shown in Table 1. It should be noted that the intraclass correlation for the Program Structure Subscale is calculated excluding Item 32. Provision for children with disabilities, since only a small portion of groups received a score on this item. Taken together with the high levels of agreement at the item level, the scale has clearly acceptable levels of reliability. It should be remembered that this field test used observers who had been trained and had a good grasp of the concepts used in the scale.

Table 1 Intraclass Correlations of Subscales

Subscale

Correlation

Space and Furnishings

0.73

Personal Care Routines

0.67

Listening and Talking

0.77

Activities

0.91

Interaction

0.78

Program Structure

0.87

Parents and Staff

0.92

Full Scale (Items 1-39)

0.92

All Child Items (1-32)

0.92

Internal Consistency. Finally we examined the scale for internal consistency. This is a measure of the degree to which the full scale and the subscales appear to be measuring a common concept. Overall the scale has a high level of internal consistency with a Cronbach's alpha of .93. For the child-related items, 1-32, the alpha is .92. This measure indicates a high degree of confidence that a unified concept is being measured. A second issue is the degree to which the subscales also show consistency. Table 2 shows the alphas for each subscale:

Table 2 Internal Consistency

Subscale

Alpha

Space and Furnishings

0.47

Personal Care Routines

0.56

Listening and Talking

0.79

Activities

0.79

Interaction

0.80

Program Structure

0.70

Parents and Staff

0.68

Full Scale (Items 1-39)

0.93

All Child Items (1-32)

0.92

Cronbach's alphas of .6 and higher are generally considered acceptable levels of internal consistency. Thus, caution should be taken in using the Space and Furnishings and Personal Care Routines subscales. Program Structure, Item 32. Provisions for children with disabilities was rated for only the few groups that had children with identified disabilities. The internal consistency score for this subscale was calculated excluding this item. Thus, the authors recommend using the Program Structure subscale excluding Item 32 unless most programs being assessed include children with disabilities.

Overall, the field test demonstrated a high level of interrater agreement across the scale items and at the full-scale score level. These findings are quite comparable to those found in similar studies of the original ITERS and ECERS, and the ECERS-R. All of these previous studies have been confirmed by the work of other researchers, and the scales have proven to be quite useful in a wide range of studies involving the quality of environments for young children. At the same time the scales have been shown to be user-friendly to the extent that it is possible to get observers to acceptable levels of reliability with a reasonable level of training and supervision.