India Shining and Bharat Drowning: Comparing Two Indian States to the Worldwide Distribution in Mathematics Achievement

This paper uses student answers to publicly released questions from an international testing agency together with statistical methods from Item Response Theory to place secondary students from two Indian states - Orissa and Rajasthan - on a worldwide distribution of mathematics achievement. These two states fall below 43 of the 51 countries for which data exist. The bottom 5 percent of children rank higher than the bottom 5 percent in only three countries - South Africa, Ghana and Saudi Arabia. But not all students test poorly. Inequality in the test-score distribution for both states is next only to South Africa in the worldwide ranking exercise. Consequently, and to the extent that these two states can represent India, the two statements for every ten top performers in the United States there are four in India and for every ten low performers in the United States there are two hundred in India are both consistent with the data. The combination of India's size and large variance in achievement give both the perceptions that India is shining even as Bharat, the vernacular for India, is drowning. Comparable estimates of inequalities in learning are the building blocks for substantive research on the correlates of earnings inequality in India and other low-income countries; the methods proposed here allow for independent testing exercises to build up such data by linking scores to internationally comparable tests.


Policy ReseaRch WoRking PaPeR 4644
This paper uses student answers to publicly released questions from an international testing agency together with statistical methods from Item Response Theory to place secondary students from two Indian states-Orissa and Rajasthan-on a worldwide distribution of mathematics achievement. These two states fall below 43 of the 51 countries for which data exist. The bottom 5 percent of children rank higher than the bottom 5 percent in only three countries-South Africa, Ghana and Saudi Arabia. But not all students test poorly. Inequality in the test-score distribution for both states is next only to South Africa in the worldwide ranking exercise. Consequently, and to the extent that these two This paper-a product of the Human Development and Public Services Team, Development Research Group-is part of a larger effort in the department to measure and understand inequality in the provision of education. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The author may be contacted at jdas1@worldbank.org.
states can represent India, the two statements ``for every ten top performers in the United States there are four in India'' and ``for every ten low performers in the United States there are two hundred in India'' are both consistent with the data. The combination of India's size and large variance in achievement give both the perceptions that India is shining even as Bharat, the vernacular for India, is drowning. Comparable estimates of inequalities in learning are the building blocks for substantive research on the correlates of earnings inequality in India and other low-income countries; the methods proposed here allow for independent testing exercises to build up such data by linking scores to internationally comparable tests.

Introduction
Net primary enrollment in India has risen steadily over the last several decades and now exceeds 90 percent in most of the country. Large planned increases in the government education budget suggest renewed interest and action on the part of the state, with an emphasis on secondary schooling. Not surprisingly, increasing enrollments and resources have shifted the debate from how many children are in school to what they are learning. A consensus is building that getting children into schools may not be enough. Filmer et al. (2006) go so far as to propose augmenting the Millennium Development Goals with a Millennium Learning Goal that provides international benchmarks on how much children know at a pre-specified age. We ask the following question: Is there a way to place Indian children in secondary schools on an international scale (given India's * We thank Lant Pritchett for extensive discussions of the paper. Kin Bing Wu who led a World Bank sector study on secondary education in India designed the collection of the data we use here, and we are grateful to her for making the data and her report available to us. Eric Hanushek and Eugene Gonzalez provided invaluable comments on an early version of this paper and their insights have been critical for the current revision. Michelle Riboud and Sam Carlson provided useful comments that pertain to India's education sector. The findings, interpretations, and conclusions expressed in this paper are those of the authors and do not necessarily represent the views of the World Bank, its Executive Directors, or the governments they represent. Working papers describe research in progress by the authors and are published to elicit comments and to further debate. reluctance to participate in internationally benchmarked tests) and, if so, what would we find in terms of the average score and variance of the achievement distribution?
We propose a method that uses publicly released questions (items) from the Trends in International Mathematics and Science Study (TIMSS) 1999 8th-Grade Mathematics test to place Indian students on an internationally comparable achievement scale. The test, which consists of 36 items taken from the full TIMSS item bank, was administered to 6,000 students in public and private schools in two Indian states-Rajasthan and Orissa. Using the published item parameters for these 36 questions in conjunction with the Item Response Theory test-equating methods used by TIMSS, we construct a distribution of scores for the tested children that is directly comparable to the worldwide distribution; this allows us to compare the tested children to the international average and to place them in reference to the 51 other countries tested by TIMSS in 1999 and2003. The average scores of children in Rajasthan and Orissa place these states below 46 and 42 of the 51 countries tested in 1999 or 2003. After nine years of education, between 30 and 40 percent of enrolled children in these two states cannot pass a low international benchmark, described as "some basic mathematical knowledge." Children enrolled in secondary schools in these two Indian states are 3.1 (OECD) standard-deviations below the OECD mean. Where children in these two states are relative to the rest of the world is harder to ascertain. On the one hand, the TIMSS sample is heavily biased towards relatively high-income countries. The median scores, for instance, in Rajasthan and Orissa do not look too bad compared to Philippines and Chile.
On the other hand, secondary school enrollments in India are also lower-53 percent of the appropriate age group is enrolled, compared to more than 90 percent in South Africa, the worst performer in the TIMSS sample. To the extent that children currently out of school are less "motivated" or "able", test scores would arguably look worse at higher levels of enrollment.
The test-score distribution is also highly unequal-the difference between the top 5 percent and bottom 5 percent in both states is among the highest in the world, next only to South Africa. Students at the bottom of the distribution in both states score similarly or worse than the bottom students in the three worst performing countries. At the same time, students at the top of the distribution score higher than the top students in other low performing countries, and higher than the median student in all but the best countries. The top 5 percent of students in Orissa, for example, score higher than the median student in more than 42 of 46 countries tested in 2003.
Faced with similar results on learning, defenders of the quality of education in Indian schools often point to the large number of globally competitive Indians. We perform the following thought experiment: Suppose that these two states represent India (more on this below). Could the country's size combined with the large variance in scores explain how divergent beliefs can be sustained by the same data? As it turns out, in absolute terms, India has just under half the number of 14-year olds who pass the advanced international benchmark as the United States-100 thousand compared to 250 thousand-and roughly the same number who pass the intermediate international benchmark. Indeed, India has more top achievers than any European country tested, which, although not surprising given India's size, helps explain India's visible position on the academic stage. But another view is also sustainable. The average child scores far below any reasonable curricular standard and a large minority in these two states fails completely. If the results form these two states hold more generally, over 18 million 14-year olds in India are either not enrolled or are failing the lowest international benchmark if enrolled. That number is 22 times the number of failing children in the United States and more than any other country tested.
Beyond providing illustrative results for India, this paper is about the building blocks for research on learning and learning inequality in low-income countries where data on internationally comparable tests are typically absent. This requires 1) techniques to place individual students on a single comparable achievement metric and 2) methods to calculate other population quantities, such as the fraction of children passing particular criterion-referenced thresholds or the 5th to 95th percentile achievement spread. Clarifying what is required for comparable measures of learning and learning dispersion allows the research to focus on substantive rather than statistical issues, without worrying about whether results are driven by measurement tools and differing methodologies.
To preview the methodology, independent tests can be linked to the TIMSS achievement distribution provided at least one question is drawn from the TIMSS item bank to fix the free parameters. The primary methodological difficulty arises because "knowledge" or "achievement" is inferred from the data rather than directly observed. Since individual knowledge is measured with error, the variance of the achievement distribution aggregated from Maximum Likelihood estimates of individual knowledge overestimates the true variance. An alternate method, outlined by Mislevy, Beaton, Kaplan & Sheehan (1992), draws from the posterior of every student's achievement distribution to obtain an unbiased measure of the full learning distribution. These draws-known as "plausible values"-are interpreted as individual achievement with the property that when aggregated to a population distribution they recover the correct population moments.
We show that the variance of the distribution is sensitive to the estimation method used (i.e. Maximum Likelihood, Bayesian, or Plausible Values), primarily because the TIMSS test is too difficult for a large fraction of Indian children. 1 The method of plausible values offers an alternative for the calculation of higher moments in any setting-such as poverty mapping-where individuals attributes are estimated with a known standard error.
Linking scores to an international distribution contributes to the literature on education in low-income countries in several ways. First, linked test scores are comparable across space and time. Despite increasing worldwide testing using standardized methods-e.g. TIMSS (51 countries), PIRLS (35 countries), IALS (22 countries) and PISA (49 countries)-the Indian government, like many others, is reluctant to participate in such large-scale testing exercises. As a result, what little is known about learning achievement in India, and most low-income countries, arises from an ad-hoc collection of criterion-referenced exams. 2 These tests, administered by independent agencies, are typically not validated using standard testing tools, cannot be equated over time or across countries, and are not subject to a battery of robustness checks that accompany large-scale testing in the OECD countries. The methods applied here allow independent researchers to report achievement distributions for the tests they control that are directly 1 Brown & Micklewright (2004) also highlights the importance of using a consistent methodology. They find, for instance, that rankings of countries by within-country difference in TIMSS changed substantially for some countries when the scoring model used in 1999 was retrospectively applied to 1995 data.
2 Examples for India include a large national study by the National Center for Educational Research and Training (NCERT) in 1994, which found that children scored an average of 47 percent in language and 41 percent in mathematics (Shukla et al. 1994), and state-wide studies with smaller samples in Bihar, Tamil Nadu, Delhi, Uttar Pradesh, Madhya Pradesh and Rajasthan (Bashir 1994, Hasan 1995, Govinda & Varghese 1993, Aggarwal 2000, Goyal 2007). In a major recent effort, the NGO Pratham tested children from almost all districts and found low levels of learning: 52 percent of children between the ages of 7 and 10 could read a small paragraph with short sentences at first grade difficulty levels, 32 percent could read a story text and 54 per cent were unable to divide or subtract (Pratham 2006). Similar results have been reported for Africa. In a relatively large effort, the Monitoring Learning Achievement Project (Chinapah et al. 2000, Strauss & Burger 2000covered 13 African countries and found literacy, numeracy, and life-skills scores for fourth graders between 30 and 70 percent. comparable to those obtained worldwide. 3 Comparable achievement measures contribute to our understanding of earnings inequality and its correlates. A growing literature examines the relationship between earnings inequality and test-score dispersion. Nickell (2004) and Blau & Kahn (2005) report a high correlation between test-score dispersion and wage inequality; Nickell (2004) for instance suggests that 70 percent of the dispersion in earnings internationally is attributable to the dispersion in test-scores.
Similarly, Bedard & Ferrall (2003) show that test-score inequality at early ages is correlated with wage inequality in the same cohort later in life. In contrast to this literature, Devroye & Freeman (2001) argue that wage dispersion within narrowly defined skill sets is higher than that across and that institutional mechanisms of collective bargaining matter more. India has recently seen a dramatic increase in inequality (Debroy & Bhandari 2007), at the same time that inequality in educational attainment is falling (Jalan & Murgai 2007). It is likely that as inequality in attainment declines further and returns to skill increase (Kijima 2006), attention will increasingly focus on the inequality in cognitive ability.
The remainder of this paper is structured as follows. Section 2 outlines the Item Response Theory method for equating test scores. The technical section and accompanying appendix provides sufficient details for critique and replication. Section 3 discusses the data, sampling strategy, and test design. Section 4 reports the international benchmarking results and variance decompositions. Section 5 outlines some caveats to our method and several robustness checks; Section 6 concludes.

Overview of Linking Methodology
Properly linking India's mathematics achievement to the world distribution requires either a single test given across all countries (and each year) or a means of linking alternate test forms which may include different items. Since giving a single test is clearly infeasible in most situations, educational testing organizations have developed statistical tools that allow scores from different exams to be expressed on a unified scale. Item Response Theory (IRT) is one such technique and is used in most large-scale testing situations such as TIMSS, PIRLS, NAEP and the SAT and GRE. The basic intuition behind this technique is to model the behavior of each item-i.e. its difficulty, ability to discriminate between two children, and likelihood of being guessed-so that any differences in items can be removed from the score. This contrasts with the commonly reported percent correct score, which gives performance on a test-specific scale.
The fundamental building block of IRT is therefore the item response function (IRF), which links the latent ability, θ, to the probability a randomly drawn examinee of a given ability will answer the item correctly. One of the most popular models for dichotomous responses is the three-parameter (3PL) logistic model introduced by Birnbaum (1968) and used by TIMSS for multiple choice items. Letting X ig represent the (0/1) response for individual i on item g, the IRF for the 3PL model is . ( This function describes all 36 items administered to our sample and gives the probability of observing a correct response given ability θ and item parameters (a g , b g , c g ). parameter, c g , incorporates the fact that on multiple choice exams even the worst performers (θ → −∞) will sometimes guess correctly. The difficulty parameter, b g , measures the item's overall difficulty since the probability of answering correctly depends equally on ability and difficulty. The discrimination parameter, a g , captures how quickly the likelihood of success changes with respect to ability. Intuitively, an item with a high discrimination parameter can distinguish between examinees with abilities just below and above b g . Overall, this relatively flexible functional form has proved adept at fitting item response patterns.
To illustrate graphically how IRT links items and tests, Figure 2 plots the item response functions for two TIMSS items that map ability on the horizontal axis to the percentage correct on the test. A third curve plots the test characteristic curve for a test composed of these two items only. Since the item response functions are fully characterized by the published TIMSS items parameter and the structural assumption of a logistic function, it is easy to read the mean ability of a child by the percentage correct on the test. For instance, if item 19 is administered and 60 percent of children respond correctly, the mean ability is 425. By comparison, the same result on item 21 would suggest a higher mean ability level since the question is more difficult.
The key advantage of IRT in large testing situations is this ability to link tests, either in a cross-section (when different children are administered different test questions) or over time (when children are tested more than once). Formally, IRT equates competence levels by identifying off the set of common items across the tests and defining a reference population. Absent a reference population, the IRF given by (1) provides competence levels and item parameters that are identified up to an affine transformation-poor performance cannot be distinguished from a difficult test and a large variance in achievement cannot be distinguished from a highly discriminating test. Specifically, the transformations will yield identical characteristic curves, so that P g (θ ; a g , b g , c g ) = P g (θ; a g , b g , c g ). However, if item parameters are fixed, the scale of θ-the mean and variance-is fixed as well. Thus by calibrating items using a defined reference group we can score the performance of all other children relative to that group, regardless of which items children actually complete. In our case, the reference group is given by the TIMSS knowledge scale. This scale fixes the item parameters such that the TIMSS 1995 sample of eighth grade children have mean 500 and standard deviation 100 (Yamamoto & Kulick 2000).
In our application, all students receive the same exam and all item parameters are fixed using TIMSS. In general, however, this need not be the case. Students can receive different exams and new items so long as each item can be linked to a common set of fixed items or a fixed reference population. For example, three two-item exams with item pairs (1,2), (2,3) and (3,4) can all be linked provided that one of the four items is fixed, even if each test is administered to a different population. 4

Estimating the Mean
Given a set of individuals who were administered the same test, the likelihood function of ob- where P g is the 3PL model given by (1) and x ig is the 0/1 response for individual i on item g.
Because of convergence issues associated with joint maximum likelihood methods that iterate between solutions for item parameters and individual abilities, most researchers use marginal maximum likelihood (MML) to estimate the 3PL model. To estimate any unknown item parameters, this method integrates out the ability distribution f (θ) to get the marginal likelihood function. Bock & Aitken (1981) propose an efficient EM algorithm to perform the resulting maximization problem. In addition to the parameter estimates, this algorithm returns a summary measure of the ability distribution f (θ) such as a mean and variance or a quadrature approximation. To obtain individual ability estimates, one can maximize the full likelihood function (6) treating the item parameters as fixed. For our application, this is all that is required to produce MLEs since all parameters are known. The sample means-the average score in Orissa and Rajasthan-can be computed from the individual ability estimates or, potentially, the means obtained during the marginalization of the full distribution.
While maximum likelihood methods are usually perfectly adequate to estimate sample means, there are some exceptions. One significant problem is that MLE proficiency is undefined if children answer fewer items correctly than would be expected by chance. So long as one child has an undefined ability estimate, so too is the sample average. As a result, researchers commonly limit the proficiency scale to some finite range. We follow TIMSS and bound MLE scores between 5 and 995-in our sample, 91 of the 6000 tested children are bounded below by 5. A second, more technical concern relates to the methods used to maximize the likelihood function (6) for ability. Yen et al. (1991) find that this likelihood function is often multimodal even for tests parameters for item 2 using the first exam. Given parameters for item 2, we can then estimate the parameter for item 3 using students who received the second test. These students need not have the same ability distribution as the first group because they can be compared directly using item 2. Using a similar argument we can link the third exam to the first two.
with up to 50 items, which is a potential pitfall for many numerical maximization algorithms commonly employed.
Bayesian methods avoid some of these problems by incorporating additional information through a prior. Leaving just enough notation to capture the basic idea, the Bayesian approach focuses on the posterior distribution, which is proportional to the product of the likelihood and prior. The expected a posterior (EAP) estimate of ability is simply the mean of the posterior distribution for each individual θ i . One advantage of EAP scores is that they are always well defined, even for poorly performing students; when the likelihood function provides no additional information, the posterior simply converges to the prior. Moreover, provided that the prior distribution is correctly specified, the mean of the EAP scores is an unbiased estimate of the sample mean and has a smaller mean squared error than the corresponding MLE based estimate.

Estimating the Variance and Quantiles
In addition to the average performance level in Rajasthan and Orissa, we are also interested in the shape of the full distribution. The primary difficulty here is that if the test is too short, too easy or too difficult, the individual errors become too large to ignore and the distribution of estimated individual proficiencies no longer converges to the population distribution (Yamamoto & Kulick 2000, Mislevy, Beaton, Kaplan & Sheehan 1992. To get a sense for whether this is an issue in the Indian case, Figure 3 plots the distribution of MLE abilities in a histogram (left axis) and the associated ±1.96 * se confidence interval on the right axis. 5 For children below the mean, the precision of the ability estimate is very low. Simply put, for most Indian children, the test is too hard. In this situation, the mean of the sample will still generally approach the 5 Item Response Theory provides the standard error for each score from the inverse Fisher information matrix after ML estimation of the IRT model. As the number of items grows large, this standard error summarizes the normal sampling distribution of the estimator. However, as the number of items shrinks, the sampling distribution becomes highly non-normal. In particular, our test is weakly informative for poorly performing students because we cannot distinguish between students scoring poorly and those score very poorly; we can reject that such students are high achievers. Consistent with how ML standard errors are calculated, Figure 3 does not capture this non-normal behavior and instead graphs ±1.96 * se. population mean, but the same is not true for the estimated variance.
To see this, consider the variance of the MLE scoresθ and the EAP scoresθ. The variance of the MLE scores includes both the variance of true scores θ and measurement error e. That is, Defining the test reliability ratio as ρ ≡ Var(θ)/ Var(θ), we have By comparison, the EAP scores are a weighted average of the MLE score and the population mean,θ = ρθ + (1 − ρ)µ. The variance of the EAP scores is therefore The true variance, Var(θ), is bounded above by the MLE score variance and below by the EAP score variance. It should be clear that this argument extends to any percentile moments such as the top and bottom quintile. Unfortunately, the error structure in IRT is complicated and closed-form corrections are not readily available.
One simple way to address this issue is to bound the distribution estimates using MLE and EAP scores. Where these estimates are similar, no further work may be required-convenient because both MLE and EAP scores are readily available from standard reports in test analysis programs such as BILOG-MG. Unfortunately, in parts of the distribution where the test is only weakly informative the bounds may be quite large; in our application, this turns out to be true for estimates of lower quantiles. 6 A more satisfactory solution, and the one followed by TIMSS, is to draw "plausible values" from the posterior distribution of each student's ability estimate and then use these draws to approximate the true achievement distribution (Mislevy 1991, Mislevy, Beaton, Kaplan & Sheehan 1992, Mislevy, Johnson & Muraki 1992, Yamamoto & Kulick 2000. Staying with our simplified posterior notation, we draw five plausible values for each child and then estimate sample moment of interest aŝ where s(θ k ) may be the variance, 90th percentile, etc, of the N element vector of plausible values Unfortunately, no publicly available software can draw plausible values for the model we estimate, making it difficult for other researchers to replicate the TIMSS methodology precisely.
We use the Markov Chain Monte Carlo (MCMC) algorithm proposed by Patz & Junker (1999a,b) to compute the EAP scores and plausible values. This technique differs from the computational approach used by TIMSS but is highly flexible and relatively straightforward to implement. We provide a fuller explanation of our estimation strategy in Appendix A.
To see whether these concerns are of practical importance, Figure 4 shows the estimated MLE, Africa would therefore favor India; alternatively comparisons between India and Ghana favor the latter. Although problematic for the mean, the lack of information on non-enrolled children may not be as problematic for different percentiles-it may be plausible to assume, for instance, that the 50 percent of children not tested are likely to join the group that performs "poorly", in a sense to be made precise below.
Second, although all attempts were made to ensure that no type of school or location was left out of the sampling procedure, it has been difficult to accurately weight the data given paucity of data on enrollments in private unaided and aided schools at the district level. This is a general problem that any testing exercise has to address and it calls for an urgent compilation of a universal dataset that can be used for sampling in the future.
Third, the data are from two states only, and therefore generalizations to all of India may be misleading-Rajasthan and Orissa are both poorer states with larger tribal populations.
Learning outcomes though may be different from those suggested by income rankings. The results from a countrywide testing exercise in rural areas (Pratham 2006)  is surprisingly not far off the Indian average-if anything, these results suggest that children in these two states may be scoring higher than the rest of the country. However, lots of caution is still warranted-particularly since Orissa performs better than Rajasthan in the tests we use while Pratham finds the opposite.
In the selected schools, students in ninth grade were administered a 36-item test where all items were selected from the list of publicly released items published by the TIMSS. The test sought to cover the content domains tested under the TIMSS with 11 items on Algebra, 5 on Data Representation, Analysis and Probability, 9 on Fractions and Number Sense, 7 on Geometry and 4 on Measurement. The performance expectation across these content domains also varied and ranged from "Communicating and Reasoning" to "Using Complex Procedures" ( Table 1). The items selected were neither too difficult nor too hard in the TIMSS calibration, ranging from -1.07 (a student 1 standard deviation below the mean would have a 50 percent chance of answering this question correctly, absent guessing) to 1.244; the items were also uniformly distributed across this difficulty range.

International Benchmarking
There are two views that currently dominate thinking about educational policy in India. One view-active proponents of which include prominent NGOs-is that Bharat is drowning. Average learning levels are so low that the typical child will leave primary school without knowing how to read or perform elementary mathematical operations. A second view-often expressed by those in the government and in the media-is that India is shining. This group points to India's increasing global presence, the large number of Indian professionals in high paying jobs, and the dramatic growth of its service industry, particularly in information technology. As it turns out, both views contain an element of truth, and both views can be justified by presenting different pieces of the same data.
Mirroring the view that Bharat is drowning, absolute achievement, as measured by the percent correct score, is low compared to curricular standards. A significant fraction of children have not mastered the content categories expected for their grade (Table 1) While the item-by-item comparison suggests that Indian children are performing significantly below the international average, interpreting the magnitude of this effect is difficult because it depends on a test-specific metric. As discussed, the percentage correct score is a function of latent achievement differences-our true parameter of interest-and the discriminating power of the test, and thus inseparable from the specific test design. The true picture may be worse. Since the tests included only enrolled children, the comparisons favor India to the extent that enrollment is lower relative to other countries. In both 8 We follow the TIMSS methodology as closely as possible and compute sample averages using the EAP scores, which is, in this case, simply more efficient than using plausible values. The MLE scores, which are estimated using BILOG-MG rather than our custom MCMC routines, yield somewhat lower estimates of the average: 374 and 386. The discrepancy between the EAP and MLE averages is likely due to students scoring in an area where the likelihood function is virtually flat or undefined. In this situation, regularity and stability become a major concern with MLE.
Botswana (75 percent) and South Africa (90 percent) gross enrollment in secondary schools is higher. It is likely that a representative sample of children (enrolled and unenrolled) would place India below additional countries.
That the average child is performing poorly masks the considerable variation in the distribution. At the bottom, children score extremely poorly. There is no evidence that the distribution is more compressed at the bottom than for other low-performing countries. In fact, only three countries-Saudi Arabia, Ghana, and South Africa-score worse than Rajasthan or Orissa if ranked by the 5th percentile cutoff score ( Figure 6). When the education system fails, it fails completely.

Inequality in the Learning Distribution
Following Micklewright & Schnepf (2006), we report a simple statistic measuring test-score dispersion-the difference between 5th and 95th percentiles of the test score distribution. Figure   7 shows the significant educational inequality in the Indian learning distribution. In both the Indian states, the 5-95 percentile spread is greater than 300, and just below the most unequal country in the TIMSS sample-South Africa.
TIMSS 2003 also presents achievement benchmarks based on an intensive effort to anchor performance to objective criteria.  (475), high (550), and advanced (625) international benchmarks; Table 3 shows the results. In Rajasthan and Orissa, 1 percent of children pass the advanced benchmark.
This actually is above many other poor performing countries. At the same time, only 42 percent in Rajasthan and 50 percent in Orissa pass the lowest benchmark. Put another way, only 40 to 50 percent of Rajasthan and Orissa's enrolled ninth graders have "some basic mathematical knowledge"-the description of the low international benchmark.
A second useful exercise that demonstrates the vast differences between tested children is to rank Table 3 by those who reach each of the different international benchmarks. Ranked by the low international benchmark, Rajasthan is 8th from the bottom and Orissa 9th; ranked by the intermediate benchmark, they are now 9th and 14th from the bottom respectively; ranked by the high international benchmark they are now 11th and 16th from the bottom. The advanced international benchmarks put both states at the respectable positions of 12th and 18th, although the precise ranking is difficult to obtain given rounding.
To the extent that these two states represent India, the combination of a wide achievement distribution and immense population explains why perceptions of India can vary so dramatically.
In Table 4, we use population age-cohort estimates and enrollment rates to estimate the number of 14-year olds in each country who pass the international benchmarks set by TIMSS. The results are striking. If one percent of Indian children reach the advanced international benchmarkthe average suggested by Rajasthan and Orissa-the total cohort size ranks 5th out of all the countries tested. Only Japan, the United States, South Korea, and Taiwan have more students passing the top benchmark. For every ten children who pass the advanced benchmark in the

Variance Decomposition
The striking disparity between top-and bottom-achievers hints that children receive different educational inputs, both based on the state in which they live and the characteristics of their families and schools. While it is impossible to draw definitive causal conclusions using simple correlations or variance decompositions, the patterns that emerge from even a basic analysis are broadly consistent with a view of an education system rife with inequality but rich in potential.
In a hopeful sign, the form inequality takes suggests that public policy plays a role. The impact of household attributes-educational inputs that the government has little power to controlappears mitigated by the institutional structure of states and schools.
We present a heuristic approach towards examining the source of achievement in Figure   8. Here, we first regress test scores on district dummies and then plot the residuals-this is a measure of how much of the variation is accounted for by districts. We then add in child and household characteristics-age, gender, caste, parental literacy, and wealth-and plot the residuals again; finally we repeat the exercise including school dummies. To the extent that districts, households, or schools explain a large portion of the variation in the test score data, we expect that residual plot to be more "concentrated" once the appropriate dummies are accounted for. So, if districts matter a lot, we expect the residual plot from a regression of test scores on district dummies to be "tighter" than the distribution of all test scores.
As Figure 8 shows, schools seem to matter most. Progressively adding district effects and family characteristics compresses the distribution slightly. Only when we add school fixed effects is the collapse noticeable; the gaps between schools accounts for more than the gaps between children from different household characteristics. Table 5 confirms this result more formally using a simple regression based variance decomposition. Here, we first regress achievement on district dummies. The R 2 from this regression gives a measure of the variance explained by districts alone. Examining the change in R 2 after adding household controls gives the fraction of achievement variation explained by observable characteristics above and beyond the district effect. While indicative of households' contribution to learning, we cannot claim households causally explain this fraction of variance since children sort into schools. If this occurs, observable household characteristics may explain achievement simply because schools determine learning and children sort. Proceeding onward, we add school dummy variables and report the increase in R 2 . This gives some sense for the importance of schools, but again we cannot make definitive causal statements. A significant increase in variance explained at this stage implies either that schools matter or that children sort on unobservable characteristics. After accounting districts, observables, and schools, the remaining variation is idiosyncratic. As Figure 8 shows, measurement error, which cannot be decomposed by definition, forms a significant portion of this idiosyncratic variation. Table 5 shows the results of this exercise. In Orissa (Rajasthan), schools explain an additional 32 percent (41 percent) of the test score variation above districts and observable household characteristics. This is twice the amount of variation explained by districts and household characteristics in Orissa and five times the variation explained by those attributes in Rajasthan.
Even if half of this effect is due to selection on unobservables, schools remain important. For comparison, the maximum variation possibly attributable to school specific factors in OECD countries is 14 percent-less than half the value for India (Pritchett 2004). If we were to remove the variation due to measurement error and renormalize our decomposition to sum to one, the schools' role would appear even more significant.

Robustness Checks
Some caveats are in order. TIMSS uses a complex test design where children are given a subset of items in a specific format. Our results are based on a test that includes 36 TIMSS questions, but the test-design is clearly different. The educational testing literature has many examples of design effects, where test scores are shown to change depending on the design of the test. By presenting results using IRT equating methods, we are essentially ignoring this rich literature.
One robustness check used in the item response literature compares the actual responses of children, averaged across ability groups with that predicted on the basis of item parameters.
In our particular case, these tests of "item fit" reveal the extent to which the shape of the item response function predicted from the TIMSS item parameters corresponds to the actual responses of examinees. Figure A1 shows the predicted and actual responses for all 36 items.
For the majority of items, both the 3PL model and the item parameters closely predicted how children would perform. In a few instances, however, the fit could be improved. As an example, item 33 is a poorly-fitted item where high ability Indian children seem to struggle more than their international peers. While these few items are unlikely to introduce significant bias, future researchers should carefully select items during the pilot phase to minimize deviations from the expected response patterns.
Further, a factor model of item responses generated the first eigenvalue (3.9) 9 times greater than the second (0.4), easily satisfying Drasgow & Lissak's (1983) rule-of-thumb for assessing the unidimensionality assumption. Nevertheless, we could not conduct formal tests of Differential Item Functioning (DIF) given that we do not have access to item-by-item responses for other TIMSS examinations (and these are typically not available in the public domain). Mullis & Martin (2000), however, conduct the required analysis for the TIMSS 1999 sample and there is little reason to suspect the results would not extend to India.
The methods and results discussed here should not be taken as advocacy for dispensing with TIMSS altogether and using their publicly released items to place tested children on international distributions. TIMSS provides a level of analysis and robustness checking that independent researchers cannot easily replicate. We view the methods presented here more as a bridge between current practices and TIMSS-like comparability rather than an alternative. Even in this case, a larger pilot that compares TIMSS results with those obtained by the methods suggested here would yield important information on the biases inherent in our equating methods.

Conclusion
The educational administration in India has often shaken off the bad news emerging from the primary educational sector on the grounds that the Indian system is based on the rigors of selection. A gruelling primary schooling would weed out all but the best performers, who would then graduate onwards to secondary schools and receive a higher quality education. One response to the poor testing results from the primary level has in fact been to point to India's position in the global economy and the comparable performance of its top firms and professionals to their international counterparts. In essence, if the schooling system is so poor, how is it that India has all these top global performers? How this situation plays out over the next decade has much to do with how production technologies evolve in the labor market. If Indian firms manage to adopt "Ford Model-T" technologies that require a handful of highly skilled and educated workers to match with a large number of unskilled workers, India shining can act as a "rising tide that lifts all boats." But if Indian firms adopt "McKinsey" technologies that require skilled workers and unskilled workers to match among themselves (as the IT consulting firms require, but not necessarily call-centers) it is likely that the country will be characterized by increasing inequalities; an enclave of a few privileged and self-perpetuating rich surrounded by a majority poor.
There is some hope in the variance decompositions and associations that inequalities in the educational system can be addressed through government policies. A consistent finding across OECD countries is the low explanatory power of schools in explaining the variation in test scores compared to households. This is problematic for policy, since it is easier to change behavior among teachers and to improve schools, than it is to do the same among parents. That a large fraction of the variation in achievement arises from differences across schools suggests that there are school-level variables, manipulable by policy, that could result in positive impacts.
What these might be, and where to go from here, should form the basis of future research and evaluations.
More generally, the methods proposed in this paper highlight the potential benefits of linking scores to the worldwide achievement distribution. While such efforts cannot replace the important work undertaken by TIMSS, they represent a clear improvement over the collection of ad-hoc exams employed by most researchers, and require little additional work. India is hardly alone in its absence from the TIMSS rankings, and many countries could benefit from an analysis similar to ours. Over time, through such efforts, independent researchers may help make tracking a Millennium Learning Goal a reality.

A.1 Estimating MLE Scores
Linking our test form to the TIMSS knowledge score distribution requires a underlying model of the response process. In our case, all 36 items presented can be described by the 3PL model given (1). Letting x ig ∈ {0, 1} denote the response for individual i on item g and X be the full data matrix, the likelihood of observing X given a vector of associated abilities, θ, is where the product form arises from assuming independence across items and individuals. Unlike most IRT models we have suppressed the notation of the item parameters to highlight the fact that they are fixed. In many cases there may be a mix of fixed anchor items and new uncalibrated items, but we do not face that situation here.
With fixed parameters it is relatively trivial to maximize the likelihood function associated with each individual using Newton-Raphson or some other numerical procedure; each first order condition is independent of the others so we do not face a curse of dimensionality. But some difficulties remain. In particular, the 3PL model's guessing parameter makes MLEs undefined for those scoring below the guessing rate. These flat parts of the likelihood function can make numerical estimates unstable. Yen et al. (1991) also find that some response vectors can produce likelihood functions with multiple modes even for tests of a reasonable length (such as 36 items).
These modes can trap derivative based maximization algorithms at local rather than global peaks.
To study these issue, we computed ML estimates using both a Newton-Raphson algorithm and BILOG-MG. While the estimates agreed perfectly for most individuals, there appeared to be some instability, particularly near the bottom of the distribution where our test is only weakly informative and where students often score below the guessing rate. Given these differences we choose to report only BILOG based ML estimates.

A.2 Estimating EAP Scores and Plausible Values by Markov Chain Monte
Carlo Both EAP and plausible values are based on the posterior distribution of individuals' ability. In Section 2 we introduced the basics of the Bayesian approach using simplified notation. To be more precise, we now change the setup slightly and introduce notation for manifest predictors of the score. Letting Y denote the matrix of predictors such as state, gender, age, wealth, parental literacy and school type, we follow TIMSS and assume that covariates are linked to ability using a simple linear model where i ∼ N (0, σ 2 ). Given this model, we can express the joint posterior distribution for all parameters as = P (X|θ)P (θ, β, σ|Y ) = P (X|θ)P (θ|β, σ, Y )P (β, σ|Y ) = P (X|θ)P (θ|β, σ, Y )P (β)P (σ) where (16)  There are many strategies for constructing a chain with this property. In the IRT context, MH-within-Gibbs achieves the objective in a relatively straightforward manner.
The basic motivation behind "Gibbs samplers" is to reduce the simulation problem to lower dimensional, perhaps univariate, space. In our case, we are interested in the distribution of N + K + 1 random variables, π = θ 1 , . . . , θ N , β 1 , . . . , β K , σ|X, Y . Gibbs sampling constructs a Markov chain M t = (θ K , σ (t) ) by sampling from the full conditionals as follows: It can be shown this chain converges to a stationary distribution π (e.g. Casella & George 1992, Tierney 1994). In the IRT context the full conditionals simplify considerably because of independence between individuals. That is, transition probabilities for each type of N + K + 1 parameters is given by If sampling from these full conditional distributions is easy, Gibbs sampling provides a means to generate a sample from the posterior of each parameter.
In practice, computing the normalizing constant in the denominator of each conditional may be difficult-e.g. a closed form solution may not exist. The MH-within-Gibbs algorithm avoids this complication by inserting a Metropolis step when sampling from the full conditionals. Chib & Greenberg (1995) provide an excellent pedagogic introduction to Metropolis-Hastings algorithms.
A representative example of the algorithm for parameter θ i is: 2. Accept the proposed value as follows: where α = min 1, By using a symmetric proposal distribution N (0, s i ) the normal MH criterion α does not include the the proposal distribution. Moreover, note that by substituting (21) into (26) we are left with an algorithm that includes only known functions since the denominator cancels. We can therefore easily compute α and simulate a Markov chain that converges to the posterior of interest. The MH steps for the regression parameters β k and σ are completely analogous. For a more comprehensive description of MCMC methods applied to IRT problems see Patz & Junker (1999a,b).
To compute the EAP and plausible values estimates we ran a chain of 4,000 observations, discarding the first 2,000 as a burn-in period. As part of the linear model, we included private school attendance, age, age-squared, family size, family size squared, gender, father literacy, mother literacy, wealth category, caste, state, school facilities category, an intercept and a missing data dummy as explanatory variables. Including these manifest predictors makes our estimates more precise and is required for subsequent analysis using plausible values to be valid (Mislevy, Beaton, Kaplan & Sheehan 1992). We assumed flat priors for the β and σ parameters making the EAP estimates analogous to empirical Bayes, although this assumption has little effect since the data dominates the prior for these parameters. To ensure convergence, we experimented with the proposal distribution variances until the acceptance rates average around 44 percent with no significant outliers. Visually checking the chain graphs and running multiple chains and comparing the results confirmed that the chains rapidly converged after several hundred observations and autocorrelations were modest. Finally, we averaged the last 2,000 observations to compute the the EAP estimate. Even with this relatively modest chain length, the Monte Carlo error was tiny compared to the variance associated with each score. We also took five evenly spaced draws from the posterior as plausible values. Wu, K. B., Goldschmidt, P., Boscardin, C. K. & Sankar, D. (2006), Student achievement in mathematics and its determinants in rajasthan and orissa, in 'Report on the Survey of Public   Students can organize information, make generalizations, solve non-routine problems, and draw and justify conclusions from data. They can compute percent change and apply their knowledge of numeric and algebraic concepts and relationships to solve problems. Students can solve simultaneous linear equations and model simple situations algebraically. They can apply their knowledge of measurement and geometry in complex problem situations. They can interpret data from a variety of tables and graphs, including interpolation and extrapolation.

High International Benchmark -550
Students can apply their understanding and knowledge in a wide variety of relatively complex situations. They can order, relate, and compute with fractions and decimals to solve word problems, operate with negative integers, and solve multi-step word problems involving proportions with whole numbers. Students can solve simple algebraic problems including evaluating expressions, solving simultaneous linear equations, and using a formula to determine the value of a variable. Students can find areas and volumes of simple geometric shapes and use knowledge of geometric properties to solve problems. They can solve probability problems and interpret data in a variety of graphs and tables.

Intermediate International Benchmark -475
Students can apply basic mathematical knowledge in straightforward situations. They can add, subtract, or multiply to solve one-step word problems involving whole numbers and decimals. They can identify representations of common fractions and relative sizes of fractions. They understand simple algebraic relationships and solve linear equations with one variable. They demonstrate understanding of properties of triangles and basic geometric concepts including symmetry and rotation. They recognize basic notions of probability. They can read and interpret graphs, tables, maps, and scales.  . Panel 2 shows the residual distribution controlling for a district fixed effect. Panel 3 shows the residual distribution controlling for a district fixed effect and child age, age squared, gender, caste, mother literacy, father literacy, and household wealth. Panel 4 shows the residual distribution including an additional school fixed effect. A considerable portion of the distribution is Panel 4 is due to measurement error.