In this statistical vacuum, all eyes are on the Center for Monitoring Indian Economy (CMIE) and its Consumer Pyramids Household Survey (CPHS). CMIE is a private agency engaged in the collection and compilation of Indian statistics for sale. CPHS is a periodic survey that has been conducted three times a year since January 2014 in consecutive four-month “waves”. It is based on “a representative sample from all over India of over 170,000 households,” according to the CMIE website. In addition, it is said to be a panel dataset that covers largely the same households over time, although response rates vary and new households are often added to compensate for fluctuation.
Aside from its high price, the CPHS dataset is – or at least sounds – like a researcher’s dream. It has become a veritable barometer of the Indian economy that is particularly closely watched for data on income, spending and employment. Research papers based on CPHS data are also springing up like mushrooms. CPHS stayed on track even during the Covid-19 crisis, we are told. The country owes the CMIE considerable thanks for rescuing its tarnished statistics system.
However, is it really true that CPHS is a “robust, nationally representative and panel survey of households” since June 2021?
World Bank discussion paper does it formulate many similar descriptions of this survey in influential articles?
Consider this: According to the CPHS, adult literacy (ages 15-49) was 100% in urban areas and 99% in rural areas at the end of 2019. It’s too good to be true. It suggests that CPHS is geared towards better-off households.
The plot thickens when we compare literacy rates at different points in time. Four years earlier, at the end of 2015, the literacy rate in the same age group was only 83% according to CPHS data. Could it really be that adult illiteracy was eradicated within four years? Seems unlikely.
We can investigate this question by looking at literacy for the same cohorts over this time period. For example, we can compare the age group of 15 to 49 year olds at the end of 2019 with the age group of 11 to 45 year olds at the end of 2015. These two groups correspond to the same cohort. If CPHS is primarily a panel dataset, then the literacy rate for this cohort should be roughly the same in 2015 and 2019. In fact, however, it increases in successive waves from 84% in 2015 to 99% in 2019. This suggests that the CPHS became the sample
The bias was already in place at the end of 2015, based on a comparison with the earlier NFHS-4. The CPHS estimate of adult (15-49 years) literacy at this point is 6 percentage points higher for both men and women than the NFHS-4 estimate for 2015-16. The bias can also be seen in the household wealth data. For example, according to the CPHS, at the end of 2015 98% of households had electricity, 93% water in the house, 89% a television and 42% a refrigerator. The corresponding numbers from NFHS-4 are much lower: 88%, 67%, 67% and 30%, respectively.
There is no guarantee that NFHS-4 will be more reliable than CPHS. But at least we know this is a nationwide representative survey, and the NFHS-4 numbers also look more plausible than their CPHS counterparts. Additionally, the NFHS-4 illiteracy numbers are in line with the 2011 census data for the same cohorts, but the CPHS illiteracy numbers are not – they are too high.
As mentioned earlier, the CPHS bias towards better-off households seems to have increased over time. In 2019, the bias was really embarrassing, given similar comparisons to NFHS-5 data for the 11 large states where this survey is on the right track. Look at Bihar. According to the CPHS, at the end of 2019, 100% of households in Bihar had electricity, 100% had water in the house, 98% had a toilet, and 95% had a TV. Paradise! The corresponding NFHS-5 numbers are much lower and much more plausible (96%, 89%, 62% and 35%, respectively). Bihar is just a state. For these 11 states together, however, there is a similar contrast (see table).
Another clue is obtained from comparisons with the Periodic labor force survey (PLFS) 2018-19 presented at the State of Working at Azim Premji University India Report 2021 (bit.ly/2SQvcqO). These suggest that CPHS vastly overestimates median earned income – perhaps around 50% in rural areas.
In short, far from being nationally representative, the CPHS sample shows a strong tendency towards better-off households, and this tendency is likely to increase over time. The bias is perhaps unsurprising as the sampling approach appears to be to collect the “main street” first in each sample village or bullet and only move to the inner streets if the sample size requires it. For this reason alone, poor households are inevitably underrepresented.
We noticed that by the way
Distortion in CMIE data in a recent review of the evidence on the economic impact of Covid-19. A series of household surveys, focusing on informal sector workers and their families, strongly suggests that employment, income, spending and food intake remained well below pre-lockdown levels throughout 2020. CPHS, on the other hand, suggests a fairly quick recovery shortly after the national lockdown. This apparent contradiction can be easily resolved when one considers that poor households are severely underrepresented in the CPHS data.
All of this is just one example of statistical questions that urgently need review given the prominent role CMIE data plays in today’s economic debate. The first step for CMIE is to confirm or withdraw its claim that CPHS is a nationwide representative survey (a reasonable expectation, certainly from an agency that is paying $ 180,000 per wave for adding a one-minute question to the Survey calculated). Then let a hundred voices boom.