COVID-19 Insights: Working with Imperfect and Incomplete Data

With so much at stake, what do we need to acknowledge with the COVID-19 data set?


At Metopio, we are dedicated to data hygiene.

After nearly eight months of regularly updating cases, case rates, testing rates, positive tests and case fatality rates, we wanted to assess where there are gaps and share how you should think about them in your analyses.

COVID-19 data is imperfect yet we need to keep collecting and analyzing what is available to understand its impact. Imperfect does not mean unusable. As the pandemic continues, though, we need to understand these gaps and how to improve data collection going forward.

Government Context

As an infectious disease, all positive COVID-19 tests have to be reported to health departments but there isn’t a similar mandate for negative tests. Early in the pandemic, the Centers for Disease Control (CDC) was not collecting data from private labs. In June, they revised that guidance so certified labs could submit test results. This is an example of how government policies impact this data set.

The U.S. public health system is made up of a vital network of local agencies that provide care, coordinate, collaborate and roll up to state agencies. The pandemic has exposed antiquated systems and a wide variation of resources, technical skills and staffing from jurisdiction to jurisdiction. In fact, in some places these agencies' geography overlaps, making tracking even more complicated.

Variations in how data is reported adds to the complexity. For example, how deaths are defined vary by jurisdiction. As of September 2020 Oregon state reports COVID-19 deaths as those who died from the disease while Washington state reports them as those who died with the disease. This is not just semantics.

Uncertainty at the federal level has exacerbated challenges in data collection. On July 14 the Trump Administration issued guidance that the Department of Health and Human Services would be collecting COVID-19 data rather than the CDC’s National Healthcare Safety Network. Then only a month later, the Administration reversed that decision citing issues of continuity, timeliness and transparency.

Shifting policy in a politically charged environment--layered on top of the operational challenges of a decentralized public health system--make data collection and standardization difficult.

Impact of Access

Another important variable confounding COVID-19 data collection is the number of tests available and who is able to get tested. This brief timeline demonstrates how testing and collecting results has evolved but continues to be challenged.

  1. On February 29, the FDA loosened the regulations on the development of COVID-19 tests. Before this date, all tests had to be conducted by the CDC for a case to be counted as a “confirmed positive”.
  2. On March 13, the U.S. declared the coronavirus a national emergency. Prior to this, tests were extremely limited and prioritized for those who were coming to a hospital. That means the population was more likely to be sick and had an increased chance of adverse outcomes.
  3. On May 18, HHS announced the CDC would distribute $10.25 billion in funding for state and local jurisdictions for testing.
  4. On August 5, the CDC updated its guidance for a “confirmed” case based on a polymerise chain reaction or PCR test. Positive results from antigen tests are considered “probable” cases because they can be less accurate. It is critical for testing to evolve, become more accurate and more available. This Food and Drug Administration (FDA) is a good resource as testing continues to evolve.
  5. In late August college and universities began returning to campuses. Many were administering antigen tests because the results are delivered in minutes without needing a lab to process. However, colleges are not typical healthcare providers and they don’t have an easy or uniform way to send data electronically to public health authorities.

If an increase use of antigen tests occurs with a decrease in PCR tests, this could make the count of infected individuals artificially low. As specific populations have more access to testing and science races to keep pace with the pandemic, data collection must also evolve. Meanwhile, we need to recognize it creates a wide variety of results that cannot always be reconciled.

Race/Ethnicity Data

It is essential to collect race/ethnicity data to understand the disproportionate impacts of COVID-19. However, it is extremely variable across government agencies. Several entities are tracking how this data is being collected.

  1. In June, the American Medical Association reported that 14 states were still not collecting race/ethnicity data for COVID-19 deaths. As of September, it had decreased from 14 down to 2.
  2. The COVID Tracking Project provides updates on how much of each state’s data includes race/ethnicity. Not all states were collecting it from the beginning, and some only collect it for certain data points.

While there are many efforts to provide consistent, national daily updates for researchers and decision-makers--only 13 states provide the data in a machine readable format--making data collection very inefficient and time-consuming when time is of the essence.

Our Curation Process

As your provider of quality, trusted data, Metopio considers all these questions and more when curating public data. We have internal processes in place to ensure consistent reporting and definitions across jurisdictions, account for the impact of reporting issues, estimate the likely bias and skew resulting from reporting and definitional challenges, and provide clearly defined demographic breakdowns where available.

Raw public datasets are often thorny and challenging to interpret; we do this work so you don’t have to. Check out our Curated Data Library made available to all subscribers for more information on our public data sets, including topics related to COVID-19.

Can we help your organization understand the impact of COVID-19 on populations and places you care about? Contact us