A groundbreaking new study published in the prestigious medical journal The Lancet has uncovered a troubling pattern of undisclosed modifications to critical United States government health datasets. The investigation reveals that over 100 federal health datasets were quietly altered this spring, with nearly half of the examined files undergoing significant wording changes without any public notification.
Experts warn that such hidden edits could severely compromise the integrity of public health research and erode public trust in federal data sources, potentially leading to flawed policy decisions and misdirected resources.
Unveiling the Hidden Changes: The Research Methodology
To uncover these clandestine alterations, researchers meticulously downloaded online catalogs, known as harvest sources, maintained by federal agencies under the 2019 Open Government Data Act. Their focus was on entries from key departments including the Centers for Disease Control and Prevention (CDC), the Department of Health and Human Services (HHS), and the Department of Veterans Affairs (VA) that showed modification dates between January 20 and March 25, 2025.
After filtering for duplicates and frequently updated files, the team analyzed 232 unique datasets. For each, they sourced an archived pre-modification copy, primarily utilizing the Internet Archive’s Wayback Machine. A word-processing program’s comparison feature was then used to highlight all textual differences, specifically excluding numerical tables. Crucially, investigators cross-referenced these changes with the official public change logs appended to each dataset’s web page, discovering a widespread absence of documentation.
Core Findings: “Gender” to “Sex” Swaps and Opaque Revisions
The study found a striking consistency in the alterations. Out of the 232 datasets analyzed, 114 (49%) contained what the authors deemed potentially substantive wording changes. A dominant pattern emerged: 106 of these alterations involved switching the term “gender” to “sex.” Other notable changes included replacing “social determinants of health” with “non-medical factors” and “socio-economic status” with “socio-economic characteristics.” One clinical trial listing even revised its title from “gender diverse” to “include men and women.”
The majority of these revisions (89 cases) directly impacted data definitions, such as column names or category labels. The remaining 25 changes appeared in narrative descriptions or tags. Alarmingly, only 25 of the 114 altered files—less than one in seven—publicly acknowledged these revisions in their official change logs.
The timing of these edits also raised eyebrows, with a marked acceleration observed: four edits in late January, 30 in February, and 82 during the first three and a half weeks of March, suggesting a concentrated effort.
Far-Reaching Implications for Research and Public Health
These government datasets are the bedrock of countless research projects in psychology, sociology, and public health. For instance, the Behavioral Risk Factor Surveillance System (BRFSS) provides vital data on health behaviors, while CDC files on heart disease and stroke mortality aid in understanding public health trends. VA summaries are indispensable for veteran mental health research.
When crucial variable labels like “gender” inexplicably shift to “sex,” studies comparing data collected under different terminologies become unreliable. Even a single undocumented change can invalidate prior statistical models, hinder replication attempts, or obscure genuine population trends. This distinction is particularly critical, as “gender” refers to a social identity, while “sex” denotes biological classification. Without clarity, analysts cannot discern if a change in demographic ratios reflects actual shifts, a mere wording tweak, or unannounced re-coding, potentially leading to misinformed public health policies and medical guidelines.
Potential Political Motivations and Lack of Transparency
The study authors point to a possible political impetus for these changes, noting a White House directive issued in early February that urged agencies to remove material perceived as advancing “gender ideology.” While no federal office has confirmed a direct link, the timing and narrow focus on the term “gender” strongly suggest coordinated action.
If the aim was indeed terminology alignment across agencies, the investigation indicates a significant disregard for the transparency mandated by the Open Government Data Act.
Study Limitations and Recommendations for Data Integrity
The researchers acknowledge certain limitations, including the inability to examine earlier periods due to archive constraints and the subjective nature of judging change substance. Furthermore, numerical content was not re-examined, leaving open the question of whether figures were also altered.
In light of their findings, the authors propose several protective measures for scholars and institutions. These include independent mirroring of federal datasets on private servers, local archiving by individual investigators, and routine spot checks against archived versions. International repositories like Europe PubMed Central offer alternative hosting for biomedical resources.
Ultimately, the researchers emphasize the paramount need for a cultural commitment to full version tracking within federal agencies. This would ensure that every member of the public can clearly see what changed, when it changed, and, most importantly, why.
The study, titled “Data manipulation within the US Federal Government,” was co-authored by Janet Freilich and Aaron S. Kesselheim.