Digital Archivists: Protecting Public Data from Erasure

In the three decades since Brewster Kahle spun up the nonprofit Internet Archive’s Wayback Machine, it has scaled up to include government websites and datasets—many of which are essential to the engineering and scientific communities. U.S. government agencies like the National Science Foundation, Department of Energy, and NASA are critical sources of research data, technical specifications, and standards documentation in pretty much every area where IEEE Spectrum’s audience works—AI & computer science, biomedical devices, power and energy, semiconductors, telecommunications…the list goes on.

Access to that governmental data directly affects the reproducibility of experiments, the validation of models, and the integrity of the scholarly record.

So what happens if an entire dataset vanishes? Among other things, it can invalidate years of research built upon that foundation.

Until recently, wholesale deletion of data has been rare. In the United States, presidential transitions typically involve some changes to government websites to reflect new policy priorities. And after 9/11, the George W. Bush administration removed “millions of bytes” of information from government sites for security reasons as well as hundreds of Department of Defense documents and “tens of thousands” of Federal Energy Regulation Commission files.

The Obama and Biden administrations likewise made changes to government websites but didn’t engage in large-scale removal of Web pages or datasets. Obama, in fact, expanded public access to government data in 2009 by launching Data.gov, whose stated mission is in part “to unleash the power of government open data to inform decisions by the public and policymakers.”

During President Donald J. Trump’s first term, researchers at the Environmental Data & Governance Initiative found that some government sites became inaccessible, and the phrase “climate change” was purged from several government Web pages.

But watchdog groups mostly didn’t observe outright data destruction, according to Spectrum Assistant Editor Gwendolyn Rak.

Access to governmental data directly affects the reproducibility of experiments, the validation of models, and the integrity of the scholarly record.

The second term has been different. In February, a few weeks after Trump was sworn in for his second term, The New York Times reported that his administration took down more than 8,000 Web pages and databases. Many of those pages have since reappeared, but some of the restored pages and files have had changes, including the erasure of terms like “climate change” (again) and “clean energy,”Grist reports. These moves have faced multiple court challenges; on 11 February, for instance, a federal judge ordered that public access to Web pages and datasets belonging to the Centers for Disease Control and Prevention and the Food and Drug Administration be restored.

In our April issue, Rak reports on efforts to preserve public access to information. In addition to the ongoing work at the Internet Archive, she describes how archivists at the Library Innovation Lab at Harvard Law School amassed a copy of the 16-terabyte archive of Data.gov, which includes more than 311,000 public datasets. That copied archive is being updated daily with new data hoovered up via automated queries to application programming interfaces (APIs).

Archivists are the guardians of memory. We depend on them to help us stay in touch with our history, maintain our knowledge base, and provide context, allowing us to understand how we came to be where we are and to light the way forward. In the fields of science, engineering, and medicine, where today’s innovations stand on the shoulders of yesterday’s discoveries, these digital preservationists ensure that the circuit of human knowledge remains unbroken.

This article appears in the April 2025 print issue as “Lots of Copies Keep Stuff Safe.”

From Your Site Articles