Protect Government Data for Future Historians: Announcing Endangered Data Week

Brandon Locke, April 2017

A woman uses a Hollerith card punch machine for the 1940 census. NARA/Wikimedia CommonsThe federal government has been collecting and producing electronic data about the nation for more than 125 years. The 1890 census was counted with one of Herman Hollerith’s first electronic tabulator machines, built specifically for that purpose. Ninety-six years ago, that first dataset was lost to history: a fire broke out in the basement of the Department of Commerce building, destroying nearly all of the census schedules, which were kept on the floor outside a vault (Blake, 1996). Making things worse, counties were not required to keep backup copies of their schedules locally, as they were for the 10 prior censuses. Due to these costly mistakes, historians will never be able to closely examine the westward movement that led Frederick Jackson Turner to proclaim the frontier closed, to trace Civil War veterans and widows, or to study the individuals that made up neighborhoods.

Historians often use government-produced data, including census records, housing and home loan records, and manufacturing and shipping records. These and other datasets provide crucial information about the nation and its people. It’s nearly impossible to tell exactly how much data is produced and shared by the government, but since President Barack Obama’s 2013 executive order concerning open data, 186 organizations (including federal agencies and state and local government) have contributed to the portal to make access easier. The data portal currently points to over 175,000 datasets, ranging from weather and pollination data to county-level crime reports. The data is generally hosted by the agency that produced it, so there isn’t any guarantee that it is mirrored by any nongovernmental organizations, making its accessibility dependent upon current administrations and governmental agencies.

Historians should join the growing movement for saving these datasets, providing their expertise and historical perspectives

There is good reason for concern about the ongoing availability and collection of data by US government agencies, all of which belong to the people. Donald Trump has signaled his opposition to a number of data-collecting initiatives, most notably those concerning climate change. In just the first two weeks of the Trump administration, the Environmental Protection Agency (EPA) was allegedly ordered to remove the climate change page from its website, and the EPA, the Department of Agriculture, the National Park Service, and other federal agencies were also given temporary gag orders.

Climate data is not the only endangered domain on the horizon, nor is the executive the only branch of government raising concerns. The Senate (S.103) and House of Representatives (H.R.482) have both introduced bills mandating that “no Federal funds may be used to design, build, maintain, utilize, or provide access to a Federal database of geospatial information on community racial disparities or disparities in access to affordable housing.” This language is deeply disturbing: not only would it prevent the collection of data regarding crucial inequalities, it would also proscribe access to existing data. While these bills would likely only apply to the Department of Housing and Urban Development, the language could set a precedent to disregard race in the collection of other data. This proposed legislation is a challenge to economic justice and evidence-based policies, and would impair historians who use this data to analyze changes in neighborhood demographics, urban development, policing, and the impact of redlining and other discriminatory housing policies.

Historians thus have a stake in preserving government data, for current projects and for future generations of historians. While many researchers, archivists, and librarians have been working independently to preserve this data for years, historians should join the growing movement for saving these datasets, considering the enormous scope and scale of the task. In the 2005 book Digital History, Dan Cohen and Roy Rosenzweig argued that historians need to be a part of the conversation, alongside archivists and librarians, on how digital materials are saved, and these datasets are no exception. Historians who frequently work with electronic data or have expertise in the particular domain of the data can help with input on the description, format, coverage, and metadata, making the data more usable to current and future historians.

While our technology and preservation policies have obviously come a long way since the loss of the 1890 census, some fundamental risks associated with electronic data loss and suppression remain. The practice of printing and distributing government data to national repositories continues to shift toward digital publication. Repository libraries often make duplicate digital copies, but these are done in an ad-hoc manner and depend on the interests and infrastructural capabilities of the repository libraries. If nobody makes an effort to mirror the publications in a way that retains authenticity and provenance, we are left with a single edition that can easily disappear on the whim of a successive administration or the elimination of an agency. Realizing this, a number of concerned researchers, librarians, and citizens have been stepping up to independently mirror federal data. The most visible of these efforts are DataRescues, events spearheaded by the Penn Program in the Environmental Humanities Lab, University of Michigan Libraries, and Project_ARCC. Volunteers across the country are working to download these datasets, both as part of DataRescue events and independently, and are uploading the data to new repositories like DataLumos and DataRefuge.

As an increasing number of constituencies are realizing, the future of public data is in doubt. What can we do to respond?

Perhaps the best way to fight information suppression is to persistently shine light on the issue. In the late 1970s, the rise of the Moral Majority in the United States gave rise to widespread campaigns to challenge materials in public and school libraries as being immoral and anti-American. Local challenges to books increased threefold between 1979 and 1981. In response, Judith Krug, the longtime director of the American Library Association Office for Intellectual Freedom, launched Banned Books Week in 1982 to fight censorship and to highlight the need to preserve the “freedom to read.”  Thirty-five years later, Banned Books Week is still widely observed across the United States, most notably in public and academic libraries. Activities often include book displays and public readings of commonly challenged books, lectures and film festivals, and letter-writing campaigns to promote the values of intellectual freedom.

We should build upon the efforts of the DataRefuge program and the success of Banned Books Week to develop sustained advocacy for governmental open data. This month will witness the inaugural Endangered Data Week, a coordinated series of events across campuses, nonprofits, libraries, citizen science initiatives, and cultural heritage centers to shed light on public datasets that are in danger of being deleted, repressed, or lost. Endangered Data Week’s goals are to bring awareness of different types of threats to publicly available data, engage with the power dynamics involved in data creation, sharing, and retention, and make endangered data more secure and accessible. These events can capitalize on endangered collections in DataLumos, DataRefuge, and elsewhere by improving access and preservation, visualizing and publicizing the data, critically engaging with the power dynamics of data collection and sharing, encouraging political activism for open data policies, and conducting workshops on data curation, documentation, and preservation.

While much of the urgency regarding federal data has revolved around Trump’s election and inauguration, this is hardly an issue constrained to the current administration. Political threats to information are nothing new, and several notable gaps in federal data, such as mandatory hate crime reporting and officer-involved shootings, have existed through administrations of both parties. This is all the more reason to organize and coordinate programming that sheds light on the reality of public data, elevating and protecting endangered data, and discussing the complex power dynamics behind data collection, organization, and sharing. Historians have a duty to stand as users of past records for the preservation of crucial information, as scholars for evidence-based policy, and as citizens for government transparency.

Brandon Locke is director of the Lab for the Education and Advancement in Digital Research (LEADR) at Michigan State University. You can find him at or @brandontlocke on Twitter.

