News

Data Overload

Web Archives and the Challenges of Scale

Seth Denbo | May 7, 2019

Imagine future historians studying the public discourse on autism in the early 21st century. They sift through an archive as vast as anything we know today, but they must contend with born-digital sources—blogs written by people with autism, for example, or the websites of advocacy organizations and government agencies, not to mention video, audio, and social media content, all gathered from across the web. The extent of this archive means that the traditional methods of doing historical research will no longer be relevant, at least for this project. Historians will have to use new techniques and digital tools to interrogate the archive. The scale of pertinent sources, the technical skills required to analyze them, and the need to assess what was and wasn’t collected by the archivists who processed these materials in the first place will raise a host of challenges.

The Internet Archive has preserved over 350 billion webpages using the PetaBox storage system.

The Internet Archive has preserved over 350 billion webpages using the PetaBox storage system. One PetaBox stores one petabyte of information. Steve Rhodes/Flickr/CC BY-NC-ND 2.0

As the web becomes ever more integrated into our lives, numerous entities, such as the Library of Congress and the Internet Archive, have begun archiving it. But these new web archives contain so much data that historians have begun reconsidering research methods, skills, and epistemology. In fact, few historians now possess the requisite qualifications to perform professional research in web sources.

In March 2019, participants in a “datathon” held at George Washington University in Washington, DC, got a taste of what research with born-digital web archives could look like. The event was organized by the Andrew W. Mellon Foundation-funded Archives Unleashed Project, which, according to its website, “aims to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past.” The project’s goal is to lower barriers to working with large-scale web archives by creating accessible tools and a web-based interface with which to use them. The datathon brought together librarians, archivists, computer scientists, and researchers from a variety of disciplines, including the humanities and social sciences, to explore web archives on a wide range of topics.

At datathons, people can broadly experiment with specific datasets, asking and answering questions about them. The Archives Unleashed Project has hosted datathons since 2016, to explore the possibilities that web archives present for research. A “big challenge for this project,” explains Ian Milligan, principal investigator of Archives Unleashed, is determining “where should the project end and the researcher take over.” In other words, how can the project ensure that it sufficiently prepares archival custodians and researchers to continue to be able do this work in the future? Through these datathons, Archives Unleashed strives to create communities of users for the tools it’s creating and build expertise in using web archives and sources.

Studying the recent past all but compels historians to use web archives.

At the datathon in Washington, the project team provided pre-selected collections of web sources, and participants chose which materials they wanted to work with. Topics included the Washington, DC-area punk music scene, web content from former Soviet Bloc countries, and the #MeToo movement. Participants identified the questions they wanted to ask of the sources, used analytical tools from the Archives Unleashed Toolkit to explore the data, and presented their findings.

One group explored the non-textual elements—images, audio, video—in the 48 gigabytes of the DC Punk Archive that they had been given to work with. With a tool in the Archives Unleashed Toolkit, they extracted over 10,000 digital objects from the collection, then determined file types in order to identify the types of materials they were working with. Expecting to find mostly audio and videos of concerts, the group also discovered tickets, posters, flyers, album covers, photographs of artists, and more—objects that would be vital to telling the history of the scene. Another team, working with the former Soviet bloc websites, analyzed those sites’ outgoing links to understand which other sites around the world were important within the content of their collection.

Another team explored sites from the #MeToo Digital Media Collection, which is being gathered by Harvard University’s Schlesinger Library as part of a project to comprehensively document the movement. Several participants approached the harvested material from an archival perspective, asking questions like: Is the collection capturing what’s necessary in order to be useful to researchers now and in the future? What are the assessment criteria that should be used to ensure that the collection has archival value? How do you document those decisions and ensure that the resulting archives are usable?

These are questions that archivists have always asked in making decisions about records and other archival materials. But web archiving includes its own problems of scale, preservation, privacy, and copyright. The Internet Archive began preserving sites from the World Wide Web in 1996. Since then it has archived nearly 350 billion web pages. The memory required to store all of this content is well in excess of 15 petabytes. (Your computer at home probably has about a thousand gigabytes of hard drive storage; one petabyte is a little over one million gigabytes.) Users of the Internet Archive’s Wayback Machine can explore a treasure trove of websites, including the entire GeoCities network; over 1,600 versions of algore.com—the website of the former vice president—dating back to 1998; and the earliest US federal government web pages. The majority of this content remains untapped by historians.

And the Internet Archive is not the only institution capturing knowledge existing on the web; traditional institutions are also involved in this effort. The DC Punk Archive is the work of special-collections librarians at the District of Columbia Public Library. National libraries and legal-deposit libraries also do this archival labor, with a growing number of countries passing non-print legal-deposit laws, which mandate the collection of sites within national domains, such as .fr or .no in Europe. The British Library has worked with the United Kingdom’s network of deposit libraries to routinely archive the entire .uk web domain, required by a 2013 law that complemented the long tradition of legal deposit of print materials in national libraries.

Working with the Internet Archive, the Library of Congress has also been creating archives of websites in the public interest since 2000. The library currently collects, per month, between 20,000 and 25,000 gigabytes of content on a wide range of topics, the sites of all legislative branches of the US federal government and a selection of those maintained by executive agencies, and some international websites, such as those covering general elections around the world and major political and social upheavals. In a phone interview, Abigail Grotke, web archiving team lead at the Library of Congress, explained how reference librarians and overseas operations officers with subject-matter expertise provide guidance on which web archive collections the library should create and maintain, and on “the urgent events” that should be documented and preserved as they happen.

Web archiving brings its own problems of scale, preservation, privacy, and copyright.

According to Grotke, the Library of Congress always obtains permission from a site’s owners before “crawling”—a term derived from the use of a piece of software called a web crawler that systematically browses and collects data from websites. While this adds complexity to the task and requires that the library be much more selective about what it collects, Grotke says that it also allows its collecting to be more “focused and deeper.” Since no collecting work can capture everything on the web, decisions always need to be made about where a crawl stops. Attention to details like these will ensure that historians can explore what’s preserved in these vast collections of data. Still, gleaning meaningful information from these sources will require historians to use new tools.

Many of the barriers to using these archives are simply a result of scale—the archives are just too big to provide good results from keyword searches or even to browse through. As a result, analytical tools are necessary. Web archiving crawls create files in the WARC format, an international standard that has been adopted by libraries and other web archiving organizations. WARC files preserve the content of a website in addition to other archival information, such as when the content was collected. The Archives Unleashed Toolkit (available for free on the project’s website, at archivesunleashed.org/aut/) used in the datathon includes scripts (little programs that do discrete tasks) to sort and manage the data and metadata in WARC files.

The toolkit allows users to, for example, strip out everything but the main content, eliminating secondary information such as website navigation and ads. Other scripts in the toolkit allow users to see what is included in the archive they are working with. Users can also filter by language, group sites in a collection by the date on which they were crawled, or find all names of individuals, organizations, or places in a group of sites. These techniques do require some basic knowledge of how websites work, but they don’t necessitate years of training.

While the web itself is of recent enough inception that only a small subset of historians who study contemporary history are currently using it as a source, more will need to be prepared to do so in the coming years. As one datathon participant put it, software programs such as the Archives Unleashed Toolkit provide means for “trying to understand your dataset before you dive into it.” As we get further and further away from the early days of the web, and with so much of our history recorded there, historians, now more than ever, need to know how to work with these materials.


Seth Denbo is director of scholarly communication and digital initiatives at the AHA. He tweets @seth_denbo.


Tags: News Archives Digital History


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. Attribution must provide author name, article title, Perspectives on History, date of publication, and a link to this page. This license applies only to the article, not to text or images used here by permission.

The American Historical Association welcomes comments in the discussion area below, at AHA Communities, and in letters to the editor. Please read our commenting and letters policy before submitting.


Comment

Please read our commenting and letters policy before submitting.