Publication Date

April 30, 2007

Thematic

Archives

The Google Books project promises to open up a vast amount of older literature, but a closer look at the material on the site raises real worries about how well it can fulfill that promise and what its real objectives might be.

Over the past three months I spent a fair amount of time on the site as part of a research project on the early history of the profession, and from a researcher’s point of view I have to say the results were deeply disconcerting. Yes, the site offers up a number of hard-to-find works from the early 20th century with instant access to the text. And yes, for some books it offers a useful keyword search function for finding a reference that might not be in the index. But my experience suggests the project is falling far short of its central promise of exposing the literature of the world, and is instead piling mistake upon mistake with little evidence of basic quality control. The problems I encountered fit into three broad categories—the quality of the scans is decidedly mixed, the information about the books (the “metadata” in info-speak) is often erroneous, and the public domain is curiously restricted.

Poor Scan Quality
My reading of the materials was not scientific or comprehensive, by any means, but a significant number of the books I encountered included basic scanning errors. For instance, the site currently offers a version of the Report of the Committee of Ten from 1893 (the start of the great curriculum chase for the secondary schools). It offers a catalog of scanning errors, as Google has double-scanned pages (page 3 appears twice, for instance), pulled in pages improperly so they are now unreadable (page 147 between page 164 and 166), and cut off some pages (page 146, for example).

I’ve digitized a number of the AHA’s old publications and appreciate that scanners don’t always work as they should and pages can often get jammed. But even fairly rudimentary quality controls should catch those problems before they go live online. After years of implementing those kinds of quality checks here—precisely because friends in the library community took me to task about their necessity—I find it passing strange that so many libraries are joining in Google’s headlong rush to digitize without similar quality requirements.

Faulty Metadata
Mistakes in Google Book Search Metadata
Beyond the fundamental quality of the scanning, a more significant problem is the incredibly poor descriptive information attached to many of the books on the site (the “metadata”). This is particularly evident in the serial publications, where having the proper name and date of a publication is particularly important. Take for example a volume of History Teacher’s Magazine that is labeled as a volume of Social Studies (the name the magazine took in 1934) and dated as published in 1953 (even though it seems to be from 1917).

These kinds of problems have two unfortunate effects. First, it makes it more difficult to place a particular work in time and thus actually locate a particular item “discovered” by using Google Books. At the same time, in many instances you will be unable to inspect public domain items more closely, because the erroneous date places the information on the wrong side of the copyright line.

Truncated Public Domain
These problems are exacerbated by Google’s rather peculiar views on copyright. While taking an expansive view of copyright for recent works, it has taken a very narrow view about books that actually are in the public domain. As I have always understood it (and the U.S. Copyright Office confirms), “works by the U.S. government are not eligible for U.S. copyright protection.” But Google locks all government documents published after 1923 1922* behind the same wall as any other copyrighted work. Among other things, that locks up works that should be in the public domain, such as the AHA’s Annual Report (published by the Government Printing Office from 1890 to 1993) and circulars from the U.S. Bureau of Education. This problem is exacerbated by the often errant data about when these materials were published—which places these works even further beyond reach.

For more than a year now, Siva Vaidhyanathan, a cultural historian and media scholar at New York University, has been objecting that the rush to digitize is moving far in advance of considered thought. His concerns seemed rather abstract when I first heard them last year, but working with Google Books over the past few months made his objections seem much more tangible and worrying.

What particularly troubles me is the likelihood that these problems will just be compounded over time. From my own modest experience here at the AHA, I know how hard it is to go back and correct mistakes online when the imperative is always to move forward, to add content and inevitably pile more mistakes on top of the ones already buried one or two layers down. With Google adding in more than 3,000 new books each day, the growth in the number of mistakes seems that much higher.

The problem of quality control only exacerbates my most basic worry about the larger rush to digitize every scrap of information—that we are adding to the pile much faster than the technology can advance to extract the information in a useful or meaningful way. When I have asked people who know a lot more about the technology than me about this problem, they tend to wave their hand and mumble about “brilliant scientists” and “technological progress.” Forgive me if I remain unconvinced. Even for someone fairly proficient in Boolean search terms I find a lot of the results from Google Books (and Google more generally) just page after page of useless and irrelevant information. I find it increasingly hard to believe that Google can add tens of thousands of additional books each month to the information pile—many containing basic mistakes in content and metadata—and the information results will actually grow better over time.

So I have to ask, what’s the rush? In Google’s case the answer seems clear enough. Like any large corporation with a lot of excess cash the company seems bent on scooping up as much market share as possible, driving competition off the board and increasing the number of people seeing (and clicking on) its highly lucrative ads. But I am not sure why the rest of us should share the company’s sense of haste. Surely the libraries providing the content, and anyone else who cares about a rich digital environment, needs to worry about the potential costs of creating a “universal library” that is filled with mistakes and an impenetrable smog of information. Shouldn’t we ponder the costs to history if the real libraries take error-filled digital versions of particular books and bury the originals in a dark archive (or the dumpster)? And what is the cost to historical thinking if the only substantive information one can glean out of Google is precisely the kind of narrow facts and dates that make history classes such a bore? The future will be here soon enough. Shouldn’t we make sure we will be happy when we get there?

* Thanks to Ralph Luker for this correction.

This post first appeared on AHA Today.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. Attribution must provide author name, article title, Perspectives on History, date of publication, and a link to this page. This license applies only to the article, not to text or images used here by permission.