Why Digitize?

Abby Smith | Feb 1, 1999

The dream of the virtual library comes forward now not because it promises an exciting future, but because it promises a future that will be just like the past, only better and faster.
—James J. O'Donnell, Avatars of the Word

In the digital world, all knowledge is divided into two parts. The binary strings of 0s and 1s that make up the genetic code of data allow information to be fruitful and multiply, to create, manipulate, and share data in ways that appear to be revolutionary. It is often said that digital information is transforming the way we learn, the way we communicate, even the way we think. It is also changing the way that libraries and archives not only work, but, more fundamentally, the very work that they do. It is easy to overstate—as well as underestimate—the transformative power of a new technology, especially when we do not yet understand the full implications of its various applications. Nonetheless, people have embraced this technology enthusiastically, often as an answer to questions that are, in many cases, not yet posed. Librarians everywhere hear the voices of people speaking like evangelicals, urging a conversion of text and visual materials into digital form as if conversion per se were a self-evident good. But because we tend to imagine the future in terms of the present, as O'Donnell points out, such projections of the present onto the future may be misleading at best. If this new technology does, indeed, turn out to be revolutionary and we cannot anticipate its impact in full, at least we should be cautious about letting the radiance of the bright future blind us to the achievements of the past.

While we may not yet fully understand the ways in which this technology will and will not change libraries, we can already discern some simple, yet profoundly important, patterns in digital applications that presage their effective and creative use in the traditional library functions of collecting, preserving, and making information accessible. A critical mass of experience is accumulating among libraries and archives active in digitizing parts of their collections, ranging in size from the Library of Congress, the National Archives, and major research libraries in the Digital Library Federation, to smaller institutions such as the Huntington and Denver public libraries. We can see already instances in which the technology is able to meet expectations for improvement of traditional library services, where it cannot, and where it may do so, but not in a cost-effective manner. This article will address the question of why a library should invest in the conversion of its traditional materials into digital form—in other words, what are the advantages and disadvantages of converting traditional analog materials into digital form?

What Is Digital Information?

Until very recently, all recorded information was analog—that is, a continuous stream of information of varying density and type. Analog information can range from the subtle tones and gradations of the chiaroscuro in a Berenice Abbott photograph of Manhattan in early morning light to the changes in volume, tone, and pitch recorded on a tape that might, when played back, turn out to be the basement tapes of Bob Dylan or the Welsh accents of Dylan Thomas reading Under Milk Wood. But when such information is fed into a computer, broken up into 0s and 1s and put together in a binary code, its character is changed in quite precise ways. Digitally encoded data does not represent the random nature of information as faithfully as analog forms of recording. Digits are assigned numeric values which are fixed, so that great precision is gained in lieu of the infinitesimal gradations that carry meaning in analog forms. Those bits of data can be recombined for easy manipulation and compressed for storage. Voluminous encyclopedias that take up yards of shelf space in analog form can fit onto a minuscule space on a computer drive, and that same digital encyclopedia can be searched many ways other than alphabetically, making retrieval of information possible that would have been unimaginable if one had only the analog copy on paper or microfilm. When a photograph is digitized for viewing on a computer screen, the original image is divided into dots with assigned values that are mapped against a grid. The pattern of the dots is remembered and reassembled by the computer upon command. Data that are not being used are not like books on a shelf or the family correspondence and photos stored in shoe boxes at the back of a closet. They are more like the stacks of LPs or the 16mm family home movies in storage in a basement. That is, digital information is not eye-legible: it is dependent on a machine to decode and re-present the bit streams in images on a computer screen. Without that machine, and without active human intervention, those data will not last.

One of the most important ramifications of information in digital form is that by its very nature it is not fixed in the way that texts printed on a paper are. Digital texts are not final or finite, not fixed in essence or form unless printed out in hard copy, for they can be changed easily and without trace of erasures or emendations. Flexibility is one of the chief assets of digital information. It is precisely why we like text poured into a word processing program. It is easy to edit, to reformat, and to commit to print in a variety of iterations without the same effort as typing onto carbon paper. That is why visual designers like computer-assisted design programs. It is easy to summon up quickly any number of variations of tone, shape, value, and location to see, rather than to imagine, what various visual options look like. Furthermore, we can create an endless number of identical copies from a digital file, because the file does not decay by virtue of copying. From the creator's point of view this may be ideal, but from the perspective of a library or archives that endeavors to collect a text that is final and in one sense or another definitive, this can complicate things considerably. Because the digital text is flexible and easily changed, the matter of preserving digital information becomes problematic conceptually. Which version of the file, or how many versions, should be archived? There are also considerable technical obstacles to ensuring the persistence of digital information.

Digitization Is Not Preservation—At Least Not Yet

All recorded information, from the paintings on the walls of caves and drawings in the sand, to clay tablets and videotaped speeches, has value, even if temporary, or it would not have been recorded in the first place. That which the creator or transcriber deems to be of enduring value is written on a more or less durable medium and entrusted to the care of responsible custodians. Other bits of recorded information, like laundry lists and tax returns, are created to serve a temporary purpose and are allowed to vanish. Libraries and archives were created to collect and serve that which has some long-term value. And libraries and archives serve not only to save that information, but also to provide evidence of one type or another of the work's provenance, which goes toward establishing the authenticity of that work.

Though digitization is sometimes loosely referred to as "preservation," it is clear that, so far, digital resources are at their best at facilitating access to information and weakest in the traditional library responsibility of preservation. Ironically, because digitization is a type of reformatting, like microfilming, it is often confused with microfilming for preservation and seen as a superior, if as yet more expensive, form of preservation reformatting. Digital imaging is not preservation, however. Much is gained by digitizing, but permanence and authenticity, at this juncture of technological development, are not among those gains.

The reasons for that are complex. Microfilm, the preservation reformatting medium of choice, is projected to last several centuries if made on silver halide film and kept in a stable environment. It requires only a lens and a light to read, like computer files, which require hardware and software to play back, both of which are developed in often proprietary forms that quickly become obsolete and can render information on them inaccessible. At present, means for retrieving information captured in an obsolete file format and stored on an obsolete medium (like eight-inch floppies) are extremely expensive and labor intensive to recover, when possible at all.

Often the medium on which digital information is recorded is itself inherently unstable; magnetic tape is one example of a common digital medium that requires special care and handling and has been known to degrade within decades beyond the point that information can be recovered. In its inherent physical fragility, it is not different in essence from the acid paper so widely produced in the last century, but the life span of tape is dramatically shorter than that of even the poorest quality paper. Of course, many 20th-century forms of analog recording, which are also magnetic (such as video and audio tape), are equally fragile and unreliable for long-term storage.

More important perhaps than the durability of the medium itself is the necessity to keep data fresh and encoded in readable file formats. Current investigations into two possible ways of ensuring data persistence—the migration of data from one software and hardware configuration to one more current, and the creation of software that emulates obsolete encoding formats—may develop solutions to this problem. As yet, we have no tested and reliable technique for ensuring continued access to digital data of enduring value, although information stored on non-proprietary formats such as ASCII has been migrated successfully on numerous occasions. Nevertheless, migration is often accompanied by some loss of identity, if not necessarily intellectual content, from one software to another, thus changing the file in certain fundamental ways.

Another reason that preservation goals are in some fundamental way challenged by digital imaging is that it is quite difficult to ascertain the authenticity and integrity of an image, database, or text when it is in digital form. How can one tell if a digital file has been tampered with and the content changed or falsified? Looked at from the traditional perspective of published or manuscript materials, it is futile even to try: there is no "original" with which to compare a suspect file. Copies can be deceptively faithful: one cannot tell the difference between the "original" output of a scan of the Declaration of Independence and one that is output four months later. In contravention of a core principle of archival authenticity, one can change the bit stream of a file and leave no record of it having been altered. That may not be important for a digital image of a well-known document like the Declaration of Independence, in which access to either the (analog) original or a good photographic image is easy enough to obtain for comparison's sake. But anyone who has seen the digitally engineered commercial in which Fred Astaire can be seen dancing with a vacuum cleaner can readily understand the ease with which improbable digital occurrences can become "real" because we "see" them. After all, the evidence is before our eyes, and our eyes cannot detect a falsehood. It is our cognitive reasoning that detects that falsehood, not our eyes. That image of the suave gliding across the floor with the functional amuses because it confounds our expectations. But what if we arrive at a library web site, for example, looking for an image that we have never seen and about which we have few expectations. The only reason that we expect that image to be a truthful representative of the original is because we can rely on the integrity of the institution that has mounted the files and makes them available to us. We transfer the confidence we experience in the reading room of that library to our work station, wherever it may be. We go to the New York Public Library site with the full expectation that the library "guarantees" the integrity of the images they mount. But it would be very hard indeed for a researcher in Alaska looking at Digital Schomburg to verify independently that any given image is indeed a faithful representation of the original. There is much research and development at present dedicated to solving the dilemma posed by the stunning fidelity of digital cloning, including methods for marking images and time-stamping them, but there is as yet no solution.

This problem is far from unique to the digital realm. Forgers and impostors have a distinguished history of operating successfully and often long undetected in print and photographic media, although they have had to work harder as well as smarter than their digital counterparts.

The traditional methods that have served the library and archival professions so far for authenticating documents have relied largely on practices derived from markers carried on the physical medium itself. After a textual examination to look for obvious differences in content, researchers have often then examined the physical carrier itself—the book or manuscript leaf—to see if there are any signs of modification or falsification.

From simple examination of watermarks to a variety of sophisticated chemical, optical, and physical tests that can verify the age of paper, the composition of inks, and the physical traces of erasures and palimpsests, researchers have access to a number of strategies to verify the authenticity of a document. Granted, there are few who routinely insist on that level of authentication in doing research, but that is because the pitfalls of using books, manuscripts, and visual materials are familiar to us and we tend to discount them without much conscious thought.

We should be wary of reposing the same quality of trust in digital resources that we do in print and photographic media until we are equally familiar with their evidentiary weaknesses.

As in other forms of reformatting, digital scanning has implications for the original item and its physical integrity. Of course, depending on the policy of a library or archive, the original of a scanned item will probably be retained after reformatting. To the extent that a reader can make do without handling the original, the digital preservation surrogate can serve to protect it from wear and tear. If there is some concern that scanning itself could damage materials, one would choose to scan from a film version of the original.

One can combine the advantages of scanning for access purposes with those of preservation microfilming by using the model of hybrid conversion, that is, creating preservation-standard microfilm and scanning that for access purposes, or, conversely, beginning with a high-quality scan and creating computer-output microfilm (COM) for preservation purposes. Work is presently underway to articulate and refine the best practices for implementing the hybrid approach to reformatting, so that it can be adopted by libraries across the country. Of course COM, unlike microfilm created from the original, is only an analog form of digital images. Though it has been fixed on a durable medium, some would argue that the image itself, having been generated digitally. has lost some essential information—or at least it has lost its fundamental analog character—and cannot therefore claim to be as desirable a preservation medium as film made from the original source.

Although this may seem a minor point to those not interested in that level of authenticity, but rather are primarily interested in easy access, it is still important to understand that digital technology literally transforms analog information and changes its nature radically simply by disaggregating it into 0s and 1s. There has to be some measure of loss of information when an analog item is made digital, just as there is when one analog copy is made from another. On the other hand, there is virtually no loss in information from one generation of digital copy to another. Images will not degrade when copied, in contrast to microfilm, which will lose about 10 percent of its information with each copy (Don Willis, A Hybrid Systems Approach to Preservation of Printed Materials, 1992, p. 6). As noted above, once you have more than one copy of a digital file, it is impossible to pick out the "original," and one will never speak of "vintage files" the way that one now speaks of vintage photographs. On the other hand, digital images are less likely to decay in storage if they are refreshed, the images will not degrade when copied, and the digital files will not decay in use, unlike paper, film, and magnetic tape.

Digitization Is Access—Lots of It

Digital files can provide extraordinary access to information. They can make the remote accessible and the hard to see visible. Digital surrogates can bring together research materials that are widely scattered about the globe, allowing viewers to conflate collections and compare items that coexist solely by virtue of digital representation. The easy access to reference surrogates—images that provide a great deal of information contained in the original even at fairly low resolution—is a boon to researchers when developing efficient and effective research strategies. Through the use of thumbnail images, which do not require a high resolution, one can at a minimum acquaint oneself with the source enough to know whether or not one needs to consult the original. Very often one can make do with the digital surrogate because it provides all the information required. The image of the 1612 map of Virginia by John Smith may provide a scholar enough information to determine how far inland Smith actually traveled. The black crosses he laid down on paper to mark the furthest points he reached on various treks are clearly legible even on a low-resolution image.

One must think about the nature of the source materials—color, black and white, or shades of gray—and the use of the images—who will be consulting them and for what—when making decisions about the parameters for image capture. The quality and utility of an image depend upon the technology of capture and display, and even the usefulness of an image for reference can be severely compromised by a low-resolution monitor on which the image will be displayed. While there is work ongoing to address the quality control and variability of computer monitors, as yet the lack of control over display mechanisms constitutes one of the weakest links in the digital chain of transmission.

Image processing—the manipulation of images after initial digital capture—can greatly expand the capacity of the researcher to compare and contrast details that the human eye cannot see unaided. Images can be enhanced in size, sharpness of detail, and color contrast. Through image processing, a badly faded document can be read more easily by heightening contrast. A dirty image can be cleaned up. Faint pencil marks can be made legible.

The map of the District of Columbia, prepared by Pierre-Charles L'Enfant for George Washington in 1791, is so badly faded, discolored, and brittle that it resembles a potato chip. It cannot be used by researchers and yields very little information to the unaided eye. Digitized several years ago, the map now allows us to make out all the subtle contours of the architect's plan and read the numerous annotations that Thomas Jefferson made on the map after looking it over at the president's request.

Like successful archaeologists, we have, with our digital picks and brushes, excavated important historical evidence that has changed the way we understand the planning of the nation's capital.

Digital technology can also make available powerful teaching materials for students who would not otherwise have access to certain resources. Among the most valuable types of materials to digitize from a classroom perspective are those from the special collections of research institutions, including rare books, manuscripts, musical scores and performances, photographs and graphic materials, and moving images. Often these items are extremely rare, fragile, or, in fact, unique, and gaining access to them is very difficult. Digitizing these types of primary source materials offers teachers at all levels unheard-of opportunities to expose their students to the raw materials of history. The richness of special collections as research tools lies in part in the representation of an event or phenomenon in so many different formats . The chance to study the presidential election of 1860 by looking at digital images of daguerreotypes of the candidates, political campaign posters (a recent innovation of the time), cartoons from contemporary newspapers, abolitionist broadsides and notices of slave auctions, the manuscript of Lincoln's inaugural address in draft form reflecting several different stages of composition—all this would be possible with a well-developed plan of digital conversion of materials from different repositories normally beyond the reach of students. While we know, for example, that the daily hit rates at the Library of Congress American Memory site is greater than the number of readers who visit the reading rooms each day, we have very little data now as to how much these types of images that are currently online are used and for what purposes. Some large libraries are attempting to compile and analyze use statistics, but this is understandably quite a challenge. We need more user studies before we can assert confidently what may seem self-evident to us now—that adding digitized special collections to the mass of information available on the Internet is in the public interest and enhances education. We also need to ensure that libraries are working collaboratively to digitize materials that, taken together, create a critical mass of research sources that are complementary and not duplicative, and that begin to fulfill the promise of coordinated digital collection building. At present there is no central source of information about what has been digitized, and with what care in the process, as there is for titles that have been microfilmed for preservation.

Some of the drawbacks of digital technology for access, as in preservation, stem from the technology's uncanny ability to represent the original in a seemingly authentic way. Working with digital surrogates can distort the research experience somewhat by taking research materials out of the context of the reading room. The nature of computer display makes only serial viewing possible, very different indeed from the kind of viewing researchers are used to in a reading room, where one can spread photographs, for example, around a flat surface and look at them simultaneously and in different groupings. Everything, all information, is mediated by the screen, and that automatically flattens and decontextualizes an image. And a digital image, no matter how high the resolution and sensitive the display monitor, is always presented through the relatively low density of information of the computer screen, compromising the high-density nature of analog materials, critical for assessing any visual evidence.

Many of the items that are up now on the web sites of such institutions as the National Archives, the Library of Congress, and the New York Public Library come from special collections that are large and often cataloged only at the collection level. In order to digitize them, curators familiar with the materials have sifted through collections and made selections from them. The collections that are on the Web are, in a real sense, publications, accompanied as they are by a great deal of descriptive information that was created in order to make the items understandable in the context of the Internet. Most research collections are unedited, with few descriptions that aid a scholar. The digital "raw materials" of the past are not as raw as they might appear to be. Often an extraordinary effort has gone into research and description before an item is ever scanned. A collection of daguerreotypes that may have been in reasonably good physical condition but not very well cataloged may undergo extensive conservation review and treatment before it can be scanned, and labor-intensive searches into the identities of faces that have been anonymous for decades will precede the cataloging and description of the mark-up. While these may be viewed as extraneous, or at least discretionary, editorial expenses, in fact they are more commonly incurred than not. As a rule, people who visit a library web site arrive with different expectations than those who visit a reading room. They expect higher levels of functionality of digital objects than they do of library materials. (They also do not have a reference librarian available to help them in their searches when they have difficulty following a lead.) Scanning is a very expensive process, and most of the cost occurs before the item is laid on the scanner. The amount of physical preparation and intellectual control work that is needed for every digital project is very large indeed.

Nevertheless, many institutions are taking on ambitious digital conversion projects in order to find out for themselves what the technology can do for them. They are investing large amounts of money in projects to make their collections more accessible and, too often, believing that they are also accomplishing preservation goals at the same time. The impact of digitizing projects on an institution, its way of operating, its traditional audience, and its core functions is often hard to anticipate. The dilemma of how best to select which parts of a large collection to be scanned is, for some, a novel challenge that calls into question basic principles of collection development and access policies. Many libraries and archives have collections that are intrinsically valuable by virtue of their being comprehensive and containing much information that is essentially unpublished. But they also may contain sensitive materials, those that deal with historical events or popular attitudes that may be offensive to us now and that must be understood in the larger context, which is precisely what a comprehensive collection provides—the context in which to understand things. How does one deal with sensitive materials in a networked environment? If one makes the difficult decision to edit out materials that are readily served in a reading room, but are too powerful to broadcast on the Internet, what does that do to the integrity of a research collection? Making information available on the Internet removes the very barriers from use that we take for granted in physical collections. No one has to travel to a library, nor do they have to present proof of their serious research interest in order to gain access to complex, disturbing, and uninterpreted materials. There are ways to build in electronic barriers to access for all or portions of a site, using much the same technology that commercial entities use in granting fee-based access. This adds a layer of administrative complexity to managing the site that libraries and archives may not be prepared to take on, even if the technology does exist. When digitization is viewed specifically as a form of publishing and not simply as another way to make resources available to researchers, the thornier issues of selection for conversion are put into an editorial context that provides a strong intellectual and ethical basis for imaginative selection of complex materials.

Furthermore, many of the collections that may be of the highest research and teaching value will not be digitized for Web access because of the strictures of copyright that might apply. There is a disproportionate amount of public domain material on library web sites these days, and that fact alone distorts the nature of the source base for any given research project. Young students in particular have taken to looking for information through the Internet so quickly that they have a tendency to restrict a search to what is available on the Web. The notion that, if it is not on the Web (or in an online catalog), then it must not exist, has the effect of orphaning the vast majority of information resources, especially those that are not in the public domain. This is not what the framers had in mind when they wrote the copyright code into the Constitution, " to promote the Progress of Science and useful Arts." This skewed representation of created works on the Web will continue for quite some time into the future, and the complications that surround moving image and recorded sound rights means, ironically, that these will be the least accessible resources on the most dynamic information source around. And until OCR (Optical Character Recognition), the post-processing technology that makes scanned text searchable, works as well for non-Latin scripts as for Latin, resources from around the world in vernacular languages will not take their proper place in the scanning queue.

What Is Gained and What Is Lost?

In contemplating a digital conversion project, an institution must ask itself what can be gained from digitization, and is the value added worth the price? Many libraries have begun the difficult task of developing selection for digitization criteria and publishing them on the Internet. Columbia University, for example, has posted guidelines for selection for digital conversion, among which is the category of "added value." They define added value thus:

Digital capture that will enhance intellectual control through creation of new finding aids, links to bibliographic records, and development of indices and other tools.
Ability to search widely, manipulating images and text, and study disparate images in new contexts will increase and enrich use.
Widespread dissemination of local or unique collections will encourage new scholarly use by providing enhanced resources.
Digital capture that will enhance use through improved quality of image—for example, through improved legibility of faded or stained documents.
Digitization that will allow the flexible integration and synthesis of a variety of formats, or of related materials scattered among many locations, thereby creating a "virtual collection" (http://www.columbia.edu/libraries/digital/criteria.html).

At present, however, the cost of digitization and of creating and maintaining a migration path for preserving the files is very expensive, and creating greater accessibility to an underused collection should be viewed in conjunction with other factors such as compatibility with other digital resources and the intrinsic intellectual value. As the Society of American Archivists has said, "The mere potential for increased access to a digitized collection does not add value to an underutilized collection. It is a rare collection of digital files indeed that can justify the cost of a comprehensive migration strategy without factoring in the larger intellectual context of related digital files stored everywhere and their combined uses for research and scholarship." (http://www.archivists.org/governance/resolutions/digitize.html)

As Donald Waters of the Digital Library Federation has expressed it, the "promise of digital technology is for libraries to extend the reach of research and education, improve the quality of learning, and reshape scholarly communication." This is not an extravagant claim for the technology, but rather a declaration of an ambition shared by many who are developing and managing the technology. And the key to fulfilling that promise lies within the communities of higher education, science, and public policy responsible for making digital technology serve those ends. Digital conversion of library holdings has its stake in this ambition, particularly to the extent that it can broad access to valuable but scarce resources. But the cost of conversion and the institutional commitment to keeping those converted materials refreshed and accessible for the long-term is high—precisely how high, we do not know—and libraries must also ensure the longevity of information that is created in digital form and exists in no other form. We need more information about what imaging projects cost, about who uses those converted materials, and how to judge whether the investment is worth it. In the meantime, libraries must continue to be responsible custodians of their analog holdings, the print, image, and sound recording collections that are their core assets and the legacy of many generations. That means continuing to use tried-and-true preservation techniques such as microfilming to ensure the longevity of imperilled information. Analog is a different way of knowing than digital, and each has its intrinsic virtues and limitations. Digital will not and cannot replace analog. To convert everything to digital form, even if we could, would be wrong-headed.

The real challenge is how to make those analog materials more accessible using the powerful tool of digital technology, not only through conversion, but also through digital finding aids and linked databases of search tools.

Digital technology can, indeed, prove to be a valuable instrument to enhance learning and extend the reach of information resources to those who seek them, wherever they are, but only if we develop it as an addition to an already well-stocked tool kit, rather than a replacement for all of those tools which generations before us have ingeniously crafted and passed on to us in trust.

Abby Smith is the director of programs at the Council for Library and Information Resources. She has a PhD in Russian history and was a Fulbright Fellow. Until recently, she worked at the Library of Congress, where she managed preservation microfilming program for Russia and Lithuania, and was a curator for several exhibitions at the library including the much-visited Treasures of the Library of Congress.

Tags: Viewpoints

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. Attribution must provide author name, article title, Perspectives on History, date of publication, and a link to this page. This license applies only to the article, not to text or images used here by permission.

The American Historical Association welcomes comments in the discussion area below, at AHA Communities, and in letters to the editor. Please read our commenting and letters policy before submitting.

Comment

Please read our commenting and letters policy before submitting.

Why Digitize?

Comment

In This Section