Digital History Glossary

This glossary of terms is intended to provide departments and committees evaluating scholars with brief definitions of some of the terms, concepts, and tools they may encounter in project descriptions, research statements, and evaluations. Based upon a glossary developed by Sheila Brennan for the Doing Digital History summer institute at the Roy Rosenzweig Center for History and New Media in 2016, it has been expanded by members of the AHA’s digital history working group. While it covers many important digital humanities and related technological terms but cannot include every relevant concept. Users of the glossary should also see this as defining aspects of the field of digital history about which they should be familiar when working with digital historians, planning to create positions that will include digital work, or otherwise exploring the role digital tools and methodologies play in our discipline.


3D models: A computational method for spatial analysis. Three-dimensional visualizations are created by specialized software using geometric data. Objects can be created, placed in scenes, and produced in physical form by 3D printers. Within 3D modeling there is variation between models which are static, such as re-creations of architectural spaces, and models which are meant to be experienced, such as those used in simulations which can be explored. Sketchup, which has a free version, is the most widely used 3D software in digital humanities (https://www.sketchup.com/). 3D models are also being created in game development platforms, such as Unity Game Engine.

Algorithm: an unambiguous set of instructions telling a computer what to do and how to do it.

API (Application Program Interface): Software that allows two web applications to communicate. Commonly used to access data in an online database. Museums, libraries and archives can provide an API as a way to share their databases, providing users with the ability to gather large quantities of data more efficiently than with the search tools and interfaces provided for the database.

Augmented reality, AR: A form of visualization that overlays information on a user’s view of a real world environment and the objects in, to create a composite picture that alters perception of the environment (as distinct from VR which replaces the real world environment with a simulation). AR is commonly displayed on smartphones and tablets, using the device’s camera and GPS to determine what information to display and where to display it. Uses of AR in digital humanities include augmenting locations with historical photographs of the place, augmenting classical statues with the colors in which they were originally painted, and augmenting displays with additional descriptions.

Backend (aka control panel or dashboard): Administrative side of software where you can make technical and content changes that is not visible or accessible to visitors to the site (not public-facing).

Blog: A website that contains discrete, short, often informal entries (posts) that appear in reverse chronological order and can combine text, multimedia and links. Originally a form of online diary that allowed readers to leave public comments to which authors could respond, blogs have evolved into venues for commentary on a variety of topics by public figures, institutions and journalists as well as individuals, including scholars. Most blogs are published using free content management systems designed for that purpose such as WordPress and Blogger, and are freely available.

Born digital: Material that originates in digital form; in contrast to material that is digitized, which originated in another form. Common forms of born digital content are digital photographs, web pages, and electronic records like email and spreadsheets.

CMS (Content Management System): A computer program that allows content to be published edited, and modified from a central interface. A CMS typically provides an interface that removes the need for the use to write code, although that option is also often available. CMSs are often used to run websites containing blogs and digital collections.

CSS (Cascading Style Sheets): A markup language to modify the design and appearance of of a webpage. Used in conjunction with HTML to create online content.

CSV (Comma Separated Values): A file with a set of information, where each value is separated by a comma or other specific character (; | / ). Can be created with spreadsheet software like Excel, Google Sheets, Numbers; when a spreadsheet is saved as a CSV file, the values in the rows and columns are separated by a comma or other specified character. Most databases and many CMS platforms (i.e. Omeka) and digital tools can import csv files, making them a commonly used means of transferring information.

Computational methods/tools: Programming and software that analyzes data; the most commonly used methods in digital humanities are text analysis, spatial analysis, and network analysis. Using computational methods requires transforming historical sources into data by extracting information and features, and creating structured data by normalizing them to fit the chosen categories in service of particular research goals. The results of computational analysis are generally presented in visualizations, such as maps, graphs and charts.

Corpus Linguistics: Builds on text analysis to elucidate meaning by examining syntactic and semantic structures larger than single words. A corpus of texts is annotated with tags for parts of speech, and for the different modifying functions and relations that a word can have in different contexts. The corpus is analyzed by combining search and colocation to use context to establish the meaning of a word.

Data cleaning: The process of detecting and correcting (or removing) incomplete records or data with inconsistent spelling or formatting from a database.

Database: A form of structured data in which related information is organized into fields (a single item of data), records (a complete set of fields; a row in a spreadsheet) and files (a collection of records). Also software that enables you to enter, organize, store, and retrieve information in a database.

Digitization: The conversion of analog content into a digital format. The creation of digital images by photography or scanning is the common form of digitisation, used in the case of documents, photographs, artworks, or objects. Sound and moving images can also be digitised, by re-recording video and audio onto digital media. See also JPEG; TIFF.

Digital Archive: A collection of digitized sources organized, described with metadata, and made accessible through an online interface. In the context of digital humanities, the term generally refers to a collection brought together online from a variety of different physical collections and locations. Archivists would generally not consider such a collection to be an archive; in that field, the term archive is only used to refer to material created by the originating organization or person, or by a third party brought together in a repository.

Distant Reading: From Franco Moretti, a term for using text analysis to look for patterns over large corpora of texts.

DOI (Digital object identifier): A managed, persistent link to an online publication. To obtain a DOI you must register with a DOI Registration Agency, which collects metadata about publications and assigns them DOI names. If the url of the publication changes, the publisher must update the DOI metadata for the DOI to continue to link to the publication.

Domain name, domain: A unique identifier for a resource on the internet such as a web server, web site or web app; used in place of the numerical addresses employed by internet protocols, as part of a URL. Domain names are used to establish a unique identity for a project. Anyone can lease a domain name by registering it with a domain name registrars, who charges an annual fee. A domain name can include one of a number of top level domains, with .org, .net., and .com being the most common (.edu domains are restricted to educational institutions and .gov domains to government institutions). Second level domains, what precedes the top level domain, is a string of text and numbers up to 253 characters. Individuals, organizations and projects often use their name as a second level domain

Dublin Core: An internationally recognized metadata standard for describing any conceivable resource, comprised of 15 elements, including “title,” “description,” “date,” and “format.” Dublin Core is used in Omeka, an open source content management system for publishing resources online widely used in digital humanities.

FTP (File Transfer Protocol) Client: A program that lets a user transfer files from your computer to a web server so that it can be available or viewed online.

GIS (Geographic Information Systems): Software that combines a database and a mapping application to relate information to a location. ARC-GIS is the best known example of this software; it is a commercial product with a steep learning curve designed primarily for social scientists working with quantitative data. An open source alternative is QGIS. See also Web Mapping.

Georeferencing: Transforming place names and addresses into coordinates for mapping.

GIF: A lossless image file format now most commonly used for images containing animations. See also TIFF

Github: An open source platform for sharing code and any other kinds of files.

GLAM: Acronym for Galleries Libraries Archives Museums.

Hosting; see Web Hosting

HTML (HyperText Markup Language): A markup language that uses tags to describe the structure of what something will look like online, and specifying the format of text (font, bold, italics), the header of a page, etc. HTML is now commonly used in conjunction with CSS, another markup language that modifies the design and appearance of HTML elements and offers an easier way of creating the style of a site.

JPEG: An image format that uses lossy compression - which compresses images by discarding some data when they are edited and saved. The most commonly used format in digital cameras and for storing and transmitting image files online. See also TIFF.

KML (Keyhole Markup Language); KMZ file: An XML-based markup language that uses tags to describe geographic information about a place that can be displayed on maps. Originally developed for Google Earth. KMZ files are compressed KML files.

LAMP (Linux, Apache, MySQL, PHP/Python): An open source software bundle that is used to create web sites and web applications: Linux is the operating system, Apache is the webserver, MySQL is the database, PHP/Python is the scripting language.

LMS (Learning Management System): A content management system designed for teaching and learning, offering the ability to organize content by classes and courses, design quizzes, and manage grades and monitor the activity of students. The best known example is Blackboard.

Lossless compression; see TIFF; GIF

Lossy compression; see JPEG

Machine Learning: Algorithms that automate analysis by taking a sample of training data and progressively building a statistical model to categorize or classify data. Commonly used when the features and patterns of the data are too fuzzy to make it feasible to use strict instructions to sort the data.

Markup language: A computer language that uses tags to define elements within a document. The language contains standard words rather than code so is human readable. The two most popular markup languages are HTML and XML. Historians and literary scholars often use an adaptation of XML called TEI to identify and mark up particular non-technical elements of a document (e.g. people or places). See also KML

Metadata: Data about data, or information that describes an item. Metadata is what you read in library catalog records or museum collections management systems. Standardized metadata uses agreed-on spelling, language, date formats etc in order to allow metadata to be compared. Metadata standards or schemas are sets of structured and standardized metadata, developed to describe resources for a particular purpose or community. Dublin Core is a widely used metadata standard for describing digital and physical resources.

Named Entity Recognition (NER): A form of Natural Language Processing that uses algorithms that identify words referring to people, places, and organizations.

Natural Language Processing (NLP):Algorithms that identify features of language such as the part of speech for each word, the basic form of a word (lemmatization), nouns that refer to real world entities like people, places, events, and organizations (NER), and the relationship of the words in a sentence (dependency parsing)

Network Analysis: A computational method that uses network graphs to visualize and measure non-spatial relationships between people, groups or information. These graphs render the components of the network as nodes and the relationships between them as edges or links and allow multiple types of both nodes and edges. The resulting networks can describe which entities are most central to those relationships, or the density or degree of centralization of the whole network. Gephi has been the open source network visualization software most commonly used in the digital humanities (https://gephi.org/).

OCR (Optical Character Recognition): Software that converts digital images (photographs, scans) of text to machine readable text that can be analyzed with computational methods. Generally only effective for text in modern typefaces (although machine learning algorithms are being developed to convert older typefaces and handwriting).

Omeka: An open-source content management system which uses an item (object/image/document) as the primary piece (as opposed to WordPress, which uses the post) and Dublin Core metadata to describe items. Omeka is commonly used for the creation of digital collections and for exhibitions based on those collections, and by archives, libraries and museums, and in classrooms. www.omeka.org

Open access: Material made freely available online. Usually refers to published peer-reviewed research made available without cost to the reader.

Open source: Software whose source code is made freely available and can be modified and redistributed, encouraging open collaboration on development of the software. Examples widely used in the field of digital humanities are content management systems such as WordPress, Omeka, and Scalar, and computational tools such as Voyant and Gephi.

Plugin: Software that adds a specific feature to an existing computer program. Used in WordPress and Omeka.

Programming language: A formal languages consisting of instructions for computers, used to create programs that implement specific algorithms telling a computer what to do and how to do it. Each language has its own vocabulary and a syntax or grammar for organizing instructions. Languages commonly used in digital humanities include R, Python, Javascript, and Ruby/Ruby on Rails.

Responsive web design: The design of web pages to render well on a variety of devices and windows or screen sizes. Since the default design for web pages generally assumes they will be viewed on a computer monitor, responsive web design means ensuring that those pages also render well on a phone or tablet.

Scalar: An open source content management system for publishing long-form digital texts. It is designed to allow for publications to be organized in nested, recursive and non-linear formats, and annotation of a variety of media. https://scalar.me/anvc/scalar/

Server; see Web Server

Spatial analysis: A computational method that involves mapping and other forms of visualization that employ spatial data to analyze historical processes. Mapping involves georeferencing location information to generate coordinates that can be mapped and visualizing that data using GIS software, web mapping platforms, or programming with open source tools such as Leaflet and Openlayers.

SQL (Structured Query Language): A programming language used to query, insert, update and modify data in a database. WordPress uses SQL to manage the database that stores information about your site, one component of a CMS.

Structured Data: Data organized in database or with markup tags, where each element fits a field in a table or has a label in a markup language. Structured data can be analyzed using computational methods. See also unstructured data.

SVG (Scalable Vector Graphic): An XML-based image format; as it is based on markup language, an SVG image can be edited as code in a text editor. SVG images can be created in graphics software such as Adobe Illustrator and Sketch.

Text analysis (aka as text mining): The computational analysis of textual data – words - in digitized documents. Algorithms identify words by looking for spaces and punctuation, a process called tokenization. The simplest form of text analysis discards word order to count the frequency of words in a corpus of documents. Voyant is an open source tool for simple text analysis (https://voyant-tools.org/). This form of text analysis can also be used to measure and compare the similarity of texts by counting the words and phrases they have in common. Other forms of text analysis build on those algorithms to try to identify the semantic relationships between words, and consequently the concepts in texts; see Corpus Linguistics; Distant reading; Topic Modeling.

TEI (Text Encoding Initiative): A set of guidelines that define an XML markup language format to tag textual components (eg word, sentence) and concepts (eg person, place). TEI is widely used in literary studies and in digital editions of texts.

TIFF (Tagged Image File Format); TIF: An image file format supported by a wide variety of software that uses lossless compression - meaning no image quality is lost when the file is edited or saved -- and consequently is the file format used for preserving images. Other common lossless file formats are PNG and GIF. See also JPEG.

Tool: A term for software used in the digital humanities.

Topic modeling: Builds on text analysis using algorithms that capture semantic features by identifying clusters of words – topics -- that are more likely to appear in proximity to each other. The algorithm divides the texts into as many topics as the user specifies to produce a model of the possible themes of the corpus. It is up to the researcher to determine the meaning of those topics; a topic could capture stylistic features or systematic OCR errors as well as themes.

Unstructured Data: Data that is not organized in a database or with markup tags. The text documents that humanities scholars commonly study, for example, are unstructured data; they can have elements of structure, such as the date, sender and recipient information in a letter, but not all the text fits those categories. Information in unstructured data needs to be tagged in a consistent way or extracted and organized in a database before it can be analyzed using computational methods such as mapping and network analysis. Unstructured textual data can be analyzed with computational methods such as text analysis, topic modeling and corpus linguistics. See also structured data.

URL (Uniform Resource Locator): Commonly referred to as a web address, a url specifies the location of a web site or web application and a mechanism for retrieving it. It is usually displayed in a web browser above the page in an address bar. A typical url includes a protocol for how the data is transmitted (usually http or https), a domain name identifying the location of the web site (eg historians.org), and a file name (eg index.html) identifying a specific part of the web site.

Virtual reality, VR: An computer-generated simulation that immerses the user in a three-dimensional environment with which they can interact. Current technology uses headsets to generate images, sounds and sensations, and sometimes augmented by controllers to transmit vibrations and other tactile sensations.

Visualization, data visualization: Placing data in a visual context in order to analyze and communicate it; encompasses images, diagrams, graphs, maps and animations. Most computational methods produce visualizations. Visualizations in digital humanities are commonly research tools produced to explore data, but they can also be used to communicate arguments.

Web application, Web app: Software that runs in a web browser rather than on your computer desktop. Web apps are stored on web servers rather than installed on your computer. See also API.

Web archive: Content collected from the web in order to preserve and provide long term access to information available online. Collection is typically done automatically using web crawlers. The information collected includes web pages, CSS style sheets, images, video and metadata. The largest web archiving organization is the Internet Archive, which aims to archive the whole web. National and local agencies are also creating web archives of specific domains.

Web crawler, aka spider: An internet bot that systematically browses the web. Generally used for indexing the web, but also to automatically collect data for web archiving

Web hosting: Providing a web server on which files, instances of CMS and web publishing platforms, and web applications/software can be made available on the internet. Some free hosting is available, usually only for specific platforms and with limited functionality and advertising. For example, a free WordPress site is available through WordPress.com, and a free Omeka site is available through omeka.net. Users of that hosting do not need to manage the servers in anyway, so they are easy to use, but in both instances only some of the platforms features are available. A dedicated or managed hostingservice leases space on its web servers, on which clients can store files and install software of their choice. Dedicated hosting requires an annual payment and some knowledge to manage. Both the cost and skill required are diminishing. Reclaim Hosting is a service widely used in higher education in the US, and offers hosting beginning at $30/year (2018) and one-click install of platforms such as WordPress, Omeka and Scalar that handles the most complex aspects of installing software.

Web mapping: Platforms such as Google Maps that offer online access to geographical data and APIs that allow users to create custom maps. An alternative to GIS used widely in digital humanities. Open source web mapping software developed for the humanities include Neatline (a set of plugins for Omeka) and Palladio.

Web Page: a file written in HTML and stored on a web server connected to the internet.

Web Server, or Server: Refers to computers connected to the internet, and to the software they run that delivers files to the web in response to requests from other computers. See also LAMP

Web Site: a collection of web pages stored on a Web server connected to the internet. Web sites are now typically created by using a CMS such as WordPress or Omeka, but they can simply be a set of files written in HTML.

WordPress: An open source content management system originally developed for blogs. WordPress allows the creation of pages and posts; pages do not have a publication date and are intended for static content in a fixed location; posts have a publication date and appear in reverse chronological order, and can be tagged and categorized.Additional features can be added to a WordPress site by installing plugins.

WYSIWYG (“What You See Is What You Get”): Interfaces for editing content that display the content as it will appear when published. They provide an alternative to interfaces that display the tags and markup language used to make the content appear in that way. The classic WordPress editorial interface provided a tab to view the content as it would appear (Visual) and a second tab to view the markup that produced that appearance (Text).

XML (EXtensible Markup Language): A markup language that uses tags to describe the content that it is identifying: title, author, year, genre etc. XML files are a form of structured data that can be analyzed using computational methods.