Jo Guldi on Text Mining, AI, and Digital History

AHA Topics

Research & Publications

Thematic

Digital Methods

Episode Description

Historian and quantitative methods expert Jo Guldi discusses text mining, AI, and the wider landscape of digital history in this longform conversation. Guldi’s work on these subjects can be found in two recent AHR articles—“The Algorithm: Mapping Long-Term Trends and Short-Term Change at Multiple Scales of Time” published in the June 2022 issue and “The Revolution in Text Mining for Historical Analysis is Here” from the June 2024 issue—and in the book The Dangerous Art of Text Mining: A Methodology for Digital History published in 2023 by Cambridge University Press.

Transcript

Daniel Story

Welcome to History in Focus, a podcast by the American Historical Review. I’m Daniel Story. And a very happy new year to you and yours. Today’s episode is something a little different. It’s a longform conversation with historian Jo Guldi on the topics of text mining, AI, and the broader landscape of digital history. Guldi is Professor of Quantitative Methods at Emory University and the author of a number of recent works exploring the implications of text mining for historical work, including two AHR articles: “The Algorithm: Mapping Long-Term Trends and Short-Term Change at Multiple Scales of Time,” published in 2022, and in 2024, “The Revolution in Text Mining for Historical Analysis is Here.” Guldi is also author of the book The Dangerous Art of Text Mining: A Methodology for Digital History, which was published by Cambridge University Press in 2023. Our conversation ranged widely, and even before we could get into the interview proper, we were talking about the sudden ubiquitousness of AI, the faulty update from the cyber security firm CrowdStrike that had crashed millions of Windows machines around the globe just a few days prior, and Jo’s recent house guest.

Yeah, that was something else. It’s like, we’re all in a frenzied discussion over many months now about AI, and then in one day, something, I mean, who knows if AI was the one that designed and pushed out this update?

Jo Guldi

Oh, I think it could’ve easily been.

Daniel Story

There’s food for thought.

Jo Guldi

Yeah, yeah. I don’t know how much we want to get into it, but the house guest is actually the director of an AI company that works on LLMs, and we have this, you know, who knows if I will ever publish about it, or in what form, but we have this, like, little little experiment going where I’m the AI skeptic, and I’m like, “the AI can never do history. Software sucks. Hallucinations are a real problem. I trust counting words, but not your neural network.” And he’s like, “no, no, no, it’s gonna work.” And so we like, he comes out here for a week at a time, and then we goof around, and I’m like, “make it interpret parliament. Make it look up these words,” and then it fails. And I’m like, “see!” And he’s like “no, no, it was a technical issue. We can do it right.” And I’m like, “okay,”

Daniel Story

Yeah, I’d love to talk about that. I mean, how would you characterize the difference between the sort of work that you do as you just kind of shorthanded as counting words, and what we are all now talking about as AI like for someone who really doesn’t understand, maybe nobody understands. But what’s the difference, what’s the difference between those?

Jo Guldi

So that’s a really good question. So how is the kind of word counting that I do different from LMS?

Daniel Story

Can you, sorry, can you tell people what LLM stands for?

Jo Guldi

Yeah. So large language models, LLMs, are the state of the art as of 2024 in artificial intelligence. So when people talk about GPT, GPT is a brand name, the larger category is large language model. It’s a huge model of weighted networks of words with some predictive algorithms that allow it to create, predict the next word, to create human sounding like sentences. But as most people who have an experience with LLMs know, it can also create a lot of what the industry calls hallucinations, what the rest of us call errors. Like you ask it who the first black president of the United States is, and it might just say John Adams today, you know, especially if it’s been watching Hamilton, and that’s not because it’s thinking, it’s because it doesn’t know how to think about time or anything else. It hasn’t been programmed to think about things. It’s been programmed to string together words in such a way that it sounds like a human, not necessarily an educated human. So it can’t do math. And I would make the argument that it can’t do history. It really can’t do history. It can sometimes recycle things that historians have said about the American Civil War, about Abraham Lincoln, particularly, a lot of people have written about it, and a lot of reviewers, newspaper critics and folks on Reddit have taken up that theme of Lincoln’s childhood. Then the LLM can do a decent job. Can the LLM go to an archive online, look up an artifact that nobody has seen before, study it, transcribe it, talk about its historical significance? We are very far from that. We are very, very far from that. And I think there are some structural reasons why it’s going to be very difficult for the LLMS to become helpful to historians in the ways that we would like. And that’s not to say that no historians are doing that work. Bill Turkel, in Canada, is throwing a workshop this fall that I’m very excited to be heading to, where he’s going to show us some of the work. Bill has been thinking about having artificial intelligences help him in the writing of history, crawling Google Books. He’s been doing that for 10 years, I’ve been hearing about it. So he’s he’s really far ahead. You know, I talked at the AHA this past January, I talked to a gentleman who had been using LLMs in the context of a business history course. Sorry, this is Louis Hyman of Johns Hopkins University has been using LLMs in the course of his business history course to to take scans of punch cards about the labor movement at the beginning of the 20th century, to create a quantitative data set of union membership, I believe, and then to analyze it using statistical models and then create visualizations. So that’s work that he says would have taken his students three semesters to do traditionally, and they are able to do it in the course of one semester, just because they can ask the LLM for help in basic processing tasks and basic statistics tasks and to make recommendations about how they proceed. So. LLMs are affecting the work of historians, but for the work of analyzing texts over time, which is the work that I’ve been doing with computers for 10 years, that work LLMs may have a much harder time working with and there’s some structural reasons for that. One of the big reasons is that the databases that open AI and most of the other LLM companies are using for their text. You know, they have huge numbers of text legendarily, their tech, their LLMs, got much better when they just increased the number of texts that the computers were looking at. Their texts don’t have a history field. So what we mean by that? It’s not that the LLMs are undergraduates and they aren’t history majors. I mean, obviously they’re not history majors. But at the back of everything, there’s some spreadsheets that have text in one column, like a web page and all of its text, and it might have a another column where they label the source like, this is from the media, and this is from Reddit, and this is from Twitter, and this is from the New York Times. There isn’t a column for the date. There isn’t a column for the date.

Daniel Story

How can this be?

Jo Guldi

Why might that matter? Well, it might matter if you care about things like discoveries as they affect the truth or new knowledge, like if you expect medical research to progress, you might want your computer to be able to discount things that we believed 30 years ago and place a priority on new medical research that everyone’s talking about. But that also, you know, it also might be important for new discoveries about the American Civil War or new attitudes about the presidency, just keeping up with the latest trend of what the young people are talking about on the social apps. So part of the problem is that the people who were designing the LLM weren’t history majors, and so they they know about computers and they know about data, but they don’t know how to think about time. So I’m working my way back to telling you about the stuff that I do and how that’s different. So I started doing digital history 10 years ago. I’m part of the cohort of digital historians, digital humanists, who came of age when Google Books was released after 2006 and we said, “Wow, there are a lot of words, and they have a date field, and we could teach computers to read them, and then maybe we could make some discoveries.” And I did this at the same time as I was doing my archival research I’ve done both the whole way through at the same time. As I started writing these articles, at first, they had a terrible, terrible review process. You know, I think my first digital history article in the Journal of Modern History took, there were like three reviewers who were confused, and then we threw them out, and there were another three reviewers. And I think we had like nine reviewers et al in all, and then I had to, like, respond to the nine reviewers like, it was insane.

Daniel Story

Wow, and what year would that have been roughly?

Jo Guldi

That probably started in about 2009 that process. The article came out a couple of years later. That was the words for walking article, you know, because the readers weren’t sure what they were looking at, and I wasn’t sure how to be persuasive. But in low these last 10 years of talking to readers and talking to editors in the history field, those conversations have become a lot more productive, and I mean really seriously engaged with, how do we know what we know, and how do we get to consensus about what the facts of history are and what we can trust the computers to do, and what do we not trust them to do? So the baseline is that I use computers to count words over time. There’s always a time field. There’s always a column in my spreadsheet that’s the year of the speech in the parliamentary debates. And then I want to know things like, “Do they mention refugees during the Crimean War? Did they mention them during the Napoleonic War? Do they use the same word? Did it change?” And you know, you could look that up. My computer can look that up in a series of other questions instantaneously, and do a little bit of math to tell you which which the words that change their statistical significance between those two periods. And then, and then, I get a list of words that have changed. So the kind of work that I’m doing is influenced by the thinking of generations of concept historians, from Raymond Williams to Reinhart Koselleck, but it’s, you know, my, my, I’m coming from a place like other historians, of asking how words change their meaning in a way that’s totally transparent, like I can go back to a passage in the text and show you, “We just, we used to call the refugees this. Now we call them that.” I just made up this, this example from whole cloth. So I don’t have a good example, but we used to talk about, we used to talk about, we used to talk about culture as an elite artifact. Now we talk about culture as a national artifact. That would be a Raymond Williams type argument. I can show you the passages in the text where that’s drawn from. So it’s all what we call “white box methods.” By “white box methods,” this is a term for statistical methods where it’s not a black box. I’m not going to wave my hands and tell you the computer knows what’s going on. It’s generating automatic intelligence. I’m doing word counts, and those counts are based on actual, actual primary texts, actual sentences. I can pull them up. I often do pull them up, and I can show you exactly what it’s based on. And we can check, like, if there’s a third meaning of the word culture that the computer didn’t get or I didn’t anticipate, then we can, we can actually go back to the text and argue about it. So the process of reviewing and thinking about what we’re doing in the history field has made for a set of “white box methods” that I think at this point are something that we can all agree upon. They’re not alien to the field of history. They’re very much grounded in theories of history. They’re grounded in techniques of argumentation and assortment that makes sense to historians, which the historians sit down with them, even the math behind a topic model or behind the distinctiveness measures that I use to understand how one decade is different than another, I can walk you through it. It’s all pretty transparent. So what the field of digital history do is doing is designed to be totally transparent and reliable, in the same way that the work of historians always has to be reliable, trustworthy, truth based, source based, arguably the the conditions for sources, for annotating your sources and for making his arguments about truth in the field of history are working at a higher level in the field of history than they are at many data driven fields in the academy at this point, because we have centuries of arguing about words, line by line and original documents, of checking each other’s archives of understanding where disagreements about interpretation can arise, of theorizing the cause of those disagreements, and for trying to move to greater consensus and greater skills of validation as a field. So there’s a way in which history has, if you will, been exempt from the kind of validation crises that shook economics shook psychology. It’s not that we never disagree about things, sure we disagree about them all the time, but we have a productive way of reviewing asking questions in the review process, and then when we have theoretical divisions in the field about, you know, the nature of American history, it’s, it’s more or less clear what those divisions are about. It’s not just interpretive, it’s not just separate sources. We know what we disagree about and why. As computer science, data science, in general, start working with words. Historians have a lot of expertise to contribute. We understand there’s still polysemy. 10 readers of a novel can still create 10 different interpretations, and yet we can come to some historical significance about the meaning of any given Jane Austen novel in its time, significance what it’s representing and so on. We can have debates about fundamental laws, fundamental ideas of history, the meaning of certain events and wars or personae. We could disagree up to a point, but fundamentally we’re after an objective truth. And so I think we’ve got a lot of expertise to contribute at a fairly high level to these distant domains as they get deeper and deeper into the world of language,

Daniel Story

It seems to me like what we need to be working on is both thinking through that potential contribution that we can make, as well as learning how to articulate it and actually engaging, right? Which is what I see you doing in this kind of prolific streak of text mining related publishing that you are currently very impressively on with these couple of articles in AHR, and then your recent book, “The Dangerous Art of Text Mining.”

Jo Guldi

Yes, yeah, so those, those articles in the AHR and “The Dangerous Art of Text Mining” are all the products of this 10 year project that I embarked on while I was still at the Harvard Society of Fellows. I worked on it while I was at Brown University and continued to work on it at other universities, and now I find myself in a data science department, because I’ve talked to statisticians for so long about these things that it turns out they wanted to keep talking, which is, like, no, it’s a nice feather in the cap of a historian to be like, Okay, I got it, right. Clearly I’m making sense to some people.

Daniel Story

Yeah, so you’re, you’re sort of building these connections and bridges full time now.

Jo Guldi

That’s right, that’s right. But I haven’t, you know, I haven’t given up my membership in the AHA, the AHR is still the high bar of publishing in our field, and it’s the place that I’ve come to is not that I’m giving up on history for data science. It’s, it’s that the high bar of truth of historical understanding, set by the history field, is incredibly important in an age of relative truthiness and an age when we’re outsourcing the work of analyzing language and even writing language to computers, it’s not just that we have to correct our students essays. It’s that we as a field know a lot of fundamental things about the range of what we can agree on and where we’re going to disagree, of what what you can argue based on the source and what’s just wrong. Yeah, those high standards of objectivity are now more and more relevant in an age of computational knowledge, and I think the field of digital history has a lot in particular to offer. So this is the argument of “The Dangerous Art of Text Mining.” It’s a it’s written for a lot of different audiences. It’s written for humanists and historians who may know nothing about the age of artificial intelligence and may feel totally uncomfortable thinking about algorithms. It’s also written for data scientists who may know nothing about history or the humanities. The argument is that we can we are in a crisis of truth, in which a lot of the tools journalists, business people, even data scientists, are using to analyze texts, are just wrong. They’re just wrong for the job. They’ll get you garbage results. They’ll tell you that Donald Trump and Barack Obama sound exactly the same in their speeches. That’s obviously wrong. That’s an example that I gave in the introduction is how to make a word cloud that says that it’s the easiest thing to do. In this age of a crisis of truth, historians have a lot to contribute, and historians methods and ways of thinking about truth and thinking about text can be the basis for a new, newly reinvigorated, smarter data science, one that can really work with questions about change over time, questions about what the most important event is, questions about how language changes in a way that’s deeply grounded in the theory of semantic meaning that’s emerged over the last 70 years of scholarship in the humanities and social sciences. So the book is organized in three parts. The first part is a kind of rogues gallery. Begins with a rogues gallery of all of the science papers that have had to be retracted publicly to great embarrassment because historians looked at them and were like, it’s not true that there’s no incest in Europe in the age of the Enlightenment. Have you heard of the Hapsburgs? Generally that happens because there’s a mismatch between data and analysis, or there are gaps in the data that have to be understood by people familiar with the historical context or the generation of the database, or people who just know where all of the data is. Often, then we do have digitalized data sets. It’s not just that it’s in the archive, it’s just that there are a lot of different data sets. You should talk to a librarian. The first, that’s the first, the first part is like a long exercise. These are things that you need to know you should probably work with a librarian. You should probably think a little about the value of words and the polysemy of words. Words can be barriers. Words can be doors. The second part takes a turn into a more, more technical dimension. It starts off with theories of temporality that include memory, event causation, and it goes back to Koselleck to Astrid Erll to other theorists of temporality from recent decades, and then for each of those categories of temporal experience, it says, is there an algorithm that can help us to understand this temporal experience? So what’s really distinctive about this is that if you think about culturomics 10 years ago, you think about, okay, we’ve got Google Ngrams. We can count the number of times people say science versus religion over the course of the 19th century, science goes up and up. Religion goes down and down. We kind of knew that, but now we know it in a different way, because we can count things. That’s good, but there’s just 1 x axis. There’s just one timeline, and that’s not how historians think about anything. Right? Time is multi dimensional. We have lots of different ongoing timelines at the same time. We’ve got the social dynamics, social movements. We’ve got what the diplomats and politicians are doing. We’ve got intellectual history and the history of science happening somewhere else. There’s also the climate like, like, how many dimensions can I visualize at once? Because that’s going on in the head of any historian when they’re analyzing any particular event. So the argument about how to deal with data is that it’s insufficient to reduce everything to the Ngram. We can do a lot better if we start thinking about how different algorithms can model memory or a different algorithm can just look for discontinuities in the expression of a certain kind of word. So it’s looking for events. It’s looking for periodization. So it helps to go back to the theory, I spent a lot of time talking to Astrid Erll and Koselleckabout what these categories of temporal experience mean, and then trying to map them onto an algorithm. And it’s never one to one of like this algorithm is a perfect fit. Usually, there are three algorithms that get at different aspects of a problem, like memory. And so you have to, you have to become familiar enough with the algorithm that you know what each one can do and what it can’t do. And I tried to work through that, a very plain language that assumes no mathematical ability, which assumes only some general interest understanding of the British history data set that I’m working with, that also assumes no familiarity with historical concepts like so we define each of these in turn and then work through them. That’s the second part. And then the third part is very brief. It’s an attempt to look into the future. What does this imply for what the study of digital history could do a map of cultures shorn a bias where we recognize the silences of the archive enough to map what we can know and to not try to map the things that we can’t know. And to respect the voices that have been silenced in the archive and respect the limits of digital history. What might this mean for North American departments? Is a very modest in comparison, if you’re thinking, in comparison with the history manifesto, my latest version of this is quite modest. It’s like, you know, maybe there will be a day in which departments in the United States rise to the level of Europe and Canada, where today there’s a digital historian in every department. It’s not like everyone becomes a digital historian. It’s that there may be one person who’s teaching these text mining methods, and there might be one person who’s dealing with the new demographic or geographical data sets. Those are two very different forms of expertise. Maybe there are two of them in your department. That would be a huge transformation for the United States, where the leading universities that create graduate students have no one fully employed, certainly not on the tenure track, or next to no one employed who works in these methods. So we might see that sort of a transformation, or another possible future is that there might be collaborations between data science departments, like the one where I’m located currently, and traditional departments of history, that we may see greater outsourcing than collaboration, where you see graduate students working in data science departments responsible for digitalizing new data sets, for coming up with new algorithms, for thinking about the match between historical theory and algorithmic functioning. I think this is a moment of time in which the future of history in North America is really and I should say in the United States, because Canada is a different story. In the United States, it’s at a turning point. You know, we could, we could dive into this moment. We could continue on a more antiquarian line, where we’re like, no, no, we don’t need to talk about LLMs. We’ll leave that to the computer science department to rediscover the laws of temporal- temporality, or whether there’s truth and language on their own. So is there going to be a coalescing of the university, or is it going to be a kind of further splintering of the C.P. Snow’s two cultures where it’s going to be more history, for history’s sake, just occasionally digitalizing a pretty volume so you can see it on the internet? I don’t know. I don’t know. So that’s the book in a nutshell.

Daniel Story

Yeah, one of the things that just kind of occurred to me is how, given the state of digital history in in the US, that many of us who do this kind of work, and who are not just curious about it, but are really invested in it. I don’t know if you would agree with this adjective, but I was thinking that, you know, we tend to be a little bit nomadic in the way that we navigate our careers, thinking about both institutions that we might end up in, or the types of departments we end up collaborating with, or are, in fact, working in. I have a history PhD, but I am a librarian. I my job title is digital scholarship librarian, and I do all the kinds of work that I want to do in this role. And the kinds of work that that probably most history departments wouldn’t, wouldn’t allow me to do, or wouldn’t be quite so excited about. And and I know that your journey has has been an interesting one, exciting but, but you’ve taken, you know, different different routes and paths as well.

Jo Guldi

Absolutely. Yeah. So I, you know, I started becoming interested in whether methods might have something to do with the future of the history discipline while I was still in graduate school, while I was studying for a PhD in British history with James Vernon and Tom Laqueur at the University of California, Berkeley, which I left in 2008. So Google Books launched in 2006 and I was just, you know, I was just doing totally normal, traditional history stuff. I had been in the archives. I was back in the library, like, looking up secondary sources. And then occasionally I was Google searching things at night, like when the library was closed, and one day, Google search results on one of the obscure Scottish politicians I was interested in, they exploded, and I had looked for this guy the week before. I was just trying to wrap my head around the biography of Sir John Sinclair of Ulbster. And there was nothing on Sir John Sinclair of Ulbster, like there was a genealogy website that acknowledged his existence in week one, and then one week later, there were like 500 hits on Sir John Sinclair of Ulbster. And I was like, what just happened? And I almost fell out of my chair, and the answer was that Google Books had just launched, and they digitalized the libraries of New York Public Library, the Harvard Library, the Stanford library, and you could just keyword search all of these with the stroke of a keyboard while sitting in your pajamas at two in the morning in Berkeley. So I, you know, I had friends in Silicon Valley who were like, Jo, “do you think the internet is changing the practice of your field?” And my answer had been “no, no, it’s not” until this happened. And then I started to say, “yeah, you know what I think it is.” And so I wrote some blog entries about this. The blog entries went viral. They were taken up not by historians, but by Silicon Valley people who were excited to hear that Google had done something that might change the future of history. And then meanwhile, in a totally different corner of North American University practice, some historians at the University of Chicago, Adrian Johns, James Sparrow, had been talking to the Mellon Foundation, and the Mellon Foundation had long believed that computers were going to change the practice of the humanities. So they asked Jim and Adrian if they would like a post doctoral fellowship in digital history. Our colleagues said yes. They looked around for someone to hire who had been thinking about digital history. What they found was my blog entries. So all of a sudden, I found myself with a position at the University of Chicago. So Chicago in 2008 to 2011 while I was there was an amazing place to think about these things, because there was a project in the digital humanities from French literature associated with Robert Morrissey, which argued about concept history, using the Encyclopédie and other sources to try to trace meaning during the French Revolution. They’d also engaged in the Journal of Intellectual History, in a debate with a Princeton historian about the nature of concepts and whether computers could could engage and promote more knowledge. Chicago was also a place where people were talking about longue durée history, where people like Frederick Johnson and Dipesh Chakrabarty were already starting to think about how considering history in light of climate change might mean that we reached over longer and longer time scales. So I arrived at Chicago and I was like, this is a this is a vision of history which is totally different than what I was learning at Berkeley. This is not what they’re doing at Harvard. This is not what they’re doing at Cambridge. Chicago is doing something totally new, and this is great. We’re going to be able to do something that we’ve never done before. I should try this. I would love to try this. So, you know, I wrote a first article about words for walking and how words like “lurch” and “slipslopper” and “streel” started to describe the movements of strangers, as they walked down the streets of 19th century London. There’s like one decade where, if you Google search the terms in Google Books, you could see an explosion of these words for walking. It was my first digital history essay published in the Journal of Modern History, because the Chicago editors of the Journal of Modern History, Jan Goldstein, was willing to take a bet on a young historian, and just said, “Keep on going. You know, it doesn’t matter if the readers are confused. Answer them as best as you can. This is an exciting new development.” So I was encouraged there. And then I got the Harvard Society of Fellows, and they said, you know, “take three years, just really do this, go after all of your crazy ideas, here are tons of resources.” So I was experimenting with digital history at the same time I was thinking about the longue durée. I was thinking about environmental history. And so I had started the work that later became the “The Long Land War: The Global Struggle for Occupancy Rights” and property ownership in the 20th century. I had started the archival work for that project. I was imagining a way in which all of this would come together, and I would be able to use text mining to tell a global history of longue durée property rights, a kind of ambitious project that no one who did traditional history would have been able to do. And aspects of that plan turned out to be total fantasy. I’ve written an extensive blog for the Royal Historical Society about like the parts of that that I just got wrong, just flat wrong, like you cannot do with the data sets we have right now. Like check in. 10 years. But like at no point in the last 15 years has it been possible to do 200 years of global history with data we just don’t have the right data sets, and like the tools that we had right then to to get to assemble data sets, could not support my ambitious project. So I had to do it archivally, on the one hand, and then try data analysis with with with the parliamentary texts, which we had a pretty good data set for. So he’s doing 19th century political histories of ideas with data at the same time as I was doing global 20th century histories of development economics and property ownership in the archive with totally traditional methods. So I was thinking maybe they would converge. They didn’t converge. “TheHistory Manifesto” was basically like, you know, intended by me as like a postcard to the rest of the discipline, about, like, “Dear discipline, dear colleagues, before I plunge off of the cliff of doing something really crazy, this is what I was trying to do, trying to see if the longue durée is relevant, do something relevant to the environment and see if computers matter.” So it took a long time to figure out how computers matter. It took, took a long time to figure out how to do it to the high standards of argumentation expected by the history discipline. So in the process of working out what became “The Dangerous Art of Text Mining,” I was invited to all of these historical conferences in Europe and Latin America and Asia. I was invited to argue with historians from concept history, from political history, from diplomatic history, from cultural history, most of whom didn’t agree with my methods and had great misgivings about whether digital history was possible or whether text mining could ever be used for their subject. So I went and I had terrific arguments with them. Now, in the meantime, things got very strange for me when I was hired at Brown University as an assistant professor in British history, and I arrived and told them in my job talk that I intended to do this longue durée analysis of property law and that I was working with digital history, and I presented to them on my very traditional first book, and they said, “Great, come” and so I was quite surprised when two years later they said, “Um, no, we’re not sure about this digital history thing. It’s a little scary. We think you should drop it.” And I had, at this point, had just been awarded a million dollar NSF grant in order to support the longue durée analysis of textual archives. So they said, “no digital history, no longue durée, no property rights. Could you just write a sequel to the first book? Could you do Victorian, Victorian trains.” So this was, like, handed to me after I had been promised tenure on the basis of my first book. It was misgivings around some chatter. There were some British historians who really liked “The History Manifesto.” “The History Manifesto” program is now basically what’s pursued at the University of Chicago and a lot of other places. But there were some British historians who were like, “Oh, I don’t know if this is the future.” And so my senior colleagues at Brown had some misgivings, and they responded by saying, “Well, we really need to take this assistant professor under control. We’ll set some boundaries, and then we’ll promote her, and then she can experiment.” Now, the problem with taking the attitude that says, you know, I think this happens at a lot of places where senior colleagues have good intentions. They want to ensure that their junior colleagues are as successful as possible. So they want to show them what they think the road to Zion looks like. For your own good, you should write the following kinds of scholarship. I think that happens a lot. It happens a lot with the best of intentions. But what I knew, and I think what many young people know, is that

Jo Guldi

if you’ve got an ambition and you’ve got a vision, you can get there if you have enough freedom and enough support. And I had the financial support, I had had my vision endorsed by all sorts of journals and institutions and publishers already. And so as a young historian, I was saying, “Well, I can do this, and actually now I’ve got a million dollars to make it happen,” but my institution didn’t support me. And so the choice was clear to me, I can drop the vision and stay with the institution and have the prestige of being at this lovely university with my brilliant colleagues, or I could drop the institution and go after the vision. And you know, so many of us become historians. We become scholars. We give up on dreams of money because we’re in love with the ideas, and so we follow the ideas wherever they go. And they might go to the library, they might go to a nonprofit, they might go to the museum, they might go to adjuncting. They might go to the community college and just writing book after book about the thing that you care about. For me, it meant going to a quiet University where I could spend my NSF grant. I could learn to code again, because I had coded as a child. I could work out. A robust version of digital history would look like that was acceptable to the historical profession as a whole. I could work on my longue durée a history of property rights. I did both of those things from my quiet University. I published them. I found myself in dialog with lots of different historians who I had never been talking to before. And so for me, this was a totally successful trajectory. I’ve landed in a place where data scientists and statisticians want to talk about what history has to offer them, because we’re used to taking a deep, sober look at facts and truth, and so that’s a wonderful place for any historian to land. And then I have other conversations where, you know, I’m talking to people in development economics about the history of property rights and Elinor Ostrom and post colonial economics and that sort of thing. And that’s another conversation that’s been wonderful to join. It’s not the world of Victorian British history that I left, but that world, in many ways, no longer exists. It certainly no longer exists in the form in which it existed 20 years ago. So it’s okay. It’s been okay, but it, I think it’s a story that highlights some of the tensions in modern history departments, as well as the way that those tensions can structure the careers of younger historians. Would have been a very different outcome had an Ivy League history department said, “Ah this is brilliant. This is the future. We need to promote this and protect this. We need to give this resources. Let’s have more classes. Let’s hire more people like that.” That has happened in department after department in Europe, which is one of the reasons why I’ve been spending a lot of time in Europe over the last decade, it has not yet happened in the United States. There’s no US History department that has made that move, and which is one of the reasons why the people who have the expertise and the knowledge so many of the people who I talk about in “The Revolution in Text Mining is Here.” Many of those people are at work in libraries, or they’ve gone into private industry. We have the first generation of pioneers is underemployed. Under recognized. There’s some very good books from all angles. And not, you know, with, with some exceptions, Roopika Risam is at Dartmouth. Jessica Johnson is at Johns Hopkins, they have tenure. They’re doing fine and well done them. They deserve it. Those are cases that have been really well protected, and I’m very proud of them. But in general, I think the methodological sophistication of the discipline as a whole has suffered because of the unwillingness of history departments to place bets on intellectually daring new use of methods.

Daniel Story

I mean, I also started on this journey in grad school, and I think in some respects, ongoingly, there’s a strange kind of identity crisis, in a way, ongoing identity crisis that I feel as a scholar. Because, you know, I’m trained as a historian, and I still have very close ties to the discipline, but I also work in a different context that I really embrace as well. And I don’t want to signal to either one that I don’t identify with you or I identify with you instead. You describe this kind of tension that’s out there in the world, but I think there’s also a tension that can happen internally, when you don’t seem to fit into any particular box very well from a kind of professional or disciplinary point of view.

Jo Guldi

I think that’s right. I think that’s right. And I think, you know, it happens in all sorts of dimensions in our lives. It could be, I’m a first gen scholar, and here I am in an Ivy League PhD program. And it could be, you know, I do have a quantitative brain, and you know what I learned GIS and I made all sorts of discoveries, but yes, I can still talk about literary theory with you. So you know those tensions exist in all of our lives, and they can be the source of enormous richness in conversation with each other. But I think you’re right. We’re in a curious historical moment in which those tensions, those tensions, feel embattled. It almost feels like both of those tensions that I named could be part of the culture wars. You could be seen as an outsider. You might not get a job or an interview because you didn’t match some set of expectations. Some of the digital humanists in literary studies have engaged this as a moral issue and recommended radical generosity as the attitude that is appropriate to these times, a radical spirit of generosity, and that can be around many kinds of diversity, right? It can be about being differently abled. It could be about different family experiences that I have in my past, or different training, different intellectual proclivities and methodology. A spirit of radical generosity is healthy for any community that cares deeply about the truth. And I think that is the core identity of history at this moment in the 21st century. We are the kind of community so dedicated to the truth that you can talk about postmodernism and deconstruction and the limits of interpretation, but you also believe that the American Civil War happened, and we know what it was about, because they left documents where they told us. We’re both a humanities subject and a social science subject. I mean, the discipline of history is kind of like Hinduism is a religious structure, like, it doesn’t, it doesn’t conquer it assimilates. You have a different way of knowing that, like the Buddha’s interesting, we’ll just make them into another god, and there’s another shrine over here, and there’s another shrine over here for Jesus. And, like, that’s okay. So to illustrate the history discipline, we’re like, really, it’s where we are now, like we have fought wars about it. There was that moment of the quantitative turn in the 1970s. There was the narrative turn, which almost conquered it. But we have history departments in their diversity, where there are demographic historians and diplomatic historians and cultural historians and historians with very different experiences of what it means to be an intellectual. You know, my parents worked in a factory. My parents were university professors. We have that in the same department. So we’ve learned to deal with that. We can deal with that with digital methods, but I think we haven’t yet. And you know the fact that we’re having this conversation about your experience in graduate school, and it makes so much sense to me. I think it’s familiar to the graduate students who I talk to at leading departments of history right now. They feel these pressures, and they feel these pressures all the more given the job market pressures. They want to have the tools of digital history, just in case, because they see the numbers, they see that their jobs and information science and data science departments, that their friends are getting, that there are increasing digital humanities asks from certain kinds of universities. At the same time, they want to be able to signal to leading universities that they’re still humanists to their core. And they play the same games about, you know, identity and personal history, like, what shall I disclose? What shall I not disclose? Sometimes it feels radically honest, but sometimes it feels imperiled and toxic and like you could make the wrong decision and be out of a job. And I think that’s terrifying. When I talk to graduate students today, I hear that it is absolutely terrifying to know what to signal to whom, not just about your personal history, but about like how you wrote the book. And my hope for our colleagues and for leading departments of history is that we should find in this radical spirit of generosity, a way to make it in some way possible to just esteem excellence wherever it’s found, to esteem insight, however it is found, and to to seek out a spirit of radical generosity in others and commend it and offer jobs to people who are similarly radically generous. Because that’s what’s going to allow this, this mood of inquiry and dedication to the truth, to continue into another generation, into the future of the university. That’s what’s really needed.

Daniel Story

Very well said. So yeah, thanks for hanging with me for so long. Actually, I’m really enjoying this conversation, and

Jo Guldi

I’m enjoying it as well.

Daniel Story

You said that at one point you transitioned—I think it was when you left Brown—that part of what you wanted to do was find a quiet place to relearn how to code, or maybe learn how to code in the way that was going to be useful to you Now.

Jo Guldi

Yes.

Daniel Story

I’m curious what that involved, and even down to the level of you know, when you code, what coding languages or language do you use? What tools do you use?

Jo Guldi

Yeah, so I learned how to code when I was 10 years old.

Daniel Story

As one does.

Jo Guldi

I grew up in the kind of, not because I’m a prodigy, but because of where I grew up. So I grew up in a place that’s kind of like not many communities in Silicon Valley today, many places where they’re like, you know people who are working in the data industry who make sure that, today it would be they make sure that their kids go to maker fairs, go to coding camps. I grew up in public schools in a Texas suburb on the silicon prairie. My daddy and everybody else’s daddy worked at Texas Instruments. My mom was a computer programmer before she had a kid. So they were like mathy. And then we were early adopters of personal computing. So there was, like, a personal computer in the house with one of those green screens, which my mom used to balance the checkbook. And I was like, Mommy, Mommy, please, I want to play with this machine. And my mom, in like, busy professional woman mothering skills of the kind that I really admire, said, well, kid, the way you learn a programming language is that you read the manual, and you read the manual 10 times in a row, and after the eighth time, it will start to make sense. I was like, what? But I followed this advice, and indeed, I read it, and it was total gobbledygook. And I read it again, and it was gobbledygook. And maybe the third or fourth time, I was like, oh, that’s how a programming language works. And I used this to, you know, when I was just a kid, to create a, like, which way text based video game, to, no kidding, to, like, navigate the Encyclopedia Americana entries of the history of the kings and queens of England. You know, obviously I needed to be a history nerd, but there were no history nerds available other than the encyclopedias, so I had to kind of invent it myself. Yeah, so I had that, like long ago background, but it’s kind of like riding a bike. Once you’ve learned how a computer programming language works, it’s easy to pick it up again. I spent my teenage and adult years, young adult years, really trying to figure out what it was that humanities nerds knew that nobody knew in Richardson, Texas, in this company town of a Silicon Valley proto company. It was clear to me that there were just cultural limits to the world of engineers that I grew up in a way that I think is was kind of prototypical of the culture wars today. The people who I grew up with didn’t talk about politics. They didn’t talk about where anybody’s ancestors came from. They didn’t talk about the causes of the Vietnam War, even though my parents and everybody else’s parents were of a generation where that could have been incredibly politically meaningful. And then I found myself later on the coasts, in places like Berkeley, California, interacting with all sorts of people whose my age, whose parents had met each other at protests of the Vietnam War, or who had served in the Vietnam War, and then come back with opinions about the state and a new perspective on Foucault. So, you know, I came from this place where, just like, I didn’t have a map of history, consequently, I didn’t have a map of ideas or of ethnic experiences and time in a way that I, you know, rapidly started learning the people who understood those things had useful perspectives on politics and literature, everything. So, you know, all I had when I left home was like these skills with math and coding, because that’s what they equipped their children with in this company town. So, you know, mad respect for historians who come from a different way of knowing. You know, my entire formation was like, wow, there’s secret knowledge out there. Code for me was like just what everybody knew growing up. It wasn’t special. It wasn’t because I was a genius. It was like, everybody learned this thing, like a trained monkey can learn this thing. They’re really good coding classes in the public schools. So, you know, I like to tell that story, because the story says to me, anybody can learn to code. It’s just about repetition. It’s just about being exposed to it. Now, that said, I don’t think that, I think that there are a lot of impediments in the way of people who have always pursued humanity’s knowledge who want to learn to code for the first time. One of the biggest impediments is that the way coding is taught is that they assume that you’re going to do accounting. Most coding classes start with let’s program a calculator, and you can work through the entire coding class and never learn anything that’s useful for digital history. I mean beyond, like the basic what is a command and what is a data set and what’s that sort of thing. So I’m currently involved, you know, I’ve been teaching historians to code for ten years now. I teach in R, which is the language preferred by most economists and statisticians, and I teach, also teach in Python, which is the language preferred by most computer scientists. They can both do wonderful things. Each one is slightly better for different kinds of tasks. Python, I much prefer for scraping data from the web, if all you want to do is scrape the Old Bailey archive online or scrape the Library of Congress, learn Python. If what you really want to do is take a data set of the parliamentary debates of Great Britain or of the debates of the United States Congress and Senate and understand how the words changed over time, start with R—the packages are better. Although you can do both things with both both languages. There are more resources for getting started right now. Programming historian is very good. I share all of my scripts for learning to code in Python on my GitHub page. We have a textbook I’ve written with my coauthor, Stephanie Buongiorno, a textbook on learning to code in R for textual analysis, which we hope will be under contract soon. It’s in review at Cambridge. It’s just gone through the review process at Cambridge. We had a lovely review process. So that will be written specifically for historians with no prior knowledge of code and no comfort in code required. So I, you know, I think it’s, it’s out there. The difficulty is just that, you know, don’t get lost in the weeds of hardcore quants teaching you how to code like them. It’s a different thing to learn to code like a humanist.

Daniel Story

Yeah. Code, like a humanist,

Jo Guldi

A different set of ideas and questions.

Daniel Story

Yeah. And I think part of what is so like, there’s the the, if you will, nitty gritty of figuring out, actually, how to do these kinds of things. But kind of circling back to some stuff we’ve already touched on, the thing that got me excited about digital history, really, it was being involved in conversations. This is when I worked with Kalani Craig at IU, at the Institute for Digital Arts and Humanities there, and I was a grad student at the time, but she, you know, involved me in these, like, consultations, really, with faculty and some grad students as well, who had, like, a research question, and then we would puzzle through with them, okay, what are you really after? And, you know, is this particular digital method or that gonna be the better fit, right? And it was this like brainstorming back and forth, trying to find the right tool. And maybe sometimes it was like, no, this isn’t really a problem for a digital tool. But you know what I mean. But that kind of like getting into the weeds of the disciplinary questions, alongside the questions of what tools are out there that might speak to these and you know. I feel like this is a common refrain in what you have written here in recent times, but I’m looking at like one thing that you said, and I think this is the “Revolution is Here” article where you’re looking at disciplines outside of history or adjacent to history, kind of, and you say what theorizing does for disciplines near to history is underline how the concerns of traditional scholarship can remain the governing impulse for work modeling text through computational means. So this idea of sifting through the disciplinary concerns as well as the digital tool possibilities, I feel like, is crucial.

Jo Guldi

Oh, it’s so crucial. It’s so crucial because you get one kind of output if you have a team full of computer scientists who are told do history with this data set. They tend to produce results that historians, real research historians, look at and shrug. I mean, that was the problem with the culturomics book. They had access to all of Google Ngrams, and what did they do with it? They showed that there was censorship during the Nazi regime. Historians were like, yeah, what else could you show us? Did you find anything surprising? And the answer was, they have no idea what’s surprising, because they don’t know anything about the history field. They’ve never read a book of history, right? So if there’s going to be a digital history that merits the interest of practicing historians and also educated members of the general public, it’s got to be a field of digital history that’s capable of creating something surprising, something that historians didn’t know. And so it’s got to be able to start with ideas about what is history that came from inside of history departments. Now, computer science has a big tradition of looking to other departments and listening patiently and saying, what are you interested in? That’s how we got computational neuroscience, where we look at brain scans and compare them and then learn things about different brain injuries. That’s why computational linguistics is so robust as a discipline. They’ve been talking to each other since at least the 1960s, like since Chomsky. And through every revolution, linguists and computation computationalists exchanged ideas about what it was that they were looking for. So a robust discipline of digital history is going to come from the same thing. It isn’t going to be that computationalists invent it whole cloth and history says, oh, thank you. It isn’t going to be that historians invent it wholecloth without the computationalists. There’s going to be borrowing of algorithms and writing of new algorithms to investigate new ideas. So I think that process has begun, and that’s really what I wanted to document. And “The Revolution in Text Mining is Here,” which is the most recent of the articles published in the AHR, that it’s coming because there are hybrid teams, where computationalists and librarians and historians hang out on the same team, not just for a day, but for like, two years. And they talk about what the computers can do, and they try and they try and they try and they keep on tweaking it and working on that what you called fit. Fit is a very important word. They work on the fit between the questions and the methods, and then, because there are theorists like Roopika Risam and Jessica Marie Johnson who then take the whole enterprise of text mining a set of documents and say, oh, there are some things that are problematic about that. What is this dataset? What is its relationship to the archive? Who wrote the archive? Was it all white men? Who was left out? Could it have been subjects of color, subjects of empire, colonized people, perhaps the voices of women? Let’s theorize what’s problematic about that, and also look at what radical librarians and radical communities and indigenous communities and scholars are doing to document their own experience. Let’s look at the concerns of African American genealogists and what they do with the data, which is totally different than the historians of ideas. So you know, we’re very grateful to those theorists of historical experience for showing the blank spaces on the map, shining the flashlight on the perpetual silences that are the limits of what you could do with data. All of that is a massive advance in the field of digital history. And now we’re starting to see, you know, more generations that combine these insights. Here’s what we can’t do, here’s what we can do. I’m going to take these algorithms. These are the algorithms that are good fit for this question, and I’m going to ask questions about the history of political ideas, or the history of 19th century keywords, or I’m going to, like Ryan Heuser, revisit Koselleck’s thesis of the Sattelzeit, and try to revisit what can computers tell us that’s new about 18th century revolutions in democracy and empire. What are the new words? And we are actually making discoveries. Maybe 80% of the work is validation, just figuring out what the fit is. But 20% can be discovery, and discoveries that are actually meaningful and valid to other historians and to members of the public. So the discipline has been progressing, and it’s been progressing through many hands. And a lot of my work is really just to shine the light on the work of all of these different communities. Some of them are theorists, some of them are technical. A lot of them are doing what you’re doing, Daniel, and just working with the fit. What are the historical questions? What are the algorithms we have? Which ones match each other? If we change the parameters of how the algorithm is applied, do we get different results? When can we trust the results of the algorithm? When shouldn’t we trust it? That work is what allows the work of data scientists to rise to the high standards of truth in the history profession.

Daniel Story

Yeah, and I think that kind of conversation is so generative on many levels. Of course, it’s important for actually implementing analyses that are going to be robust and actually worthwhile. But at the same time, or in addition to that, just having the conversation can bring aspects of what it means to do history to the surface in a way that sometimes doesn’t happen when we’re just doing business as usual as it were, right? Because historians are notorious for not wanting to talk about methodology, which may or not be true in in practice, but that’s a bit of a stereotype, that there are certain aspects of the way we do that we don’t theorize. And doing some of this kind of work causes us to to delve into theorizing a bit more, which I feel like is very worthwhile thing.

Jo Guldi

Yes, I think digital history, as it exists right now, is a very methods forward space. You have to think very critically about what the tools, available tools, are and how best to understand them. And it’s not a small investment. In other ways, the digital history community has produced a set of trustworthy techniques, and those techniques can now be applied by other practitioners who don’t necessarily put methods forward. So we’re starting to see, you know, the first generation of historical monographs that are based in history, that are based in a lot of text mining, but don’t necessarily talk through the methods, where you don’t necessarily have to think about which algorithm they used in order to appreciate the results. So I can think of a handful of books which used text mining on the back end, but don’t really talk about them.One of them that’s really fun to think about is, Niall Whelehan is a historian of Irish diaspora across the north and south Atlantic who’s looked at Irish ideas of land ownership in Canada and Argentina. Sometimes Irish immigrants become advocates for property rights for all and for rent control, because they were doing that at home to defeat the English colonizer. Sometimes they just become another colonizer who’s taking away land from other people. But that tension is really interesting. And they’re like women involved. And how did Niall do this beautiful book, Changing Land? I mean, I think he used Lara Putnam’s AHR article about text mining newspapers as a guide book. When you look at his sources, it’s a pandemic era book where he was using global newspaper archives to trace these Irish actors and where they landed, and then to think critically about their ideas. So he was doing something very methodological on the back end that never makes it into the preface or the introduction. He doesn’t talk about it. He doesn’t sell himself as a digital historian, but it’s very much a digital book, and I can say that because, like, I’ve been on a panel with him, and I pointed this out and celebrated the fact that Irish history is responding to Lara Putnam in this, in this really methodologically forward way. Irish history was delighted to hear that it was a digital history subject. They seemed cool with that, so I think I could tell that story and celebrated it for them. But I think you’re right. There was this like, there’s this kind of reluctance. They’re like, well, we’re just doing history. Here’s our book. We want to tell you about the finding, not about the methods that we use to create it, but in a sense, they can do that. They can do that because they’re standing, we all stand, on on ladders that were built by other people for us.

Daniel Story

Yeah, yeah. I think that’s a great example. So I think, and your “The Revolution Is Here” article, you have this kind of sweeping view of some of the developments around text mining. Your article, “The Algorithm: Mapping Long Term Trends and Short Term Change at Multiple Scales of Time,” it’s a super interesting article, and a really great specific example, right, of puzzling through this question of fit and appropriateness of a particular text mining technique to a historical question or historical problem. Do you want to unpack a little bit of what you were doing in that article and the specific text mining technique you employ?

Jo Guldi

Yes, yes. So “The Algorithm” is really, it’s a methods forward article that makes some real historical discoveries. So it makes real surprising finds about 19th century British history on the basis of just trying to pay attention to what one algorithm can and cannot do. So to follow along, I need to explain a couple of concepts. One concept is what I call distinctiveness. So there’s a group of statistical measures whose job is to tell us if we’ve got a Venn diagram and we’ve got two overlapping circles and one of them is labeled A and one of them is labeled B, we can ask questions about these circles in various ways. We can say, what is in circle A and in circle B—what’s their intersection? We can say, what’s in circle A, but not in circle B? So what’s specific? What’s distinctive about circle A? A, and then we could do the opposite. What’s in circle B, but not circle A? So a distinctiveness algorithm is supposed to do that for a dataset. So if I’ve got all of the words from Shakespeare and the words from Sir Francis Bacon, what are the words that they use? Are there any words that Shakespeare uses that Bacon doesn’t use, and that Bacon uses and Shakespeare doesn’t use? This is like a very simple statistical concept, and there’s a whole group of algorithms that you could use. They all get slightly different results, because they’re going to weight words, like, what if Bacon uses it just once? Should we throw it out? Shakespeare uses it 50 times? What if Bacon uses it five times? Do we count that? So you know, the weighting changes depending on which algorithm you use. Traditionally, in library science and in the digital humanities, we use this to compare authors, or particular plays, or particular novels. But in my work with historians, what’s clear is that we sometimes care about Shakespeare and Bacon, but you know what we really care about is time. We really care about time. So we care about what are the words said in the 1850s but not the 1860s, or the 1840s? What are the words that are distinctive of this period of time? You can think about if we look when historians of the future look back, they’re going to want to know what are the kinds of things that people said when Trump was president that they never said before in American history and they never said after. What are the Trump distinctive words? And you could ask that for what’s said in Congress, or what’s said on social media. Could ask that about what Trump himself said that no other president said. That’s very key for us. So it turns out that you can take these old fashioned algorithms that were used for the Shakespeare but not Bacon words, and you can just put them onto your database of years and highlight the year catalog. But nobody had ever done this before, because, you know, because digital history is new. So we started doing this. And at first we started looking at 20 year time periods. And so if you do this for 20 year time periods, for all of the words said on the House of Commons, in the House of Lords, and the British parliamentary debates, which are known as Hansard, If you look at Hansard for 1800 to 1820, and 1820 to 1848, and you ask, what are the words that are most distinctiveness, you get a lot of things that make sense, like during the Napoleonic Wars. There are debates about censorship that don’t appear for the rest of the century during the the debates about abolishing the tax on grain. There are debates about free trade and grain during the Irish land wars. There are debates about tenant property rights and rent control that don’t turn up in other parts of the century. So it makes sense, it’s validating. It makes sense that the algorithm can help you do history. So that doesn’t create any new revelations about 19th century Britain, but it would be very useful if you had something like the Trump debates and you were trying to do more recent history. But what’s fascinating about these algorithms is that once you’ve done that, you could just tweak the data in all sorts of ways, and you could see what the algorithm can tell you that’s new. So you can ask about the most distinctive words for each 20 year period, but you could also ask about the most distinctive word for each year or each month of every year. And then the computer will also assign a number, a rating of how distinctive each word period pair is. So some years are more distinctive than others. Some years you’re inventing new words from whole cloth, and some days or weeks in Parliament, you’re talking about things that have never been talked about. And it turns out that once you get down to the level of the week or the day, the words that are distinctive are very often linked to working class concerns, so that, for example, a petition of the dyers, the people who are dying clothing, for Parliament to help them regulate their industry, only shows up in one week. Because, you know, the dyers may be a real, legitimate working class social movement, but they don’t have a lot of money. They’re not the railway interests. They’re not going to be in Parliament whenever they want to be. It takes them a lot of energy to get into Parliament, to get Parliament to debate what they’re talking about. So that one week is pretty, pretty important. So once you have used the distinctiveness algorithm in this way, you also have as an artifact of having ranked these time periods a proxy measure of parliamentary attention. So you can assign how long each issue got talked about. Was this a one week event? Was this a one month event? Was this a one year event? So it turns out that a really big deal, like a really powerful lobby, like the railway lobby, gets about a month or two of parliamentary time, but an issue as important as slavery gets significantly less. A working class issue, like the regulation of sidewalks, a regulation of carriages of handsome cabs, or the regulation of the fabric dyeing industry, only gets a week at a time, or maybe just a day. So you get a measure of how much parliamentary attention each social movement is able to get. Even the Chartists, the Chartists who are super important, the movement that gets the working class to vote, they’re able to command parliamentary attention for about a week of time. We understood Chartist politics in the in the past, but we weren’t able to measure Chartist relationships to parliamentary attention before, and so that’s a big deal. But also the method is applicable to any other text base in which you might be interested in power dynamics. You could have court cases, and you could ask similar questions. How frequently do Indigenous groups get their cases tried in court, given all of the barriers to getting to court? You know, that there’s a power dynamic there. But how much of the time is it, you know, the equivalent of days or weeks, or is it entire months? Who gets the attention of the judges for months at a time? That’s a question that you can ask of court records. You can ask it of news records. The answers would be different. You can ask it about the presidency. You can ask it about the congressional debates, the parliamentary debates. You can ask it about practically any text archive, so long as you have a relatively long array textual archive, and then even if there are only a handful of mentions, it allows you to perform some kind of an analysis of power dynamics in the archive. So it’s a useful tool for creating a chronology if you don’t understand why one year is different or one decade is different than the other in a textual database. It can also scale. You can go down to a micro history and ask how particular months and days were different than others, and then you can use it to measure parliamentary attention.

Daniel Story

That’s really fascinating, and it strikes me, please feel free to disagree with this observation, but it strikes me that your real kind of innovation here, or the substance of the work, was not inherently technical. It was, you know, it was you interrogating the different ways that you could think historically about the dataset, or the material that you had access to, the different kinds of questions you could ask. Although, I guess probably as you applied the technical aspect, it perhaps, you know, kind of helped you think about those conceptual questions differently too. Maybe it’s kind of a back and forth as you tinker.

Jo Guldi

Yeah. I mean, it’s fun to think about about that question that you just raised: Is this a historical intervention, or is it a technical intervention? And I think at least two things are important about that question. One is that you know many historians who might listen to this podcast may be asking themselves, could I create an intervention like this without doing all of this work to learn code? And the answer is, importantly, yes. Because the kind of intervention that happened in this article is really an intervention in thinking about time and thinking about what it is that the algorithm has found and how it’s meaningful, given the reality of power dynamics. The kind of technical expertise that you need in order to write, to create a methodological intervention like this is not a mastery of mathematics, or a mastery of code, or even the ability to code. It’s the ability to ask questions about an algorithm if you’re collaborating with a data scientist, and collaborating could just mean having a beer with a data scientist. You’re having a beer with a data scientist, and you say, I’m interested in algorithms that allow me to understand change over time in language over a year or a century. What are the components, what are the inputs into the algorithm, and what’s the output? And then, if I change the parameters of this algorithm, what kind of output comes out differently? So all of my data science has been in collaboration with statisticians who were the masters of the algorithms. I don’t write my own algorithms. I don’t choose my own algorithms. I have asked data scientists and statisticians the whole way through which algorithm would you use? Now, can you explain to me what the inputs are and what the outputs are? The insight comes when you say, ah, one of the inputs is the length of time. Is it 20 year periods? A ten year period? A five year period? A month? A day? What happens when we vary that? What might we learn? And the insight isn’t instantaneous. It isn’t over that beer. It’s like, the data scientist goes home, runs the algorithm, comes back with this output in which the answer is, if it’s on the level of a week, it’s the Chartists. If it’s on the level of the day, it’s the dyers. If it’s the level of the month, it’s the railway interests. And you look at that and you say, oph, that’s a power dynamic. That’s absolutely about class interests. I know what that is. That’s your training as a historian kicking in and saying, I recognize those patterns. I do pattern recognition. That’s a social force. That’s a social assortment via time. What’s the implication of that for how I understand parliament? Well, it’s no big deal to say parliament is run on privilege. Certain things get answers, but we could never measure it before. We could never visualize it. We could never map it on a calendar before, and now we can. How might that change the kinds of questions that we’re able to ask about the 19th century, the 20th century, or modernity as a whole? And it opens up, I think, doors that other historians will be able to walk through, explorations that other people will be able to take. I haven’t exhausted it at all, but that process, that process of pulling apart the algorithm and saying, do I tweak it again? You don’t have to be a statistician. You don’t have to have done the statistics class. You have to really patiently be able to interview a statistician and then really patiently look at the results and then really patiently ask, what if we change this component in order to do this kind of work? It can be done in a collaborative environment. Maybe it’s best done in a collaborative environment. But it’s that’s work that I’ve theorized in another set of writings. It’s collected in The Dangerous Art of Text Mining, but also in an earlier journal article for Cultural Analytics, which is a DH journal, a very serious one called “Critical Search,” in which I tried to theorize what, essentially, we need, what historians need for high standards of the truth from the search process. Where we’re asking algorithms for things, we need to not take it for granted what the buttons on this black box do. We need to make it a white box. So in “Critical Search,” I think about what it’s like to pull apart an algorithm, and I’ve laid that out in a way that a lot of digital humanists have told me is useful. I think it’s a potentially challenging article for a history classroom. If there’s nobody with a DH background, nobody with a quantitative skill set, maybe don’t start with “Critical Search.” If you’re like, you know, if you’re at an antiquarian Department of History, it’s just about narrative, maybe start with “The Algorithm” or with “The Revolution in Text Mining Is Here.” But if you have some graduate students who have some skills, or you yourself have started to think about an algorithm, or use an algorithm, “Critical Search” is an article that can help you think about what are some questions that I ask of the algorithm to get some answers. Because there are so many algorithms, and there’s so many ways of doing this. It’s just an approach that originates in the desire to make use of humanists, respect for iterativity, for polyseme, for multiple meanings, and a respect for what it is that algorithms do. It’s a coming together of those two sets of concerns to produce a new way of critical thinking that respects our historical traditions as well as the real work of quantitative algorithms.

Daniel Story

Yeah, all the wonderful things that can happen when historians and statisticians share a beer.

Jo Guldi

Indeed.

Daniel Story

Yeah. Well, I think we can wrap it there, unless you have something else that you want to add.

Jo Guldi

I think that’s great. Yeah, we have covered a lot of territory and some in great depth.

Daniel Story

Yeah, thank you, Jo, for your time.

Jo Guldi

Thank you. Thank you, Daniel, for such a great interview and for asking such good questions. It’s a pleasure to spend time with you.

Daniel Story

Thank you. Likewise.

That was my conversation with Jo Guldi on text mining, AI, and the wider landscape of digital history. You can find Guldi’s articles on these topics in the June 2022 and June 2024 issues of the AHR. Her book, “The Dangerous Art of Text Mining: A Methodology for Digital History,” was published by Cambridge University Press in 2023. History in Focus is a production of the American Historical Review, in partnership with the American Historical Association and the University Library at the University of California, Santa Cruz. This episode was produced by Syrus Jin and me, Daniel Story. You can find out more about this and other episodes at historians.org/ahr. That’s it for now. I wish you a very happy new year. See you next time.

Show Notes

In this Episode

Jo Guldi (Professor of Quantitative Methods, Emory University)

Daniel Story (Host and Producer, UC Santa Cruz)

Music

By Blue Dot Sessions

Production

Produced by Daniel Story
Transcription support by Phoebe Rettberg

Season 3 | Episode 5

Jo Guldi on Text Mining, AI, and Digital History

Episode Description

Transcript

Show Notes

In this Episode

Links

Music

Production

Join the AHA