Skip to content

From the archives

Carbon Copy

In equal balance justly weighed

Slouching toward Democracy

Where have all the wise men gone?

By Populist Demand

When urban and rural voters went separate ways

For the Record

Preserving today for tomorrow

Lisa Betel

History in the Age of Abundance? How the Web Is Transforming Historical Research

Ian Milligan

McGill-Queen’s University Press

328 pages, hardcover, softcover, and ebook

It turns out that the line between what we think of as “history” and what we consider “current events” is roughly twenty years in the past. The Meech Lake Accord, for example, is firmly in the realm of history. The 9/11 attacks, by contrast, still retain some of the immediacy of current events. That means we are now entering the age when — in order to preserve the historical record — librarians, archivists, and historians will need to collect and study “born-digital” documents.

The migration from paper to electronic records has ushered in a host of complications for anyone who works with the written word, and Ian Milligan’s book is a primer on how to deal with those complications going forward. Indeed, archives have practical consequences for all of us: Functioning governments can base their policies only on the information they have, and social problems can be analyzed only if records are kept, catalogued, and retrievable. Archives shape the history we teach in schools. And if we lose (or fail to save) the garbage permits, how will we tell our grandchildren where we put the toxic waste?

Milligan, who teaches history at the University of Waterloo, identifies two overarching problems in this age of electronic record keeping: the technical differences between static paper and dynamic zeros and ones; and the sheer volume of what is now available to collect. In both cases, a new kind of ethics emerges.

Consider a website, which is really a collection of disparate elements — text and links and widgets — all assembled and viewed through a browser window. “A webpage, in the sense of a single document sitting in a box in the archives, does not exist,” Milligan writes. An individual’s WordPress site “can amount to almost 20,000 files if you wanted to replicate or ‘mirror’ it on your own system so you could have a copy of the site as it existed at a single point in time.” It’s an even more daunting task with a commercial or institutional site.

When a web page is archived, an image is taken of the main page, and then a web crawler follows and records the content of linked pages, which may change at any moment, “leading to technically fictive sites being archived.” This so-called temporal shift can be hard to detect, and it means that different users may all see slightly different versions of Wikipedia or or even What we are archiving today is in no way definitive.

Digital documents are something of a paradox in other ways. They are at once seemingly immortal and in danger of being lost entirely. We are all warned of the dangers, especially for young people, of sharing that questionable photograph on Instagram. The very real prospect of every (searchable) bad decision we ever made being stored in perpetuity has sparked the “right to forget” movement, along with never-ending discussions of how to safeguard our privacy. Yet the right to forget must not overshadow the right to remember: we are in real danger of losing crucial information.

In the last weeks of 2016, the Faculty of Information at the University of Toronto held a “guerrilla archiving” event to save environmental and climate change data from erasure by the incoming Trump administration. Other changes in government, such as the recent election of Jason Kenney in Alberta, have inspired similar rushes to archive politically charged information. In the past, this kind of information would have been printed and stored in depository libraries across the country.

Since 1996, many of those depository libraries have worked with the Internet Archive to systematically collect and preserve huge swaths of the web. Some of the earliest sites — Milligan considers the ones GeoCities hosted, for example — have been saved. More recently, new tools have equipped smaller institutions and even ­individuals to assemble targeted collections pegged to #OccupyWallStreet, #MeToo, and other trending hashtags.

But even with these tools, Milligan reminds us, records can be wiped out with a few keystrokes. A decade ago, for example, AOL announced it would shutter a twenty-­year-old hosting service with some fourteen million home pages, “giving only a month’s notice to some people that their personal sites would be destroyed.” Bewildered users were faced with the surprisingly daunting task of downloading or moving their information. “People’s online presence was being rapidly destroyed — without sufficient notice, comment, or ­compassion.”

Nothing of those fourteen million home pages now remains. The stakes are still higher for public and private institutions. And even those of us without a personal website must reckon with the risk: increasingly, we can no longer read emails from even a decade ago, and the countless documents we carefully saved on floppy disks and DVDs might as well be lost. It’s enough to make you think twice before digitizing that family photo album.

In addition to dealing with the counterintuitive scarcity of digital records, Milligan considers the problem of abundance. Never has recorded history been so vast and the sources — from governments, organizations, and individuals — so varied. These records can both illuminate and obscure. No one is sure how big the web is, but it is too big to be saved in its entirety or to be closely read, one document at a time. So Milligan offers historians something of a handbook, showing how they might change their techniques and, perhaps, ask new questions of the record.

Step one: Run a search for all the documents relevant to your subject. We all do this all the time, of course, and we are used to getting hundreds of thousands of results for any question we pose. But increasingly we must ask what our search engines are selecting and excluding. By what criteria are they determining relevance? In other words, what are the underlying (probably biased) assumptions in the algorithm that control results and how they are presented? “With millions of results, we need to know how the search engine ranks results — and, just as importantly, historians need to be able to make sense of that method as well,” Milligan writes. He also reminds us that “the same search, run hours or months apart, might lead to very different results.”

If close (and fixed) reading is impossible, then historians must learn “distant reading,” where data mining techniques and examination of metadata reveal trends over time — trends that can form a new kind of narrative, albeit an opaque one. Intentionally or not, what’s recorded in the metadata constrains the narrative, and nuance can be lost without close reading.

Step two: Get the training. Few if any schools are currently equipping future historians with the technical skills to run database queries and understand the metadata, and Milligan advocates for a curriculum that includes such instruction. (We should also teach these skills to journalists and policy analysts.) In the absence of formal courses, Milligan suggests where scholars can develop their technical skills independently. However, considering the ever-­changing web, it is doubtful his suggestions will be useful for long.

If history is getting harder to write, it’s also benefitting from many, many more first-hand perspectives. Through web pages, blog posts, and emails, today’s historian has an embarrassment of unmediated primary documents, written by people from all walks of life in real time:

The political historian can now begin to explore how everyday people engaged with political parties and the everyday process of policymaking, not just through letters to ministers and editors, but through their blogs and social media feeds, and a cultural historian can now understand say the Pokémon GO phenomenon not through New York Times reporters but from the personal accounts of players on the web and Twitter.

Which brings us to our step three: Deal ethically with material created by people who never expected their work to enter the historical record. We have had to contend with such assumptions in the past, when an estate donated a person’s diary or correspondence to an archive, for example. And within a university context, oral histories have been subject to ethical review boards, with strict guidelines to ensure that people are not harmed by their participation in research. But in this age of abundance, we have fewer mediators — and the line between public and private is blurred. There are few guidelines on how or when it is appropriate to use privately created electronic records. Does the simple fact that they are “published” on the internet make them fair game?

We are debating the issue as we go. Take, for instance, the Dalhousie dentistry students who shared racist and misogynistic posts in an invitation-only Facebook group back in 2014. Do historians have an ethical obligation to respect, in perpetuity, the privacy of adults who published their thoughts (closed-­minded as they may be) in a closed community? Does such an obligation lapse at some point, so that future scholars can interpret such communications as part of larger evidence of bias toward women and people of colour? While a high school principal or university administrator may lean one way, Milligan suggests the appropriate privacy litmus test for a historian: “If the author was posting a heart-felt story on a friend’s GeoCities guest book, part of a seemingly closed social network of a few high school chums, there probably was such an expectation. This means that when engaging with individual sites, the central metric should be ‘expectation of privacy.’ ”

From now on, those who dwell in the archives will face practical and ethical problems that their predecessors simply didn’t. Ultimately, Milligan believes, emerging techniques will allow us to deal with large data sets, preserve the anonymity of individuals, and still incorporate their perspectives into the record: “We can use ‘distant reading’ to zoom our gaze away from the individual websites and to look for larger patterns within an archive.” While such an approach does not eliminate “all ethical concerns about a collection” (to say nothing of the technical ones), it can help “mitigate them to some degree.”

The future of history may well be one where “people are obscured, but . . . still read into the historical record.” Let’s just hope the record we’re compiling today is one that’s heard.

Lisa Betel holds a master’s degree in library and information science from the University of Toronto.

Related Letters and Responses