Welcome to Recordkeeping Roundcasts, a series of conversations with interesting people who are doing interesting things.
In this first episode the Recordkeeping Roundtable’s Cassie Findlay is talking with John Sheridan, Digital Director at The National Archives UK, about the challenges of scale and complexity that come with digital recordkeeping. As background for this conversation, take a look at TNA’s Digital Strategy, released in May 2017.
CF: It’s terrific to be speaking to you, John Sheridan from The National Archives UK. Thank you very much for agreeing to chat with me about a number of topics of interest.
JS: It’s a real pleasure to have this opportunity to talk about all these great things that we have to think about.
CF: Yes I was really fascinated to read the National Archives’ digital strategy that came out last year, and a number of other posts to the blog that touch on many issues that I have been interested in myself, and that the people of the Recordkeeping Roundtable, my colleagues in Australia, have also been interested in, and so today what I wanted to do was dive down into some of those issues. I must admit it was quite hard to pick them, because we could have gone down any number of paths in this conversation, but perhaps I’ll start off with the first subject, which is in broad terms, it’s about scale and complexity. In the National Archives’ digital strategy, the idea of transfer and ingest of digital records is mentioned, and there is a commitment to standardize and streamline the process for transfer, talking about a self-service digital interface for government departments to automate the registration, transfer and ingest of new records. And there’s also some mention about changes to how digital records are described and the cleverer use of metadata.
So I guess my question to you, having worked myself in the very challenging and complicated world of accepting and ingesting and preserving digital records. These are fantastic goals, but are there limits to what can be achieved in this regard? And if so, where do they lie?
JS: That’s a great question. And of course, there are limits. I think, to give you a little bit of context for how we arrived at this place, we’ve been archiving digital records for perhaps as long as 15 years now, and we’re on therefore maybe a third incarnation of the digital records infrastructure and a digital archiving system. So we’ve quite a lot of experience at what it takes to manage the kinds of complex varied records that government is producing and that we’re needing to archive and make available, and our processes up to now have been quite, I suppose, hand-crafted. Each collection has looked very different, each EDRM system that we might take records from has thrown up different issues; databases look very different to managing emails from an EDRM system, which looks very different from managing a collection that’s been managed, say, on a shared drive.
And we’re trying to get from this point-to-point solution to a place where we essentially make digital transfer a commodity. It’s a thing that we can do readily and easily and efficiently. Now, we’ve been poking at what is it about digital transfer that causes us real difficulty, and the variation in source material and the variation in collection and the variation in issues that we discover when we try and ingest digital records is certainly one of the areas where we have… It generates quite a lot problems for us. It generates problems, particularly around things like consistency of metadata. So some of this is about creating a transfer system that’s much more forgiving for the information we have about the records, because we know that one of the most important risks that we can manage is the preservation risk, and the understanding that we have of a digital collection, and in order to be able to effectively manage that risk, we need to ingest it and we need to be actively preserving it.
Now, that brings in questions around automatic metadata creation, but also being much more forgiving about how accurate we need information to be and our expectations of the information that we have. So if you have something like a creation date, you can either be in a world where you absolutely require a creation date for a record to be perfectly accurate, or at least a plausibly accurate claim. That sets the bar quite high in a world where the records that you are confronted with are routinely reporting a date that you know must be wrong. So, how about you have an approach to how you think about the metadata that’s available, where you have an acceptance that there will be a variation in the reliability of that metadata, say, about a date of creation, and you base your decision-making against an acceptance of variation in quality, rather than setting the bar so high that the data must be right for you to be able to make a perfectly right decision.
And so it’s partly in order to have a transfer process that works in real life. It’s partly about streamlining and automating, but it’s also about, in our minds, having a more realistic expectation of the information that we’re likely to gather, particularly as we have more automated processes, and that we’re realistic then in terms of how we use that information for our decision-making. And it’s about, I suppose, a more general shift in terms of accepting and having archival practice for, an era where the record-keeping traditions at play by record producers in the digital era are just very different from the record-keeping traditions and practices that were at play in the era of physical records, and accepting that the digital record is going be much messier, it’s going to be more chaotic and we’re going to need to have different expectations in terms of the information that we have and that we use. If we are to have transfer processes that scale, rather than going back and fixing stuff up or putting an enormous burden back on record creators to can, just perfect metadata that they’re never going to be able to produce.
So it’s as much, I suppose, from my perspective, this idea of building a technological capability that enables us to automate or increasingly commoditize transfer, but also this acceptance about data quality and being much more realistic about our expectations of the data that we have. In a way, that’s actually quite… It’s quite uncomfortable for an archive because, frankly, your catalogue gets really badly messed up. Archival institutions are, yep, the born digital record of a 21st century government, is going to be more chaotic, the metadata’s going to be a lot poorer, and your catalogue is going to be much, much messier than you’re used to. And we just have to live with it.
CF: And I suppose, there’s fantastic opportunities, I think, in that chaos and complexity in the way that relationships can be drawn out between records and between agents or actors in the business, but it does stress and strain some of our existing descriptive practices, as well as our management tools. Do you have any sense that description at a higher level and letting the complexity play out and using tools to analyze and investigate records that are different to the traditional discovery tools is partly the solution, or are you looking to adapt descriptive practices down to the item within the item, within the item type of scenario?
JS: So how the archive goes about providing context, archival context, so that the user of a record can have some sense of its evidential value, based on some claims about the record creator and what the record creator was doing at the time the record was produced and something about how the record was managed, the record-keeping system that was at play, that’s all crucial, and it’s key to how the archive goes about providing value, how a digital archive goes about providing value. The approaches and the standards that we have for doing that, though, are, I would go so far as saying badly out of date, badly out of date. So the ISAD [G] type cataloging standard and its incarnation in technological form, with at the interchange format, the encoded archival description, this whole body of how archives go about providing context is serving us really poorly in practice, and it’s because we’re taking essentially approaches to description that worked well for physical records and we’re applying them to digital records and they don’t work so well anymore.
So we need to break free of that, but we still have to… We have to be able to give an answer to the user of the record, which is the key questions, who created this record? What were they doing? How was this managed? What was this part of as a system? It’s just the answer isn’t going to be a conceptualization against original order in a hierarchical descriptive inventory. It’s not going to look like that and it shouldn’t look like that. It’s going to be something that’s much more fluid, it’s going to be connected, related concepts in something that looks more like a graph-based data model than a hierarchical model, it’s going to be more data-oriented rather than textual description, and the ability that we have, as you say, to compute over the records means that we will be continually through actively computing over the collections that we have, the digital collections that we have, we’ll be continually identifying new connections between records.
The other piece where there’s just the most extraordinary opportunity for digital archives is that the digital record of government becomes contemporaneous with the World Wide Web, this massive globally deployed information infrastructure, which itself is being archived, and so for the digital archive in a way that it was never true, or the possibility was never there for us to realize of how the archive is contextualized with the rest of what’s known and available in the world. For a digital archive that is on the World Wide Web, and part of a globally deployed information infrastructure itself, which has been significantly archived. So I’m thinking, for example, of connecting email records in 1997 for a key decision that was made inside the UK government. Now, 1997 is the same year that the BBC launches its website, and you can see the news stories that the BBC was presenting in 1997 to this day. So you can compare what a politician was saying to the BBC or what they’re reported as saying on the Web in 1997 and use that as a piece of context for the exchange they may have had about the same issue with, and the briefing they may have received ahead of that interview, which would be part of the digital public record.
Now, this is a wholly new opportunity and, again, our approach to how we provide context, that we design it in a way that enables those connections, those linkages, enables something that’s much more fluid than the traditional approach that we’ve had through the rigid hierarchical descriptive inventory. That’s the prize. The problem is, is it’s a long way to go from where we are, and we don’t want to, if I use the cliche, throw the baby out with the bathwater. We’re looking to try and do something that is, that realizes some of these possibilities, but still means that we are retaining intellectual control over our collection and that we’re able to answer the most basic questions around archival context.
We’ve said quite a lot about it in our guidelines for cataloguing born digital records, but this is very much a work in progress, and I think it’s something that digital archives around the world are going to have to work on and be much more radical about. And we’re going to have to work hard on this over the next decade. We’ve a long, long way to go to having the kinds of practice around archival description and context that we need. I think from the National Archives’ perspective, we are putting ourselves out there saying, “The old world will not do. This is just not going to work anymore. It’s not going to scale. It’s not going to meet users’ needs, and it doesn’t allow us to realize the possibilities that we have at this juncture.” Archival context is still key, but we have to now move on and re-imagine it, and we’re trying to do this very much by working in the open. It’s why we’re blogging and talking about some of our ideas so openly.
Next episode: We talk with John Sheridan about blockchain experimentation and the ARCHANGEL project.
About John Sheridan
John Sheridan is the Digital Director at The National Archives, with overall responsibility for the organisation’s digital services and digital archiving capability. His role is to provide strategic direction, developing the people and capability needed for The National Archives to become a disruptive digital archive.
John’s academic background is in mathematics and information technology, with a degree in Mathematics and Computer Science from the University of Southampton and a Master’s Degree in Information Technology from the University of Liverpool.
Prior to his current role, John was the Head of Legislation Services at The National Archives where he led the team responsible for creating legislation.gov.uk, as well overseeing the operation of the official Gazette. John recently led, as Principal Investigator, an Arts and Humanities Research Council funded project, ‘big data for law’, exploring the application of data analytics to the statute book, winning the Halsbury Legal Award for Innovation.
John has a strong interest in the web and data standards and is a former co-chair of the W3C e-Government Interest Group. He serves on the UK Government’s Data Leaders group and Open Standards Board which sets data standards for use across government. John was an early pioneer of open data and remains active in that community.