Recordkeeping Roundcasts Episode 3: Machine learning

3127953038_9666f77a29_o

Robot | by Sebastianlund

In this episode, we conclude our three part conversation with The National Archives UK’s John Sheridan, with a chat about machine learning and its possible applications in the management and use of records.

Our thanks again to John for being generous with his time and offering us so many interesting ideas. Remember to subscribe to the TNA blog for updates on the many interesting projects underway there.

You can subscribe to this podcast series on Google Play; iTunes is having a little moment with adding new podcasts but hopefully soon.

Transcript

CF: Now, we have a short period of time remaining, so I hope I can continue for another little while with the conversation, if that’s all right with you, if you’ve still got the time.

JS: Absolutely.

CF: Excellent, very good. Because this third subject is one that I think I’ve certainly noticed being discussed more and more within our profession, and indeed, there’s been some interesting experimentation happening in Australia by some former colleagues of mine. I’m talking about machine learning. So, for example, State Archives and Records New South Wales recently reported on an experiment that they ran, and I saw that the National Archives has blogged about your own experimentation. So I guess this is a pretty broad question, to find out what it is that you’ve been doing to test out machine learning and its applications for management, searching and use of records.

JS: So our primary focus is just to develop our own knowledge and understanding as an institution around machine learning approaches, capabilities, what kinds of things is this good for, what’s relatively easy for us to pick up and use. So we’ve done things like, we had inside the archive a two-day machine learning Hackathon, where running into the Hackathon, we gave primarily technical people working at the archive like a primer into the main machine learning algorithms and some of the tools you can use, some of the software libraries you can use if you want to develop a machine learning project or a machine learning-based project, and then just have people working in teams to create their own project.

And we had a really amazing variety of candidate applications, whether it was trying to solve the problem of which research guide do you put against which part of the catalogue and can you use machine learning to… And we’ve got like 500 research guides, and 30 million catalogue descriptions. We’ve never been able to work out which research guide you might want to offer a user when they’re looking at a given catalogue description, and the catalogue’s too big for us to be able to manually do that. So that could be trying machine learning as a way of doing that and that actually works pretty well.

CF: Okay, so potentially you could have a bot that popped up and said, “Oh, I see you’re interested in records of X, Y, Z. Here is a guide.” Would that be one potential use of that?

JS: Right, exactly right. With some kind of probability. So that turned out to be like a just a quick experiment, but that looks a viable area for us to explore. On the digital preservation side, we’re really interested in whether we could use machine learning to help us characterize digital objects that we can’t characterize using file format signature identification. Typically, this would be things like computer programs, they all report as text files, and signature identification is not a good way for being able to distinguish between whether you’ve got computer software that’s written in Java or C# or Python, say. And we’re particularly alive to the fact that in government, we see people increasingly producing things that have all of the characteristics of records that are computer programs or mixtures of things that are computer programs and textual content.

So, there’s a thing called R Markdown, which statisticians are using increasingly inside government, where they’re mixing Markdown textual content with statistical algorithms in R in single assets. These things have sure have many of the characteristics of records that we think are likely to be appraised and selected, so that’s a mixture of textual content and computer code. Now, signature identification is not good at identifying that kind of content. So could we use machine learning and a training set to start identifying some of these object types that signature identification doesn’t work for? And it turns out, yes, actually that works pretty well if you have a good training set. So we learnt that.

We’ve also tried machine learning, and this is where I think the strongest interest is for decision-making about selection. Now, we know that there’s a real challenge for identifying records from non-records. The classic challenge, of course, is around email, and we know that machine learning is pretty good at solving classification problems, so you can do things like, “Is this personal email? Is this business email?” And you can set up a machine learning-based approach for giving you a probabilistic insight into, “That’s almost certainly personal email, that’s almost certainly business email, that may be some kind of combination and I’m only 60% sure.”

We are, I think, reasonably confident that the future digital record of government is going to be selected through machine learning, and that that is the future for digital archives from here on in. There are some problems that current machine learning capability is well disposed to being able to address in that context of using machine learning for record selection, and business/personal email is a classic case in point. There are some things which are super hard. If the decision-making relies on knowledge of the wider context, then it’s really difficult to see, without having some kind of model for that context, how a machine learning system is going to be able to be helpful. So we’re really keen to try and understand some of those parameters and then get a sense for where do we need to lean on I suppose more traditional forms of knowledge engineering alongside some of these newer machine learning-based approaches. And that’s very much our focus, but initially, it’s very much about just building our own appreciation and know-how as an organization so that we can experiment and ask and answer some of these kinds of questions.

CF: Yes, the questions around the use of machine learning to make significant decisions affecting government services, or individuals that are happening in the wider community. I guess archives need to start to think about and take ownership of what implications are there for using these technologies in terms of things like trust and things like perceptions of bias. So transparency, I suppose, is partly the solution to that. Do you have any thoughts on those kinds of wider societal perceptions of an archive that uses machine learning to identify and manage records of its societal memory?

JS: So there’s two elements to this. In terms of the possibility, then machine learning provides the only really scalable way, I think, that we’re going to arrive at decision-making for selection of the record that needs to be permanently preserved from the information that can be discarded or left behind. And typically, government departments in the UK, and we see this being a pattern elsewhere, have a heap, a digital heap, from which the long-term public record needs to somehow be extracted, permanently preserved, and made available. And when you’re talking about tens of terabytes or maybe even hundreds of terabytes of heterogeneous digital records, e-mails, and Word documents and all sorts of other things, the only way that’s going to get done is through machine learning.

It may even be that a machine learning approach to selection is more consistent, more reliable, and can be more transparent than the historical approach to how we’ve gone about it, which has been human decision-making. But we are going to run into some hard limitations, where machine learning is good for classification problems, but it’s bad for having an appreciation of a textual understanding, and whether we can find ways of training systems that can give us some proxies for contextual understanding is stuff that we need to explore and research. On the flip side, we can also see that machine learning systems are going to be increasingly widely deployed by government agencies and that we can imagine deep networks themselves becoming public records, so we need to figure out how on earth might we appraise and just select a deep network and preserve it and make it available.

So we have that element too, and again, the only way that we even begin to get an appreciation for what that means is by rolling our sleeves up, developing our understanding of the technology, trying it out, learning and asking these questions and trying to answer them. It’s something that we’re clearly too small to do this just on our own, it’s something that I think archives around the world need to be developing all of our knowledge and our understanding and capability and having these conversations if we’re to really be able to build the body of practice that we clearly are going to need.

CF: Yes. Well, it’s such an interesting area, and I’m sure when you attend iPRES later in the year, you’ll probably run into some conversations on the experimentation that’s happening in other sites as well, and hopefully that can bring about some more partnerships and collaboration in this space, because it’s such an important one.

JS: I hope so.

CF: I think, perhaps, I need to draw this fascinating conversation to a close. And so I want to say thank you so much for giving your time today and so thoughtfully addressing all of those topics, it’s very much appreciated. So thank you so much, John Sheridan from the National Archives UK.

JS: Thank you very much.

About John Sheridan

John-Sheridan - CopyJohn Sheridan is the Digital Director at The National Archives, with overall responsibility for the organisation’s digital services and digital archiving capability. His role is to provide strategic direction, developing the people and capability needed for The National Archives to become a disruptive digital archive.

John’s academic background is in mathematics and information technology, with a degree in Mathematics and Computer Science from the University of Southampton and a Master’s Degree in Information Technology from the University of Liverpool.

Prior to his current role, John was the Head of Legislation Services at The National Archives where he led the team responsible for creating legislation.gov.uk, as well overseeing the operation of the official Gazette. John recently led, as Principal Investigator, an Arts and Humanities Research Council funded project, ‘big data for law’, exploring the application of data analytics to the statute book, winning the Halsbury Legal Award for Innovation.

John has a strong interest in the web and data standards and is a former co-chair of the W3C e-Government Interest Group. He serves on the UK Government’s Data Leaders group and Open Standards Board which sets data standards for use across government. John was an early pioneer of open data and remains active in that community.

About Cassie Findlay

Digital recordkeeping, archives and privacy professional, co-founder of the Recordkeeping Roundtable. @CassPF on Twitter.
This entry was posted in podcast and tagged , , . Bookmark the permalink.

Leave a comment