Transcribing 3 million scans at the National Archives of the Netherlands

Some Transkribus projects are just a few pages long. Many are a few hundred or thousand pages long. But the latest Transkribus project at the National Archives of the Netherlands involved a whopping 3 million pages of documents. And this is just the beginning. Over the next few years, the Dutch archive aims to scan about 10% of its entire collection—that’s more than 10 million scans a year—and transcribe at least a part of the collection to make it more accessible.

We spoke to Liesbeth Keijser, Project Manager for Digitisation at the National Archives of the Netherlands, to discover more about digitising such large collections of documents with Transkribus.

Welcome to the National Archives of the Netherlands

Based in the Dutch coastal city of The Hague, the National Archives of the Netherlands is the country’s largest archive. It is home to hundreds of years of governmental and official documents, as well as private documents relevant to the history of the Netherlands. Millions of pages are looked after at the archive. In fact, the collection is so large that if you were to line it all up in a row, it would stretch for over 140km!

*The National Archives of the Netherlands is the country’s largest archive © Tineke Dijkstra*

However, most of the archive’s documents are still paper-based, which makes accessing them difficult in two ways. Firstly, you have to travel to The Hague to browse the archive. Secondly, and probably more importantly, there is no way to quickly search whole collections for specific information. Instead of simply typing a search term into a database, you have to manually search through collections of papers, which is infinitely more time-consuming.

With that in mind, the National Archives embarked on an ambitious digitisation strategy. “Our plan is to scan 10% of our archives over the next 15 years,” digitisation manager Liesbeth explained. “That will add up to more than 100 million scans in a couple of years.”

To make the scans more accessible, the archive is using handwritten text recognition technology to automatically transcribe the handwritten text and convert it into a digital text file. They decided to start with a collection of 3 million pages, mainly records regarding the Dutch East India Company in the 17th and 18th centuries and notarial deeds from the 19th century. This first project would set the groundwork for later parts of the digitisation strategy.

Creating an AI model with Transkribus

The National Archives started working with handwriting recognition technology about five years ago and the team have been pleasantly surprised by how easy it is. “Using Transkribus and creating a custom AI model was actually quite straightforward,” Liesbeth said. At the start, we were aiming for a CER [character error rate] of 20%, we would have been happy with that. But after creating 6000 pages of training data, we got down to a CER of 7%, which was even better for us.”

In keeping with Transkribus’ cooperative values, Liesbeth’s team also decided to make their AI model public, so that other people can benefit from their work. Their model, Dutch Handwriting 17th-19th century, now contains almost 1.5 million words and can be used by any Transkribus user working with similar documents.

Webinar voor beginners in het Nederlands

Publishing the transcriptions

For Liesbeth and her team, the transcription was actually the less complicated step of the project. “Transcribing everything was the easy part,” she explained. “Publishing everything online was a lot more complex, both from an archival and a technical perspective.” Deciding how to organise everything into a logical online format was one challenge, finding people with the right development skills to build exactly what the archive needed was yet another.

*Over 3 million pages were automatically transcribed during the project, which ran from 2020 to 2021. © Zoeken in transcripties*

After considering different solutions, the team decided to build a custom system divided into a back end and a separate front end by two suppliers. The result was the “Zoeken in transcripties” platform. Although the project is still ongoing, the platform already provides access to a wealth of documents, making it much easier for researchers and interested persons to find the information they need. The team also added named entity recognition to the system, so that it would automatically enrich the transcriptions with named entities such as people and places.

“Ideally, we would have a platform that integrates seamlessly with our existing IT infrastructure. That isn’t quite possible yet, but we are still pretty happy with the results so far.”

The benefits of digitisation

And it is not just Liesbeth’s team who is happy with the new digitised collection. “We’re still collating exact data about user satisfaction, but our impression is that people like the new system.”

“A good example of this was the bittersweet feedback we got from some academic researchers. They really liked that so many documents were suddenly so easily accessible. But because they suddenly had so many new sources to work with, they realised they had to scrap their current conclusions and start again. I think this shows just how much impact a digitisation project like this can have on academic research.”

Thank you, Liesbeth, for talking to us!

Liesbeth’s Transkribus Tip:

“When embarking on a project like this, make sure there is someone in your team who has a background in AI. It is hard to compare different technologies if you don’t understand the differences between them, so make sure the team has that knowledge before you start.”