Transcribing historical documents is vital yet often painstaking work, particularly if you’re dealing with English secretary hand. This script, widely used in England from the 16th to early 17th centuries, appears in countless documents from the era. However, English secretary hand is notoriously difficult to read, with its intricate and elaborate style leaving many historians and researchers struggling to decipher it. As a result, millions of pages of documents written in this script remain inaccessible, locked away in archives with their secrets still untold.
Fortunately, recent advancements in AI and machine learning offer promising solutions for unlocking such documents. Emily Kadens, a lawyer and historian at Northwestern University, has embarked on a mission to harness this technology, decode the complicated script, and make these historical documents accessible to all. We sat down with Emily to learn more about her journey and the development of the “Egerton” model for English secretary hand.
English secretary hand is one of the trickiest hands in the English language. © National Archives, LondonThe challenge of English secretary hand
Emily’s expertise lies at the intersection of law and history, with a focus on early modern legal documents, many of which were written in English secretary hand. Among them are the records of English equity courts, which, unlike the regular royal common law courts, used a fully written, English-language procedure.
“These records are incredibly rich sources of information about all aspects of English life, including commerce, agriculture, property, family and marriage, [and] crime,” Emily explained. Over the years, she has manually transcribed over a thousand pages of such documents—a laborious and time-consuming task given the script’s complexity.
Determined to make these legal documents more accessible, Emily discovered Transkribus, a platform for transcribing historical documents using AI. “I was thinking about making the 1000+ pages of equity court case files that I had manually transcribed over the years [...] available online. As I looked for a way to do that, I learned about Transkribus and realised that I could use my manual transcriptions as Ground Truth to [train] a model.”
However, due to the known complexity of English secretary hand, not everyone was so convinced by Emily’s plan. “A couple of archivists from major libraries holding large quantities of secretary hand documents assured me it was not possible to create a secretary hand model,” Emily said. “But Transkribus created a workable model very quickly.”
The Egerton model now contains thousands of transcribed documents. © National Archives, London via TranskribusBuilding the Egerton model
The starting point for that model was a selection of Emily’s manually transcribed documents. Together with a dedicated team of 3 to 4 researchers, Emily embarked on the meticulous task of correcting these transcriptions. After accumulating around 50,000 words, they trained their first model, aptly named "Egerton" after Thomas Egerton, the Lord Chancellor who presided over many of the equity court cases reported in the documents.
“We then started picking images [...] of documents, mostly from the equity courts, that were good enough [...] for Transkribus to read, and running them through the model. [We] put every new transcription through three checks before marking it as Ground Truth.”
As the team gathers more Ground Truth data — much of it from the National Archives in London — they continually retrain the model. The current version of the Egerton model boasts an impressive 750,000 words and a character error rate (CER) of just 3%. “This fall, we plan to begin a sustained project transcribing one term of depositions from the English Court of Chancery from 1597. That will add about 550 more pages to the model.”
Emily’s team had to work out a system for transcribing the many different abbreviations and symbols. © National Archives, LondonThe art of standardisation
But training a model for English secretary hand wasn’t without its difficulties. One of the biggest hurdles was the lack of standardisation. Unlike other official scripts, secretary hand varied significantly from scribe to scribe. “Many scribes wrote very idiosyncratic forms of secretary hand that to a lay eye seem nothing alike. The extent to which [Egerton] can recognise really different-looking hands is completely amazing.”
“It has also been fascinating to develop conventions for dealing with the various, often non-standard abbreviations we have encountered,” Emily added. “Because we are prioritising readability in our transcriptions, we have trained Egerton to silently expand almost all abbreviations, but some abbreviations represent more than one word, so we had to figure out how to deal with that.”
“We now have a 17-page conventions document to explain to transcribers how to handle various situations with abbreviations, capitals, punctuation, the Unicodes and Transkribus tags we use, and the occasional abbreviated Latin words.”
Ensuring the layout is recognised correctly helps achieve more accurate transcriptions. © National Archives, London via TranskribusBuilt on words, not characters
As Transkribus deciphers texts word by word, instead of character by character like conventional OCR systems, Emily’s team also had to think more broadly than they used to about Transkribus’ capabilities. “It [...] took us a while to fully understand that Transkribus is not doing OCR and what that means for what we can ask the technology to do.”
“For instance, [in secretary hand ] I/J and U/V were written in the same way. At first, we thought we had to be consistent in our transcriptions and pick either I or J and either U or V.” But because Transkribus learns what words look like, rather than individual characters, Emily’s team realised that the Egerton model could be trained to recognise when a word contained an I instead of a J and a U instead of a V. “So now we use modern spelling for words with capital I, J, U, and V even though the I/J and U/V letter forms are the same.”
A work in progress
Currently, the Egerton model remains a private tool as Emily and her team continue to refine it. Their focus is now on incorporating data from deposition hands, which are some of the most difficult in terms of readability. “We are trying to put large enough samples of the various bad deposition hands into our Ground Truth to achieve better than a 7% CER for even the most difficult hands. So far, we are making surprisingly good progress on this.”
Ultimately, though, the team wants to make the model public and is aiming to do this once they hit a million words of training data. “[This] seemed like a huge amount two years ago, but [it] is now a milestone we should hit in early 2025.”
Emily’s team are continually adding new documents to the model. © National Archives, LondonLessons learnt from model training
Reflecting on her journey, Emily notes that one of the most valuable lessons she learned was the importance of thinking through transcription conventions from the start. “We have had to rethink earlier decisions and [...] spend a lot of time going back and changing Ground Truth to reflect new choices,” Emily explained. “Things are going to come up as you build, but to the extent that you have thought a lot of it through before you start, you will save time fixing things later.”
Another key lesson is that training models is not a linear process. “Your character error rate and the quality of the transcriptions will not always seem to be improving,” Emily said. “Egerton took a giant step forward at 300,000 words, and saw another massive improvement at 600,000 words. But not every new model run brings such noticeable change. So don’t get frustrated.”
The historical documents have also thrown up some surprises outside of the legal field. © Emily Kadens via XFrom scepticism to success
The Egerton model stands as a testament to the potential of AI in historical research, challenging long-held assumptions about complex scripts and creating new possibilities for historians working with documents from the early modern period.
Despite the Egerton model still being in development, Emily is willing to share her work with others in the field. “I am happy to make the model available privately now to anyone working on English secretary hand material,” she said. If you are interested in using Egerton, Emily invites you to reach out to her by email or follow her updates on X (formerly Twitter) as the model continues to evolve.
Thanks Emily for sharing your experiences with us!
Interested in using Transkribus for your research?
Transkribus has been used for research projects around the world, helping to create searchable, digital versions of paper resources. We have several subscription plans aimed at researchers, from our solo 'Scholar' plan to the 'Team' plan, which is ideal for small research groups. You can view all the available subscriptions on our Plans and pricing page.
When researcher Álvaro Cuéllar set out to transcribe a series of theatrical works from the Spanish Golden Age, he hoped he would find something interesting. But he did not expect to discover a completely new work by one of Spain’s most famous authors, Félix Lope de Vega y Carpio. Find out more about his research project in this Success Story.