Training a bilingual Irish-English model in Transkribus using An Gaodhal

What happens to the historical texts of a living language once the script used to write it is abandoned for another? Does a script even have a future, if people no longer learn how to read it?

Undoubtedly, there are many ways to approach these problems, including education and transliteration — translation from one script to another rather than from one language to another. Where a language is endangered, the issue takes on a particular urgency.

Machine-reading techniques such as OCR are a good option for making texts accessible. But what if the texts you want to recognise appear in bilingual contexts? Or the spelling pre-dates standardisation of the language? What if texts feature contributions from writers who were unaccustomed to writing or who had only just learned to write themselves?

A collaboration between New York University and the University of Galway is exploring these questions and more in the An Gaodhal project. One of its many aims is to create the first publicly available OCR model for bilingual texts printed in Irish and English. This new model is doubly innovative as it is also the first to combine multilingual and multiscript functionality in a single OCR model. We spoke to team members Oksana Dereza, Deirdre Ní Chonghaile, and Nicholas Wolf to find out more.

The history of Cló Gaelach

In Ireland, the Irish language, or Gaeilge, was once written in a script known as Cló Gaelach. You can still see it on signs, shopfronts, monuments, and headstones throughout the country. However, during a period of standardisation and modernisation in the 1960s, Cló Gaelach was replaced by Roman letters and the script was no longer taught in schools. "It is remarkable how quickly a script can fade from public memory," says Deirdre. "Still, Cló Gaelach has a strong nostalgic appeal in Ireland — and in its significant diasporic community."

*Office of Public Works notice using Cló Gaelach.*

The Irish diaspora is central to the An Gaodhal project. In the late 19th century, large numbers of people moved from Ireland to the U.S., and many of those were native speakers of Irish. By the 1890s, 40% of the world's Irish speakers lived outside Ireland. Due to the numbers of speakers in major cities such as New York, it is not really surprising that the world's first Irish-language newspaper was produced not in Ireland, but in Brooklyn, New York, by Michael J. Logan.

*"An Bratach Ghealréaltach" — Translation of "The Star-Spangled Banner" by Fr. Eoghan Ó Gramhnaigh, An Gaodhal 13, no. 1 (Sept 1898): 5.*

A bilingual newspaper for a diasporic community

Logan’s bilingual Irish-English newspaper was called An Gaodhal. It ran monthly from October 1881 to December 1898 and had a readership of around 3,000 people throughout the USA and Ireland. The newspaper contained a variety of texts — such as articles, ads, subscriber names, folklore, poetry, and songs — typically arranged in a unique format with the Irish text and the English translation side by side. This format not only helped Irish speakers to maintain their language skills, it also enabled many bilingual speakers who were literate only in English to learn how to read Irish for the first time.

*"The Gaelic Alphabet" frequently reprinted in An Gaodhal.*

The only complete set of An Gaodhal in existence, housed at the University of Galway Library, had already been scanned and published online. However, the text had not yet been extracted from those images, making it impossible to search the collection or perform data analysis. It was this problem that the project team wanted to resolve, by extracting accurate text from all 2,298 pages and creating a digital and searchable corpus. "381 pages feature Irish mostly, 896 English mostly, and 1,019 both languages together," Deirdre said. "Also, the corpus reflects the three major dialects of Irish so, however small its 1.86 million tokens may seem, it presents a welcome diversity in the prospective training data."

Choosing Transkribus

When the project began in January 2023, there were no publicly available OCR models suitable for Cló Gaelach and the pre-standardised spelling of the Irish language, in either monolingual or multilingual contexts. The only related project in existence was a Cló Gaelach training dataset for the Tesseract software, published by Scannell et al. (2020).

As Nicholas Wolf, the principal investigator of the An Gaodhal project, explained: "We knew we would need to develop a bilingual model from scratch, as none were available to us. We decided to train an Irish-only model and then use that model to train a bilingual Irish-English model. Transkribus seemed well placed to serve our needs and it has worked well so far!

Deirdre added: "In our two models, the selected unicode characters do not replicate exactly the design of Cló Gaelach (such as Gaelchló provides https://www.gaelchlo.com/); rather, in deference to long-standing practice, Roman type characters — including those with diacritics — were chosen, thus ensuring interoperability between this dataset and others (see http://corpas.ria.ie/)."

A simple solution to training a bilingual model

The project team chose a simple, yet effective, workflow for the model training process. "We first ran a preliminary OCR process using Amazon Textract across all 2,298 pages of the newspaper,” Nicholas explained. “This identified text regions containing English-language content, enabling us to isolate the Irish-language content. We then masked English-language text regions to produce page images with Irish-language text only, which would facilitate training a model for Cló Gaelach."

After that, the team set about transcribing the Irish-language content on 60 pages of the newspaper entirely by hand. As there is no fully integrated keyboard for Cló Gaelach available, the team customised the virtual keyboard embedded in Transkribus to allow the required unicode characters to be inputted. A preliminary model, Version 1, was then trained on those 60 pages of Ground Truth; with 18,533 training tokens, it achieved a CER of <1%. Version 2, which had 164,015 training tokens, is now publicly available under the name of An Gaodhal Irish (Gaeilge) Monolingual Model.

The team is currently working on the bilingual model.

Lessons learned from layout deviations

Some layout elements of the newspaper — including small tables appearing between text regions, varying line directionality, and curved or acrostic texts — needed the layout analysis to be performed by hand. Given the limitations of the project’s resources and the limited number of such elements, the team decided against training workable baseline, table, or field models. Instead, all layout elements were either automatically generated using the default baseline recognition settings or manually applied and then fully reviewed by a member of the project team before any text recognition was performed.

*Layout variables — Different table layouts and mix of orthographies in page sections and individual lines.*

Along the way, the team learnt a couple of shortcuts for effective manual layout analysis. “It’s important to ensure line polygons capture all the diacritics and punctuation,” Nicholas told us. “This reduces the need for manual correction afterwards.”

“Within the confines of our project resources, we limited the tagging of text regions to page numbers, paragraphs, and marginalia,” he went on to explain. “Where required, tags for ‘gaps’ or for words that were ‘supplied’ or ‘unclear’ were applied. If you go on to train a model, though, don’t forget that lines featuring the 'unclear' tag are excluded from the Ground Truth.”

What the future holds

The project team have already published their Irish model, and are currently completing their bilingual Irish-English model. They are publishing all of their data publicly, including: full text (ALTO XML), which is corrected manually; a BART-based bilingual OCR post-correction model and the dataset with which it was trained; and a paper published in the LT4HALA @ LREC-COLING 2024 proceedings.

The team's expert on computational linguistics, Oksana Dereza, has developed an OCR post-correction model for historical bilingual Irish-English data and is currently working on Named Entity Recognition (NER) for historical Irish. "There have been many developments in the provision of digital tools for the Irish language,” Oksana said. ”And we are delighted that the An Gaodhal project is at the forefront of that wave of innovation. We hope that other under-resourced language communities will take inspiration from this project, especially its approach to multilingualism."

For more on this project, you can listen to this 7 minute report on Irish radio, and follow #AnGaodhal for updates.

Our Transkribus Tip

“When you get stuck, ask for help, either from the Transkribus team or from its user community, many of whom share their experiences in real-time. Such consideration and generosity meant a lot to the project team, which was working remotely. Long live this ethos of open scholarship!”

Funding

The project is funded by the Robert D. L. Gardiner Foundation, the Irish Institute of New York, Glucksman Ireland House at New York University, and the University of Galway.