Working with historical documents written in Ottoman Turkish has long been a difficult task. Not only is it considered a 'dead' script today, but from accessibility to legibility, there are certainly challenges for those trying to research Ottoman Turkish texts. However, the Digital Ottoman Corpora team has taken on these challenges of transcribing, digitising, and analysing collections of Ottoman Turkish print to make Ottoman Turkish sources more accessible for research. We spoke to Suphan Kirmizialtin, who told us more about the Digital Ottoman Corpora project and how they even published the first text recognition model of Ottoman Turkish print.
The Digital Ottoman Corpora Project
Suphan Kirmizialtin, a historian specialising in the Ottoman Empire and researching the intersection of gender and modernisation in the Middle East, is leading the Digital Ottoman Corpora team in this project. Within the last few years, Kirmizialtin’s work shifted to digital crowdsourcing projects in cultural heritage as well as text recognition and textual analysis of historical archives. This brings us to today's project of the Digital Ottoman Corpora, where we talk about preserving and understanding Ottoman history through technology.
Accessing historical sources is crucial for gaining a deeper understanding of the past. The question of accessibility becomes rather complex with Ottoman Turkish archives, due to the replacement of this language with a latinised and Turkified version in the late 1920s. Today, Ottoman Turkish, now considered a 'dead' script, is accessible only to specialists, unless the documents are transcribed into 'Modern Turkish'. With this in mind, the Digital Ottoman Corpora set the following goals for their project:
- Improve the accessibility of the Ottoman Turkish historical archives to both researchers and the public
- Help to create a digital research infrastructure for Ottoman Turkish.
Transforming handwritten or printed historical documents into transcribed and digital format enables researchers to easily search, analyse, and extract valuable information from large volumes of historical texts. “Overall, I can say that the Digital Ottoman Corpora project is part of a larger effort to use computational methods to disseminate and analyse historical texts. By making these important resources more widely available, we hope to contribute to a deeper understanding of the past and its relevance to contemporary society.”, explains Suphan Kirmizialtin. When combined with textual analysis, transcribed and digitised texts can reveal patterns, trends, and discussions that enhance our understanding of historical events, social dynamics, and cultural phenomena.
Challenges of Ottoman Turkish Script
For researchers and scholars looking for valuable information within historical documents, Ottoman Turkish, a literary language written in Arabic script and deeply influenced by Arabic and Persian, is intriguing but intricate. In 1928, it was replaced by 'Modern Turkish', transforming the blended writing, vocabulary, and grammatical structures into a Latin script. This transition also replaced Persian and Arabic loan words with Turkish equivalents, shifting text directionality and writing systems.
Original Ottoman Turkish documents were written and read from right-to-left, but transcriptions in Modern Turkish follow a left-to-right format. Beyond the directionality, the writing system switch poses a significant challenge for historians transcribing Ottoman Turkish. Compared to Modern Turkish, Ottoman Turkish's writing system is complex; word pronunciation isn't explicitly stated, but merely implied.
Because the transcription process relies more on educated decisions about how to solve this linguistic puzzle, Kirmizialtin describes transcribing Ottoman Turkish as more of an art form than an academic exercise. Faced with all these challenges that the language and therefore the historical documents present, the Digital Ottoman Corpora set out to transform research on Ottoman Turkish texts, and to make these types of materials more accessible and available to a wider audience.
Unlocking Ottoman Turkish Historical Archives
Without digital access, without a document catalog or structure, looking for valuable information within archives and documents almost becomes a Sisyphean task. Having spent most of the research time manually going through each document to identify relevant research material for her dissertation research, Suphan Kirmizialtin was just too familiar with how time-consuming and inefficient analysing and transcribing historical documents can be.
Curious to see how this process can be made more time-efficient, Kirmizialtin started looking for options on how to improve this. Discovering Transkribus, Kirmizialtin recognized that it not only has the potential to save time for individual researchers, but also offers the capacity to fulfill multiple needs, potentially transforming the way research is conducted in Ottoman studies. “Ottoman Turkish has not been represented in textual digital humanities in a meaningful way, mainly due to the lack of machine-readable corpora in this language. [...] we aim to produce searchable and computable historical text collections in Ottoman Turkish that can then be studied with textual analysis and data visualisation tools.”
Preparing the Material
Having considered the challenges, the Digital Ottoman Corpora team started preparing their work together with Transkribus’ AI-powered text recognition software in order to create machine-readable and accessible text. With the goal in mind to train and publish a text recognition model with Transkribus, the team hopes to “raise awareness of HTR in the Ottomanist community and encourage more scholars to participate in the digital corpora creation project for Ottoman Turkish”.
The first step was to select and prepare the text material for the transcription process. The material chosen to train the automated text recognition model consisted of six Ottoman Turkish periodicals from the late 19th and early 20th centuries and an Ottoman Turkish dictionary. These journals covered topics ranging from politics to literature and culture, providing "valuable insights into the social and intellectual history of the late Ottoman Empire".
As previously mentioned, a major challenge is the transition from the Arabic script used in Ottoman Turkish to the Latin script used in modern Turkish, along with the difference in directionality (right-to-left vs. left-to-right). To overcome this, the team defined a standardised transcription scheme, taking into account both the requirements of Transkribus’ transcription engine and human readability.
Transcribing the Documents
After choosing the most suitable transcription method for the project, the next step was to create training data. The creation of training data involves providing a large number of correctly transcribed pages as ground truth, which is used to train an automatic text recognition model that is capable of recognising the script. In this case, the generation of training data was a mixed approach, including manual transcription and the repurposing of existing Latinised texts through normalising spellings, correcting errors and standardising transcription schemes.
An additional obstacle that was unique to this project was the contrasting text orientation between Ottoman Turkish and modern Turkish. For this project of transforming Ottoman Turkish journals into transcribed and digital Modern Turkish Latin script text, the transcription software would need to support matching right-to-left images with left-to-right transcription. At first, Transkribus didn't support the alignment of right-to-left images with left-to-right transcription. However, Suphan Kirmizialtin was able to develop a workaround to reverse the direction of the transcribed text in Modern Turkish. By working closely with the Transkribus development team, another feature has been added to Transkribus that drastically reduced the time and effort required to train a text recognition model for Ottoman Turkish.
“While we may not have arrived at the ultimate solution to all these complexities of Ottoman Turkish, I believe that adopting an algorithmically-compatible transcription scheme and consistently applying it in ground truth creation holds the key to successful HTR training in Transkribus.”
The effort of Suphan Kirmizialtin and the Digital Ottoman Corpora team has resulted in the first text recognition model for Ottoman Turkish with a CER on the validation set of 7.21%.
Try the model here: Ottoman Turkish Print
From Ottoman Turkish Periodicals to the First Public Model for Ottoman Turkish Print
Although the Digital Ottoman Corpora project faced some skepticism about working with automated transcription, as well as challenges due to the nature of the language they were working with, the team was able to publish the first automated text recognition model for Ottoman Turkish print. The collaboration with Transkribus also demonstrated the potential of HTR's software in making Ottoman Turkish texts and Ottoman cultural heritage accessible to a wider audience. With the success of this project, the Digital Ottoman Corpora team hopes to start a conversation within the community about important issues such as creating consistent guidelines for converting Ottoman Turkish texts from the original Arabic script to Latin script which are compatible with algorithms and automated systems.
With this innovative project, Suphan Kirmizialtin and the Digital Ottoman Corpora team are at the forefront of the fascinating field of automated transcription of Ottoman Turkish printed texts. They are redefining the approach to accessing, analysing and interacting with Ottoman Turkish texts, paving the way for new discoveries and a deeper understanding of this rich cultural heritage.
Thank you for talking to us Suphan, we will be following the development of the project with great interest! To find out more about the project, please visit the official Digital Ottoman Corpora website.
Suphan Kirmizialtin’s Transkribus Tips:
“Create detailed documentation and metadata for the HTR training process. I read somewhere that ‘metadata is a love note to the future’ and I completely agree!”
“Avail yourself of the experience and wisdom of the awesome Transkribus user community.”