Creator(s)
habib.ibrahim@live.com
Language(s)
Arabic
Centuries
17th, 18th
CER on Validation Set
8.15%
Size (Nr. of Words)
115,606
Model ID
242669
Project: Agapet - Advanced HTR for Christian Arabic Manuscripts
Background: Existing OCR tools for Arabic have achieved notable accuracy, yet HTR (Handwritten Text Recognition) for Arabic manuscripts remains challenging, with current models demonstrating variable effectiveness depending on the specific handwriting styles and manuscript collections used. Leveraging extensive experience with Christian Arabic manuscripts and expertise in copyists, scriptoria, and stylistic trends in regions such as Syria, Palestine, and Sinai from the 9th to 18th centuries, this project aims to develop a robust HTR model trained on representative Christian Arabic texts.
Representative manuscript groups include:
• 9th-10th century: Palestinian milieu
• 11th century: Antiochian milieu
• 13th century: Antiochian milieu and Damascus
• 14th century: Sinai
• 16th century: Tripolitan milieu (Lebanon)
• 17th-18th centuries: Aleppo milieu
Previous Experience: Between 2020 and 2021, I published a critical edition of the Abridged Antiochian Menologion (approximately 1,000 pages), a collection of Christian hagiographies. This work initiated a project using Transkribus to train an HTR model. Between 2022 and 2024, the accuracy of the model increased significantly—improving from 50-60% with general models to 95% when specialized for specific handwriting tendencies. This success demonstrated the model’s potential to recognize texts associated with Simeon of Homs, a prominent 17th-century Aleppo school copyist.
Model Limitations and Areas for Improvement: While the model performs well for the unique Abridged Menologion manuscript, its application is limited by the specific characteristics of Simeon of Homs's handwriting. Expanding the dataset with manuscripts from prominent Aleppo copyists, such as Thalja ibn Huran and Marqus in Dughan, as well as versions of the Menologion from the 13th-century Antiochian milieu, could improve the model’s versatility.
Project Goals
1. Expand Training Data: Incorporate manuscripts from diverse historical milieus—early Palestinian, Antiochian, and Damascus traditions, as well as those from Sinai and Tripoli.
2. Test and Enhance: Integrate tools such as e-Scriptorium for further testing and refinement.
3. Integrate Syntactic and Lexical Analysis: Combine this HTR model with syntactic and Part-of-Speech recognition. In collaboration with the GREgORI project and the e-cheikho project, the goal is to simulate a human-like ability to decipher ambiguous or degraded text through syntax and vocabulary context, ultimately aiming for a 6-million-token, neuron-based model. This combined approach will enable the model to improve its interpretative abilities, making it more intelligent and adaptable to complex handwritten texts.
The project is named 'Agapet', after the renowned scribe of the Monastery of St. Elie of the Black Mountain, honoring the legacy of early Christian scribes.
Short Bibliography – Works on Copyists « Poimen al-Sīqī moine copiste (fl. 1223-1237) et la cellule des moines sinaïtes à Damas», in: Parole de l’Orient, 50 (2024), p. 93-129. « Marqus of Aleppo, a seventeenth century forgotten scribe. Biography reconstructed from the colophons », in: George Kiraz and Sabine Schmidtke (ed.), Literary Snippets: Colophons Across Space and Time, 2023, p. 255-283. « Talǧat an-nāsiẖ fils du prêtre Ḥūrān al-ḥamawī », in Chronos 39 (2019), p. 125-171.
You can use this model to automatically transcribe Handwritten documents with Handwritten Text Recgnition in Transkribus.