TibNewsOne4All 0.2

Model details

Creator(s)

f.erhard@uni-leipzig.de

Language(s)

Tibetan

Centuries

20th

CER on Validation Set

2.52%

Size (Nr. of Words)

92,423

Model ID

169581

About this Model

The model TibNewsOne4All is trained on 500 pages (ca. 100.037 words) of 13 different Tibetan language newspapers of the 1950s and 1960s published in both India and the PRC. The model mainly transcribes Tibetan Uchen script, but can also handle cursive scripts and - very limited - Chinese and English. TibNewsOne4All was trained for the Divergent Discourses, a collaborative research project led by Robert Barnett at SOAS and Franz Xaver Erhard at Leipzig University with funding from AHRC and DFG. Settings:

- training set of 500 pages

- validation set of 27 pages

- lines tagged 'unclear' were excluded.

- 250 epochs

- early stopping: 20.

- Existing line polygons were not used in the training

- Tibetan language model TMUP 0.1 used as a base model.

Try it out

TibNewsOne4All 0.2 is freely available to everyone

You can use this model to automatically transcribe Handwritten documents with Handwritten Text Recgnition in Transkribus.