OCR: Modern Tool for Old Texts

OCR: Modern Tool for Old Texts

The OCR4all tool ensures converting historical printings into computer-readable texts. It is very reliable, user-friendly, and open source. It was developed by scientists at the University of Würzburg.

OCR OCR4all University of Würzburg Julius-Maximilians-Universität Würzburg
Page from a french version of the "Narrenschiff". Such old fonts can be reliably converted into computer-readable text with OCR4all. Source: Dresden State and University Library (Staats- und Universitätsbibliothek Dresden), CC BY-SA 4.0

Historians and other Humanities’ scholars often have to deal with difficult research objects: centuries-old printed works that are difficult to decipher and often in an unsatisfactory state of conservation. Many of these documents have now been digitized – usually photographed or scanned – and are available online worldwide. For research purposes, this is already a step forward.

However, there is still a challenge to overcome: bringing the digitized old fonts into a modern form with text recognition software that is readable for non-specialists as well as for computers. Scientists at the Center for Philology and Digitality at Julius-Maximilians-Universität Würzburg (JMU) in Bavaria, Germany, have made a significant contribution to further development in this field.

With OCR4all, the JMU research team is making a new tool available to the scientific community. It converts digitized historical prints with an error rate of less than one percent into computer-readable texts. And it offers a graphical user interface that requires no IT expertise. With previous tools of this kind, user-friendliness was not always given as the users mostly had to work with programming commands.

Developed in cooperation with the humanities

The new OCR4all tool was developed under the direction of Christian Reul together with his computer science colleagues Professor Frank Puppe (Chair of Artificial Intelligence and Applied computer science) and Christoph Wick as well as Uwe Springmann (Digital Humanities expert) and numerous students and assistants.

OCR4all originates from the JMU Kallimachos project, which is funded by the German Federal Ministry of Education and Research. This cooperation between the Humanities and computer science will be continued and institutionalized in the newly founded JMU Center for Philology and Digitality.

In developing OCR4all, computer scientists have collaborated with the Humanities at JMU – including German and Romance studies and literature studies in the project "Narragonien digital". The aim was to digitize the "Narrenschiff", a moral satire by Sebastian Brant, a bestseller of the 15th century that was translated into many languages. Furthermore, OCR4all has been frequently used in the JMU's Kolleg "Medieval and Early Modern Times".

OCR4all is freely available to the public on the GitHub platform (with instructions and examples): https://github.com/OCR4all

Each print shop had its own font

Christian Reul explains the challenges involved in the development of OCR4all: Automatic text recognition (OCR = Optical Character Recognition) has been working very well for modern fonts for some time now. However, this has not yet been the case for historical fonts.

"One of the biggest problems was typography," says Reul. One of the reasons for this is that the first printers of the 15th century did not use uniform fonts. "Their printing stamps were all carved by themselves, each printing house practically had its own letters.”

Error rates below one percent

Whether e or c, whether v or r - it is often not easy to distinguish in old prints, but software can learn to recognize such subtleties. To do so, it has to be trained on sample material. In his work, Reul has developed methods to make training more efficient. In a case study with six historical prints from the years 1476 to 1572, the average error rate in automatic text recognition was reduced from 3.9 to 1.7 percent.

Not only the methodology was improved, JMU computer scientist Christoph Wick has also decisively further refined the technical component by developing the Calamari OCR tool, which is also freely available and has since been fully integrated into OCR4all. Therefore, one gained even better results: Now, even for the oldest printed works, error rates of less than one percent can be achieved in general.

Lexical projects

Reul has also convinced external partners of the quality of Würzburg's OCR research. In cooperation with the "Zentrum für digitale Lexikographie der deutschen Sprache" (Berlin), Daniel Sanders' "Wörterbuch der deutschen Sprache" (Dictionary of the German Language) has been digitally indexed and a scientific publication on this work is currently being prepared. The various lines of this text often contain different fonts, representing different semantic information. Here, the existing approach to character recognition was extended in such a way that not only the text but also the typography and thus the complex content structure of the lexicon may be reproduced very precisely.

The computer scientist from Würzburg will soon complete his doctoral thesis, but he is also willing to proceed working with OCR in the future: "The computer science behind OCR is extremely exciting," he says. A possible project in the near future: the creators of the "Idiotikon", a dictionary of the Swiss-German language, have indicated their interest in collaboration since they might well need the Würzburg's specialist knowledge.

JMU Center for Philology and Digitality

Website Christian Reul

Web Links

OCR4all on GitHub

Calamari on GitHub

Link to publication (case study with six historical books)

Publication combining methodological and technical improvements

Center for Philology and Digitality

The JMU Center for Philology and Digitality is the result of an initiative launched by Professors Dag Nikolaus Hasse, Fotis Jannidis and Ulrich Konrad. The Center bridges the gap between the Humanities, computer science and Digital Humanities. It represents the first building block for a new Humanities Center on the JMU North Campus.

A new building for the ZPD is to be erected there, close to the Graduate School building.  From 2022 on one expects around 100 people working in the new ZPD building on a total area of 2,700 square metres. The total cost of the building is estimated at 15 million euros. A digital lab, research rooms and lecture halls are planned on the ground floor of the ZPD. The upper floors will be used primarily for offices and communication rooms.

 

Press release from the Julius-Maximilians-Universität Würzburg, JMU, by Robert Emmerich


Cambridge University Cambridge Digital LIbrary Heidelberg University medieval manuscripts

Cambridge and Heidelberg announce major project to digitise treasured medieval manuscripts

Cambridge and Heidelberg announce major project to digitise treasured medieval manuscripts

Hundreds of medieval and early modern Greek manuscripts – including classical texts and some of the most important treatises on religion, mathematics, history, drama and philosophy – are to be digitised and made available to anyone with access to the internet.

In a major collaboration announced today (March 28), Cambridge University Library, 12 Cambridge colleges, the Fitzwilliam Museum, Heidelberg University Library and the Vatican Library have come together as part of a two-year £1.6m project, funded by the Polonsky Foundation, to digitise more than 800 medieval manuscripts.

The project between two of Europe’s oldest universities, both renowned for their medieval collections, will see the digitisation of every medieval Greek manuscript in Cambridge and all those belonging to the Bibliotheca Palatina collection, split between Heidelberg and the Vatican. It will provide a unique insight into the chronological range of Greek manuscript culture, from the early Christian period to the early modern.

Dr Suzanne Paul, Keeper of Rare Books and Early Manuscripts at Cambridge University Library, said: “The Cambridge and Heidelberg collections bear witness to the enduring legacy of Greek culture – classical and Byzantine – and the lasting importance of Greek scholarship.

“The works of Homer and Plato were copied and recopied throughout the medieval period and the early biblical and liturgical manuscripts are profoundly important for our understanding of a Christian culture based on the written word.

“These multilingual, multicultural, multifarious works, that cross borders, disciplines and the centuries, testify to a deep scholarly engagement with Greek texts and Greek culture that both universities are committed to upholding.”

Once digitised, the Cambridge manuscripts will join the works of Charles Darwin, Isaac Newton, Stephen Hawking and Alfred Lord Tennyson on the Cambridge Digital Library. Since its launch in 2010 – with the digitisation of Newton’s Principia Mathematica making headlines around the world – the treasures of Cambridge’s Digital Library have been accessed more than 13.5 million times.

Dr Veit Probst, Director of Heidelberg University Library, said: “Numerous discoveries await. We still lack detailed knowledge about the production and provenance of these books, about the identities and activities of their scribes, their artists and their owners – and have yet to uncover how they were studied and used, both during the medieval period and in the centuries beyond. The meanings of the annotations and marginalia in the original manuscripts have yet to be teased apart. From such threads, a rich tapestry of Greek scholarship will be woven.”

With more than 38,000 volumes digitised to date, Heidelberg’s Digital Library has been visited by scholars and members of the public in 169 countries, outlining the global appetite for digital access to collections which would be impossible for most to access directly.

The current status of these collections presents significant challenges to scholars both in terms of cataloguing and conservation, with the medieval bindings of many manuscripts in a fragile state. The current catalogues for them date from the nineteenth century; many of those for the Cambridge manuscripts were written by the scholar M.R. James, Provost of King’s College, Cambridge, but best known for his ghost stories which remain popular to this day.
Of the Cambridge Greek manuscripts, around 210 are held at the University Library, 140 at Trinity College, and a further 60 spread across 11 other colleges and the Fitzwilliam Museum. Of the Bibliotheca Palatina Greek manuscripts, 29 are in Heidelberg and 403 are in the Vatican Library, having been transferred there from Germany as a spoil of war in 1623.

Dr Jessica Gardner, Cambridge University Librarian, said: “Opening up some of the most important Greek medieval manuscripts to not just scholars, but the widest possible audience, is another key milestone towards our goal of sharing Cambridge’s treasured collections with the world.

“I would like us to get to the stage where the University’s entire medieval collections are digitised. This project is testament to what can be achieved when Cambridge’s libraries, colleges and museums work in tandem – while at the same time building ever-closer relationships with a distinguished European research library like our own.”

Dr Leonard S. Polonsky CBE, Founding Chairman, The Polonsky Foundation said:“Our Foundation is proud to support this important collaboration between the ancient universities of Cambridge and Heidelberg, which represents a significant development for both institutions. For Heidelberg the project will complete the virtual reconstruction of the Palatine Library that is being carried out with the Vatican Library.

“For Cambridge it is the first phase of a collaboration among the University Library, Cambridge colleges and the Fitzwilliam Museum to digitise their collections of Western medieval manuscripts. Benefiting from the extraordinary opportunities afforded by digitisation, the project brings together the treasures of these great institutions and makes them available to researchers and the wider public in innovative and attractive ways."

Cambridge University Cambridge Digital LIbrary Heidelberg University medieval manuscriptsPress release from Cambridge University, by Stuart Roberts


The sword of a Hispano-Muslim warlord is digitized in 3D

The sword of a Hispano-Muslim warlord is digitized in 3D

A treasure from the Toledo Army Museum (Spain)

At age 90, Ali Atar, one of the main military chiefs of King Boabdil of Granada, fought to his death in the Battle of Lucena in 1483. It was there that his magnificent Nasrid sword was taken away from him, and researchers from the Polytechnic University of Valencia and a company from Toledo have now modelled it in order to graphically document and present it on the web.Ali Atar, Warden of Loja and Lord of Zagra, was a Hispano-Muslim warlord at the service of King Boabdil, the last sultan of Granada, to whom he was also related when he married his daughter Moraima. In April 1483 Boabdil tried to take the Christian city of Lucena (Cordoba) with the help of his father-in-law, but they lost the battle: the Nasrid king was captured and Ali Atar died fighting at the age of 90.

The sword has been digitalized in the workshops of the Toledo Army Museum (MUSEJE). Credit: IngHeritag3D

His magnificent sword, covered with gold, ivory and precious metals then passed into the hands of the Christians and, after many historical vicissitudes, this Andalusian treasure is now preserved and exhibited in the Toledo Army Museum (MUSEJE, Spanish acronym, Museo del Ejército).

To graphically document this valuable piece and make it known through the web, researchers from the Universitat Politècnica de València (UPV, Spain) and the company Ingheritag3D have carried out a three-dimensional digitization process. The study has just been published in Virtual Archaeology Review.

This is a photogrammetric image. Credit: IngHeritag3D

First they photographed the sword from many angles using a technique called photogrammetry. Then they overlapped all the images, drew planimetries (drawings of the meticulous filigree of the grip) and generated its 3D model.

"These techniques offer the possibility of valuing relevant pieces inside and outside museums, since three-dimensional modelling is prepared both for specialists -who can manipulate the piece virtually-, and for being shared publicly and interactively through the Internet," says engineer Margot Gil-Melitón, co-author of the work.

Using a web viewer, any user can use their mouse to check an exact replica of the handle of this genet sword, a type of genuinely Nasrid weapon introduced in Al-Andalus by the Zenetas (Berber people from whom it takes its name). Ali Atar's sword has a knob in the shape of a bulbous dome, an ivory fist carved with drawings and Arabic letters, and a golden arriax (sword grip) topped with zoomorphic figures.

To record the details of this fine ornamentation, the researchers have devised solutions that have facilitated the analysis of highly reflective materials and complicated geometries. Their workflow could also be applied to characterize other museum pieces.

Ali Atar Nasrid sword King Boabdil of Granada Battle of Lucena digitisation
This is a 3D modelling process of Ali Atar's Nasrid sword. Credit: IngHeritag3D

The other author of the study, Professor José Luis Lerma of the UPV, concludes: "A resource as valuable as cultural heritage can no longer be satisfied with physical conservation: it must be complemented by exhaustive digital preservation in all its forms, which facilitates the investigation of the pieces, their correct safeguarding and dissemination of knowledge to the general public."

###

References:

Margot Gil-Melitón, José Luis Lerma. "Historical military heritage: 3D digitisation of the Nasri sword attributed to Ali Atar". Virtual Archaeology Review, Vol 10 - No 20, pp. 52-69, 2019. DOI: https://doi.org/10.4995/var.2019.10028

Interactive 3D animation of the handle of the genet sword of Ali Atar: https://skfb.ly/ZzzA

 

Press release from Fundación Española para la Ciencia y la Tecnología (FECYT), Agencia SINC