MacBERTh – Bidirectional EncoderRepresentations from Transformers of historical English and Dutch

Applicant: dr. Lauren Fonteyn (Leiden University)

The aim of this infrastructure project is to create the first set of deep neural language modelspre-trained on historical textual material (Dutch and English) from different time periods. This semantic encoding infrastructure, or ‘MacBERTh’ , will serve as an invaluable SSH research tool that enables new ways of analysing historical text: by making the underlying meaning of words, phrases and abstract sentence patterns accessible, searchable, and analysable in a bottom-up, data-driven way, the offered infrastructure will allow researchers from the present to uncover and draw connections between concepts and ideas from the past.

The technology underlying this infrastructure is based on a crucial insight from distributional semantics, which states that the linguistic context in which words and phrases appear provides a good approximation of their meaning. Based on this idea, a number of powerful computational models have been developed to create detailed, compressed linguistic representations when trained on large bodies of text. These representations have already proven to be crucial in addressing various challenges in computational linguistics (NLP/NLU) and related disciplines. Yet, these models have not yet been exploited to study meaning representation in historical language. This gap will be filled by ‘MacBERTh’.

More information

A whole new (computational) world – Leiden University news