Project DBBE

Colin Swaelens, Lemmatisation & Morphological Analysis of Unedited Greek: Do Simple Tasks Need Complex Solutions?

Abstract

Fine-tuning transformer-based models for part-of-speech tagging of unedited Greek text has outperformed traditional systems. However, when applied to lemmatisation or morphological analysis, fine-tuning has not yet achieved competitive results. This paper explores various approaches to combine morphological features to both reduce label complexity and enhance multi-task training. Specifically, we group three nominal features into a single label, and combine the three most distinctive features of verbs into another unified label. These combined labels are used to fine-tune DBBERT, a BERT model pre-trained on both ancient and modern Greek. Additionally, we experiment with joint training — both among these labels and in combination with POS tagging — within a multi-task framework to improve performance by transferring parameters. To evaluate our models, we use a manually annotated gold standard from the Database of Byzantine Book Epigrams. Our results show a nearly 9 pp. improvement, demonstrating that multi-task learning is a promising approach for linguistic annotation in less standardised corpora.

Practical information

This lecture will be given at ‘The 63rd Annual Meeting of the Association for Computational Linguistics‘, which will take place in Vienna, Austria from July 27 to August 1st, 2025.

Date & time: to be confirmed

Location: Austria Center Vienna (Bruno-Kreisky-Platz 1, Vienna, Austria)

Kyriaki Giannikou, Typologies for the study of historical Greek texts: Perspectives from two UGent projects

Abstract

I will discuss the typologies developed by two digitally-oriented projects from the University of Ghent, EVWRIT and DBBE, for organising, categorising, and describing documentary and literary historical textual material in Greek. The EVWRIT (Everyday Writing in Graeco-Roman and Late Antique Egypt. A Socio-Semiotic Study of Communicative Variation) project focuses on Greek documentary texts, examining their external features to uncover social meaning in the communicative and administrative contexts of the Ptolemaic, Roman, and Byzantine periods. Its main goal is to illuminate the relationship between form and content in these historical texts, providing a multi-aspect and well-structured framework for analysis. Meanwhile, the Database of Byzantine Book Epigrams (DBBE) stores and presents metrical paratexts found in the margins of medieval Greek manuscripts, primarily focusing on original texts and scribal choices, while grouping and linking them to their edited versions. The DBBE focuses heavily on metadata, contextualising the texts through details of their production (date, place, manuscript, etc.) and also their handling by secondary literature, if present. By comparing the typologies used in both projects, I will highlight different approaches in structuring and presenting historical textual data, showcasing how they can offer equally valuable insights.

Practical information

This lecture will be given at the Typology workshop, organised by the grammateus project at the University of Geneva on 21-22 March 2025. See the full programme here.

Date & time: Saturday 22 March 2025, 15:50

Location: Amphithéâtre 012A – Battelle D (Route de Drize 7, 1227 Carouge)

Eleonora Lauro & Colin Swaelens, Enhancing and Visualising Textual and Material Analysis of Manuscripts: A Graph-Based Approach

Abstract

Manuscripts are no longer studied as purely textual witnesses in a bottom-up approach as in stemmatological philology, but also as physical objects. Current computational developments enable new top-down approaches. Graph databases visualise in an intuitive way complex relationships between chunks of data, coming from – in our case – metrical paratexts of the Database of Byzantine Book Epigrams (Ricceri et al. 2023). We carried out a pilot-study in which we clustered 200 occurrences of the same epigram based on textual differences and linguistic annotations (Swaelens et al. 2024). This already revealed complex relationships in the graph representation between clusters of texts, triggering scholars to dive deeper into the reasons why they are grouped. The current paper explores how a graph-based approach can present even more intricate connections between manuscripts by adding metadata (date, place, scribe) to the textual data. A qualitative analysis of both bottom-up and top-down approaches reveals that they complement each other and provide researchers with new perspectives.

Practical information

This lecture will be given at the “International Medieval Congress”, organised by the Institute for Medieval Studies at the University of Leeds. IMC 2025 will take place from Monday 07 July to Thursday 10 July 2025.

Date & time: Wednesday 9 July 2025, 14:15

Location: Leeds

More information about this conference and the full programme can be found here.

Kristoffel Demoen, Les paratextes métriques (‘book epigrams’) dans les manuscrits grecs comme objets matériels et comme liens textuels entre les producteurs des manuscrits, les œuvres transmises et les lecteurs

This lecture will be given at the ‘Séminaire Cultures anciennes et temporalités 2024-2025″, organised by The Research Center HiSoMA (Histoire et Sources des Mondes Antiques).

Date & time: Friday 21 Feruary 2025, 9:30-12:30

Location: Maison de l’Orient et de la Méditerranée Jean Pouilloux (86 rue Pasteur – Lyon 7e), Salle Reinach (4e étage)

More information can be found here.

Crash Course in Greek Palaeography

The Greek section of Ghent University, in collaboration with the Research School OIKOS and the Royal Library of Belgium, offers a two-day crash course in Greek palaeography. The course will take place on 27-28 May 2025 in Ghent and Brussels. It is intended for MA, ResMA and doctoral students in Classics, Ancient History, Ancient Civilizations, Byzantine studies, Medieval studies and related fields. Students must have a good command of Greek. The course offers an introduction into Greek palaeography from the Hellenistic period to the end of the Middle Ages and is specifically aimed at acquiring practical skills for research involving literary and documentary papyri and/or manuscripts. Participants will gain hands-on experience with original papyri housed at Ghent University Library and with manuscripts from the Royal Library of Belgium in Brussels.

Programme

The course will take place over two full days, with one session in Ghent on Tuesday, 27 May, and the other in Brussels on Wednesday, 28 May. Specialists in Greek palaeography will deliver lectures providing a chronological overview of the evolution of Greek handwriting, accompanied by introductions into the material features of both papyri and codices. The lectures will be followed by practical sessions, consisting of supervised reading of selected extracts from papyri and manuscripts in small groups. There will be guided exhibitions of selected papyri (in Ghent) and medieval manuscripts (in Brussels).

Tuesday, 27 May 2025
09:30-10:00 Introduction to the Crash Course
10:00-10:30 Introduction to papyrology and the materiality of papyri – Dr. Serena Causo
10:30-11:45 Papyri of the Ptolemaic and Roman period – Dr. Joanne Stolk
11:45-13:00 Practice with papyri of the Ptolemaic and Roman period
13:00-14:00 Lunch break
14:00-14:30 Presentation of papyri from the collection of the Ghent University Library – Dr. Serena Causo
14:30-15:45 Papyri of the Byzantine period – Dr. Yasmine Amory
15:45-17:00 Practice with papyri of the Byzantine period
18:30 Dinner in Ghent (optional)

Wednesday, 28 May 2025
09:30-10:00 Introduction to the codicology of the Byzantine book – Dr. Grigory Vorobyev
10:00-10:30 Display moment 1
10:30-10:45 Coffee break
10:45-12:00 Byzantine book scripts 1: From the first codices to the eleventh century – Prof. dr. Floris Bernard
12:00-13:00 Reading practice 1
13:00-14:00 Lunch break
14:00-15:15 Byzantine book scripts 2: The Comnenian and Palaeologan periods – Prof. dr. Andrea Cuomo
15:15-15:45 Display moment 2
15:45-16:45 Reading practice 2

The teaching staff also includes Kyriaki Giannikou, Dr. Juan Bautista Juan-López, Eleonora Lauro, Dr. Divna Manolova and Dr. Chiara Monaco.

Practical information

The study load is equivalent to 2 ECTS credits (2×28 hours). In preparation for the course, participants will be required to read secondary literature which will be distributed several weeks in advance. Additional materials will be provided in order to help develop further reading skills after the course.

There is no participation fee for this course. Lunches will be provided on both days free of charge. Travel and accommodation expenses are the responsibility of the participants. The train connection between Ghent Sint-Pieters Station and Brussels Central Station is frequent, with a travel time of less than 40 minutes. Participants may choose to lodge in either city.

The course will take place at the following venues:

Registration

Prospective participants should register by sending an e-mail to grigory.vorobyev@ugent.be with a short motivation letter (approximately 300 words), detailing their academic background, research interests and motivation for attending the course. Priority will be given to MA and doctoral students associated with OIKOS and those who have not previously had the opportunity to study palaeography. The deadline for registration is 1 March 2025. Applicants will be notified of the outcome shortly thereafter.

Colin Swaelens, Part-of-Speech Tagging & Lemmatisation in Unedited Greek: Simple Tasks, Complex Challenges?

Abstract

In today’s landscape of language technology, dominated by large language models, tasks like part-of-speech tagging and lemmatisation receive less attention in current NLP research. However, these tasks still pose significant challenges, especially for under-resourced, morphologically rich languages like Ancient Greek. Our project focuses on the verbatim transcriptions of Byzantine marginal poetry stored in the Database of Byzantine Book Epigrams (DBBE). Due to the highly interconnected nature of the poems, we aim to eventually perform similarity detection across the corpus. As a first step, we sought to annotate the DBBE with part-of-speech tags, morphological analyses, and lemmas. Although research on these tasks dates back to more straightforward rule-based systems from the 1970s, current taggers struggle with these unedited texts. The inconsistent orthography — largely due to itacism — adds to this complexity. To mitigate these issues, we trained a transformer-based language model encompassing classical, medieval, and modern Greek. Our experiments, however, revealed that fine-tuning the model for each annotation task was not always fruitful. There is a growing tendency to address such challenges with a multi-task head, allowing the model to process multiple annotations concurrently, drawing inspiration from cognitive psychology. This raises the question: will this more intricate solution outshine the seemingly more transparent methods of the past?

Practical information

This lecture will be given at the Computational Humanities Research Group Seminar Series, organised by the Department of Digital Humanities of King’s College London.

Date & time: Tuesday 10 December 2024, 4:00 pm

Location: Bush House, Strand Campus (30 Aldwych, London) & online

More information about this conference and the full programme can be found here.

Kyriaki Giannikou, Assessing and Reassessing Formulaicity: are editorial practices a blessing or a curse?

Abstract

Formulaicity is a widely discussed concept in the study of historical Greek, primarily due to the influence of the Homeric epics, where it is traditionally understood to arise from oral contexts where formulaic sequences reduce processing effort during lengthy recitations. Besides that, formulaic language also appears in entirely written contexts, such as post-classical Greek administrative and legal documents, where high standardisation meets the need of accuracy and efficiency (see e.g. Nachtergaele 2023; Saradi 2019). The corpus I focus on, Byzantine book epigrams — short, metrical texts found in the margins of Byzantine manuscripts — presents a unique case. These paratexts, embedded in the medieval manuscript tradition, blend literary and documentary functions without any oral performance context, oscillating between practical precision and creative expression. This paper explores a methodological challenge in studying formulaic language within historical Greek corpora, focusing specifically on the Database of Byzantine Book Epigrams.

Even recent comprehensive research on Homer’s formulaic language (Bozzone 2024) relies on modern editions of the Homeric epics that attempt to reconstruct an ‘archetype’ based on medieval manuscript ‘witnesses’. In contrast, the DBBE diverges from strict adherence to traditional editorial practices by presenting epigrams preserving all original scribal choices (‘Occurrences’) while also offering ‘normalised’ versions (‘Types’) that group similar instances of the originals (Ricceri et al. 2023). This raises questions: To what extent can we rely on edited texts to analyse formulaicity? How might editorial choices, driven by the desire for a cohesive text, obscure the original variability of formulaic sequences? Does the interaction between formulaicity and editorial practices facilitate research, or does this create the impression of greater fixedness in formulae, potentially skewing certain aspects of the analysis?

This paper explores the potential impact of editorial intervention on formulaicity research, advocating for a more flexible methodology that balances the use of both edited and original sources. Through a case study on supplications for salvation within a subset of the DBBE corpus, I will demonstrate how formulaic expressions function in this hybrid referential-poetic (cf. Jacobson 1960) context, and how editorial practices may shape our understanding of formulaicity. Ultimately, this study seeks to position this material within the broader framework of formulaicity research and to discuss the implications of editorial practices for linguistic research in historical corpora.

Practical information

This lecture will be given at the conference ‘Formulaic Language in Historical Linguistics: data, methods, tools, and theory’, organised on 2-3 June 2025 by the Academy of Finland project “The learning of Latin in the 8th to 12th century: a linguistic approach to medieval Latin literacies” in collaboration with the Classical Philological Society of Finland.

Date & time: Monday 2 June 2025, 16:40

Location: Tieteiden talo (Kirkkokatu 6, Helsinki, Finland)

More information about this conference and the full programme can be found here.

Colin Swaelens, Similarity Detection: A Starting Point for Greek

Abstract

Antique literature survived thanks to scribes painstakingly copying texts from one manuscript to the other, prior to the art of printing. Occasionally, these scribes added metrical paratexts to the manuscripts, i.e. texts standing next to the main text (Genette, 1987) and introduced in Byzantine scholarship by Lauxtermann (2003) as book epigrams. Ghent University’s Database of Byzantine Book Epigrams (Ricceri et al., 2023) stores more than 12,000 of such epigrams, being verbatim transcriptions precisely as they are found in the manuscripts. This entails that the Greek of these epigrams is interspersed with orthographic inconsistencies, mainly due to phonetic changes like the itacism. These verbatim transcriptions are called occurrences and are grouped under one or more so-called types, a readable representation of its occurrences in standardised, classical Greek. Eventually, we aim to develop a dynamic system to group hemistichs, verses and epigrams based on distinct similarity measures in order for scholars to find all kinds of similar texts instead of only the ones that pop up in their mind. While developing those similarity measures, just like any other algorithm, evaluation is an essential part of the development process. However, a gold standard for the evaluation of verse similarity measures does not exist. At this point, we already conducted a pilot study on pairwise annotation of 2 verses with 10 annotators. Each verse was set off alongside six pairs of verses, of which the annotator had to mark the most similar one in their opinion. The inter-annotator agreement (IAA) yielded an agreement score of 57.69%, which is seen as a moderate agreement (Landis & Koch, 1977). This agreement score is the arithmetic mean of the agreement between each pair of annotators, as all annotators annotated the exact same set of verses. Despite the rather modest size of this pilot study, it is possible to unravel the distinct lines of reasoning of the annotators. They did not receive detailed instructions for the annotation process, because of which every annotator was free to have their own focal point. The most remarkable of those focal points was the metre. One of the annotators based their judgement on the amount of syllables a verse counts. The majority, however, seemed to take syntax as a decisive factor to determine the most similar verse; semantics were only deciding, if the syntax of both options was identical. While the gold standard is being annotated, we already started computing similarity between words. These similarities will, in a next stage, be used to compute similarity between (half) verses. The main goal of the experiment is to find out whether transformer embeddings take into account enough context to find identical or similar words with deviant orthography.

Practical information

This lecture will be given at the ‘Computational Approaches to Ancient Greek and Latin Workshop’, organised by KU Leuven and the University of Groningen. This workshop series started in 2021 with the aim of further exploring the potential of computational approaches (Natural Language Processing) applied to Ancient Greek and Latin. The 2024 edition will be held hybridly on November 28th and 29th, 2024.

Date & time: Friday 29 November 2024, 13:45-14:30

Location: KU Leuven: Mgr. Sencie Instituut (Erasmusplein 2, 3000 Leuven, Belgium) & online

Register via this link. Registration for in-person attendance is not possible anymore. The deadline for registration for online attendance is 27 November 2024.

More information about this conference and the full programme can be found here.

Kristoffel Demoen, Kyriaki Giannikou & Colin Swaelens, The Database of Byzantine Book Epigrams. Paratextual Poems from the Margins of Medieval Manuscripts to a Searchable Digital Corpus

This lecture will be given at the 8th International Byzantine Seminar Lecture Series (2024) on “Digital Methods for Byzantine Studies”, organised by the Institute for the History of Ancient Civilizations at the Northeast Normal University in Changchun (China), in collaboration with the Department og Byzantine and Modern Greek Studies at the University of Cologne and the Department of Historical and Classical Studies at the Norwegian University of Science and Technology.

Date & time: Thursday 21 November 2024, 11:00 am (CET)

Location: online via Zoom

Registration is free, but required. The Zoom link will be provided upon registration. To register or for more information, email with “IBSLS Registration” to liq762@hotmail.com.

LW Research Day 2024: poster session

The fourth LW Research Day will take place on Wednesday 27 November 2024, in the Ghent University Museum (GUM). Central theme is ‘From Source to Understanding’.

What is the role of interpretation in our journey from studying source material to scientific understanding? Indeed, that journey can never be devoid of interpretation, which, in many cases, serves as the quintessential bridge between source material and understanding, whether it pertains to a historical study based on ego documents, the archaeological perspective on the material culture of the past or the anthropological view of human behaviour. Not infrequently, interpretation itself becomes the object of research. For instance, translation scholars examine translation choices that result from interpretations. Literary and art scholars investigate works that themselves provide an interpretation of the world in which they originate and the world they create. Similarly, language itself reflects a particular understanding of the world in a historical and sociological sense, which linguists further explore. In times of digital humanities, the interpretation of (big) data by AI becomes not only conceivable but even the norm. What do interpretation and hermeneutics signify for our fields today? What constitutes a successful or legitimate interpretation, and what are the pitfalls of interpretation?

The PhD students of the DBBE team will present a poster on their research projects in the framework of the Database of Byzantine Book Epigrams.

Kyriaki Giannikou – Dealing with Building Blocks of Expression: Formulaic Elements & their Creative Variations in Byzantine Book Epigrams
Eleonora Lauro – Epigrams in Context: Glimpses into Medieval Southern Italian Book Culture

More information can be found on the LW Research Day website.