Colin Swaelens, Lemmatisation & Morphological Analysis of Unedited Greek: Do Simple Tasks Need Complex Solutions?

Abstract

Fine-tuning transformer-based models for part-of-speech tagging of unedited Greek text has outperformed traditional systems. However, when applied to lemmatisation or morphological analysis, fine-tuning has not yet achieved competitive results. This paper explores various approaches to combine morphological features to both reduce label complexity and enhance multi-task training. Specifically, we group three nominal features into a single label, and combine the three most distinctive features of verbs into another unified label. These combined labels are used to fine-tune DBBERT, a BERT model pre-trained on both ancient and modern Greek. Additionally, we experiment with joint training — both among these labels and in combination with POS tagging — within a multi-task framework to improve performance by transferring parameters. To evaluate our models, we use a manually annotated gold standard from the Database of Byzantine Book Epigrams. Our results show a nearly 9 pp. improvement, demonstrating that multi-task learning is a promising approach for linguistic annotation in less standardised corpora.

Practical information

This lecture will be given at ‘The 63rd Annual Meeting of the Association for Computational Linguistics‘, which will take place in Vienna, Austria from July 27 to August 1st, 2025.

Date & time: to be confirmed

Location: Austria Center Vienna (Bruno-Kreisky-Platz 1, Vienna, Austria)