Maxime Deforche, An Orthographic Similarity Measure for Graph-based Text Representation

Abstract

Computing the orthographic similarity between words, sentences, paragraphs and texts has become a basic functionality of many text mining and flexible querying systems and the resulting similarity scores are often used to discover similar text documents. However, when dealing with a corpus that is inherently known for its orthographic inconsistencies and intricate interconnected nature on multiple levels (words, verses and full texts), as is the case with Byzantine book epigrams, this task becomes complex. In this paper, we propose a technique that tackles these two challenges by representing text in a graph and by computing a similarity score between multiple levels of the text, modelled as subgraphs, in a hierarchical manner. The similarity between all words is computed first, followed by the calculation of the similarity between all verses (resp. full texts) by using the formerly determined similarity scores between the words (resp. verses). The resulting similarities, on each level, allow for a deeper insight into the interconnected nature in (parts of) text collections, indicating how and to what degree the texts are related to each other.

Practical information

This lecture will be presented at the 15th Internation Conference on Flexible Query Answering Systems.

Date & time: Wednesday 6 September 2023, 12:00 pm

Location: Campus Universitat de les Illes Balears (Carretera de Valldemossa, km 7.5, Palma de Mallorca)