Nadine Beks van Raaij worked on simplifying complicated texts during her master’s thesis at the Jheronimus Academy of Data Science. Specifically texts sent from the government to citizens that are complicated to understand for most people.
For this purpose, Nadine has developed and trained several “models. From a rule-based model to a large language model. These models take a difficult to understand text and translate it into an easy to understand text without losing the content of the letter (and thus legal validity). To address the problem of vocabulary, which cannot simply be changed to simpler words, a glossary was added as input. Legal terms thus remained the same, but were explained in simpler terms for the readers to understand the meaning of the word and the context around it.
To compare these models, she first applied technical evaluation methods, evaluating the output of the models with the BLEU, BLEURT, ROUGE and LiNT scores. The first three are common evaluation methods for text simplification, the latter specifically for Dutch language readability.
She also conducted interviews with experts, both linguists and legal experts, to test that the simplified texts are grammatically and lexically correct and also contain the same content and are therefore legally valid.
Finally, a reader survey was conducted among people from different backgrounds (in education, reading hours, etc.) with a sample size of 72 people, ensuring an excellent representation of society. As a result, she found that OpenAI’s well-known GPT model performed best relative to the other models.
The GPT model is a large language model (LLM), which therefore specializes in language-related issues. The GPT model was optimized by prompt engineering: giving the model a particular task, written in a specific way, so that the model clearly understands the goal and can deliver the desired results. This resulted in a model that specializes in simplifying government documents.
The GPT model did not simply perform slightly better than the other models; it managed to increase participants’ comprehension of the text from 60% to 90% on average for three different letters, a significant improvement!
Thus, Nadine achieved her goal, and proceeded to implement the best model so that she can help as many people as possible and make a positive impact on their lives.
Are you interested in writing your thesis with us? Are you excited about the opportunity to be part of the Data Science & AI team from PNA? Learn more about our future projects!