Automation of archival processes with AI-based text classification

In an era where digitization is playing an ever-increasing role, many organizations face the challenge of storing their paper archives digitally. This transition not only offers advantages in terms of accessibility and efficiency, but is also necessary to comply with increasingly stringent regulations and laws surrounding data management and privacy.  

In his master’s thesis, Hakan developed a system for segmenting large paper archives into individual documents and classifying them according to predetermined rules. This is important because archives often contain documents that, due to privacy and legislation, should be destroyed, while others should be preserved. By classifying at the document level rather than the archive level, the system can accurately determine which parts of a file should be destroyed and which should be stored. This prevents mishandling of documents and reduces both legal risks and storage costs.  

After exploring possible methodologies in the literature, the choice was made to use machine learning models and deep learning-based LLMs (Large Language Models) to classify text extracted from scanned PDF files by OCR techniques. Two components were developed in this process: the segmentation model, which breaks down a file into separate, associated documents, and the classification model, which determines for each document whether it should be destroyed or retained. In addition, a third model has been developed: a rule-based information extraction model which extracts and stores important metadata from texts, such as dates, people and organizations. This metadata supports the global process by providing additional data needed for accurate classification.  

The segmentation and classification models are optimized to an F1 score of 0.92. This means that our system is able to segment and classify documents correctly with a high degree of reliability. Specifically, the system correctly identifies 92% of documents, which is an excellent performance. It ensures that documents are processed correctly so that we meet legal requirements and handle our archives efficiently.  

The information extraction model has a cumulative accuracy of about 91% for recognizing birth dates. However, in most cases, correct dates can be found with an accuracy of more than 99%. This allows us to determine when the more accurate method can be applied, depending on the context and the precision required.  

Because of the positive results and potential impact of the segmentation and classification models, follow-up research will be conducted. This research will focus on further optimization of the models and their practical implementation to ensure that they fit seamlessly with existing systems and processes.  

Status
Complete 100%

Are you interested in writing your thesis with us? Are you excited about the opportunity to be part of PNA’s Data Science & AI team? Learn more about our future projects!

Curious about how we can help you?