Image caption generations

Archives have existed as long as we have been writing information on paper. This also means that the process began very old-fashioned, with the use of paper files. Today, in the digital age, using paper files is very inefficient for organizations. This is because someone looking for a specific file would first have to enter room, go through all the files in that room and pick out the right file. This process would take a relatively large amount of time, which could be spent on more challenging tasks. If the archive existed in a digital environment, this process would be less demanding and easier.

A Dutch organization (which cannot be named because of the sensitivity of their work) also followed this thought process and began digitizing the archives. The specific task of captioning the images of this archive was delegated to Aurelia van den Berg, who took on this task for her Bachelor's thesis project. These captions would then be used to make the images more searchable after digitization by including objects and details of the images that properly and completely describe the image.

To achieve this goal, Aurelia implemented the so-called Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) model. The CNN model would analyze the image and extract the features so that the LSTM model could recognize the image and generate a sentence based on the most likely next word. She had added an "attention mechanism" to this model, which mimics the way a person would analyze and describe an image by focusing more attention on the aspects of the image that are more relevant than others.

Dit werd vervolgens vergeleken met een heel ander model, namelijk met Generative Pretrained Transform (GPT). GPT4o is gepubliceerd door het welbekende OpenAI en is openbaar beschikbaar. Het is een groot taalmodel (LLM) dat een op transformer gebaseerd model gebruikt om de afbeeldingen te analyseren. Door een goed werkende prompt uit te schrijven, konden we onderschriften genereren die de afbeeldingen beschreven.

We then compared the quality of image captions to find out which would work best within the context of archival photographs. This was done using both automatic evaluation metrics and manually evaluating a smaller number of images with human evaluation.

Ultimately, we found the human evaluation more relevant, as higher scores for those images corresponded to the captions that were desirable for the Dutch organization. These captions included specific colors, garments and their materials, as well as the full background described in detail. Using these details, a user of the archive could find an image based on the aspect they remembered.

When comparing the models, we found that the GPT4o model scored highest in human evaluation, the evaluation metric that was most important, and therefore the model was selected. The other models often had errors in counting objects and colors, so they scored lower in human evaluation.

Although we expected the technical metrics to be useful, we did not find them very appropriate for this study. We expect that this was due to the reference captions against which the generated captions were compared, as we had evaluated afterwards that the references were not of the quality we expected from the captions. Therefore, it would be worthwhile to see if we could calculate these metrics again, but with better references.

Now that the model has been selected, the next step would be to optimize it for this organization so that no aspect is excluded from what they need in their work, as well as the actual implementation of the captioning system in the photos of their archives so that they can continue to work on digitizing their archives.

Status

Complete 100%

Ben je geïnteresseerd in het schrijven van je scriptie bij ons? Ben je enthousiast over de mogelijkheid om deel uit te maken van het Data Science & AI team van PNA? Lees meer over onze toekomstige projecten!

Image caption generations

Status

A good discussion about how we can help you?

Contact us