Image caption generations

Archives have existed as long as we have been writing information on paper. This also means that the process began very old-fashioned, with the use of paper files. Today, in the digital age, the use of paper records is very inefficient for organizations. In fact, someone looking for a specific file must first enter room, go through all the files in that room and pick out the right file. This process would take a relatively long time, which could be spent on more challenging tasks. If the archive existed in a digital environment, this process would be less demanding and easier.

A Dutch organization (which cannot be named because of the sensitivity of their work) also followed this thought process and began digitizing archives. The specific task of providing captions to the images from this archive was delegated to Aurelia van den Berg, who took on this task for her Bachelor’s final project. These captions would then be used to make the images more searchable after digitization by including objects and details of the images that properly and completely describe the image.

To achieve this goal, Aurelia implemented the so-called Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) model. The CNN model would analyze the image and extract the features so that the LSTM model could recognize the image and generate a sentence based on the most likely next word. She had added an “attention mechanism” to this model, which mimics the way a person would analyze and describe an image by focusing more attention on the aspects of the image that are more relevant than others.

This was then compared to an entirely different model, Generative Pretrained Transform (GPT). GPT4o is published by the well-known OpenAI and is publicly available. It is a large language model (LLM) that uses a transformer-based model to analyze the images. By writing out a properly working prompt, we were able to generate captions that described the images.

We then compared the quality of image captions to figure out which would work best within the context of archival photographs. This was done using both automatic evaluation metrics and manually evaluating a smaller number of images with human evaluation.

In the end, we found the human evaluation more relevant, because higher scores for those images corresponded to the captions that were desirable for the Dutch organization. These captions included specific colors, garments and their materials, as well as the full background that was described in detail. Using these details, a user of the archive could find an image based on the aspect he remembers.

When comparing the models, we found that the GPT4o model scored highest in human evaluation, the evaluation metric that was most important, and therefore the model was selected. The other models often had errors in counting objects and colors, so they scored lower in human evaluation.

Although we expected technical metrics to be useful, we did not find them very appropriate for this study. We expect that this was due to the reference captions against which the generated captions were compared, as we had evaluated afterwards that the references were not of the quality we expected from the captions. Therefore, it would be worthwhile to see if we could recalculate these metrics, but with better references.

Now that the model has been selected, the next step would be to optimize it for this organization so that no aspect is excluded from what they need in their work, as well as the actual implementation of the captioning system in the photos of their archives so that they can continue working on the digitization of their archives.

Status
Complete 100%

Are you interested in writing your thesis with us? Are you excited about the opportunity to be part of PNA’s Data Science & AI team? Learn more about our future projects!

Curious about how we can help you?