How we used gpt-4o for image detection with 350 very similar, single image classes.

Reading time: 12 min

This story recounts a challenging request that emerged in our small engineering team and how we solved it. The final solution demonstrates how LLMs have shifted what product-oriented teams can achieve with AI.

× × ×

AskMona is a small company with an even smaller engineering team, striving to provide AI solutions to niche markets. We originally specialized in cultural institutions like museums and foundations but have since expanded into tourism and education.

One aspect of our product involves using computer vision to match points of interest—such as artworks in exhibitions—with personalized experiences. Due to our innovative reputation in the cultural field, a museum approached us with a specific request:

They had a large collection of car illustrations printed on exhibition walls and needed an app to match a picture of a car to its related content - identifier, image and links to more information. They also had a lean budget, so maximizing efficiency in infrastructure and maintenance was crucial.

The signature predated my arrival, but I know that our product team had initially planned to deliver an augmented reality experience via a web app, leveraging the image-based tracking abilities of this stack. This is not something we had in our main line of product, but the tech leader of the time felt this was the way forward. A specialized AR partner, established company in the field, was even identified to help us unblock any potential difficulties. The project was then shelved for later development as other matters took priority.

When I joined as CTO months later, we prepared for the first delivery milestones. At this point I had a detailed look on the images we were given, and I remember my sudden reaction: "but they all look the same.".

The provided images were the digital originals used for printing the wall display, in 6 or 7 staggered rows. On top of that, we had just a few pictures of the actual wall, but it was clear that visitors' snapshot would incur distortion, light and color variation, presence of shadows.

Quickly, it became apparent that the client wouldn't have time to capture additional real-life snapshots of the illustration on the wall. And to complicate matters, the museum was over a thousand kilometers away, in another country.

That was the challenge we faced.

× × ×

I knew from previous experiences that web-AR technology has limitations. Having seen the images I felt doubtful about the approach ; this was confirmed by our initial proof of concept. I quickly turned to our AR partner for help, providing a detailed view of our use case for feasibility confirmation.

The response came back: They recommended against the approach, arguing that even their battle-tested technology could not handle such a case.

For instance, their system would simply not accommodate for 350 detection markers at once. On top of that, the size of their bundle dependency files alone would make our app's user experience unwieldy. And finally, the very similar-looking pieces would likely confound the system.

Suddenly, with a looming deadline, we found ourselves without a viable plan. Fortunately, our customer success team negotiated an extension with our client, but we still needed a solution to justify the delay.

And so we went to work.

× × ×

Our small team, though primarily product-focused, has some ML background. Before pivoting to an LLM-based solution, we had built our own NLP models. So our first step was to experiment with training an on-device image classification model, with MobileNet emerging as the ideal candidate.

MobileNet is a lightweight image classification model, pre-trained on a large dataset yet optimized for mobile devices. Its architecture is simple and lightweight, making it easy to load and run in the browser’s Javascript through the ONNX runtime, which in turn meant very light infrastructure and maintenance effort, a requirement for the project.

To leverage the model's training while specializing it for our challenging use case, we tried transfer learning — a classic strategy where only the final layers of a neural network are trained on the specific dataset, while the model carry the general knowledge inferred in its pre-training phase. However, transfer learning typically requires hundreds of images per class, but we only had one. To address this limitation, we turned to data augmentation, artificially creating new versions of each image by modifying colors, adding noise, applying distortion, or rotating images. By the end, we had generated 600 augmented images per car.

Our team experimented with various training parameters until we arrived at a model that seemed promising. Early tests with a few real life snapshots were encouraging. I felt cautiously optimistic — proud, even — of our resourceful solution.

Yet, more extensive testing revealed inconsistent results. Multiple snapshots of the same car often yielded different matches. The client’s initial trials with our alpha app displayed similar issues, as the model struggled to identify the correct car illustration consistently.

To improve reliability, we implemented some user-guiding features, such as a viewfinder to align the camera with the cars and a multi-snapshot background process to gather additional data. Despite these improvements, the solution wasn’t reliable enough for public use.

At this point, the client started to express concerns about our ability to deliver.

× × ×

Meanwhile, part of the team was working on enhancing our main product's image recognition pipeline. Two words on our method.

To do image matching we use nearest neighbor search (or KNN). This works by encoding the catalog of images into embeddings - a tranlsation of the image produced by some pre-trained model, mapping "features" and meaning into a huge numerical representation. When a candidate image comes up from our users, we can convert this one too into embeddings. Then, we run a search for the nearest vector in our catalog and return the closest matches. This method shines because most of the work is done at indexing time, when converting the images. But it also means the system relies on the embedder's quality.

Every provider is releasing multimodal embedding services these days, but image embedding could not really be found as a serverless service at this time, so our system used a self-hosted VGG16, an image classification model, to convert the pictures to vectors. This worked fine for museum collections which are usually diverse-looking, but the embedding quality was limited, and we did not think it would work reliably with the very similar images of our car project.

The release of the AWS Titan multimodal model, alongside an image embedding endpoint, changed the situation: such a large model, jointly trained on text and image, would map finer features from our images. Eager to simplify our architecture, lower our costs, and benefit from better quality embeddings we migrated our main pipeline to use it. Pleased with the results, we thought - why not give it a try on our cars problem?

The proof of concept was quickly implemented. Most of the work could be preloaded on the user's device; we only needed to get the embeddings for the incoming image. Initial results felt a lot more reliable; some cars were matched correctly time and again with various angles and lighting conditions. But on the other hand, some cars would never be matched, always offering another, very similar one in its place.

× × ×

We communicated progress with the client and even sent a team member to conduct real-life tests and gather more pictures of the actual setup. This confirmed our initial discoveries of stability combined with partial success, bringing hope and despair in equal part. But we took the time to analyze the results, and we found something interesting.

When detection failed, the proper match often resided in the second or third row, with a small distance to the first. Our solution suddenly felt just an inch shy of working.

We proposed mitigating this by offering two or three options to the user when candidates were close. If presented neatly, this could complete the feature and visitors would be able to identify the cars with a little natural intelligence on top of the AI.

However, this approach would lose the magical effect of having a machine identify what was on the screen. Understandably, the client ultimately dismissed the proposal.

With other pressing work on our main product, it felt we had given this project all we could. It was time to move on.

× × ×

But the story wasn't over.

After all, we were so close. From three hundred fifty very similar images, we were down to just three - a worthy achievement. The only thing we needed at this point was to find "something" to identify the correct one.

There was a lot of activity at that time around SOTA multimodal LLMs doing incredible things with images. People could do handwriting recognition or get segmented bounding boxes from their prompts. Some talked about moving OCR pipelines to vision LLMs.

So we decided to give it a try: why not prompt a vision model, like gpt-4o, to do this last step for us? As a prompt, we opted for one user message per image - the three candidates and the reference - each containing th image data and some identifier text. Plus one last user message to instruct the model. The output is the image identifier or a specific code if a match isn't found.

And suddenly, we had a really good solution.

I got to test it again and again. Obviously, it is not perfect; the software will still get mixed up in some cases (the twin cars with a “4” on the body from the earlier image is among them; it’s just too similar). However, in the vast majority of cases, the results felt like magic.

The client was pleased with the results and greenlit the development of the web app supporting the feature.

This solution proved so effective that we migrated our main image matching service to use it. It resolved the long-standing problem of assigning meaning to similarity scores (how close embeddings must be for a match ?). Previously, we depended on complex heuristics and a trial-and-error approach, which resulted in false positives. Now, for any ambiguous cases, a multimodal model can verify potential matches.

× × ×

In summary, our solution filters the images with a KNN search based on image embeddings, then feeds the top candidates to an LLM to check for a match when any doubt remains - a setup not unlike the text retrieval of our chat products, based on a KNN search and a reranking/filtering step leveraging an LLM.

While this implementation is slower than our previous KNN-only setup, the results have been exceptional. We're yet to experiment with faster, smaller models. Costs are somewhat higher, but the low-resolution image input (initially down to 512x512) has proven adequate for our use case while minimizing token count. A full prompt for image matching costs around 0.0001 USD with gpt-4o.

More generally, this small adventure illustrates how large language models are changing the dynamics between engineering, product, and AI. It reminds me of how cloud computing and commoditized tooling moved complex processes - that used to require specialized teams - into the hands of generalist product engineers. In the same way, using AI in a product does not require specific engineers anymore, allowing teams to focus on framing domain problems, understanding users, and building products.

LLMs and the platforms powering them are quickly becoming one-stop shops for any ML-related tasks. From my perspective, the real revolution is not the chat ability or the knowledge embedded in these models, but rather the versatility they bring in a single system. We experience this at our company: if we do offer chat-based experiences in some of our products, most of the interesting ways we use LLMs are similar to the example presented in this article. Small, behind-the-curtain tool use, simple agentic workflows. Text categorizing, OCR, document analysis and extravtion, search result reranking, recommendations, image matching, and so on.

As a parting word, I'd say that I am really excited for the future. Not for the ever-larger models and the mythical allure of AGI, but for the optimizing phase that is already ongoing, with models getting smaller and smaller for the same quality. Having access to this kind of tooling on a regular server-grade CPU or even on user's devices opens up whole new playgrounds for crafting products.