Helping agents "see" your photos in Gumnut

In my last few posts, I’ve been showing small examples of what’s possible when agents like Claude can work with your Gumnut photo library — curating albums, building apps, and so on. Those demos make it look easy, but that’s because there’s a sophisticated photo intelligence layer under the API. Over my next few posts I want to walk through what that layer does, from face detection and clustering to semantic search and more.

For this post, I’m starting with photo and video descriptions.

The problem

In early versions of Gumnut, any agent that wanted to understand the content of a photo had to download the raw file and run it through a vision model itself. Every decision — “is this worth showing the user?”, “which of these 50 matches is actually what they asked for?” — required the agent to pull down and process the asset. For a typical library of tens or hundreds of thousands of photos, that’s prohibitively slow and expensive.

What I observed in practice was worse: agents would often get lazy and guess the content of a photo from metadata like the filename, EXIF data, or album name. They guessed what the photo might be about — and based all of their follow-up responses on that speculated content.

Early screenshot of Gumnut in ChatGPT from September 2025. It’s a pretty accurate guess based on the GPS coordinates, but ChatGPT didn’t actually look at the photos.

What those agents lacked was a high-level, token-efficient representation of each photo’s content — something they could consume alongside the rest of an asset’s metadata — without downloading the full file.

The solution

I solved this in Gumnut by pre-processing every photo and video with a vision-language model (VLM) to produce a short prose description, which I expose as a top-level description field on every asset. It’s returned by the standard Asset API, so every client — the MCP server, the web app, the SDKs, the Immich adapter — gets it automatically. With this additional metadata, agents can work as if they’ve opened and inspected every asset in your Gumnut photo library.

Recent screenshot of Gumnut in Claude. Includes much more detail about my weekend.

The descriptions target around 120 output tokens — detailed enough to capture the scene, subjects, lighting, and composition, but short enough that they don’t take over your agent’s context window. Here’s an example of what the production pipeline generates for a photo of my cat sitting in the kids’ toy wagon:

“A gray tabby cat with striking green eyes sits snugly inside a bright orange plastic toy wagon, its body curled slightly as it gazes directly at the camera. The wagon, branded "Green Toys," rests on a light wood floor, surrounded by colorful children’s toys — including a yellow dump truck, red fire engine, and blue-and-white camper — softly blurred in the background. The shallow depth of field keeps the cat and wagon sharply focused while diffusing the surrounding playthings into warm, pastel-toned shapes. Indoor lighting casts gentle shadows and highlights the texture of the cat’s fur and the smooth curves of the plastic.”

Constraints

Generating descriptions is straightforward in concept, but challenging once I added some important constraints:

Cost efficiency

Gumnut is currently free because I haven’t prioritized the time to build any subscription or billing logic, but eventually I do need it to cover its costs, including hosting and asset processing. A naïve $0.01/asset would cost ~$2,500 to process a single 1 TB library of typical 4 MB photos. I wanted the cost at least an order of magnitude cheaper than that.

For planning, I settled on a ceiling of ~$50 to describe a 1 TB / 250,000-asset library (a realistic “typical user” point). That works out to around $0.0002 per photo, and is low enough that the cost can be recovered in under a year of a typical photo storage subscription plan.

Zero data retention

Description generation sends user photos and videos to third-party inference providers. That’s a trust boundary I take seriously. Any provider I use has to be contractually committed to not training on or retaining user data beyond what’s required for the single inference request. For OpenRouter, that means enforcing zdr: true on every request, which hard-routes to ZDR-compliant upstreams (Venice, Together) and returns HTTP 503 rather than silently falling back to a non-compliant provider. For Google’s Gemini API, I use the paid tier, which doesn’t train on API data and only retains prompts for a short abuse-monitoring window.

Accuracy

By accuracy, I mean that the description correctly captures what’s actually in the image — the scene, subjects, actions, setting — and doesn’t describe things that aren’t there.

I leaned on two VLM benchmarks as proxies: RealWorldQA (practical understanding of everyday scenes) and HallusionBench (fabricated visual content). Classic image captioning benchmarks like COCO Captions turned out to be a poor fit — they test ~15-word captions and are saturated by every modern model. The most directly relevant benchmarks (CapArena-Auto, DeCapBench) didn’t have scores for most of the 2026 models I was evaluating, so as part of testing I ran an empirical evaluation on a sample of representative photos from my own library.

Benchmarks helped narrow my model choice, but they didn’t prevent another failure mode: the model confidently narrating things it cannot actually see. I learned this the hard way on the same cat photo above, when the response to an earlier version of my prompt included this sentence:

❝

"The composition centers the cat as the quiet protagonist amid playful chaos, evoking a sense of calm curiosity and whimsical domesticity."

That sentence is charming, but it’s also a hallucination. The model can see a cat in a wagon; it cannot see “quiet protagonist” or “whimsical domesticity.” So the production prompt explicitly instructs the model to state observable facts and to avoid interpreting mood, narrative, or artistic intent. Lighting, focus, and composition are in; whimsy, nostalgia, and tension are out. This enables a downstream agent to draw its own interpretations from a grounded description.

Token efficiency

The description lives in asset metadata and is returned to every API consumer. Bloating it would hinder an agent’s ability to fit many assets into its context. To determine the right description length, I ran a qualitative comparison of descriptions generated by models at different lengths, and ultimately landed on 100-150 tokens as a range that balanced token efficiency with capturing details.

For some photos, that was far more than necessary; for others, it wasn’t quite enough. I’ll probably continue to tune the prompts for more brevity in the future; if an agent wants to “see” more detail on a photo, it can always download the whole photo.

What I chose, and why

Benchmarks and provider pricing narrowed the candidate set; an empirical evaluation on a sample of my own photos picked the winner. The shortlist spanned three categories: cheap open-weight models (Qwen, Gemma, Llama), frontier-provider cheap tiers (Gemini Flash-Lite, GPT-5.4 nano/mini), and frontier models as a quality ceiling.

For images, the production default is Qwen3.5-9B via OpenRouter with ZDR enforced. I landed on this for 3 reasons:

Vision benchmarks. Qwen’s own model card reports MMStar 79.7, RealWorldQA 80.3, and HallusionBench 69.3 — meaningfully ahead of every other cheap-tier model with vision capabilities I shortlisted, including GPT-5.4 nano (the closest OpenAI analog).
Cost. OpenRouter prices Qwen3.5-9B at $0.10/$0.15 per 1M input/output tokens, significantly cheaper per photo than GPT-5.4 nano’s $0.20/$1.25. Production images are downscaled to 1024 px on the longest side before submission, which helped land the per-library cost comfortably under the $50 budget I set.
Portability. Qwen3.5 is Apache 2.0. If OpenRouter’s economics change, or if ZDR guarantees shift, I can move to Together AI (cheaper batch pricing, first-party SOC 2 / HIPAA), DeepInfra (ZDR by default), or self-host on a single L40S or A100. Nothing in the design ties me to a single provider.

For videos, the default is Gemini 2.5 Flash-Lite via Google’s paid tier. This was a different calculus: the deciding factor wasn’t benchmark scores, it was the fact that Gemini is the only cost-effective option with native video understanding. Every other candidate in the price tier would have forced me to build and maintain a frame-extraction pipeline — ffmpeg, keyframe selection heuristics, metadata compensation when container data gets stripped. Gemini’s Files API lets me upload the raw video and get a coherent description of motion, audio, and temporal context in one call.

This does mean I run two providers. I accepted that trade-off rather than constrain model selection to the subset of providers that cover both media types well.

Designed to outlive any single model choice

A few pieces of the architecture are worth calling out, because they turned out to matter more than I initially expected — and all of them are about not getting locked into today’s model landscape.

Configurations are first-class, immutable database records. Each description generation config stores the model name, provider, full prompt text, temperature, max tokens, and media type. Descriptions reference the config that produced them. This means I can compare outputs from different configs side-by-side on the same asset without a bespoke eval harness — just run a new config over the sample set and diff. Configs are immutable after creation, so every description record is a faithful snapshot of how it was produced. Changing the prompt means creating a new config, not mutating an old one.

At most one active description per asset. Exactly one description is marked is_active at any time, and that’s the one the API returns. When a better model comes out, I can generate descriptions under a new config in the background, then flip the active flag in bulk — per-asset, so if the new model refuses on a photo, the old description stays active for that one asset.

Refusals are a distinct output state, not a failure. If a model refuses to describe an asset (safety filter, content policy), I record that as a refusal — description=null, is_refused=true — rather than retrying or throwing. The API returns "" (empty string) to distinguish “model refused” from “description hasn’t been generated yet” (which returns null).

Planned extensions

Feeding descriptions into search. Gumnut’s semantic search today is essentially a single CLIP embedding per image — fine for short queries like “sunset on the beach,” weaker on the longer compositional ones users sometimes type, where CLIP’s text encoder has known limits. The descriptions unlock a different shape of search: a hybrid pipeline where the description text retrieves alongside the image embedding, and structured metadata filters compose with both. That’s a meaty enough topic to deserve its own post once the implementation is further along — for now, the relevant point is that good descriptions were a prerequisite for moving past single-vector search at all.

Riding the model curve. Models keep getting smaller, cheaper, and more accurate. The immutable-config architecture above is built for this — I expect to re-evaluate cloud candidates periodically, swapping in a new config when something meaningfully better lands. The more interesting question is how far the same curve pushes description generation toward the client: once a mobile NPU can run a 3–9B VLM without tanking battery life, the client could generate descriptions on upload directly. The usual wrinkles apply — battery, thermal, and the fact that “the client” is a thousand devices with wildly different inference characteristics.

What’s next

These blog posts take a while to write, so I hope you find them interesting and helpful! As always, feel free to reach out if you’d like to chat.