In my last post, I started walking through Gumnut's photo intelligence layer with photo and video descriptions. This time I want to tackle the piece I get asked about most: figuring out who is in a photo.

Many of the photos in my library are of my family, and "who" turns out to be the single most important thing to get right. It makes a big difference whether the child in a photo is my kid or someone else's that I don't know. Almost everything I'd want to do with my photos later — find pictures of one of my kids, build an album for a grandparent, let an agent pick the best shot of a particular person — depends on the library knowing who's who.

The textbook approach

At a basic level, this problem looks solved. The standard recipe has three steps:

Detect, embed, compare. Simple in principle, more challenging in practice.

Detect faces in an image with a face detection model. Gumnut currently uses RetinaFace, which returns a bounding box and a several facial landmarks for each face it finds.
Embed each face with a face recognition model. Gumnut currently uses Facenet512, which turns a face crop into a 512-dimensional embedding — a list of 512 numbers that captures what the face looks like. The useful property is that two photos of the same person produce embeddings that sit close together, and two different people sit far apart.
Compare embeddings with a distance function. Gumnut uses cosine distance, where 0 means identical and the number grows as two faces look less alike. If two faces are close enough, call them the same person.

That's the whole idea, and on a clean benchmark it works extremely well — these models clear 99%+ on the standard academic face-recognition tests.

Apparently my family isn't textbook

Back in middle school, one of my friends jokingly created something he called the "Ted Mao scale of siblic identicality" — basically saying that my siblings and I looked identical as far as he was concerned. There must be something to it though, because when I pointed the basic algorithm at a dataset of my 3 kids and their 5 cousins, the textbook approach completely broke down:

The blurry boundary between siblings and their cousins

Siblings and cousins look alike. Some of my kids and their cousins resemble each other strongly enough that even a sophisticated model can't reliably tell them apart — their average embeddings land close together in that 512-dimensional space.
Accessories get in the way. Hats, sunglasses, and masks hide the parts of the face the embedding leans on. A great picture of my kid in ski goggles is, to the model, a much weaker signal than the same kid looking straight at the camera.
Angles and expressions get in the way too. Faces caught in profile, mid-laugh, or contorted into a deliberate goofy face all push the embedding away from that person's "normal" — sometimes far enough to look like a different person entirely.

Honestly, when the kids start wearing each others' clothes and accessories, I can't always tell them apart either, so I'm sympathetic to the model. So I've spent a lot of time building a face recognition pipeline that holds up under these conditions instead of folding.

The pipeline

Every image runs through three stages. The first two happen at the time you upload a photo; the third runs across your whole library periodically.

Stage 1: detect and embed

When a photo arrives, Gumnut normalizes it and detects the faces in the photo. I throw out tiny background-crowd faces — the strangers behind your kid at the playground, or the strangers at a concert. They're rarely who you care about, they're low quality, and filtering them out first keeps the clutter down. Gumnut uses a combination of face size, focus, and sharpness to determine who's the subject of a photo vs. in the background. Then it embeds each face it keeps into a 512-dimension vector.

Stage 2: assign the face to someone you know

Once a face is embedded, Gumnut tries to attach it to a person it already knows about. Each person has a centroid — a weighted average of all the face embeddings collected for them so far. Assigning a new face is theoretically a matter of finding the nearest centroid.

In practice, I had to layer on additional rules in order to achieve accurate matching, so a face only gets assigned when it clears a few gates:

Distance: the face has to be genuinely close to that person's centroid, not just closer to them than to anyone else.
Cohesion: the candidate person's existing faces have to be internally consistent. If a person is already a bit of a muddle, I refuse to add to them rather than make the muddle worse. Similarly, a person needs a minimum number of faces before they can attract new ones, so a brand-new "person" based on a single photo doesn't become a magnet that swallows everything nearby.
Margin: if the closest two people are both plausible and almost equally close, the face assignment is ambiguous.

If a face fails any of the gates, it doesn't get assigned. I chose this conservative approach based on the principle that polluting an existing person with unrelated faces is worse than leaving those faces unassigned.

Stage 3: cluster the leftovers

Plenty of faces don't match anyone yet — the first photos of a new person, or anyone who failed the gates in Stage 2. Periodically, Gumnut takes all of those leftover faces across your whole library and clusters them: groups the ones that look like the same person and turns each group into a new unnamed person you can name later.

Originally, I tried a density-based clustering approach. The original algorithm (DBSCAN) groups faces by chaining together near-neighbors: if A is close to B and B is close to C, they all join the same group. Unfortunately, in a library full of similar-looking siblings, I found that this chaining ends up linking different kids together through a trail of in-between faces.

One day it did exactly that. The clustering produced a single 3,674-face blob that was really a mash-up of several of my kids, and a too-trusting merge step then added that entire blob to one child in a single step. It was a clear demonstration that a clustering algorithm that chains will eventually chain multiple people together.

After that, I replaced DBSCAN with a graph-based algorithm called Chinese Whispers, which produces tighter, better-separated groups and doesn't chain identities together through bridges.

DBSCAN vs. Chinese Whispers

I also added guardrails around what's allowed to become a person or merge into one:

A new person has to show up across multiple different photos, not just several faces in one group shot — real people persist across your library over time.
A new cluster has to be internally consistent before it's allowed to materialize or grow. A loose, sprawling cluster is almost never one real person.
Once you've named a person, that person is sacrosanct. No automated step ever moves faces out of a named person, deletes them, or merges them away. Automation can suggest that two named people are the same and let you confirm it, and it can fold an unnamed group into a named person — but only when it clears a strict set of distance, consistency, size, and sibling-margin checks.

The common thread is that when the signal is weak — a sibling, a hat, a bad angle, a group that doesn't hang together — the pipeline is built to refuse rather than guess.

How I know any of this works

To help evaluate different algorithms for face assignment and clustering, I had to build a ground truth: a set of faces where I know, for certain, who each one really is. Naturally, I chose to label my own family photos, full of my kids and their cousins, with their faces aging over multiple years — over 5,600 faces across a dozen people, each one assigned to the right person.

Assigning faces to the right person, face-by-face, across thousands of faces, in a normal photos UI would have been miserable, so naturally I built a small tool for it, which I've since published in Gumnut's demos repo. Rather than make me look at every face, it helps me focus on the suspicious ones.

The face cleanup app I wrote to label faces, with names removed and faces blurred

For a given person, it surfaces the faces that are actually closer to another person's centroid, the faces sitting in that sibling-ambiguous margin where the second-best match is almost as close as the best, and the worst outliers that sit far from everyone. Each face comes up alongside the other people it might belong to, and I reassign it with a single keystroke. It's a thin client over the same public API everything else uses — it reads each face with its nearest-person candidates and writes corrections back with a PATCH to the faces endpoint. Working through the suspicious faces this way, I turned a several thousand faces into a trustworthy answer key over a couple of evenings.

Then I froze the labeled set — the face crops, their embeddings, and who each one really is gets snapshotted — so when I test a change months apart I'm grading two runs against the exact same answer key instead of a dataset that's drifted.

What the measurements showed

The clearest result is the one behind the clustering-algorithm switch I mentioned earlier. When I replayed the old density-based clustering (DBSCAN) on my unlabeled library at the settings it actually ran in production, it didn't just occasionally chain two people together — it collapsed 96% of every face in the library into a single "person".

On a standard clustering-accuracy score that runs from 0 to 1, that scrambled result lands at essentially 0. The graph-based algorithm I replaced it with (Chinese Whispers) scored 0.88 on the same faces. Measured directly, the share of cross-kid face pairs that got wrongly lumped together — the sibling-confusion problem, quantified — fell from 100% to low single digits.

The same answer key also helped with many other decisions.

One was the matching threshold. When I plotted how far every face sat from the person it was assigned to, the distances dropped off a cliff — and almost every face past that cliff turned out to be the wrong person. I'd been using a single distance cutoff for two different jobs: deciding whether a new face belongs to someone the library already knows, and deciding whether two unknown faces are each other. The data said those jobs want different cutoffs, so I split them and pulled the assignment one in to sit right at the cliff. A few more faces end up unassigned as a result, which is the trade I'd already decided I wanted.

Another was the cohesion gates I described back in the assignment and clustering stages. The mega-merge pushed me to measure something I'd been ignoring: not just how far a face is from a person, but how tightly that person's own faces hang together. When I computed that across the answer key, the polluted clusters — the ones that had quietly swallowed a sibling — stood out cleanly from the healthy ones. There was an obvious line between "this is one person" and "this is two people." That line is what the gates check before the system grows or merges into a person: if a cluster has already drifted past it, adding more faces only makes it worse, so the system refuses. With the gate in place I could replay that 3,674-face merge and watch it get rejected — a cluster that big merging into a person that size is suspicious on its face, so it never even reached the distance math.

Up next

The pipeline above is in production, I’m following up with some additional improvements in the near future:

Scaling the clustering. Grouping faces means comparing them to each other, and the straightforward way to do that compares every face to every other face. That's fine at my current scale — a few thousand faces fits comfortably in memory — but the work grows with the square of the number of faces, so it won't survive a library with hundreds of thousands of them. The fix is to stop computing every comparison and instead ask the database for just each face's nearest neighbors, using an approximate-nearest-neighbor index (pgvector's HNSW index) that Postgres can answer in roughly logarithmic time. The index is already in place; wiring the clustering step to use it is the next step, and it's what takes this from "works for my family" to "works for a library 100x the size."
Better embeddings. The whole pipeline rests on how well the embedding model separates one person from another, and the sibling problem is ultimately an embedding problem. Running the same answer key through several newer ArcFace-family recognition models, the best of them pushed my kids two to three times farther apart than my current model does — exactly what I want. The catch is that switching models is real work: everyone's embedding changes, and the thresholds that I use for gates would need to be re-tuned.

As for this newsletter, the obvious next piece of the photo intelligence layer is search — actually finding the photos you're thinking of, by what's in them rather than by filename or date. The descriptions from my last post and the people from this one both feed into that, and it's where I want to go next in this series.

As always, these take a while to write, so I hope you find them interesting — and feel free to reach out if you'd like to chat.

Knowing who's in your photos

The textbook approach

Apparently my family isn't textbook

The pipeline

Stage 1: detect and embed

Stage 2: assign the face to someone you know

Stage 3: cluster the leftovers

How I know any of this works

What the measurements showed

Up next

Reply

Keep Reading

On Photo Intelligence

Home