Reading age from a single frame
Crop tight to a face and you throw away half the story. Bad light, an odd angle, a low-res frame: any of them can fake a few years in either direction.
First the model finds you. It draws a box around each face and the body it belongs to, matches the two up, and ignores everyone else in the shot. Each crop gets resized without squashing and normalised (rescaled to the range the network expects). Then it is chopped into a fine grid of patches, dense enough to catch the things that actually age a face: fine lines, the way light scatters in skin. Two streams, face and body, trade notes through attention (the network deciding which patches matter most). When your face is blurry, the body quietly carries more weight. Out the end comes one number: your age.
Per-channel normalisation feeds a high-density token grid; cross-stream attention fuses face and body evidence.
