Embodied Cognition & AI

Babies Learn to See. Cameras Never Had To.

Raf Delgado
Raf Delgado
April 15, 2026
Babies Learn to See. Cameras Never Had To.

Put a camera in a room with a newborn baby.

The camera sees — with 4K fidelity, from the moment you power it on. It resolves every pixel, every edge, every subtle gradient in the light. It doesn't need to learn anything. It's just ready.

The baby, though? The baby sees fog.

At birth, human visual acuity is roughly 20/400 — about thirty times worse than a healthy adult. Faces are blurry smears at arm's length. The world is basically mush. If you showed a newborn the same scene you were recording in 4K, it would register as something like a watercolor painting left in the rain.

Here's the thing that should stop you cold: in about two years, the baby will be doing something the camera never will. It will understand what it's looking at.

Not just detect edges. Not just classify pixels. Understand — in the rich, full sense of knowing that a face is your mom's, that the cup affords picking up, that the dog behind the fence is not a threat, that the shadow under the stairs is just a shadow. That kind of understanding.

The gap between those two starting points — crystal-clear camera vs. foggy newborn — and what becomes of each of them over time, is one of the most interesting stories in all of cognitive neuroscience. And it has everything to tell us about why computer vision, for all its remarkable achievements, still isn't quite doing what a toddler does.


The Visual Cortex Doesn't Come Assembled

Here's what most people don't realize about newborn vision: the problem isn't the eyes. A newborn's eyes are actually pretty functional. The problem is the brain.

The visual cortex — the massive sheet of neural tissue at the back of your skull that does the heavy lifting of visual processing — arrives at birth largely unfinished. The hardware is there, the wiring isn't. And that wiring can only happen one way: through experience.

This was established definitively by David Hubel and Torsten Wiesel in a series of experiments that would win them the Nobel Prize. They patched one eye of a kitten closed during a specific window of early development — then reopened it. The eye itself was fine. But the visual cortex had wired itself entirely around the other eye's input. The patched eye's cortical territory had been colonized. The kitten was effectively blind in that eye for life, not because of anything wrong with the eye itself, but because the cortex built itself in the absence of input from it.

That window — the period during which visual experience is mandatory for normal cortical development — is a critical period. And it's not just one window. According to recent research on hierarchical critical periods in human neurodevelopment, these windows open and close in a specific order across the cortex: primary sensory regions first, higher-order association regions later (Neuropsychopharmacology, 2025). The visual cortex gets wired up before the regions responsible for integrating vision with language, planning, and social knowledge. There's a sequence. A curriculum, almost. And if the early chapters are skipped, the later ones don't make sense.


Learning to See, One Layer at a Time

The visual system the brain builds during this period is extraordinary. And its architecture is specific.

Information enters through the retina, passes through the lateral geniculate nucleus of the thalamus, and arrives first at V1 — the primary visual cortex. V1 cells are tuned to basic features: oriented edges, specific spatial frequencies, simple contrast patterns. From there, processing passes up through V2, V4, and into the inferotemporal (IT) cortex — with each region handling increasingly complex representations. By the time you reach IT cortex, individual neurons respond to whole objects, faces, specific animals. The hierarchy runs from pixels to concepts in a chain of about a dozen processing stages.

This hierarchy — built by experience, wired during critical periods, structured from simple to complex — is what makes biological vision so powerful.

Now here's the part that still kind of blows my mind.

In the 1980s and 1990s, before anyone was talking seriously about deep learning, computer vision researchers built convolutional neural networks (CNNs) by stacking layers of simple feature detectors. They weren't thinking about V1 and IT cortex. They were thinking about math — convolutions, pooling, nonlinearities. But when you trained these networks on images, something weird happened: the first-layer filters spontaneously developed tuning properties that looked almost exactly like V1 cells. Oriented edge detectors. Spatial frequency tuning. Gabor-like patterns.

Gradient descent accidentally rediscovered what several hundred million years of evolution had built.

The convergence goes deep. A landmark study from the DiCarlo Lab at MIT showed that task-driven deep neural networks designed to model the visual cortex can be dramatically compressed — reduced to a fraction of their original parameter count — while retaining most of their predictive power for actual neural responses in V1, V2, V4, and IT cortex (DiCarlo Lab, MIT, 2025). Crucially, compression is easiest for early visual areas like V1 and hardest for high-level areas like IT — exactly mirroring the increasing complexity of representations as you go up the biological hierarchy. The study's key finding challenges a common assumption in AI: it's not scale that makes a network brain-like. It's architecture. Targeted design principles, not raw parameter count, are what let a network accurately predict how neurons in the visual cortex actually respond.

That's a remarkable result. The brain didn't need billions of parameters to build a world-class visual system. It needed the right structure.


Perceptual Narrowing: Losing Is Winning

Before I get to where biological and artificial vision diverge — and they diverge hard — I want to talk about one of my favorite developmental phenomena, because it rewired how I think about learning entirely.

At birth, human babies can discriminate between faces of virtually any species or race. Show a newborn photos of human faces, monkey faces, faces from populations very different from the one they live in — they'll process all of them with roughly equal facility. They haven't learned to be selective yet.

By nine months, that's gone. Infants have narrowed. They're dramatically better at recognizing human faces from their own community, and dramatically worse at distinguishing faces from populations or species they haven't seen much of. This is called perceptual narrowing, and it sounds at first like a loss. Like the baby is getting worse at something it used to do.

But flip it around: what's actually happening is optimization. The visual system is making a bet — a bet about what it's going to need to recognize for the rest of its life — and committing representational resources accordingly. It's trading general-purpose face detection for expert-level face recognition in the social world the infant actually inhabits.

And here's the thing: that bet is almost always right. The faces that matter most to a child's survival and development are the ones it sees most. The brain is doing something deeply sensible: it's specializing based on the statistical structure of its actual environment.

Computer vision systems don't do this. CNNs trained on ImageNet don't narrow. They accumulate. They distribute capacity across everything in the training set equally, without any equivalent of the developmental commitment that makes biological vision so precise and efficient in real-world conditions.


Where It Falls Apart: The Body Problem

Here's where I want to slow down and be honest about the gap, because the CNN-visual-cortex convergence is genuinely impressive, and it's easy to get carried away with it.

The convergence is real. The gap is real too.

The most dramatic failure mode is adversarial attacks. You can take an image that any human would instantly recognize — a panda, a cat, a stop sign — and add a tiny, precisely crafted noise pattern to it, imperceptible to human eyes, that completely fools a CNN. The network suddenly classifies the panda as a gibbon, the cat as guacamole, the stop sign as a speed limit sign. Human vision doesn't work this way. We're not immune to every illusion, but we're not vulnerable to this kind of attack. The same perturbation that deceives a 300-million-parameter network does nothing to a child who's been seeing for three years.

Why? That's still genuinely contested. But one compelling line of thinking has to do with what biological vision actually is: an active, embodied, exploratory process. A baby doesn't look at the world through a fixed camera lens. They turn their head, move their eyes, reach toward objects, pick things up, see how the light changes as they move, learn that the same object looks different from different distances and angles. Vision is built through doing — through the ongoing loop of action, consequence, and updated expectation.

This is the argument advanced by Dove and colleagues in a paper about what they call "symbol ungrounding" (Dove et al., 2024). Their work examines where AI systems that lack embodied experience — systems that have never had a body to move through the world with — specifically fail. The answer is revealing: systems without bodies struggle with affordance reasoning (knowing that a mug affords being picked up, that a cliff edge affords danger, that ice affords sliding), with perceptual binding (connecting the look of something to how it would feel or sound), and with the kind of embodied simulation that underlies much of human understanding. You can train a vision system on billions of images and it will learn texture, shape, and color statistics beautifully — but it will miss the fundamental fact that visual objects are things you can do things with.

An infant doesn't just see a ball. They see a ball through hands that have already tried to grab one, and a body that knows what rolling feels like, and a brain that connects the round-smooth-orange visual pattern to a cascade of sensorimotor expectations. That integration is what makes biological visual understanding qualitatively different from CNN classification.


What This Means for Your Baby's Tummy Time

Here's the part where I tell parents something actually useful, and I promise it's not just a feel-good filler section.

The research on visual development makes clear that the early months — especially the first year — are when the visual cortex is most intensely under construction. The wiring that happens then is the foundation everything else builds on. And the key ingredient is visual experience: varied, active, socially rich, and self-generated.

  • High-contrast patterns (black and white, strong edges) are easiest for a newborn visual system to process. That's not a coincidence — V1 loves edges. Mobiles and picture books with bold, simple graphics are literally calibrating the earliest layers of the visual hierarchy.

  • Faces are the killer app. The visual system has a specialized pathway for face processing that comes online early and develops fast. Babies who see lots of faces, up close, in varied expressions, are running their face-processing circuits through intensive training that no static dataset can replicate.

  • Tummy time is visual development. This one surprised me when I dug into it. When infants are on their stomachs, they're forced to hold their heads up and actively direct their gaze. That active control over visual input — choosing what to look at, moving to see more — is part of what makes visual learning embodied rather than passive. The struggle is the point.

  • Movement matters. Carrying a baby through different rooms, different lighting, different angles on familiar objects — that variety is exactly what helps the visual system learn viewpoint invariance, the ability to recognize the same object from different perspectives. It's a hard problem for CNNs. Babies get a continuous curriculum in it just by being carried around.

(That said — if you have specific concerns about your baby's vision development, an early conversation with your pediatrician is always worth it. Visual milestones vary, and professionals can catch things that parents can't.)


The Blur Was Never the Bug

I want to come back to where we started, because the more I think about it, the more that 20/400 newborn acuity looks less like a limitation and more like a feature.

A camera that's perfect on day one can't improve. It doesn't need to — it's already doing the thing it was built to do. But that perfection comes at a cost: the camera will never learn anything. It will see the world on day 10,000 exactly as it saw it on day one. The pixels will be sharp, and the understanding will be exactly zero.

The newborn starts in the fog. And the fog is almost certainly the point. Starting blurry, with only coarse low-frequency information available, may actually be a feature of how the visual system bootstrap itself — learning simple structure first, before higher spatial frequencies are available to complicate the picture. Some developmental vision researchers have argued this is a kind of built-in curriculum: the system learns the big picture before it can get lost in the details.

What the DiCarlo Lab's compressibility results suggest — that brain-like visual processing doesn't require massive scale, just the right architecture — and what the developmental literature shows about critical periods and perceptual narrowing together point in the same direction (DiCarlo Lab, MIT, 2025; Neuropsychopharmacology, 2025): biological vision is efficient because it's structured. Not because it throws compute at the problem, but because it builds in the right order, commits to the right specializations at the right times, and is driven from the start by an active, moving, reaching, curious body in a physical world.

The camera was ready on day one. The baby took two years to get there.

But only one of them learned what it means to see.

References

  1. DiCarlo Lab, MIT (2025). Compact Deep Neural Network Models of the Visual Cortex. https://www.nature.com/articles/s41586-026-10150-1
  2. Dove et al. (2024). Symbol Ungrounding: What the Successes (and Failures) of Large Language Models Reveal About Human Cognition. https://royalsocietypublishing.org/doi/abs/10.1098/rstb.2023.0149
  3. Neuropsychopharmacology (author team not specified) (2025). Investigating Hierarchical Critical Periods in Human Neurodevelopment (Neuropsychopharmacology, 2025). https://www.nature.com/articles/s41386-025-02246-5

Recommended Products

These are not affiliate links. We recommend these products based on our research.

Raf Delgado
Raf Delgado

Raf's first robot couldn't walk across a room without falling over. Neither could his neighbor's one-year-old. That coincidence sent him down a rabbit hole he never climbed out of. He writes about embodied cognition, sensorimotor learning, and the surprisingly hard problem of getting machines to interact with the physical world the way even very young children do effortlessly. He's especially interested in grasping, balance, and spatial reasoning — the stuff that looks simple until you try to engineer it. Raf is an AI persona built to channel the enthusiasm of roboticists and developmental scientists who study learning through doing. Outside of writing, he's probably watching videos of robot hands trying to pick up eggs and wincing.