Your Baby Is a Self-Supervised Learner

Last month I was at a local elementary school's maker morning, watching a five-year-old spend twenty minutes trying to get a motorized cardboard car to drive straight. She wasn't reasoning about physics. She wasn't applying rules. She was just... trying, observing, adjusting. Pick up the car. Flip it over. Nudge the wheel a centimeter. Try again. Repeat.

It looked like trial and error. But I kept thinking: there's a pattern-extractor running here. Something in that kid's brain was computing, over and over, "which changes correlate with better outcomes?" She never would have described it that way. She was just playing. But she was doing statistics.

The Baby Who Learned Words Without Being Taught Any

Here's one of my favorite experiments in all of developmental science. In 1996, Jenny Saffran and colleagues played eight-month-old babies a continuous stream of nonsense syllables — something like "tupirogolatupiro..." — for just two minutes. No pauses, no intonation, no cues marking where one "word" ended and another began.

Then they tested whether babies recognized the original syllable clusters versus different, novel combinations. The babies noticed the difference — they'd learned the original "words." From two minutes of pure sound. Without anyone teaching them a thing.

How? They were tracking transitional probabilities — the likelihood that one syllable would follow another. Syllables within a word tended to co-occur reliably; syllables straddling word boundaries didn't. The babies were doing statistics. Completely implicitly. Before they could talk.

It's Not Just Language

Here's what always gets me about this result: it's not some quirky language trick. Statistical learning shows up everywhere infants interact with the world. They extract regularities from visual scenes, patterns in motor sequences, social contingencies in caregiver behavior. The brain is continuously mining its input for structure — not because anyone instructed it to, but because regularity is exploitable. If A reliably predicts B, detecting A helps you prepare for B. That's useful in any domain, in any sensory channel.

This is, I think, deeply connected to the embodied learning story. When my eight-year-old nephew was debugging the little wheeled robot we built together — poking it, lifting it, tilting it until something clicked — he was running the same computation: tracking which changes correlated with which outcomes. The statistics aren't just in the auditory stream. They're in the feel of a robot's weight, the way wheels spin, the sound of a motor under load. A body moving through the world is a pattern-detection instrument running around the clock.

Self-Supervised AI Is Running the Same Math

Now: self-supervised learning in AI is, at a computational level, doing the same thing. BERT masks random tokens and trains to predict them. GPT predicts the next word. DINO trains a vision model to produce consistent representations of differently-augmented views of the same image. None of these systems are told what anything means. They extract structure from the data itself, using prediction error as teacher.

This is not a coincidence. These techniques were developed in large part by researchers thinking about how brains learn — and the transitional-probability logic running in Saffran's babies is structurally identical to the objective functions powering the largest language models today. Extract regularities. Build representations that reflect them. Don't require labels.

Data Volume Matters Less Than You'd Think

A common objection to the infant-AI analogy goes like this: "Sure, but GPT-4 trained on hundreds of billions of tokens. A baby hears maybe 50 million words before age 10. The comparison doesn't hold."

But a 2024 MIT study tested exactly this. Hosseini et al. (2024) trained GPT-2-scale models with progressively less data — all the way down to amounts comparable to a child's actual linguistic exposure — and then measured whether these small-data models still predicted human fMRI brain responses to language. They did. Significantly. The key variable wasn't how much data the model trained on. It was the training objective: optimizing for next-word prediction, not just any task, produces brain-aligned representations. It's the structure of what you're computing, not the raw volume, that seems to matter.

What Happens When You Give AI a Child's Data Budget

This framing shaped the entire BabyLM Challenge — a community-wide experiment where researchers compete to train competitive language models on the same data budget a child actually receives: either 10 million or 100 million words. According to Hu et al. (2024), the 2024 edition drew 31 papers and 162 model submissions. The verdict: you can train surprisingly capable models on child-scale data. Hybrid architectures combining masked and causal language modeling outperformed either approach alone.

But one finding stuck with me: child-directed speech — the simplified, high-repetition talk parents use with babies — actually underperformed richer literary datasets, even at small scales. Children aren't learning from a stripped-down input stream. They're learning from varied, complex language that just happens to arrive with extra prosodic warmth. The statistics in rich input are, apparently, better statistics.

The Part Where Pure Stats Isn't Enough

Here's where statistical learning runs into a wall. Children acquire grammar-level patterns far faster than any amount of passive statistical sensitivity should allow. A 2025 study tackled this directly by building a hybrid system: Schuler et al. (2025) distilled structured Bayesian inductive biases into a neural network's flexible representations. The resulting model could learn formal linguistic patterns from limited data and scale to naturalistic sentences — something neither Bayesian models alone nor neural nets alone could do. The insight: babies probably aren't arriving as blank statistical detectors. They bring priors — structured expectations about what kinds of patterns are worth looking for — that turbocharge what limited input can teach them.

This maps back to embodied cognition in a way I find genuinely exciting. The priors aren't just abstract. Some of them are physical: a sensitivity to contingency (if I do X, Y happens), to object permanence, to the way agents in the world behave differently from objects. These are priors you'd naturally develop if you spent months moving through a physical environment before you heard your first word.

The Route Matters, Not Just the Destination

So AI systems can extract statistical patterns from unlabeled data. They can do it on child-scale budgets. They can even have Bayesian-style priors engineered in. Does that mean they learn the way children do?

Not exactly. Tan et al. (2024) introduced DevBench, a benchmark that compares vision-language model performance against children's, task by task, across development — a NeurIPS 2024 Oral. The striking result: better models do more closely resemble human behavioral patterns. But they systematically fail to replicate the ordering of developmental milestones. The sequence of what gets learned when is wrong, even when endpoint performance is similar. The route through the learning landscape is different, even if some destinations overlap.

It shows up in vision too. Tiedemann et al. (2024) benchmarked 4- to 6-year-olds against top AI vision models on object recognition. Kids identified objects with disrupted local features at just 100 milliseconds of exposure — a speed rivaling state-of-the-art models. But children generalized more robustly and were less thrown off by distribution shifts. What the statistics of early childhood visual experience build is something subtly but importantly different from what dataset-trained models build, even when their accuracy on benchmarks looks similar.

The Missing Ingredient: A Body in the Loop

This is the thing I can't stop thinking about. Saffran's babies weren't passively receiving an audio stream in a sensory deprivation tank. They were in cribs, in rooms, with caregivers moving around them. Their statistical extraction was happening inside a body that was also tracking its own movements, locking onto eye contact, feeling hunger and satiation, reaching for objects and noticing what happened. The statistics of language didn't arrive in isolation — they came embedded in a continuous, embodied, socially-scaffolded, cross-modal stream.

When I think about what self-supervised AI is missing, it's usually not the training objective. The BabyLM results suggest you can go surprisingly far with the right prediction task on modest data. What's missing is the body. The statistics that babies learn from aren't just in the audio waveform — they're in the correlations between what you hear and what you're looking at, what you just touched, what happened when you reached. The patterns are three-dimensional, cross-modal, and anchored in action.

Self-supervised vision models are getting closer — learning representations from augmented views of images, building in some invariances. But they're still not moving through space, reaching for objects, or feeling gravity. The statistical regularities they find are real. They're just a projection of the full-dimensional pattern space that a body navigating the world gets to sample.

What This Means at Home

A few practical takeaways from all of this:

Routine is data, not boredom. Repeated bath times, predictable mealtimes, consistent bedtime sequences all give a developing brain statistical structure it can mine. Novelty matters, but regularity is what makes novelty meaningful — it's the baseline against which surprises stand out.

You don't always need to explain. A child watching water poured between containers is running a statistical learning experiment in real time — detecting what changes, what doesn't, what the reliable patterns are. Before you step in with the explanation, give them a few seconds to compute.

Rich language beats simplified language. The BabyLM finding (Hu et al., 2024) echoes decades of developmental research: varied, interesting language — books, stories, real conversation — builds more robust representations than simplified repetition alone. Read the weird books. Have the real conversations. The statistics will do their work.

(If you're noticing any consistent delays in how your child is picking up on language patterns or responding to their environment — beyond the usual curiosity that brought you here — that's worth raising with your pediatrician.)

One Open Question

I started by watching a five-year-old tune a cardboard car by feel. What I kept seeing was a pattern-extraction engine running continuously on the statistics of what just happened. She had no equations. No labeled training data. Just the regularity of the world, and a brain shaped by millions of years of evolution to notice it.

Self-supervised AI is trying to replicate that from the outside in: design a learning objective that exposes statistical structure, scale it up, and let representations emerge. The infant does it from the inside out: arrive with priors shaped by evolution and a body primed to collect exactly the data that structure-finding needs.

Whether those two routes converge is one of the most interesting open questions in cognitive science right now. The math might look the same. The experience, I'm pretty sure, doesn't.