Skip to main content
Part of complete coverage on

Why I taped my son's childhood

By Deb Roy, Special to CNN
Click to play
The birth of a word
  • Deb Roy and his wife installed taping equipment throughout their home
  • They documented son's early childhood with 200,000-plus hours of audio, videotape
  • The tapes show how their son learned to master words such as "water"
  • Roy says huge datasets such as his are changing the way we understand the world

Editor's note: TED is a nonprofit organization dedicated to "Ideas worth spreading," which it makes available through talks posted on its website. Deb Roy is co-founder & CEO of a technology company, Bluefin Labs, and is on leave from the MIT Media Lab, where he directs the Cognitive Machines group. Follow him at

Cambridge, Massachusetts (CNN) -- Little did I know that studying how my son learned to speak would come to this: a TED Talk gone viral, partially thanks to Ashton Kutcher and his 6 million Twitter followers -- and a technology platform that may change the way we understand social, political and commercial communications.

Six years ago, my wife and I, speech and cognitive scientists respectively, wanted to understand how a child learned language comprehensively and naturally, since most theories on language acquisition were grounded in surprisingly incomplete observational data.

In my academic work, which involves teaching machines to learn and speak, a data-based understanding of language development is crucial -- as it is to my wife's work in studying speech disorders. So we decided to create a data set to study. A really, really big data set.

Before our son was born, we wired our home with microphones and cameras and started recording, with various carefully designed privacy protections in place. The goal was to capture all verbal and non-verbal interactions between him and his caregivers (my wife, our nanny and I) and understand the contextual environment around his language development.

When we finished taping roughly three years later, we'd amassed over 200,000 hours of audio and video. And, no doubt, history's biggest home video collection. Captured in this rich record are countless memorable moments for my family, from my son's first steps to the arrival of his sister and beyond. The linguistic genius of babies

Frankly, making the recordings was the easy part. Figuring out how to process and analyze this Big Data was a terrific challenge. My "Speechome" research team (speechome = speech + home) at the MIT Media Lab has developed fast machine-assisted methods for transcribing speech and annotating video at scale.

After transcribing several million words of speech, one of the first magical glimpses we got into the data was the acoustic equivalent of a time-lapse video of a flower blossoming. We were able to hear the evolution of a word form as my son transitioned from saying "gaga" to "water."

The effect of this audio time lapse was striking, allowing us for the first time to hear the trajectory of a spoken word by accelerating through months of child development in seconds.

My research team went on to develop deep machine learning algorithms to discover semantic connections within the video and audio data. Our resulting ability to "ground" the meaning of words allows us to trace the birth of words like "water" back to larger social contexts. This lets us pinpoint and study when, where and how language is acquired. JR's wish to use art to turn the world inside out

Though it's still early in the project, we believe the Speechome's research methodologies and technologies may have profound implications on challenges as diverse as the "education" of cognitive robots and other machines, and advance our understanding and treatment of disorders that effect children's development of communication and related social skills.

This approach to language grounding has broader social implications and commercial applications as well. As it turns out, the communications revolution -- as fueled by mass media, social media and their technologies -- is producing on a daily basis geometrically larger data sets and more varied contexts than the Speechome's.

Between mass media and social media sits what I call a semantic barrier. Mass media actions (shows, news, games, ads, etc.) create social media reactions. The semantic barrier has prevented us from grounding the meaning of people's comments pulled from a social media stream in the larger context of what, if anything, they're reacting to. Or in other words, we haven't been able to align audience response to its mass media stimulus with any real precision or scale.

Crossing the semantic barrier is the focus of a new endeavor for me as CEO of Bluefin Labs. We're using deep machine learning to help organizations such as advertisers, content programmers and distributors to measure and understand audience response to mass media content by linking social media comments directly to their televised source -- at scale, with precision and near real-time. Steven Pinker chalks it up to the blank slate

We are able to measure how audiences respond to specific televised events such as plays in a football game or ads in a show, and how response to such events varies by viewing context.

Beyond commercial application, technologies like Bluefin's are in a position to reveal sweeping views into the human condition that will leave few areas of society untouched. Big new data sets about people are revealing insights into how we develop, think, behave and interact.

At the same time, billions of people have come online and are broadcasting their voices in tweets, posts, updates and other forms that simply did not exist just 10 years ago. By analyzing and cross-pollinating these rich streams of human data, our ability to understand, predict and influence human behavior will be expanded dramatically -- with deep implications for health, finance, retail, politics and beyond.

The opinions expressed in this commentary are solely those of Deb Roy.

Featured Deal |