Wavii has built a system that analyzes hundreds of thousands of articles, blogs and tweets
Wavii plans to gradually expand and eventually offer application programming interfaces
Adrian Aoun wants to build a system that instantly understands everything posted to the Internet.
He started the project about three years ago, and on Wednesday, he and his company, Wavii, unveiled version number one. As it stands, Wavii’s online service is a Facebook-like newsfeed for everything other than Facebook.
It feeds you news about what’s going on in the world at large, not just random thoughts from your friends and family. But in building this service, Aoun and company are tackling a much larger problem. They’re trying to organize the Internet’s information in ways that machines can understand it.
“There’s a world of untapped information out there, in news articles and blogs and tweets,” Aoun says. “What we’ve done is we’ve taught our machines to read those articles, blogs, and tweets, and we extract the concepts that are being talked about. We’re watching the Web in real time, what everyone is writing about and talking about, and we’re building structured data that can then be used by automated applications.”
With the company’s current service, for instance, users can set up a newsfeed dedicated to a particular person or topic. The service will alert you when anything big happens with Kim Kardashian, Mitt Romney, or IBM, and it will do so in plain English.
That’s a task far more difficult than it might seem. Aoun and his engineering team have built a system that analyzes hundreds of thousands of articles, blogs, tweets, and other websites as they’re posted to the net and then tags them with metadata that describes the information they hold.
It’s an ambitious project – so ambitious that you can’t help but question how successful Aoun and company will be.
Raymie Stata – the former chief technology officer at Yahoo, a company has built several realtime analysis systems in recent years – says it’s actually not that difficult to analyze such large amounts of data in real time. What’s difficult, he says, is making sure the analysis is correct.
“I don’t see the ‘realtimeness’ of this product as being a particular challenge,” Stata says, adding that this sort of processing is cheap because you can easily spread it across a large number of machines. “The hard part … is a good recommendation engine.”
Aoun agrees. But he goes further. Designing that engine, he says, is even more difficult when you’re trying to use it in real time.
The man who did not work for MySpace
Andrian Aoun did not work for Myspace. He’s careful to point that out. He worked for Fox Interactive Media, the company that owned Myspace. “Let’s not put all the blame on me,” he says.
At Fox, he spent an awful lot of time thinking about why Myspace was “getting creamed by Facebook.” In the end, he decided this had nothing to do with how ugly Myspace was. Myspace was getting creamed by Facebook, he says, because Facebook knew how to structure data.
If you added your company’s name to your profile, for instance, it wasn’t just empty text. It was link to a page, and this page, in turn, linked to anyone else who worked for that same company.
This meant that data could be easily reused on pages and services across the site – again and again and again. “Facebook gave your data some underlying representation,” Aoun says, “and it realized the power you can give to a computer interface if you have this sort of underlying data.”
So, after leaving Fox, he founded Wavii. The idea was to structure the Internet in much the same way that Facebook structured data about your online friends – a gargantuan task.
At Facebook, the site’s many users help you build that structure. Facebook asks for information, and users give it. Wavii needed a way of structuring much more data, all on its own.
The company set out to build a system that could understand natural language. But it didn’t use classic natural language processing. It didn’t try to deconstruct the relationships between each individual word in each individual sentence.
It used machine learning, attempting to understand natural language by analyzing the relationship between vast quantities of data.
It’s the Google approach. Rather than trying to build a system that can think, you use large amounts of data to fashion a system that gives the illusion that it can think.
“Wavii isn’t trying to be 100% precise on the meaning of each individual sentence,” says James Pitkow, the former Xerox PARC researcher and Internet pioneer who now serves as an advisor to Wavii. “Instead, it looks at all the data that exists on a subject – tens of articles, hundreds of articules, thousands of articles – and compares them.”
If Google acquires Motorola, he says, hundreds of news stories on the net will discuss the acquisition. Wavii’s system may not know what that Motorola is a company, but if it has enough data, it can connect the dots.
“If you know that Google is a company and that companies acquire companies, you can quickly figure out that Motorola is a company,” Pitkow says. “When you have a preponderance of data and examples to look at, it makes your job a lot easier. You can rely the multitude to resolve the ambiguity.”
Buy, yes, the system requires a little bootstrapping. Part of the process involves Wavii engineers feeding semantic information into the system. Once these meanings are in place, the system can learn more on its own.
Adrian Aoun’s father is a linguist. Joseph Aoun studied with Noam Chomsky at MIT and spent 25 years at the University of Southern California, before taking over as president of Northeastern University in Boston.
According to Joseph Aoun, his son grew up saying he would never follow him into the field of linguistics. His son hasn’t. But then again, he has. “Clearly, something rubbed off,” says Joseph Aoun.
Google meets Facebook meets the future
To analyze this avalanche of data, Aoun and his team built their own distributed software platform that runs across thousands over virtual servers. Aoun compares the system to the “Caffeine” platform underpinning Google’s search engine. It’s able to crunch data in real time and immediately move it into a much larger database of information.
This database is split into two parts: one holds that structured metadata generated by the Wavii system, and the other holds the actual internet data that will be served up to users. Aoun compares this portion of the system to Haystack, the platform Facebook built to store the billions of photos posted to its social network.
The metadata is stored on Amazon’s Elastic Compute Cloud service with a homegrown in-memory database, and the data itself is housed on Amazon’s sister service, S3. When you use Wavii, the system queries the metadata, and using this metadata, it populates your feed with the links and other information stored on S3.
At the moment, Aoun and company limit the scope of this system. You can only “follow” certain types of news topics. But it plans to gradually expand this scope, and eventually, Aoun says, the company will offer APIs – application programming interfaces – that will allow others software applications to use its structured data.
Aoun acknowledges that the project is enormously ambitious. But he doesn’t see this as a problem. “That’s the way it should be,” he says.