The social media wave is being followed by a big data tsunami.
Ok, the imagery is getting a little outlandish, but the flood of information that must be stored and analyzed is generating excitement, especially in Boston, where many in the tech world worry that they were at the beach while Silicon Valley and New York enjoyed the fruits of the Web 2.0 revolution.
Social networking companies such as Facebook and Twitter are generating terabytes of content, IDC analyst David Reinsel said during a keynote Thursday at the Massachusetts Technology Leadership Council’s Big Data Summit in Burlington, Mass. For example, 3 billion photos each month are uploaded to Facebook for a total of 3,600 terabytes per year. (A terabyte equals one trillion bytes.)
More important than content creation, he said, is content consumption, which involves vaster amounts of data: “Consumption is what’s driving big IT…Consumption is what drives traffic to your website, and that’s what gets you ad revenue…It demands analytics.
“The future opportunity is around social networks and how to drive that commerce, how to drive that revenue, and then, even beyond that, smart technologies.” Reinsel said. “It’s really about finding answers where we haven’t even asked questions yet.”
Database pioneer Michael Stonebraker, who led a panel after Reinsel spoke, said storing and retrieving data is not that difficult. “What’s hard is managing that data.” He said Facebook is running 4,000 instances of MySQL to manage the social network, but that’s not fast enough, so upfront the company is running 9,000 instances of a database memory caching system. “They’ve got a ton of moving parts just to try and keep up with their load,” Stonebraker said.
While data warehouses in large companies like Wal-Mart are growing at a moderate rate and “continue to be in pain” the data tsunami is coming from Web 2.0 companies and science applications such as genomics, Stonebraker said. Yahoo is collecting 1 terabyte a day of click-stream data for ad placement. Social gaming company Zynga is recording “everything anyone ever did,” said Stonebraker. “They want to sell you virtual goods…and to sell you virtual goods, they’ve got to figure out what you’re doing.”
As genome sequencing gets cheap enough for average people to afford, drug companies will want to store and analyze all the data, Stonebraker said. Even now researchers are drowning in data: at Johns Hopkins University alone, for instance, there are 20 research groups with half a petabyte or more – 500 terabytes or more – of data. All these users want scalable storage, better tools to analyze their data and new features, such as the ability to instantly see the source of data used in a particular calculation, Stonebraker said.
“I think my esteemed panel members may be underestimating the size of the tsunami to come,” said Paul Brown, Paradigm4’s chief architect. “The data that’s coming in is being generated not by human beings very much but by machines. And machines are very like reality TV stars: they’re fast, they’re cheap and they’re plentiful.”
“The amount of data that’s trying to be combined from a variety of sources is all machine-driven,” said Rock Gnatovich of TIBCO Spotfire, “but ultimately it’s got to be the human that interacts with that, interprets that then is able to make the decision and drive the action,” This will require user interfaces with better visualization and more interactivity, and the ability to run on everything from PCs to smartphones.
“I think storytelling is becoming one of the new frontiers,” said Luke Lonergan, co-founder of Greenplum, now part of EMC Corp. But beyond that, “it really matters a lot to bring the brain to the problem in a way that you can untangle the complexities.”
By Russell Garland, Wall St. Journal