SEO Blog • Organic SEO Blog Search Marketing News

Monday, March 16, 2009

Google's Data Mining System - Packaged For The Masses
As Originally Posted to The New York Times

Cloudera is the quintessential Silicon Valley story.

Three of the top engineers from Google, Yahoo and Facebook have teamed up with an ex-Oracle executive to tackle the problems inherent in quickly analyzing big piles of data. On Monday, they’re revealing a commercial product based on the open source software Hadoop, which provides the analytical magic behind the world’s biggest Web sites. The team at Cloudera, based in Burlingame, Calif., think they can extend Web smarts to the business world, aiding companies in retail, insurance, bio-tech and oil and gas.

Hadoop is the open-source version of the file system and MapReduce technology developed by Google. Google has used such software to rewire its entire search index, making it possible for the company to run ever-faster searches on cheap servers and to ask questions of its vast data stores and receive coherent answers.

Rather than keeping data locked in a central database, Google spreads information across thousands of servers. Engineers can then send out requests to these servers via MapReduce and gain new insights into peoples’ searching behavior and the relationships between Web sites. This process makes for more consistent Google search engine optimization results, and best of all, MapReduce keeps these complicated jobs humming along even when computers fail because of its ability to maintain a cohesive picture of all the systems.

While Google has kept the deep details on this technology to itself, the company did publish a couple of papers describing some of the underlying principles. That gave Doug Cutting, formerly a software consultant and now a Yahoo engineer, enough information to create an open-source take on the code.

Yahoo has since invested millions of dollars improving Hadoop and uses the technology to figure out what users should see on its home page, based on their surfing habits, and what ads to display next to search results.

Other Web 2.0 users, including Microsoft, Facebook and Fox Interactive Media, have picked up Hadoop as well.

The founders of Cloudera argue that the analytical powers of Hadoop can benefit a whole new class of businesses. For example, they want to show biotech firms new ways of analyzing genome and protein data and give oil and gas firms new ways of digging through their reservoir data.

The pitch has proved attractive enough for Accel Partners to pump money into the start-up. Diane Greene, the co-founder of VMware; Marten Mickos, the former chief executive of MySQL; and Gideon Yu, the chief financial officer at Facebook, have invested in the company as well.

While Hadoop remains free, Cloudera plans to sell support and consulting services around the software.

The backgrounds of the executives point to the classic Silicon Valley nature of the story.

Just 26, Jeff Hammerbacher has already worked on Wall Street and at Facebook after graduating from Harvard, where he earned a degree in math. Christophe Bisciglia, 28, arrived at Google after raising and selling horses online during his high school years.

Amr Awadallah, 38, arrived in the United States from Egypt and secured a job at Yahoo, where he helped develop Hadoop. And Mike Olson, the 46-year-old chief executive, is a database executive that sold an open-source software maker called Sleepycat to Oracle in 2006.

While Google could make a fuss about intellectual property rights to the technology, the company has given Cloudera its blessing. Mr. Bisciglia discussed the company with Google’s chief executive, Eric Schmidt, last March.

“He agreed that this technology is not just for researchers, and it’s good for Google to make this pervasive,” Mr. Bisciglia said. “The more data people create, the more data Google can slurp up.”

A number of prominent computer scientists have hailed Hadoop as the right answer for an age when companies have moved from dealing with gigabytes of data to terabytes and now petabytes (one petabyte is equal to 1 million gigabytes or 1,000 terabytes). It’s one thing to store all of that information and another thing to be able to mine it in an efficient manner.

“It is a new reality that people have the ability to store and analyze terabytes and petabytes of data,” Bisciglia said. “Now they need the tools to process it.”