Google BigQuery: Simplified Big Data

Google BigQuery: O BigData simplificado
BigQuery

What is Google BigQuery?

It is a cloud solution to deliver Big Data SaaS. Pay for what you use. Google BigQuery is a Big Data solution, like Hadoop, with the advantage of not needing to hire and/or buy a bunch of servers, nor rely on highly specialized labor. Well, this is Google’s promise.

I’ve been using BigQuery since December 2012, so just under 30 days. It’s still early to give a thorough impression, but so far everything has been good.

I uploaded a modest dataset: 500 million rows, data from browsing a particular website. And I’m “asking” questions like: which products were viewed? What was actually purchased? Where are the internet users from, and which products were viewed/purchased by a particular region? What offers were “pushed” to each user while they were browsing? Among others.

Attention! BigQuery is not a traditional relational database! It is still Big Data, in all its scope: Unstructured data (though presented as tables), NoSQL (though it has its own SQL “like” language), you cannot create indexes, and you cannot modify data (updates or deletes). It is purely an OLAP system.

I always say: Big Data is not for everyone, nor for every application.

If Big Data is for you and/or your application, it’s possible that BigQuery could also work!

Just like in Hadoop, where you can use Hive (and its HiveQL), in BigQuery you can run extremely fast Ad Hoc queries with SQL “like” syntax.

Why is BigQuery much faster than Hadoop? Well, this is hard to answer. Very hard. To scale performance, Hadoop relies on boxes (servers). While most Hadoop clusters I’ve seen in Brazil have 4 to 10 servers (some cases have up to 40 servers, but most are under 10), in BigQuery, applications are designed to be large from the start. Your data is replicated to dozens of servers.

Just like a single swallow doesn’t make a summer, a handful of Hadoop servers doesn’t either.

So, comparing BigQuery’s performance with a “small” Hadoop cluster is not the most fair comparison.

As I mentioned before, I imported a modest dataset into BigQuery’s cloud: 500 million rows, various TXT files, which add up to around 150GB. The import and/or copy task to the cloud is tedious, slow, boring, and not very smart. If something goes wrong in any file, things get really bad.

I can imagine a 100GB daily load or an initial load of around 10TB. It’s possible that through a commercial relationship with Google, there could be something like Star Trek teleporters. I didn’t find it.

Anyway, I am certainly excited about the performance. But again, I’m comparing it to my “small” 6-server cluster.

BigQuery

On my cluster, with everything “perfectly” tuned, I can run my heaviest query in 3 minutes. On Google BigQuery, it takes 5 seconds. Note that, on my cluster, I use Hive/HiveQL to make a parallel comparison with the flexibility of BigQuery’s SQL. Yes, it is much faster.

I’m not too happy about the inflexibility of controlling my data, accessing it in different ways, modifying it, and overall having control over “my data.” In a traditional Hadoop cluster, I have total freedom.

The price seems quite fair. Much cheaper than buying (or renting) servers and maintaining the whole infrastructure. I’m not sure I would have the courage to put strategic data on such an open cloud like Google’s, even if protected via SSL, blah, blah, blah.

Joins! Yes, it is possible to perform Joins in BigQuery, just as you can do it in Hadoop clusters using Hive and Pig. I didn’t have any problems with Joins, but Google has a number of limitation notes. Again, my data size is small, and almost ridiculous, in terms of Big Data, so I don’t run into these limitations.

I miss several features from the Hadoop ecosystem, but it’s still too early to make a significant judgment on BigQuery. I can only say that for someone with 30 days of experience, I’m satisfied.

You know when you buy a new car? It has no defects or strange noises, seemingly. Actually, they are there, but the enthusiasm for the new car masks them. I think I’m in that phase. So far, almost everything is “love.”

Schedule a meeting here

Visit our Blog

Learn more about databases

Learn about monitoring with advanced tools

BigQuery

Have questions about our services? Visit our FAQ

Want to see how we’ve helped other companies? Check out what our clients say in these testimonials!

Discover the History of HTI Tecnologia

Compartilhar: