BigQuery

I have been working on BigQuery for a few months and would like to share what I have learnt.

What is BigQuery?

BigQuery is a SAS Database platform by google. BigQuery is very similar to RDBMS, and you use SQL to work with it. The main advantage of BigQuery is that it is a RDBMS for bigdata. You can put as much data (even PB) as you want and then query the data. Since there is no free lunch and we have the CAP Theorem, BigQuery uses eventual consistent. Due to this there are no foreign keys, though you can still do joins and all other SQL Queries. The time to get a result is not sub seconds, but you go over terabytes of data within a few seconds an even petabytes in a few minutes. For a demo see Google BigQuery introduction.

Limitations

Currently (this is an ongoing project and constantly changing) the onset of bigquery is that it is a readonly database. You can load any data you want and then do all your analysis on the data.

There are multiple ways to upload data into bigquery. You can duplicate data from other tables, upload csv files or use streaming.

Google has added update and delete command, but they are very limited.

Streaming

Since we need to transform data online from salesforce to BQ I used the streaming API.

I have faced a few issues with tables that use streaming:

  • The first, is that on tables that you stream you cannot use the update or delete commands :(.

  • Second problem that I faced, was that google supports eventual consistency. I had wanted to use the data from the table after the streaming (as you will see how i overcame the update problem) , but the data was not fully there. To solve this, I query the count on the table and wait in a loop (usually under 10 secs or so) until I have the full count before using the table.

  • Another issue with streaming is that google does not support partitioning of the table (which can improve performance and price)

Update problem

Since basically bigquery does not support update and delete (current support is limited, and does not work on streaming tables), I had the following problem:

I have a table that is a snapshot of a table from an external system (salesforce). Every so often I want to sync the delta that has changed to the BQ table. Appending new records is not an issue, but updating or deleted is. So what is the workaround?

Since I also don’t want half batches, during the sync of the data, all data will be saved into a temp table (another issue I hit - you cannot reuse table names with streaming since deleted tables still stay around for a while, so I added a timestamp to the table name).

I added a column to the temp table so i know if the record is a deleted record or not. I do not know if the others are add or update, but there is a solution for this as you will see.

I created a select to get all records from the main table that the Id’s do not appear in the temp table (these are the records that have not changed). Then I union to this, all the rest of the records from the temp table (without the deleted ones, this will give me all the new and updated records).

The trick is that the output of this query can then override the original table. So in the end I get an updated table with the same name that has all the records (of course the price for this is a full table scan of both main and temp tables).

Note: The nice SQL "select Id from x where Id not in (select Id from y) " does not work well in BQ and on big tables (over 200,000 records) you get an error of "Resources exceeded during query execution". To solve this I changed the query to be an outer join where the right side is null: “select main.* from x main LEFT JOIN y temp ON temp.Id = main.Id WHERE temp.Id IS NULL”

Pricing

With google you pay only what you use. So with BQ you will pay for storage. In addition for storage you will pay for analysis of the data. So on this you must be careful. Part of the magic is that instead of trying to bypass the full table scan, google optimized the full table scan and will always scan the table. So you need to check your query, and do not bring what you don’t need (you will pay for it).

Google has a nice feature in the UI to display the MB processed for the query.

For more information on pricing see https://cloud.google.com/bigquery/pricing.

Summary

BigQuery is a very useful tool for analysis of big data, and is still a product in development. So before you jump in make sure you know your requirements well and check that BigQuery supports them.