Salesforce Data Extractor Conundrum
There are many companies that are using Salesforce as their CRM. Though, what is very lacking in the CRM systems is analytics. So many companies need to export their data from Salesforce to another systems like BigQuery.
If you need to do this and do not want to get into the nitty gritty issues than there are solutions out there that you can use, such as alooma or rivery. These platforms know how to transfer data from many systems to analytic systems like BigQuery or Redshift.
Since you are reading this blog then it means that you have decided to implement some sort of export by yourself. Salesforce as you know has a relational DB but all interfaces to it is via the REST interface of Salesforce. If you writing your code in java, there is a very nice client on top of the REST api by jesperfj.
The tables in Salesforce can vary very much as to how many rows or how many columns each table has. So we will need to deal differently with each table depending on the size in rows and MB due to column numbers.
Another limitation that we need to deal with is Salesforce API limits. Salesforce has many limits per API call, so you need to decide which one to use when. For more information see API Limits.
Our solution is going to be based on three options to use. If you will be syncing multiple tables, in the end you will need to mix all three solutions depending on the table size and table type.
Any time you sync a lot of data, you have to deal with two phases. There is the first phase that needs to download a big bulk of data for the first time. The second phase is to download only the delta per table.
When getting the latest delta from a table, you need a column that is the last modified so that you know which are the updated rows. In salesforce there are two types of tables. One is append only tables (History tables) and these tables do not have a modified column but have a "CreatedDate" column instead. For the other tables you need to decide if you want to use the “LastModifiedDate” or “SystemModstamp” (for more information see systemmodstamp-vs-lastmodifieddate).
For a full list of API’s see which api to use.
The first api that I will discuss, and the most basic one is the standard REST API. Through this API you can request information about the tables in Salesforce and data from the table themselves. The main limitation with this interface is that each request is limited with up to 2000 results. So to download a million rows you will need 500 round trips (each request returns a token for the next one). For more information see the REST API.
This interface is very suited for the case that you need to download only a few thousand of records.
In the case that you do need to use this interface to download a lot of records (as we will see in the next sections that not all tables support the bulk options) you can use the REST API. There is the option to download in multithreading, and see the section on Manual Chunking for some tricks on mass downloading.
Since some of the tables have millions of records we obviously need a bulk download option. Salesforce does give us this option. To use this option you create a job in salesforce to export the records. You then poll to check the status of the job, and once it is finished you can download the data from the job, see the api for more information.
Every solution brings the next problem, in this case if the record size if very large it will be very hard to download the gb in one chunk. For this salesforce added the option for chunking. The request stays the same but in the rest header we add the chunk size. The job then breaks down the results into chunks, and each one can be downloaded in parallel. For more information see chunking solution.
The bulk download with chunking is the best solution for download large sets of data. The problem is that not all tables in salesforce support chunking. Tables like Notes, AccountHistory do not support chunking, and we need to download millions of records.
To do this we need to implement the same chunking that salesforce does on the server side but with the standard REST API.
In a nutshell the Id column in each table has a sequence built into it. Since the Id is the clustered index of all tables the search by it is very fast. So we need to download all the Id’s sorted and then we can create separate jobs to download bulks with ranges of Id’s. For a full description of the solution see pk-chunking-techniques.
I would like to add an enhancement to the article of chunking. To download all Id’s we need to run the query and queryMore for all the Id’s in chunks of 2000 each. Even though this can be done in multithreading it takes a long time. As you can see in the article the queryMore has the offset in the request. So instead of sending 500 requests for 1,000,000 records, you can skip all the Id’s in the middle and send in the queryMore the batch size (say 100,000) and then we can get down to 10 requests for all Id’s.
When working with salesforce, you need to know the different API's and the knowledge to leverage the different options according to the changing needs.