The Caveat of Loading Data to Partitioned Table on BigQuery
Table Partitioning and Why
Table partitioning is physically separating the data from the same logical table for optimisation purpose. The bigger the table, the more we can see the benefit of partitioning. Because of the same reason, table partitioning is a common practice on data warehouse, which usually hold a big amount of data. By partitioning the table, queries become more efficient and data lifecycle is easier to manage.
Let’s say we are managing a data warehouse of social media, one of the biggest table would be fact_post table which contains all posts from all users. The fact_post table is best to be partitioned by date. The most common way to partition the table is by manually separating the physical table into one table per day with table name pattern fact_post_yyyymmdd.
To analyse data in certain timeframe, we can union all tables with table name pattern so that we only read tables we need to. Managing data lifecycle is also easier; for example, to manage data retention, we only need to drop table to delete old data, which is a lot faster compare to DELETE statement with WHERE clause.
Common Partitioning Type
Date is not the only common partition type. Some tables are too small, hence monthly or even annual partition does more make sense. Another common pattern is to partition by fields with sane number of unique possible values, e.g. country. Partitioning by customer, especially for B2C, is not that common since it will generate enormous amount of tables which leads to metadata overhead.
You will need to understand your query pattern and distribution of the data in order to decide what column you should use to partition your data. The most common column used for filtering on WHERE clause is a potential partitioning column. Try to avoid skew / hotspot on your data, e.g. partition by country meanwhile most of transactions happen only in one specific country.
Partitioned Table on BigQuery
The partitioning example I mentioned before is all manual. On BigQuery, there are a feature called partitioned table, where partition is handled as part of BigQuery feature. To quote the official partitioned table documentation (taken 1/Sep/2019):
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.
Having partitioned table as part of the BigQuery feature makes our life a lot easier. It is easier to query the whole table for all partition, it is also easier to specify the partitions we need to query. We can choose partitions to read by placing the condition on WHERE clause right away, e.g. using BETWEEN <start date> AND <end date>
Remember that the goal of partitioning is to optimise your query. On BigQuery, it’s a big deal as you likely optimise your cost a lot by doing partitioning. At the time of writing, BigQuery support two partitioning modes:
- Partitioning by ingestion time, which means the time when you load your data to BigQuery. This means your table is partitioned by _PARTITIONTIME column, a meta column.
- Partitioning by date column, based on a column you specify with DATE or TIMESTAMP data type.
The Caveat of Loading Data to Partitioned Table
I had a use case to partition data by date using event_date column. Reading the documentation, I decided to try out partitioning by date column, it felt natural to do so. To understand the behaviour, I loaded the sample data using BigQuery Python client library.
As I designed my pipeline to be idempotent, I used WRITE_TRUNCATE disposition. This means, my pipeline will keep replacing the partition based on the result of pipeline. Read this for more on why idempotent pipeline is easier to manage.
And here’s the caveat. Let’s say the result of my pipeline contains dates of 1 Sep, 2 Sep, and 3 Sep 2019. I expected BigQuery to detect those three days and replace three partitions accordingly, but that was not the case. All partitions were gone, replaced by the new data, hence only three partitions remain. Maybe if I had used WRITE_APPEND, behaviour I expected would happen. But as I want my pipeline idempotent, that was not my option and I didn’t test that.
So, in order to keep my pipeline idempotent, I decided to use ingestion time or _PARTITIONTIME column, and load with table decorator $YYYYMMDD. For example, to load the data of 1-Sep-2019, I load using WRITE_TRUNCATE disposition to fact_something$20190901. This also means I have to filter my data before I load it, so that it only contains data for that date I am going to load to. This is done by using event_date = “2019–01–01” condition.