Partitioning and Bucketing Data

Eswar Kondapavuluri
Chartboost Engineering
4 min readAug 13, 2019

--

There are a lot of things we’d like to do with Big Data such as querying, aggregating, transforming. But one of the most important aspects of Big Data that affects all subsequent operations is storage. The different ways in which we store data allow us to optimize for the more critical functions while accepting a decrease in performance in other areas.

When discussing storage of Big Data, topics such as orientation (Row vs Column), object-store (in-memory, HDFS, S3,…), data format (CSV, JSON, Parquet,…) inevitably come up. In this post, I’ll be focusing on how partitioning and bucketing your data can improve performance as well as decrease cost.

Simple diagram illustrating difference between Buckets and Partitions

Know Your Data

Before diving in, it is vital to know what kind of data you are working with. For example, you may need to know the size of your data set, the cardinality of key/important columns, and/or the distribution of values in said columns. The better you understand your data, the better you’ll be able to optimize it.

Partitioning Data

Partitioning data is simply dividing our data into different sections or pieces. Filters or columns for which the cardinality (number of unique values) is constant or limited are excellent choices for partitions. We often want to query on various filters such as date, id, or type. Therefore if we partition our data on date and type, our queries will skip reading unnecessary partitions and read only from sections which are guaranteed to contain the necessary information.

For example, at Chartboost we collect billions of auctions each day. I’ll be working with a small subset of the data along with AWS Athena to illustrate how partitioning can be useful.

Below is a CREATE TABLE DDL for my example auctions table.

CREATE EXTERNAL TABLE sample_auctions (
id string ,
app string ,
placement string ,
country string ,
bid double ,
os string ,
osv string ,
timestamp bigint)
PARTITIONED BY (
date string)

As you can see this table is already partitioned by date. Therefore if I query for a specific date range:

SELECT *
FROM sample_auctions
WHERE date >= '2019-07-19'

Athena will only scan data under partitions that matching those dates. This isn’t quite good enough however, so let’s try to improve the table. Often times we need to query or aggregate on a specific app, placement, or os. And since I know my data, I can see that I have 25 unique apps, two placement types (Interstitial, Rewarded), and two OS (Android, iOS). These columns have well defined cardinality and therefore are excellent candidates for partitioning.

Note that country column would have also made a good candidate for partitioning since the values for countries are also limited. And while it is tempting to partition on os as well (there are very good use cases), I chose not to because in my example data set I have many different OS versions and the cardinality will only increase in the future.

Now with the following table:

CREATE EXTERNAL TABLE sample_auctions (
id string ,
country string ,
bid double ,
osv string ,
timestamp bigint)
PARTITIONED BY (
date string ,
app string ,
placement string ,
os string)

queries such as:

SELECT   osv,
count(*) AS num_auctions
FROM sample_auctions
WHERE date >= '2019-07-19'
AND app = 'testapp'
AND placement = 'interstitial'
AND os = 'iOS'
GROUP BY osv
ORDER BY osv desc

will be much more performant.

Bucketing Data

Bucketing also divided your data but in a different way. By defining a constant number of buckets, you force your data into a set number of files within each partition. Think of it as grouping objects by attributes. In this case we have rows with certain column values and we’d like to group those column values into different buckets. That way when we filter for these attributes, we can go and look in the right bucket. Bucketing works well when bucketing on columns with high cardinality and uniform distribution.

For example, in the above table, both id and timestamp make great candidates for bucketing as both have very high cardinality and generally uniform data.

CREATE EXTERNAL TABLE sample_auctions (
id string ,
country string ,
bid double ,
osv string ,
timestamp bigint)
PARTITIONED BY (
date string ,
app string ,
placement string ,
os string)
CLUSTERED BY (
id,
timestamp)
INTO 50 BUCKETS

The bid column seems perfect to bucket on, but this would be a mistake because it the example data set we have a large number of “no-bids” which corresponds to “0.0” in bid value. Therefore, even though we have high cardinality, the overwhelmingly skewed distribution in bid values would put most of our rows into a single bucket which would decrease performance.

Conclusion

Partitioning and bucketing can be very powerful tools to increase performance of your Big Data operations. But to properly use these tools you need to know your data. However, data can be really complex and difficult to understand, in which case trial and error can help you get a better idea of your data distribution or point you in the right direction. I found AWS Athena CTAS queries extremely useful in this regard as I was able to quickly create tables experimenting with different partitions and bucketing schemes.

--

--