How to put error bars on your data: the easy way

Ryan Topping
Chartboost Engineering
4 min readJul 29, 2019

--

It has never been easier or cheaper to store and analyse large amounts of data. This has lead to an explosion of techniques and technologies for harnessing this data to improve decision making, track customer behaviour and even to create brand new products.

This began with the arrival of Hadoop and has continued with the development of serverless analytics (e.g. Glue + Athena on AWS) that drastically reduce the technical know-how and time investment required to start collecting and analysing data.

At Chartboost we collect large amounts of data that is vital to our functioning as a business. However, as easy as it is these days to start collecting data, actually extracting knowledge from that data can be tricky unless we understand the importance of statistical reasoning. In this post I make a case for why this is important and how any engineer can use some basic statistics to get more from their data.

The Importance of Noise

It is tempting to believe that if we can collect billions of data points then we can eliminate noise by making the sample size really large. For instance, at Chartboost we see billions of ad auctions per day, that should be enough to accurately measure our average revenue per auction, right?

The truth is, we only bid on a fraction of auctions, a fraction of which lead to an impression, a fraction of which lead to an install for which we recoup some revenue. This means that in a sample of a billion auctions, we might only see revenue for a few thousand of those. This makes the variance much higher than we might expect. Even in very large datasets we can’t escape the need to understand the variability in our measurements.

The most common approach is to use a measure of variability such as the standard error of the mean(SE) to draw error bars. The SE tells us roughly “if we could repeat the process that collected this data, what range would the measurement fall into?”. This applies to averages only, but metrics that we care about tend to be averages, so this is usually enough.

The formula for SE is:

SE = σ / sqrt(N)

Where σ is the standard deviation and sqrt(N) is the square-root of the number of observations. Adding error bars of +/- (2 * SE) defines the interval for which roughly 95% of repeated measurements would fall. This is useful because it tells us how precise our measurement is; if we could repeatedly measure the same thing and see wildly different results, then we should take what we see with a pinch of salt.

For instance, this plot shows the average revenue per auction in USA vs China:

At first glance, without error bars, we may conclude that we are making 10% more revenue per auction in China. However, with error bars, we can see that there is too much noise in this measurement to draw any conclusions. We could repeat the experiment exactly the same way and expect to see up to roughly 50% swing either way in the metric for China. In this case we need more data!

Calculating Errors is Easy!

Most analytics focussed databases (e.g. Hive, Presto, AWS Athena) provide a stddev function that can be used to easily calculate SE. You can even calculate the lower and upper limits of your error bars in SQL, e.g. in Presto:

One nice trick is to add the building blocks of the standard deviation calculation into your ETL process. Let’s say we are building an aggregate table of revenue per country:

We can add the sum of squared revenue like:

Then, since standard deviation can be calculated as:

σ(x) = sqrt( mean(x²) — mean(x)² )

… this gives us an easy way to calculate error bars for aggregated data, as the sums needed for this can be calculated over any granularity. For instance we could calculate average revenue per transaction for each country, with error bars, as:

(these nested subqueries are not strictly necessary, you could do this in one SELECT, they just make the query more readable)

This approach allows us to calculate errors over much larger pre-aggregated data sets, in a way which avoids scanning the raw tables. Notice that we only needed to query the aggregate revenue_country_product table here and not the much bigger transactions table. It is easy to take the same query and calculate error bars for any dimension or combination of dimensions in the table.

Hopefully this post has shown the importance of measuring variability and that you don’t need to know a lot about statistics to calculate some basic error bars!

--

--