Concise Test Databases — Maintaining integration environments while saving 65% of the server cost

Abhilash
Chartboost Engineering
4 min readJun 11, 2019

--

Testing is one of the core pillars of any product or software. When the frequency and scope of testing increases, doing it in a cost effective way becomes important. We have taken a programmatic approach to creating and utilizing concise test databases to make testing easier and more efficient.

Background

At Chartboost, a number of new features go out on a daily basis. To test these, we used to spin up integration environments (A.K.A. staging or test environments) with a very large replica of our production database (MongoDB). As a result, we ended up having large test clusters that is costing a lot of money and taking a while to fully deploy. Over the last few months we realized that not all the data from the mongo dump was always utilized fully. There were always some chunks of collections that were not needed to test most features. Hence, we came up with an approach of having a trimmed version of our mongo data set for integration environments that contains all the necessary data that the new features require while also allowing for the data to be fresh and not become out of date with our production data.

Approach

We wrote a utility using Python and its libraries (pymongo and requests) that helped us with this. Below are its components:

Metamarkets

  • We use Metamarkets for tracking live analytics such as impressions, ad requests, and spend for our users.
  • Using Python requests library, the util fetches the top 200 / 250 company ids from Metamarkets. This component uses the APIs of Metamarkets. This allows us to identify the larger users for the previous day so we can use their data for testing.
import requests

class MMXRequest(object):

def __init__(self, api_token, url=MMX_URL, interval='P200D', granularity='P1D', dimension=None, metrics=None,
dimension_filters=None, limit=200):
self.url = url
self.interval = interval
self.granularity = granularity
self.dimension = dimension
self.metrics = metrics
self.dimension_filters = dimension_filters
self.api_token = api_token
self.limit = limit
self.body = {
"interval": self.interval,
"granularity": self.granularity,
"dataView": "my_data_view",
"dimension": self.dimension,
"metrics": self.metrics,
"dimensionFilters": self.dimension_filters,
"limit": self.limit
}
def make_request(self):
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer " + self.api_token
}
response = requests.post(self.url, json=self.body, headers=headers)
if response.status_code == requests.codes.ok:
return response.json()['result'] # The result object from the response is parsed to get the required information
else:
print(response.text)
raise Exception("MMX API query failed")

Config

  • Stores mongo configurations
  • Has whitelisted company ids that get added to company ids from metamarkets

Mongo

  • Using pymongo, this component creates mongo dump of the collections that have the company ids we got from metamarkets and the whitelisted ones from config.
    Note: We have a large number of company ids and since only one mongo dump file needs to be done per collection, having the same dump file per collection for different company ids is a challenge. So, we temporarily added a field in the collections with company ids and took mongo dump per collection based on that field. The temporary field is destroyed after the dumping process is complete or in case of any exceptions.
  • Takes full dumps for small collections and collections without company ids.
import subprocess
from pymongo import MongoClient

class MongoConnect:

def __init__(self, source_env, user, password):
self.user = user
self.password = password
self.environment = MONGO['mongoserver'].format(source_env)
self.mongo_port = MONGO['remoteport']
self.mongo_url = 'mongodb://{}:{}@{}:{}/my_db'.format(self.user, self.password, self.environment, self.mongo_port)
self.client = MongoClient(self.mongo_url)
self.db = self.client.my_db
self.db_name = self.db.name
# Initialize all the collections here

def get_dumps(self, collection_name, output_file_location, flag=None):
print('\nGetting mongo dumps for {} collection...'.format(collection_name))
command = [
'mongodump',
'--host', '%s' % self.environment,
'--port', '%s' % self.mongo_port,
'--collection', '%s' % collection_name,
'--db', '%s' % self.db_name,
'-u', '%s' % self.user,
'-p', '%s' % self.password,
'--out', '%s' % output_file_location
]
if flag is not None:
command.append('-q')
command.append(flag)
try:
output = subprocess.check_call(command, stderr=subprocess.STDOUT)
if output is not 0:
raise subprocess.CalledProcessError
else:
print('\nMongo dumps for {} collection success!'.format(collection_name))
except subprocess.CalledProcessError as e:
print(e.output)

AWS S3

  • Compresses the mongo dump, creates a tar file and uploads it into an AWS S3 bucket.

All this runs as a Jenkins job. The job creates Python virtual environment and takes the source environment as an argument providing more flexibility to the utility (not limiting to just creating dumps from production to integration, it can create dumps from any source environment provided). The next job to create integration environment picks up the tar file from S3 that's uploaded by the Python mongo utility job.

Benefits

  • The size of the mongo dump we get by using this Python utility is 3 times smaller than the full dump thereby saving around 65% of the server cost to maintain integration environments
  • Saves new integration environment spin up time by 50%
  • Response times of the services are way less than having the full mongo dump
  • As we spin up multiple integration environments a day to test various features, this helps run thousands of automated tests easily within a short period of time.

--

--