Here at Bonusly we have grown quickly as a company and as an engineering team over the past 18 months or so. Along with that growth also comes some growing up, which means that practices that worked for a team of 5-6 are no longer suitable for a team of 15-20. Something that recently fell into that category is how we handle providing application data for local development work.

This can be a tricky problem to solve for an application with a large and complex data model. I’m not sure there is a perfect solution, at least I have not run across one. Instead, you have to choose the option that is the best fit for your application and team, while understanding and acknowledging the trade-offs.

We identified the major pain-points of our existing solution and settled on three basic requirements for our new solution:

  1. It has to provide a reasonable set of data that allows the application to be functional, and ideally provide a decent facsimile of real-world use.
  2. It needs to be generated via automation. We don’t want a process that requires manual intervention by anyone, and we don’t want to require developers to engage in some complex workflow to generate the data set.
  3. It needs to have a simple delivery method. We also don’t want some time-consuming or complex workflow for getting the data into the development environment.

After considering a wide variety of options, we decided the best path for us is to use our own company’s real-world data. Since we actively and enthusiastically use our platform internally, we have a reasonably sized, reasonably complete set of data that is our own. We can use a set of queries to extract our data into a deliverable and make that available to our developers. This solution is not perfect. There are many ways to use and configure our application, and no single customer could implement all of them simultaneously. That means our local data will be specific to our own configuration and usage. There will be many cases where developers will need to manufacture local data to simulate a use-case that our company data doesn’t provide. But by-and-large we can count on our own data to cover a pretty wide cross-section of uses, and create a viable and useful local development environment.

But how do we automate the process of gathering the data and making it easily available to developers? That was the problem that I set out to solve.

Our application uses MongoDB, and last summer we moved to a development configuration which runs the MongoDB service in a Docker container. This has made it easier for devs to get up and running with the correct version without needing to rely on a native installation. It’s kind of like having a version manager for your services. This got me thinking about whether I could use a code build tool to query for our company data from our production database and wrap it up into a custom docker image that could be easily distributed to our developers.

I did a little digging around and settled on the idea that I could probably make this all work using AWS Codebuild and Elastic Container Registry. Codebuild would handle the task of querying our database and building the docker image which could then be pushed to ECR to distribute to developers who would only have to run a simple command whenever they wanted to pull down a new image with up-to-date data. This also has the advantage that we can set up a VPC to launch the Codebuild project into and peer that with our hosted MongoDB so everything stays nice and secure.

So, how do I put this all together? To the Google bot! But no. I looked far and wide and could not find anyone else who had already had this mad scientist idea and actually implemented it. At least no one who had written about it after the fact. Well, that’s never happened before. Usually someone has written about doing something at least close enough to riff off of.

Alright, we’re going in blind and I have no idea whether this will even work in the end but it seems like it should be possible. I decided to start with figuring out how to extract the data from our database. I instinctively felt like it would be beneficial to work with a language that can be easily compiled into a static and portable binary. My reasoning was that this should be a pretty small application, and having a portable binary would eliminate any need to deal with setting up an environment on the build platform to provide a VM or runtime. I’d also need a language that has a full featured and supported MongoDB driver. I settled on choosing between Rust and Go. Both would fit the bill and both were languages I had an interest in trying out. I settled on Go because it felt like I’d be able to get up-to-speed and build something with it more quickly.

I really enjoyed working with Go. After a couple of days I felt pretty comfortable and reasonably productive with it. So I set up a Go project with the MongoDB driver and got to work putting together an application that could query our database and wrap up the data. For the latter part I settled on writing to files in MongoDB’s ExtJSON format. Ext(ended)JSON is unique to MongoDB and allows you to write your data to JSON-like files while preserving MongoDB types and other nice things. It’s not recommended for something like database backups, but for our purposes it was perfect, because while mongodump has support for some simple queries, it doesn’t support the kinds of complicated queries I needed to surgically remove all of our company data. The mongoimport tool is capable of importing ExtJSON files, and if you create a separate file for each collection where the filename matches the collection name, it will automatically create and name the collections for you as it imports the files. Dope.

Once I had something hacked together that was able to pull out some data and write it to ExtJSON files, I paused to think about how I should implement the rest. I knew by this point that I would need several types of queries for different collections, but there were a finite number of types, and for any given collection only a handful of parameters would need to change. This made me think that I could create a sort of framework for handling each query type and then store the parameters in a config file of some sort which I could then read an enumerate to generate and execute the full set of queries I would require. This idea was also appealing because it meant that I could potentially document the query types, along with when and why to use each one and how to add new parameters to the config file. This would empower other developers to update the query engine without actually having to use Go, and I won’t have to be the single point of entry for anyone who wants to make a change to this thing. Double Dope.

This worked out quite beautifully. Here is my description of the various query types and how they look in the JSON config file.

Under the hood, the Go app reads this file and iterates over the objects. There’s a switch statement on the "filter" field that determines which function within the query module to call and each of those functions knows how to take the given parameters and build and execute the query. That’s really all there is to it. To make any changes, all someone has to do is edit the config file, commit the change and push it to Github where it will get picked up by the build process the next time it runs. Which brings me to the automated build process…

I set up an AWS Codebuild project that pulls down the repo with the Go app. Also within the repo is a .dockerignore file so that we can exclude everything we don’t want to end up in the Docker image, like all the Go code itself. There is a Dockerfile for building the image. A buildspec for telling Codebuild what to do and a special shell script for loading the data into the MongoDB instance.

The build process is fairly simple. Here is an example of what the buildspec file looks like.

The only dependency we need to add to the build container is for Docker itself.

The pre-build stage just runs the compiled Go binary to fetch all the data we need and store the files in a /tmp/collections folder. Next it logs into ECR which is pretty trivial since everything is in the land of AWS.

The build stage runs docker build and tags the image as latest.

Finally the post build stage pushes the newly created image up to ECR so it’s available for developers.

The image build is also pretty simple:

We start from the standard mongo image, create a /tmp/collections folder and copy the contents of the build container’s /tmp/collections into it (the data we downloaded in the pre-build stage). Then we copy over the shell script I mentioned earlier.

I originally thought that I’d be able to import the data into the MongoDB instance during the image build and just ship it totally ready to go, but unfortunately I couldn’t get that to work and I found an equally acceptable solution, so I didn’t spend a lot of time fighting with it.

The solution that worked was to add that data-import.sh file to the /docker-entrypoint-initb.d folder within the mongo image. Any sh or js files contained in that folder will be executed when the image is built into a container. The script is very simple:

We loop over the ExtJSON files we created and run mongoimport on them. Now we can spin up a mongo container with the data we need right inside of it!

Codebuild allows you to create a schedule for running a build project, and since the build only takes a couple of minutes to complete we run it every night. I also configured some Slack notifications to alert us if the build fails for any reason.

The last thing we needed was to make it seamless and easy for everyone to pull down an image from ECR without having to deal with credentials each time. To accomplish this I used ECR Credentials Helper, an official tool from AWS for just this purpose.

By having everyone set that up (and adding it to our developer bootstrap script) we can just create an IAM user with read-only access to ECR and provide credentials to the environment within our custom rake task that handles all of the docker interactions for a developer whenever they want a fresh database.

This was a fun challenge and definitely allowed me to get a little creative. The end result is a solution that fulfills all of our requirements and has so far been rock solid.