Development vs Analysis Environments
Docker, Vagrant, Chef, Puppet, Kubernetes… etc #
So when dealing with writing code that you don't have to act upon production
data, you use an environment that allows you to break things continuously
until you find something that works. This is the ideal development cycle, one
where you can break states of your program, and be able to revert quickly and
easily. All the tools listed above help in some form of that. When you're
developing something that will go up into production on a server somewhere,
you the ideal development environment is one that mirrors the production one,
so when you want to deploy to production, the process will be seamless.
I currently use Vagrant and Ansible mostly for this. With a recent
addition of using packer to build virtualbox images. I have used
docker in the past, but back then the toolchain was still not very
refined and it was a tough learning curve. These days it's gotten much better,
but I haven't had time to test it out.
I mostly use Amazon EC2 cloud servers, but that can be a whole different
discussion.
But what if you're doing analysis? #
For those that deal with not just web based services, but also have to do
analysis, how does this fit into the picture? Often I download datasets and
explore them a bit with R or Python until deciding on if the dataset is useful
at all. Up until now, I've been doing this all locally, since my thinking
never defaults to using cloud based technology first. But recently I've had to
deal with datasets that are larger than usually, 20-40gb of data. I know this
isn't really that much data, but my laptop gets a bit stuffed when trying to
deal with all this locally.
So I've been thinking that I should be doing analysis like this on the cloud,
since the analysis might lead to eventually having the data in production. But
then there is an intermediate task of being able to share the results of
analysis easily.
I've played around with setting up an EC2 instance with RStudio on it, using
Louis Aslett's RStudio AMI. This makes it incrediably easy to set up
RStudio that you can access via the web, and also let's you run potentially
long scripts on the cloud while you do other work. (I say this as I'm running
an image processing script locally that is taking hours…)
In the near future I plan on exploring this further and figuring out the ideal
workflow for doing analysis as well as displaying results on the web.
Something I've also been thinking about is turning processing scripts into AWS
Lambda functions that will output into a S3 bucket for displaying.
Stay tuned!