Author: Hunter Blanks
The prior post announced some early spot instance support for boto’s EMR
library, and it’s my pleasure in the current one to announce what would
appear to be full support for spot instances in both boto.emr and in
Yelp’s mrjob. Code, examples, and a brief explanation follows.
So far as code goes, there are two repos to look at — and two corresponding
pull requests out to the maintainers of boto and mrjob:
Please note that the changes for mrjob are on its development branch. It
would have been great to also provide a change on top of the master branch;
however, the master branch version of mrjob has its own “botoemr” library —
so rather than patch boto.emr and a copy of botoemr in mrjob, I’ve just
gone with mrjob’s development branch, which now uses boto’s EMR library.
Once you have the updated library, starting a jobflow with spot instances
from boto is pretty simple:
from boto.emr.instance_group import InstanceGroup
KEYNAME = 'XXX'
c = boto.connect_emr()
instance_groups = [
InstanceGroup(1, 'MASTER', 'm1.small', 'SPOT', 'email@example.com', '0.20'),
InstanceGroup(4, 'CORE', 'c1.medium', 'SPOT', 'firstname.lastname@example.org', '0.20')
print c.run_jobflow('spot-jobflow', 's3://emr-dev.monetate.net/hjb/log/',
The first thing of note here is that we’ve replaced the usual
instance_count/master_instance_type/slave_instance_type options with
a list of instance groups. You can have as many groups as you want — but
obviously you need one group with a single instance and the MASTER role.
CORE nodes are used for both tasks and HDFS — and because of the latter,
their number can’t be changed after startup. TASK nodes are only used for
tasks; you can create them now or add them later, as noted in the previous
The second thing of note here is that, while
instance_count/master_instance_type/slave_instance_type is naturally
still supported in boto, the instance_groups option has no choice but to
supersede it. Specifically, AWS’ API rejects a request with both
instance_count/master_instance_type/slave_instance_type and an
InstanceGroupConfig list. My current take has been to simply supersede
if you call run_jobflow() with both kinds of options. The other take
would be to raise an Exception of some sort — if folks feel that’s a
better way of doing it, please don’t hesitate to let me know.
Although we start some jobflows directly from boto, one of our common
use cases is to start Python bootstrapped, streaming jobflows using mrjob.
Getting spot instances working with it requires the updated boto library,
the updated mrjob library, and a few extra lines to your mrjob.conf YAML
Here in the emr section of the file, we have added an “emr_instance_groups”
stanza describing two instance groups analogous to the ones in the
boto example above.
After you’ve added this section, you can either create a jobflow just by
running a mrjob script with “-r emr”, or you can create a persistent one
python2.5 -m mrjob.tools.emr.create_job_flow -c /path/to/mrjob.conf
(after sourcing your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY into
your environment, of course).
By far the easiest way to verify that your spot instances are working is
using the AWS console — you’ll see spot requests under the EC2 tab,
initially in a “open” state, and you’ll also see a new jobflow in
the Elastic MapReduce tab, along with a description of your two
instance groups in the instance group tab. (Sorry, I’d put a screenshot
here, but posterous doesn’t let you do inline images in Markdown. Hmmf).
One more note: in my limited experience, spot instance jobflows do seem
to take a little bit longer to come up than on-demand jobflows.
If you’re looking to test these libraries without installing them
everywhere, Ian Bicking’s virtualenv offers an easy way:
virtualenv boto+mrjob --no-site-packages
easy_install simplejson pyyaml
git clone git://github.com/hblanks/boto.git
git clone git://github.com/hblanks/mrjob.git
(cd mrjob; git fetch origin development:development; git checkout development)