Author: Hunter Blanks

The prior post announced some early spot instance support for boto’s EMR
library, and it’s my pleasure in the current one to announce what would
appear to be full support for spot instances in both boto.emr and in
Yelp’s mrjob. Code, examples, and a brief explanation follows.

So far as code goes, there are two repos to look at — and two corresponding
pull requests out to the maintainers of boto and mrjob:

https://github.com/hblanks/boto

and:

https://github.com/hblanks/mrjob/tree/development

Please note that the changes for mrjob are on its development branch. It
would have been great to also provide a change on top of the master branch;
however, the master branch version of mrjob has its own “botoemr” library —
so rather than patch boto.emr and a copy of botoemr in mrjob, I’ve just
gone with mrjob’s development branch, which now uses boto’s EMR library.

boto

Once you have the updated library, starting a jobflow with spot instances
from boto is pretty simple:

The first thing of note here is that we’ve replaced the usual
instance_count/master_instance_type/slave_instance_type options with
a list of instance groups. You can have as many groups as you want — but
obviously you need one group with a single instance and the MASTER role.
CORE nodes are used for both tasks and HDFS — and because of the latter,
their number can’t be changed after startup. TASK nodes are only used for
tasks; you can create them now or add them later, as noted in the previous
post.

The second thing of note here is that, while
instance_count/master_instance_type/slave_instance_type is naturally
still supported in boto, the instance_groups option has no choice but to
supersede it. Specifically, AWS’ API rejects a request with both
instance_count/master_instance_type/slave_instance_type and an
InstanceGroupConfig list. My current take has been to simply supersede
if you call run_jobflow() with both kinds of options. The other take
would be to raise an Exception of some sort — if folks feel that’s a
better way of doing it, please don’t hesitate to let me know.

mrjob

Although we start some jobflows directly from boto, one of our common
use cases is to start Python bootstrapped, streaming jobflows using mrjob.
Getting spot instances working with it requires the updated boto library,
the updated mrjob library, and a few extra lines to your mrjob.conf YAML
file:

Here in the emr section of the file, we have added an “emr_instance_groups”
stanza describing two instance groups analogous to the ones in the
boto example above.
After you’ve added this section, you can either create a jobflow just by
running a mrjob script with “-r emr”, or you can create a persistent one
by doing:

(after sourcing your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY into
your environment, of course).

Last remarks

By far the easiest way to verify that your spot instances are working is
using the AWS console — you’ll see spot requests under the EC2 tab,
initially in a “open” state, and you’ll also see a new jobflow in
the Elastic MapReduce tab, along with a description of your two
instance groups in the instance group tab. (Sorry, I’d put a screenshot
here, but posterous doesn’t let you do inline images in Markdown. Hmmf).

One more note: in my limited experience, spot instance jobflows do seem
to take a little bit longer to come up than on-demand jobflows.

Appendix

If you’re looking to test these libraries without installing them
everywhere, Ian Bicking’s virtualenv offers an easy way: