Thursday, October 27, 2011

Importing a custom Python package in Amazon MapReduce


Amazon's MapReduce makes it difficult to use custom, application specific modules in Python applications. Without special configuration, MapReduce only loads the two map/reduce files into its streaming jobs. I'm a heavy of custom modules - my scripts were failing with import errors without a clear solution.

The solution is relatively simple (and automatable)! To import custom modules, we will need to get them into the working directory of the streaming job. Then, we need to add that directory to our PATH.

I'm using Amazon's elastic-mapreduce command line tool to start my job flows. elastic-mapreduce gives you the option to easily push a single file to the working directory using the --cache parameter. However, it only allows you to push one file (and you can't add more than one --cache parameter. I tried). We'll have to use its friend, --cache-archive.

The Plan - quick summary:
  1. Create a package out of the application modules you want to import.
  2. Create an archive from the package
  3. Push the archive to S3
  4. Use the sys module to add the library directory your job's PATH
  5. Add the tar to your streaming job with --cache-archive
Broken down:
1. Create a package
Packages in Python are created by putting the modules together in one folder with a file named __init__.py. See the Python docs for more information. Here's my directory structure:

2. Tar the package
Note that I'm using -C to change into my helper_classes directory before creating the archive. This ensures that the files aren't put into a folder inside the archive, but live in the top level instead. Once I've created the tar, my directory structure looks like this:

3. Push the archived package to S3
I'm using s3cmd. How you push the archive to S3 doesn't matter, as long as you get it up there.

4. Add the package to your job's PATH
To temporarily add the folder with the application modules to our PATH, use sys.path.append(). This must be done before we attempt to import any of our custom modules. Here's an example:


5. Add the archive to your streaming job with --cache-archive
NOTE: What you put after "#" will be the directory name! This is where your tar will be unpacked to - it must match the directory name you used added to your PATH.

That's it! You can see a fully working (and automated) example of this method in my scrabble-bot project. In particular, check out add_streaming.sh and its helper upload_to_s3.sh.

There may be better ways to get helper modules into MapReduce jobs (perhaps using bootstrap actions?), but this one has worked well for me.

1 comment:

  1. A year and a half later this post solves a problem I've been stuck on for days. I don't know if you'll ever see this, but thanks. :)

    ReplyDelete