Thursday, October 27, 2011

Importing a custom Python package in Amazon MapReduce


Amazon's MapReduce makes it difficult to use custom, application specific modules in Python applications. Without special configuration, MapReduce only loads the two map/reduce files into its streaming jobs. I'm a heavy of custom modules - my scripts were failing with import errors without a clear solution.

The solution is relatively simple (and automatable)! To import custom modules, we will need to get them into the working directory of the streaming job. Then, we need to add that directory to our PATH.

I'm using Amazon's elastic-mapreduce command line tool to start my job flows. elastic-mapreduce gives you the option to easily push a single file to the working directory using the --cache parameter. However, it only allows you to push one file (and you can't add more than one --cache parameter. I tried). We'll have to use its friend, --cache-archive.

The Plan - quick summary:
  1. Create a package out of the application modules you want to import.
  2. Create an archive from the package
  3. Push the archive to S3
  4. Use the sys module to add the library directory your job's PATH
  5. Add the tar to your streaming job with --cache-archive
Broken down:
1. Create a package
Packages in Python are created by putting the modules together in one folder with a file named __init__.py. See the Python docs for more information. Here's my directory structure:

2. Tar the package
Note that I'm using -C to change into my helper_classes directory before creating the archive. This ensures that the files aren't put into a folder inside the archive, but live in the top level instead. Once I've created the tar, my directory structure looks like this:

3. Push the archived package to S3
I'm using s3cmd. How you push the archive to S3 doesn't matter, as long as you get it up there.

4. Add the package to your job's PATH
To temporarily add the folder with the application modules to our PATH, use sys.path.append(). This must be done before we attempt to import any of our custom modules. Here's an example:


5. Add the archive to your streaming job with --cache-archive
NOTE: What you put after "#" will be the directory name! This is where your tar will be unpacked to - it must match the directory name you used added to your PATH.

That's it! You can see a fully working (and automated) example of this method in my scrabble-bot project. In particular, check out add_streaming.sh and its helper upload_to_s3.sh.

There may be better ways to get helper modules into MapReduce jobs (perhaps using bootstrap actions?), but this one has worked well for me.

Saturday, August 6, 2011

Zeromq benchmarking with large objects

At work, we're in the process of adding zeromq to our architecture. We plan to use it to make our C++ application multi-process (and eventually multi-computer).

We currently move the data between modules using pointers to Protocol Buffer messages. We know that a change to copying this data around using zeromq will take longer; we want to know how much of a slowdown we will see.

Our data starts out as protobuf messages, so it must be serialized before sending. The protos serialize into std::strings. From there, they are copied into zeromq messages and sent on a zeromq socket. The process is reversed on the receive side. 

Here is an example of our sending/receiving benchmarks on a large piece of data we use. Serialized, it is around 45MB in size.

It takes us 168.92ms to pass this one proto message. For comparison, the non-zeromq method (where a DataManager passes pointers around) takes 2.76ms.

Veteran zeromq users may notice that we're copying the serialized strings into the zeromq messages. These steps are the ones in red (and you thought the chart was colorful just because I like colors). This copying can be avoided in some use cases with zeromq's zero-copy functionality. If we were to take these actions out of our chain, the minimum total time would be 111.85ms

Using these benchmarks, we can decide which data passing paths can be replaced with zeromq, and which are time critical enough to require staying pointers. A ~0.1 second slowdown is a significant amount of time for our processing, but the benefits zeromq provides (multi-process communication!) outweigh the drawbacks in many cases.

    Thursday, July 28, 2011

    Python scripting to access Android GPS

    One of my long-term ideas has been to record myself playing Ultimate frisbee. I think visualizing a player's movement around the field (as well as the analytics you could get from the movement) would be awesome.

    I thought my old G1 would be the perfect device for the job. It is a small, self contained unit with a GPS. Using SL4A it even runs Python- what more could I want? I could poll the GPS for the player position and record their lat/lon at every tick. From that, it would be easy enough to make a graph of movement.

    Android is very straightforward with getting Location data from the GPS, and the Python scripting layer is easy to use.


    This code sets up our GPS listener and records the first event it spits back at us. From there, it's easy to record a bunch of events, parse out the lat/lon, and plot the points with matplotlib.


    That's a lat/lon graph of me running around a block of houses- ouch! I made some zig-zags (the waves on the top) and even backtracked (the big bump on the bottom) for a half block, but the frequency is nowhere near being good enough to be useful. Even though it is easy to access, the G1's GPS just doesn't update fast enough (with enough precision) to record a player's movement.

    I had a small amount of hope that the accelerometer would be viable option, but as this stackoverflow answer (and the accompanying video) shows, this isn't possible. The error introduced by going from acceleration to position is too great for navigation the navigation I want to do.

    Even though I failed in my planned goal, the code to record the GPS is very nice. It's available on github.

    As for next steps, I asked around and got some good ideas. First (and easiest), I plan to try to use Network location data along with the GPS data on the phone to try get a more accurate location. I don't think it will improve the update frequency, but it may. Long term, the best way to get my desired data (track someone running around a field) probably rests with short range triangulation. Something like putting radio transmitters on the sides of the field and determine the position of a device in the middle, or perhaps using cameras (Kinect style) to triangulate a person's position visually and with high frequency.

    Wednesday, June 15, 2011

    New toy! Rooting a Nook Touch

    So last week I made my first impulse buy in a while; I bought a brand new Nook Simple Touch. Why? I spend way too much time looking at LCD screens, I would love to change some of that to e-Ink (going outside just isn't part of the equation). Also, it's running Android - what's not to love?

    The main goals I have are as follows:
    • Read books (easy!)
    • Read Instapaper articles (harder)
    • Get on Google Reader (even harder)
    One solution is to sync all of these things to Calibre and load them manually. But what's the point of having a wireless e-reader if you're plugging it in all the time? And manual operation? Screw that. The only solution was to root it and get some side-loaded apps working to solve my problems.



    The root process is well spelled out here (thanks JesusFreke!). I've already added my clarifying contributions to the wiki; where I can help is what you do with your Nook after rooting. Here's what I did...



    Yes, that's right. Angry Birds, first priority. It's obligatory.

    Now that that's out of the way, some real stuff. All apps are loaded by making a adb connection (this should have been the last step of rooting). So, you run your adb connect, and can immediately load apps with adb install {app}. It's spelled out for the Nook Color here; same process.

    Installing is easy- getting access to the apps once they're on the device is slightly more complicated. For this, you need two apps. One is ADWLauncher, an Android homescreen replacement. The B&N one the Nook ships with just won't cut it. Unfortunately, the only time you can access your home screen is on Nook startup; if you leave it, you can't get back. For that, you need SoftKeys. This allows you to reach your desired home screen whenever you like. You should set it up as explained for the Nook Color here. I've also heard people having success with Button Savior, but I haven't tried it (yet!).

    Now you have a launcher/home screen, and a home key so you can always get back. Now you can load whichever apps you want. Here's what I've found that works and doesn't work so far...

    Working:
    • Nook Color Tools
      • very helpful- provides access to system settings. Necessary to uninstall apps.
    • Dolphin browser
      • good browser that seems to work with only minor hiccups
    • Dropbox
      • file sharing! yippee!
    Not working (yet):

    Here's someone else's list: much more comprehensive.

    Note that two of my original goals are on the not working list. Some people seem have had success with both; I'll provide an update when I figure out how to get them working.

    One of the biggest issues I've had is simply finding the apks to load. Has anyone had success with Android Market? Or found a reliable way to download apps? My next plan is to try downloading them to my Nexus S and grabbing them from there.

    By the way, there have been some great resources online recording the progress of rooting the Nook. This thread on xda-devs has been very prolific. Mike Cane has had nearly daily updates on his blog.

    Friday, May 6, 2011

    Dice games confuse me -- testing a strategy in Python

    My friend (let's call him Coleman) told me about a dice game today, and how using probability you could increase your win percentage. I didn't believe him, and I set out to prove him wrong.

    First, let me explain the game. Two people each roll a die. Each can look at their opponent's roll, but not their own. They can have no communication once the die have been rolled (but can collaborate beforehand). The object of the game is for each to guess what their own die is; they both win if they each guess correctly, otherwise they lose.

    I assumed that without any knowledge of what their die is, each person has a 1/6 chance of guessing correctly. This means that the pair have a 1/36 chance of both guessing correctly and winning the game.

    Coleman tried to tell me he could do better. If the two people agree to each guess what their opponent's die is showing, they have a 1/6 chance of winning. This didn't make sense - what your opponent rolled has zero to do with what you rolled, and there is no way that choosing their roll would give you a better outcome.

    Time to test it - Monte Carlo style! I wrote a script that plays the game; it first plays guessing randomly, then guesses "intelligently" by following the strategy above. My gut told me that the two would have the same outcome, but my head was telling my gut to shutup and go have a sandwich.

    Here's the result of playing randomly:
    We get 2.8% (around 1/36); just as expected!

    Now let's try playing by the strategy. My gut says it will also be 1/36...
    16.7% -- around 1/6! What... how... ???

    So Coleman was right. Here is how I convinced myself that the math was actually true - let's see if it makes sense outside of my head. By agreeing to play by the strategy, you're changing the game. Before, the game was two people each guessing a die roll correctly; this has a 1/36 chance of working. With the strategy, the game becomes two people roll the same number. This happens every 1/6 games. So you change the game with your strategy, and get a much better win percentage.

    You can check out the Monte Carlo script I used here. I'm gonna go find some dice, and hit Coleman with a sandwich.

    Thursday, May 5, 2011

    Relearning Data Structures with Python

    I realized the other day that I had learned about tons of data structures in class, but I've never used some of them in production. So, I set out to write simple implementations of a hash table and binary tree. After all, after I've made one in practice I'm more likely to make one when it's actually useful.

    The focus in this exercise was clean implementation. Keep the code as short as possible, not sacrificing for readability. I think I've struck a good balance.

    I've found that since I've learned Clojure, I'm more likely to use functional style coding in other languages. For example, here's my hash_function in Python:
    Note the use of map. I was even using reduce for a while, but Python's sum led to a cleaner implementation. Also note that I'm using the decorator @staticmethod. Functional style calls for functions that have well defined inputs and outputs without side effects - declaring this method to be static explicitly shows that I am following that principle.

    Compare that to how I would have written this six months ago:
    The new, functional way is much cleaner. 

    There are definitely many things I could add for my next steps. Some structure changing methods (dynamically allocated hash table? self balancing tree?) could be pretty cool.

    You can check out the source on github. There is a tester file that shows how to use the class (and even provides some profiling code I used). To enable a tester, call it from the main function in tester.py

      Friday, April 15, 2011

      RSA-Cryptography in Clojure

      I've been playing around with some simple RSA cryptography in Clojure. It was inspired by this problem on Programming Praxis.

      The problem is to create a working RSA key generator. Then, functions need to be made that take these generated keys and use them to encrypt and decrypt messages.

      Clojure is a great tool for this project; Java's BigInteger library seems to be made specifically for cryptography. The function probablePrime() is especially awesome- I didn't have to do any work for prime generation or checking. Here is my find-prime function:

      Using Java libraries in Clojure looks slightly ugly, but it works great once you've figured it out. And since Clojure is a Lisp, I can abstract out the ugliness into a macro and keep it far from my computation code.

      I was playing around with running messages round-trip (encrypt, decrypt, and compare), and I noticed that they often didn't match up. I thought I had a huge bug, and made a questioning post on this tutorial. Turns out that I was missing the little fact that the message (m) must be less than the modulus (n). This makes a lot of sense in hindsight, but I racked my brain for a while on that problem.

      Now that I've confirmed that my math is working, I've started working on ways to attack the algorithm. The standard way to determine decrypt key D is to factor modulus N. This doesn't use any of the code I just wrote, so I've been attacking by using sequential D's to decrypt a ciphertext with a known message. If the decrypted message matches what we started with, I know D is a potential decrypt key. This attack is easily parallelized in Clojure.

      The source for solving the Programming Praxis problem is here. There's a README that should get you started playing around with your own RSA crypto. The next step would be to add some padding for security or conversion to/from ASCII text.

        Friday, March 25, 2011

        Cursor-10 Clojure bot

        Usually when I say I love using programs to solve problems for me, I'm referring to some kind of machine learning. Today, I mean I'm using the raw power of automated mouse control beat a flash game that was pissing me off.

        Cursor-10 is a strange and unpolished Japanese game. Try it out here.

        Since its been my latest toy, I decided to write the bot in Clojure. I could have chosen Python, but manipulating the mouse is much easier in Clojure due to Java's Robot class. To be honest, the entire thing could have been written in something like AutoIt, but I wanted to take advantage of the abstractions available with Clojure's Lisp syntax to make everything as clean as possible.

        Like I said, my choice of Clojure meant that I could take advantage of Java's Robot class- awesome. However, mouse-click actions are a side effect in Clojure land, meaning that its lazy evaluation made my Robot calls never happen. I needed to sprinkle doseq and dorun calls all over the place to force the side effects to actually be resolved. Not a huge deal, but it was the source of most of my debugging headaches.


        So now I know how I'm going to play this game- clicking using Clojure/Java's Robot class. Next I need to know where to click. For some rapid prototyping, I started by manually entering coordinates for interesting game elements: crystals, ladders and boxes. This quickly proved an inadaquite solution... there must be a better way!


        The better way was to use ImageMagick to programmatically find the coordinates of elements. Check out this blog post to see the details of how I did that.

        It would have been awesome to use ImageMagick in real time to make my program more dynamic, but the computation took WAY too long for that to be possible. So I created lists in Clojure of everything I wanted to click on on each floor. These lists are currently contained inside the main module of the project, but it may have been cleaner to put them in their own text file and load that file on program startup.

        After all that, I had a working bot. Check it out:



        I win! The video starts at the final cursor (the only one that grabs crystals and gets points), but start at the beginning to see them all in action.

        Note that I did give the bot some manual interaction; I started each cursor at the appropriate time. I could figure out the time between each level and make the bot fully autonomous, but that sounds like a lot of headache that I could devote to something more deserving.

        The project source is available on github. There's a helpful README to get you started.

        Thursday, March 24, 2011

        ImageMagick subimage search

        As part of a larger project, I needed to find the coordinates of any instances of a small image inside a larger image. Here are the two images:


        I think you can guess out what my project was. Anyway, using ImageMagick's subimage-search to get the desired coordinates (all those pyramid shaped crystal things) was pretty easy. Here is the command I used to analyze floor #11:

        This outputs a new grey image, with a white spot corresponding to the location of every match. Check it out:
        Note the white spot in the center of each crystal- those are the matches we're looking for! If we tell ImageMagick to output a text file, we get a list of coordinates:

        Note the coordinate with the color of "white." Now we're getting somewhere- a numerical coordinate corresponding to a match. Here is a quick one-liner I used to parse the coordinates out of the file:

        This grabs all the lines of the file that contain 'white.' It then uses awk to parse the raw output into something more useable in Clojure:

        Done! Now we have a coordinate list of all the crystals in the image. Check out how I used it in a future blog post.

        Tuesday, March 22, 2011

        Inno Setup checkedonce flag in Code section

        When using Inno Setup, you can set flags to any of the optional [Tasks] your installer might be preforming. We were using the 'checkedonce' flag for one of our tasks at work to make sure the Task was set by default only the first time we ran the installer on a machine. When we brought this Task into our [Code] section to be run by a custom Wizard page, we lost our checkedonce functionality.

        The issue at hand is that we want to set the flag to be True if we are installing for the first time on a user's machine, and set to False if we are performing an upgrade. If we are preforming an upgrade on a machine, we should be able to find the Inno Setup app ID in the registry, stored here:
        Software\Microsoft\Windows\CurrentVersion\Uninstall\{AppId}\
        If we find this key, we know the installer has been run before, and we are preforming an upgrade. If we can't find it, it's a new install. Here's the code to check if it exists:

        Note that we check both HKLM and HKCU, because [I think] the application install information could live in either one.

        References
        http://stackoverflow.com/questions/2000296/innosetup-how-to-automatically-uninstall-previous-installed-version

        Sunday, March 13, 2011

        Genetic "Hello World"- Clojure Edition

        Last post I talked about implementing the Genetic Algorithm equivalent of "Hello World" in Python. Since I was learning Clojure around the same time, I decided to implement the algorithm the newer language as well.

        Already having the iterative Python program written, it was strange to move to the functional Clojure. I made a point not to write a port of my previous program; I used the knowledge and experience from writing the algorithm in Python, but made sure to "think" in Clojure while I was writing.

        In Python, my script's main logic was contained in one large function with a while loop.


        In Clojure, this became a small recursive function which called started a chain of calls to other small functions.


        While the Python script could have been written with more helper functions like the ones used in Clojure, the language does not push you in that direction the same way you get with functional languages (obviously). Therefore, the large function with the while loop seems natural in Python, just like how lots of small functions seems completely natural in Clojure.


        One of the major benefits I've realized while playing with Clojure is how easy unit testing becomes when the language pushes you to write small functions with no side effects. Every function has a suite of obvious tests. This concept mostly failed for this particular project however, since the reliance on using random selection in many functions made unit testing mostly impossible for them. I tested every function when it seemed natural to do so, but my experiments on testing functions with regular but randomized output were too convoluted to actually use.


        There is one problem I have with my Clojure solution that I have been unable to solve. It runs much slower than the equivalent Python version. I have hooked it up to Java's VisualVM to do some profiling, but I was unable to get helpful results. It showed (I think) that most of my time was spent sorting the solution set. While this does make sense, the algorithm is much cleaner with the sort- and in any case, the Python version has the same sort in the same spot. I plan to look in to this issue more (following these suggestions) once I have more experience with Clojure.


        The full source is available on github. It includes a README to help you get started playing around, and explains more about the algorithm. 

        Friday, March 11, 2011

        Python Genetic "Hello World"

        I've always enjoyed using computers to solve problems I'm too lazy to solve myself (see my Monte Carlo attack on an old math joke). That's why I got really excited when I read Howard Yeend's post on Genetic Algorithms, which made everything seem complex but accessible. My proven method for trying to understand something to write a program to do it for me (this dates back to learning the quadratic formula by writing a program for the TI-83 in high school), I figured I'd try my own Genetic "Hello World."
        Here we can see the output from running the Genetic Hello World program. It took 76 generations to reach the target string.
        There's many great resources for Genetic Algorithms, but here's a high level overview. A random sample of chromosomes is created. These chromosomes are combined, and the ones that score the best on a fitness function are kept for the next round of breeding. If you have a good fitness function, your solutions get better and better until you have an acceptable answer.

        The source for this project is available on github. There's a README included that explains how to get set up and how to modify it to play with it yourself. I'm embedding the main script here so you can take a look: 

        Thursday, March 10, 2011

        Boids Sandbox

        In an effort to discuss past projects that I've enjoyed working on, I'm going to start with one from my senior year of college. As part of my elective Artificial Intelligence course, we had to end the semester with a demonstration project from any area of AI. Having a double threat of recently reading about Boids as well as learning Python in my internship, my path was clear. I was going to develop my own bird flocking simulator.
        Boids operate as an autonomous group, flying together in flocks without a defined "leader." Each boid is affected by neighboring boids according to three rules:

        1. Separation- ensures the boids don't get too close together and collide
        2. Cohesion- keeps the boids together in a loose flock
        3. Alignment- makes the boids point in the same direction as their neighbors, making the flock head the same way
        As you can tell, the forces of Separation and Cohesion are direct opposites. However, each force is given a operation distance as well as a weight during calculation. Here are the values my model currently uses:
          separation_distance = 20**2
          cohesion_distance = 50**2
          alignment_distance = 50**2
          separation_weight = 2
          cohesion_weight = 1.5
          alignment_weight = 1.5
        The Separation force doesn't kick in until a closer distance than Cohesion, but it operates with a greater weight.

        You can check out the complete source code (and images) from Bitbucket (this project was before I discovered the wonder of git). It has a README to help you get set up (it's easy!), and start manipulating parameters to play with the sandbox yourself. The main module is embedded here so you can take a look:

        Sunday, March 6, 2011

        An infinite number of mathematicians walk into a bar...

        The first one orders a beer. The second orders half a beer. The third, a quarter of a beer. Before the fourth mathematician can place his order, the bartender says "You're all idiots", and pours two beers. 
        Yes, dumb math joke. But how easy is this to prove? We could form a mathematical proof that shows this must be true (I think I did this in class at one point), but where's the fun in that? Monte Carlo methods don't really prove much, but I think they're a fun way to solve1 this kind of math riddle.


        So, how fast can we write a Clojure function that figures this out for us? Pretty fast- it's a one liner2.
        (reduce + (pmap #(/ 1 (Math/pow 2 %)) (take 200000 (iterate inc 0))))
        I'm going to use this post to telly you how this works. It's best to try to read Clojure code from the inside out. Let's start with
        (take 20000 (iterate inc 0))
        This will generate a lazy sequence of the numbers 0 - 19999. The (iterate inc 0) part is actually an infinite (but luckily, lazy) sequence. Pretty cool, huh?


        The next block is the map function
        (pmap #(/ 1 (Math/pow 2 %)) (take 200000 (iterate inc 0)))
        The map documentation shows that it takes a function and a sequence of numbers. We've already generated our vector, so what about this function? 


        The #( syntax means we will be using an anonymous function. The % becomes the variable we pass in to the function; in this case we're passing in each number from 1-19999. This uses Java's pow function from java.lang.Math to raise 2 to that power, then divide into 1. Notice how easy it is to use Java libraries within Clojure code.


        So now we have our two arguments to pmap. As the documentation says, this function is exactly equivalent to map, but has the added benefit of the map tasks in parallel. Since our functions are pure and don't depend on each other for computation, we can parallelize them. So, we use pmap instead of map, and boom3! Instant parallelization. I love this game.


        Now we have a 20000 element long sequence kinda like this
        [1 1/2 1/4 1/8 1/16 1/32 ...]
        Each of these corresponds to a drink that one of our infinite mathematicians ordered back in our joke4. The last thing we need to do is sum all of these fractions up.
        (reduce + (pmap #(/ 1 (Math/pow 2 %)) (take 200000 (iterate inc 0))))
        And that brings us back to our original one liner. Now that we understand the inside, we can see that reduce takes a function (yes, + is just a function) and a sequence. It returns whatever it gets by "folding" the answer from each call of the function to the next one (check the docs or Wikipedia for a real explanation), leading us to get a single number from our sequence. 


        The answer is 2 by the way. Probably could have saved you a bunch of time if I just said that in the beginning.


        How could this be made cooler? By using 20000 elements, we're ensuring that we reach a point where the rounding inaccuracies of floating point numbers show us an answer of 2. Of course, the actual answer never reaches two. However, we could write our own recursive function (which would replace the reduce call) that stopped execution when it noticed that the input sum hadn't changed after calculation, meaning we'd reached our maximum precision.


        1 Using "solve" in the loosest sense of the word.
        2 Blogs have magical properties that turn amazing one liners into ugly multi-line messes. At one point it was one line, and in a 80 character window too.
        3 This is the one and only time this blog will link to Steve Jobs. Get your fill while it lasts.
        4 Were you aware that analyzing a bad joke to death doesn't make me like it any more?