GNU Parallel and Block Size(s)

I’ve been a fan of GNU Parallel for a while but until recently have only used it occasionally. That’s a shame, because it’s often the simplest solution for quickly solving embarrassingly parallel problems.

My recent usage of it has centered around database export/import operations where I have a file that contains a list of primary keys and need to fetch the matching rows from some number of tables and do something with the data. The database servers are sufficiently powerful that I can run N copies of my script to get the job done far faster (where N is value like 10 or 20).

A typical usage might look like this:

cat ids.txt | parallel -j24 --max-lines=1000 --pipe "bin/munge-data.pl  --db live >> {#}.out

However, I recently found myself scratching my head because parallel was only running 3 jobs rather than the 24 I had specified. After trying various experiments I finally went back and re-read the very complete manual page.

And, finally, I put the pieces together when I came across the notion of “blocks” and then saw this in the section about piping:

Spread input to jobs on stdin (standard input). Read a block of data from stdin (standard input) and give one block of data as input to one job. The block size is determined by –block.

The default block size is 1MB. How big was my input file? Event though it contained hundreds of thousands of primary keys, it was about 2.5MB in size.

Ah ha! That explained why parallel only bothered to fire up 3 sub processes for me. So a bit of tweaking was in order and I ended up with this:

cat ids.txt | parallel -j24 --block-size=32K --max-lines=1000 --pipe "bin/munge-data.pl  --db live >> {#}.out

That, as we like to say at work, runs good. My 15 minute task now completes in a less than 2 minutes.

While parallel is a useful tool, it also has A LOT of options. This is about the 4th or 5th time I’ve had to read the manual page and I find that I’m still learning things each time I do. Hopefully this will save someone else a bit of head scratching when they can’t figure out why GNU Parallel isn’t running the number of jobs they asked for.

About Jeremy Zawodny

I'm a software engineer and pilot. I work at craigslist by day, hacking on various bits of back-end software and data systems. As a pilot, I fly Glastar N97BM, Just AirCraft SuperSTOL N119AM, Bonanza N200TE, and high performance gliders in the northern California and Nevada area. I'm also the original author of "High Performance MySQL" published by O'Reilly Media. I still speak at conferences and user groups on occasion.
This entry was posted in craigslist, mysql, programming, tech. Bookmark the permalink.

7 Responses to GNU Parallel and Block Size(s)

  1. SQ says:

    I find that xjobs is simpler and plenty effective:
    http://www.maier-komor.de/xjobs.html

    • oletange says:

      Simpler is a matter of taste, but I like

      parallel mpg123 -s {} ‘>’ {.}.wav ::: *.mp3

      better than:

      ls -1 *.mp3 | sed ‘s/\(.*\)\.mp3/”\1.mp3″ > “\1.wav”/’ | xjobs — mpg123 -s

  2. The input is in text lines, so don’t you want the distribution to sub processes to also be text lines (be split on line boundaries) too?

  3. oletange says:

    Have a look at –pipepart:

    –pipepart
    Pipe parts of a physical file. –pipepart works similar to –pipe, but is much faster.

    If –block is left out, –pipepart will use a block size that will result in 10 jobs per jobslot, except if run with –round-robin in which case it will result in 1 job per jobslot.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s