GNU Parallel and Block Size(s)

I’ve been a fan of GNU Parallel for a while but until recently have only used it occasionally. That’s a shame, because it’s often the simplest solution for quickly solving embarrassingly parallel problems.

My recent usage of it has centered around database export/import operations where I have a file that contains a list of primary keys and need to fetch the matching rows from some number of tables and do something with the data. The database servers are sufficiently powerful that I can run N copies of my script to get the job done far faster (where N is value like 10 or 20).

A typical usage might look like this:

cat ids.txt | parallel -j24 --max-lines=1000 --pipe "bin/munge-data.pl  --db live >> {#}.out

However, I recently found myself scratching my head because parallel was only running 3 jobs rather than the 24 I had specified. After trying various experiments I finally went back and re-read the very complete manual page.

And, finally, I put the pieces together when I came across the notion of “blocks” and then saw this in the section about piping:

Spread input to jobs on stdin (standard input). Read a block of data from stdin (standard input) and give one block of data as input to one job. The block size is determined by –block.

The default block size is 1MB. How big was my input file? Event though it contained hundreds of thousands of primary keys, it was about 2.5MB in size.

Ah ha! That explained why parallel only bothered to fire up 3 sub processes for me. So a bit of tweaking was in order and I ended up with this:

cat ids.txt | parallel -j24 --block-size=32K --max-lines=1000 --pipe "bin/munge-data.pl  --db live >> {#}.out

That, as we like to say at work, runs good. My 15 minute task now completes in a less than 2 minutes.

While parallel is a useful tool, it also has A LOT of options. This is about the 4th or 5th time I’ve had to read the manual page and I find that I’m still learning things each time I do. Hopefully this will save someone else a bit of head scratching when they can’t figure out why GNU Parallel isn’t running the number of jobs they asked for.

About Jeremy Zawodny

I'm a software engineer and pilot. I work at craigslist by day, hacking on various bits of back-end software and data systems. As a pilot, I fly Glastar N97BM, Just AirCraft SuperSTOL N119AM, Bonanza N200TE, and high performance gliders in the northern California and Nevada area. I'm also the original author of "High Performance MySQL" published by O'Reilly Media. I still speak at conferences and user groups on occasion.

View all posts by Jeremy Zawodny →

8 Responses to GNU Parallel and Block Size(s)

SQ says:

November 17, 2016 at 6:01 pm

I find that xjobs is simpler and plenty effective:
http://www.maier-komor.de/xjobs.html

- oletange says:
  
  November 18, 2016 at 6:01 am
  
  Simpler is a matter of taste, but I like
  
  parallel mpg123 -s {} ‘>’ {.}.wav ::: *.mp3
  
  better than:
  
  ls -1 *.mp3 | sed ‘s/\(.*\)\.mp3/”\1.mp3″ > “\1.wav”/’ | xjobs — mpg123 -s
  
Jeffrey Friedl says:

November 17, 2016 at 6:50 pm

The input is in text lines, so don’t you want the distribution to sub processes to also be text lines (be split on line boundaries) too?

- Jeremy Zawodny says:
  
  November 17, 2016 at 9:26 pm
  
  It is. And the –max-lines parameter controls how many are fed to STDIN at a time.
  
  - oletange says:
    
    November 18, 2016 at 6:29 am
    
    Not entirely true: –max-lines gives the record size. So if you do –max-lines 4, GNU Parallel may read 4, 8, 12, 16 … lines at a time, but op to the –block size. If you wanted to control how many lines are fed, you should use -N which is the number of records to read (a record defaults to a single line).
  - Jeremy Zawodny says:
    
    November 18, 2016 at 8:45 am
    
    Ah, thanks for the clarification…
oletange says:

November 18, 2016 at 5:58 am

Have a look at –pipepart:

–pipepart
Pipe parts of a physical file. –pipepart works similar to –pipe, but is much faster.

If –block is left out, –pipepart will use a block size that will result in 10 jobs per jobslot, except if run with –round-robin in which case it will result in 1 job per jobslot.

Pingback: GNU Parallel and Block Size(s) (2016) - GistTree