I’ve been a fan of GNU Parallel for a while but until recently have only used it occasionally. That’s a shame, because it’s often the simplest solution for quickly solving embarrassingly parallel problems.
My recent usage of it has centered around database export/import operations where I have a file that contains a list of primary keys and need to fetch the matching rows from some number of tables and do something with the data. The database servers are sufficiently powerful that I can run N copies of my script to get the job done far faster (where N is value like 10 or 20).
A typical usage might look like this:
cat ids.txt | parallel -j24 --max-lines=1000 --pipe "bin/munge-data.pl --db live >> {#}.out
However, I recently found myself scratching my head because parallel was only running 3 jobs rather than the 24 I had specified. After trying various experiments I finally went back and re-read the very complete manual page.
And, finally, I put the pieces together when I came across the notion of “blocks” and then saw this in the section about piping:
Spread input to jobs on stdin (standard input). Read a block of data from stdin (standard input) and give one block of data as input to one job. The block size is determined by –block.
The default block size is 1MB. How big was my input file? Event though it contained hundreds of thousands of primary keys, it was about 2.5MB in size.
Ah ha! That explained why parallel only bothered to fire up 3 sub processes for me. So a bit of tweaking was in order and I ended up with this:
cat ids.txt | parallel -j24 --block-size=32K --max-lines=1000 --pipe "bin/munge-data.pl --db live >> {#}.out
That, as we like to say at work, runs good. My 15 minute task now completes in a less than 2 minutes.
While parallel is a useful tool, it also has A LOT of options. This is about the 4th or 5th time I’ve had to read the manual page and I find that I’m still learning things each time I do. Hopefully this will save someone else a bit of head scratching when they can’t figure out why GNU Parallel isn’t running the number of jobs they asked for.
I find that xjobs is simpler and plenty effective:
http://www.maier-komor.de/xjobs.html
Simpler is a matter of taste, but I like
parallel mpg123 -s {} ‘>’ {.}.wav ::: *.mp3
better than:
ls -1 *.mp3 | sed ‘s/\(.*\)\.mp3/”\1.mp3″ > “\1.wav”/’ | xjobs — mpg123 -s
The input is in text lines, so don’t you want the distribution to sub processes to also be text lines (be split on line boundaries) too?
It is. And the –max-lines parameter controls how many are fed to STDIN at a time.
Not entirely true: –max-lines gives the record size. So if you do –max-lines 4, GNU Parallel may read 4, 8, 12, 16 … lines at a time, but op to the –block size. If you wanted to control how many lines are fed, you should use -N which is the number of records to read (a record defaults to a single line).
Ah, thanks for the clarification…
Have a look at –pipepart:
–pipepart
Pipe parts of a physical file. –pipepart works similar to –pipe, but is much faster.
If –block is left out, –pipepart will use a block size that will result in 10 jobs per jobslot, except if run with –round-robin in which case it will result in 1 job per jobslot.
Pingback: GNU Parallel and Block Size(s) (2016) - GistTree