GNU parallel --ungroup is weird

Index, feed.

[ Date | 2017-03-27 01:51 -0400 ]


GNU parallel, without any options, will start as many jobs as there are cores and will, importantly, buffer so that each job's output appears as a unit, without mixing among jobs. (The order in which each job's output is printed is generally non-deterministic: the first to terminate wins.) So the following command (asking parallel to run seq 0 3 three times in parallel) will reliably print the exact same thing every time, not mixing the 0-1-2-3 sequences:

$ yes 'seq 0 3' | head -n3 | parallel

Those are reasonable defaults.

Temporary space usage

Trouble may arise when instead of seq 0 3 the command is, for example, zcat, and the input is a bunch of files that expand to gigabytes and gigabytes.

For example, this asks parallel to pipe each file matching the pattern data-in/*.tsv.gz into command zcat | adjust —therefore decompressing each input and filtering it through hypothetical command adjust— and finally compressing the compound output into single file:

$ parallel zcat \| adjust ::: data-in/*.tsv.gz | gzip >summary.tsv.gz

If the output from zcat | adjust is large, this may well fill the filesystem where temporary files are stored: parallel will buffer an entire file's worth of output before passing to the rest of the pipeline (here, gzip >summary.tsv.gz). If the input files are all roughly M bytes in size, thus taking about the same time to process, they each turn into N bytes after being zcat|adjusted, and there are j jobs run in parallel, then the temporary storage could reach j·N bytes. Taking the example of M = 1GB, N = 10GB, and j = 40 cores (this would be the case where adjust expands the input lines into several, perhaps into canonical TSV from an input that is a mixed of TSV and comma-separated values within a field), we would use up to j·N = 400GB of temporary storage, whereas the final disk space used for those processed files, once recompressed, may be a tenth of that, depending on entropy.

This is an unfortunate situation on a system that lacks the necessary amount of temporary storage, especially in the likely case where the order of records (lines) does not matter: we would much rather have parallel print processed lines as they becomes ready, perhaps in small batches; keeping gigantic blocks of data to be able to ship them off all at once doesn't make sense in this case.

The default behavior could also cause subsequent members of a longer pipeline to starve while blocks of their input are being prepared, causing inefficient use of computing resources by leaving some of them idle when they could be used.

An obvious (wrong) solution

The man page from GNU parallel 20120422 (shame on me for using such outdated systems) says the following:

--group  Group output. Output from each jobs is grouped together and is
         only printed when the command is finished. stderr (standard
         error) first followed by stdout (standard output). This takes
         some CPU time. In rare situations GNU parallel takes up lots
         of CPU time and if it is acceptable that the outputs from
         different commands are mixed together, then disabling grouping
         with -u can speedup GNU parallel by a factor of 10.


-u       Ungroup output.  Output is printed as soon as possible. This
         may cause output from different commands to be mixed. GNU
         parallel runs faster with -u. Can be reversed with --group.

Those make it seem as though the solution to our problem is to just use --ungroup, or -u, when calling parallel:

$ parallel -u zcat \| adjust ::: data-in/*.tsv.gz | gzip >summary.tsv.gz

This even mostly seems to work when running tests on tiny samples of data! but there is a catch, acknowledged by the author in public forums, including the announcement of version 20130822:

--line-buffer will buffer output on line basis. --group keeps the output together for a whole job. --ungroup allows output to mixup with half a line coming from one job and half a line coming from another job. --line-buffer fits between these two; it prints a full line, but will allow for mixing lines of different jobs.

As documented above, --ungroup will merrily run parts of records (lines) together. I can't think of a non-convoluted situation where this would be useful. This is fairly easy to observe, for example:

## Without -u, the two input lines `aa` and `bb` stay separate (they
## could be printed in any order with respect to each other, so `bb`
## followed by `aa` is possible as well):
$ parallel cat ::: <(echo -n a; sleep 2; echo a) \
                   <(sleep 1; echo -n b; sleep 2; echo b)

## With -u, we risk having the lines mixed:
$ parallel -u cat ::: <(echo -n a; sleep 2; echo a) \
                      <(sleep 1; echo -n b; sleep 2; echo b)

Actual solution

The right way, if we do not want mixed partial records, is to use --line-buffer (in non-ancient versions of parallel):

$ parallel --line-buffer cat ::: <(echo -n a; sleep 2; echo a) \
                                 <(sleep 1; echo -n b; sleep 2; echo b)

Quick links:

Camp info 2007
Camp Faécum 2007
Japanese adjectives
Couleurs LTP
French English words
Petites arnaques
DSC-W17 patch
Scarab: dictionnaire de Scrabble
Omelette soufflée au sirop d'érable
Camembert fondu au sirop d'érable
La Mona de Tata Zineb
Cake aux bananes, au beurre de cacahuètes et aux pépites de chocolat