[ | Date | | | 2017-03-27 01:51 -0400 | ] |
GNU parallel, without any options, will start as many jobs as there are cores and will, importantly, buffer so that each job's output appears as a unit, without mixing among jobs. (The order in which each job's output is printed is generally non-deterministic: the first to terminate wins.) So the following command (asking parallel to run seq 0 3
three times in parallel) will reliably print the exact same thing every time, not mixing the 0-1-2-3 sequences:
$ yes 'seq 0 3' | head -n3 | parallel
0
1
2
3
0
1
2
3
0
1
2
3
Those are reasonable defaults.
Trouble may arise when instead of seq 0 3
the command is, for example, zcat
, and the input is a bunch of files that expand to gigabytes and gigabytes.
For example, this asks parallel to pipe each file matching the pattern data-in/*.tsv.gz
into command zcat | adjust
—therefore decompressing each input and filtering it through hypothetical command adjust
— and finally compressing the compound output into single file:
$ parallel zcat \| adjust ::: data-in/*.tsv.gz | gzip >summary.tsv.gz
If the output from zcat | adjust
is large, this may well fill the filesystem where temporary files are stored: parallel will buffer an entire file's worth of output before passing to the rest of the pipeline (here, gzip >summary.tsv.gz
). If the input files are all roughly M bytes in size, thus taking about the same time to process, they each turn into N bytes after being zcat|adjust
ed, and there are j jobs run in parallel, then the temporary storage could reach j·N bytes. Taking the example of M = 1GB, N = 10GB, and j = 40 cores (this would be the case where adjust
expands the input lines into several, perhaps into canonical TSV from an input that is a mixed of TSV and comma-separated values within a field), we would use up to j·N = 400GB of temporary storage, whereas the final disk space used for those processed files, once recompressed, may be a tenth of that, depending on entropy.
This is an unfortunate situation on a system that lacks the necessary amount of temporary storage, especially in the likely case where the order of records (lines) does not matter: we would much rather have parallel print processed lines as they becomes ready, perhaps in small batches; keeping gigantic blocks of data to be able to ship them off all at once doesn't make sense in this case.
The default behavior could also cause subsequent members of a longer pipeline to starve while blocks of their input are being prepared, causing inefficient use of computing resources by leaving some of them idle when they could be used.
The man page from GNU parallel 20120422 (shame on me for using such outdated systems) says the following:
--group Group output. Output from each jobs is grouped together and is
only printed when the command is finished. stderr (standard
error) first followed by stdout (standard output). This takes
some CPU time. In rare situations GNU parallel takes up lots
of CPU time and if it is acceptable that the outputs from
different commands are mixed together, then disabling grouping
with -u can speedup GNU parallel by a factor of 10.
And:
--ungroup
-u Ungroup output. Output is printed as soon as possible. This
may cause output from different commands to be mixed. GNU
parallel runs faster with -u. Can be reversed with --group.
Those make it seem as though the solution to our problem is to just use --ungroup
, or -u
, when calling parallel:
$ parallel -u zcat \| adjust ::: data-in/*.tsv.gz | gzip >summary.tsv.gz
This even mostly seems to work when running tests on tiny samples of data! but there is a catch, acknowledged by the author in public forums, including the announcement of version 20130822:
--line-buffer
will buffer output on line basis.--group
keeps the output together for a whole job.--ungroup
allows output to mixup with half a line coming from one job and half a line coming from another job.--line-buffer
fits between these two; it prints a full line, but will allow for mixing lines of different jobs.
As documented above, --ungroup
will merrily run parts of records (lines) together. I can't think of a non-convoluted situation where this would be useful. This is fairly easy to observe, for example:
## Without -u, the two input lines `aa` and `bb` stay separate (they
## could be printed in any order with respect to each other, so `bb`
## followed by `aa` is possible as well):
$ parallel cat ::: <(echo -n a; sleep 2; echo a) \
<(sleep 1; echo -n b; sleep 2; echo b)
aa
bb
## With -u, we risk having the lines mixed:
$ parallel -u cat ::: <(echo -n a; sleep 2; echo a) \
<(sleep 1; echo -n b; sleep 2; echo b)
aba
b
The right way, if we do not want mixed partial records, is to use --line-buffer
(in non-ancient versions of parallel):
$ parallel --line-buffer cat ::: <(echo -n a; sleep 2; echo a) \
<(sleep 1; echo -n b; sleep 2; echo b)
aa
bb
Quick links: