[ Date | 2021-03-31 21:37 -0400 ] [ Mod. | 2021-05-14 15:25 -0400 ]

## Problem statement

In some "Big Data" circles, trying to do parallel computation on single, large files is a known cause of bottlenecks: unless the job of decompression can be naturally split over several computation units, decompression has to happen without parallelism.

On clusters that may have hundreds of computation units, it is wasteful to let the cluster mostly idle as only N units are able to concurrently read N input files, when N is much smaller than the number of available computation units.

File formats designed for parallelism allow extensive concurrency: it is possible to quickly find indices to the start of each record, or to small groups of records, and to spread the actual job of deserializing over as many workers as needed.

For various possible reasons including interoperability, however, it happens that input is simply text-based formats (perhaps JSON, perhaps TSV), that use standard block-compression algorithms: gzip, bzip2, zx. Sadly, any of those but bzip2 will be impossible to read in parallel using Apache Hadoop.

How does Hadoop "split" bzip2 to allow parallel decompression?

## Skippability

Does the bzip2 file format have a neat index of the starting offset of all blocks? Or, instead, does each block begin with a size marker, allowing blocks to be skipped efficiently?

Given that bzip2 is indeed "splittable" in Hadoop, and that this checks out experimentally (switching a large, single input from gzip to bzip increases concurrency from 1 to as many computation units as available), before even looking into internals, I would have expected the on-disk format to be something like the following where block-length would allow skipping:

symbol := expression
block-length := some encoding for one integer

Unfortunately, it looks like bzip2 block headers do not include a size marker, but rather that parallel decompressors rely on searching for 48-bit magic strings that mark block boundaries (the 6-byte BCD encodings of $\pi$ and $\sqrt\pi$, incidentally); the actual format is1:

block-header := block-magic block-crc randomized orig-ptr

None of the fields within the block header is a length that would allow skipping straight to the next block.

This still allows some amount of "skipping", in the sense that a reader can find block boundaries without decompressing, but this is not true skipping in the sense of not reading bytes that are irrelevant. This requires going through the input linearly, reading most or all bytes, only saving the time that would have been spent decompressing.

The Hadoop reader implementation2 has a skipToNextMarker method, which looks for a bit pattern within a stream: it is used to split the bzip2 stream into blocks, by searching for the block start marker:

public class CBZip2InputStream extends InputStream implements BZip2Constants {

public static final long BLOCK_DELIMITER = 0X314159265359L; // start of block
// ...
private static final int DELIMITER_BIT_LENGTH = 48;

As far as I can tell, the process is very straightforward, and starts by reading DELIMITER_BIT_LENGTH bits from the stream, testing those for equality with BLOCK_DELIMITER, and reading additional bits as needed until the pattern is matched or the end of the buffer is reached.

## References

• Joe Tsai. Bzip2: Format Specification.

This document carefully presents itself as a non-official specification, and is based on reverse-engineering bzip2-1.0.6.

• Mike Haertel. Why GNU grep is fast. Excerpts:

"GNU grep [...] [avoids looking at every input byte]"

"GNU grep uses the well-known Boyer-Moore algorithm, which [...] uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character."

• The Knuth–Morris–Pratt algorithm from c. 1970 allows searching for a fixed pattern within string in linear time (a function of the length of the pattern plus that of the string being searched; $Θ(m) + Θ(n)$ worst-case, where m and n are the lengths of the pattern and that of the string, respectively).

• The Boyer–Moore algorithm from 1977 allows searching in better than linear time (best-case): the longer the pattern, the faster the search.

1. Adapted from Joe Tsai Bzip2: Format Specification.

2. As of 2021-03-31.

www.kurokatta.org

www.kurokatta.org

Photos
Montréal
Oregon
Paris
Camp info 2007
Camp Faécum 2007
--more--
Doc
Jussieu
Muttrc
Bcc
Montréal
Couleurs LTP
French English words
Petites arnaques
--more--
Hacks
Statmail
DSC-W17 patch
Scarab: dictionnaire de Scrabble
Sigpue
Recipes
Omelette soufflée au sirop d'érable
Camembert fondu au sirop d'érable
La Mona de Tata Zineb
Cake aux bananes, au beurre de cacahuètes et aux pépites de chocolat