Everyone knows grep is faster in the C locale

Index, feed.

[	Date	\|	2019-03-30 01:07 -0400	]
[	Mod.	\|	2021-07-16 09:53 -0400	]

It is a well-known "fact", in some circles, that running grep in the C locale is much faster than in UTF-8 locales, the latter being a common default on current client systems.

Indeed, just the other day, a colleague of mine was running something akin to:

grep '"event-type-id":4727' app.log

where app.log is a multi-gigabyte file of JSON lines¹, the goal being to quickly see whether a newly-added type of event is seen. Almost immediately, someone jumped in to recommend running grep in the C locale, instead of the default UTF-8 locale, for Guaranteed Extra Speediness™.

I wondered whether this was indeed good advice. (TL;DR: using the C locale does not help performance in general, or jump to conclusions.)

Searching for a fixed string

This is my most common real-world use case at the moment: search for a small fragment of JSON within a file of JSON lines, both of them in practice ASCII.

In the following benchmark runs:

the file being searchd, json10g, is made of 10GB of JSON lines, each about a kilobyte in length;
the search string is just "event-type-id":4727;
the search string does not match anywhere in the input;
grep is the native version of grep for the system used; GNU grep version 2.25 from Ubuntu 16.04 LTS;
the host has an Intel Kaby Lake processor (i7-7500U), simultaneous multithreading disabled, with 16GB of RAM;
the storage is SSD, which does not matter since the inputs fits in the buffer cache.

On to the results:

Command	Mean [s]	Min…Max [s]
`LC_ALL=C grep`	3.156 ± 0.136	3.098…3.542
`LC_ALL=en_US.UTF-8 grep`	3.106 ± 0.007	3.091…3.113

This is how hyperfine summarizes these results:

Summary
  'LC_ALL=en_US.UTF-8 grep -f json-fixed.pat json10g' ran
    1.02 ± 0.04 times faster than 'LC_ALL=C grep ...'

Section conclusion: setting LC_ALL=C did not help improve performance.

Searching using a simple non-fixed regexp string

Of course, it is possible that things get hairier when one uses more of grep's power, which accepts a language of full regular expressions as search patterns, and not only fixed strings. Let us put this to the test; the search string is now the non-fixed regular expression event-type-id":4[0-7].[0-9]:

Command	Mean [s]	Min…Max [s]
`LC_ALL=C grep`	3.835 ± 0.015	3.809…3.859
`LC_ALL=en_US.UTF-8 grep`	3.845 ± 0.006	3.837…3.857

Summary
  'LC_ALL=C grep  -f json-simple-re.pat json10g' ran
    1.00 ± 0.00 times faster than 'LC_ALL=en_US.UTF-8 grep ...'

Section conclusion: setting LC_ALL=C did not help improve performance.

Caveat: the regular expression used here is fixed-length and does not use very many of grep's features; perhaps more complex examples would fare differently.

Searching using a less simple non-fixed regexp string

The performance section from the GNU grep documentation says the following²:

Generally speaking grep operates more efficiently in single-byte locales, since it can avoid the special processing needed for multi-byte characters. If your patterns will work just as well that way, setting LC_ALL to a single-byte locale can help performance considerably. Setting LC_ALL='C' can be particularly efficient, as grep is tuned for that locale.

Outside the C locale, case-insensitive search, and search for bracket expressions like [a-z] and [[=a=]b], can be surprisingly inefficient due to difficulties in fast portable access to concepts like multi-character collating elements.

Wow! The blanket LC_ALL=C advice appears a bit misleading since, as far as I can tell, when the locale does not make a difference to the pattern, neither does it make a difference to performance. It is however interesting to note that this snippet points us directly to a pattern on which grep should not perform the same with varying locales:

Command	Mean [s]	Min…Max [s]
`LC_ALL=C grep`	12.491 ± 0.047	12.455…12.597
`LC_ALL=en_US.UTF-8 grep`	180.482 ± 1.476	179.123…182.781

Summary
  'LC_ALL=C grep -f json-complex-re.pat json10g' ran
   14.45 ± 0.13 times faster than 'LC_ALL=en_US.UTF-8 grep ...'

Section conclusion: setting LC_ALL=C did help improve performance, tremendously.

Conclusion

In the use cases I tested, setting LC_ALL=C either:

did not make any performance difference at all, when the pattern was a fixed string, or a "simple" regular expression; or
made a notable performance improvement, when the regular expression contained an alphabetic character range.

Note that, in the second case, changing the locale also changes the results, so that one cannot just blindly set it to C. It is fine to use the C locale if, say, [a-f] is meant to be exactly [abcdef], but the range could include more characters in different locales:

$ echo 012@épaulées%DEF | grep -o '[a-z]*'
épaulées

$ echo 012@épaulées%DEF | LC_ALL=C grep -o '[a-z]*'
paul
es

Theories

Why would the apparently rarely-relevant LC_ALL=C advice be so prevalent?

Is it a matter of old vs current versions of GNU grep?

Perhaps older versions³ of GNU grep behaved differently from the one that comes with my current system, and they were indeed helped by running in the C locale, regardless of pattern?

All runs below use the same 10GB JSON input json10g as in the previous section, and the same two search strings (fixed, and bona-fide regexp "re").

Command	Mean [s]	Min…Max [s]
`LC_ALL=C grep-2.0` / fixed	4.850 ± 0.152	4.701…5.037
`LC_ALL=en_US.UTF-8 grep-2.0` / fixed	4.848 ± 0.143	4.661…5.008
`LC_ALL=C grep-2.25` / fixed	3.114 ± 0.008	3.106…3.131
`LC_ALL=en_US.UTF-8 grep-2.25` / fixed	3.115 ± 0.008	3.097…3.123
`LC_ALL=C grep-2.28` / fixed	3.132 ± 0.010	3.106…3.141
`LC_ALL=en_US.UTF-8 grep-2.28` / fixed	3.136 ± 0.011	3.123…3.157
`LC_ALL=C grep-3.0` / fixed	3.132 ± 0.008	3.119…3.145
`LC_ALL=en_US.UTF-8 grep-3.0` / fixed	3.134 ± 0.009	3.119…3.153
`LC_ALL=C grep-3.3` / fixed	14.628 ± 0.475	14.465…15.978
`LC_ALL=en_US.UTF-8 grep-3.3` / fixed	14.476 ± 0.008	14.462…14.490

`LC_ALL=C grep-2.0` / simple-re	5.422 ± 0.133	5.173…5.530
`LC_ALL=en_US.UTF-8 grep-2.0` / simple-re	5.316 ± 0.149	5.159…5.476
`LC_ALL=C grep-2.25` / simple-re	3.834 ± 0.015	3.818…3.858
`LC_ALL=en_US.UTF-8 grep-2.25` / simple-re	3.838 ± 0.010	3.816…3.849
`LC_ALL=C grep-2.28` / simple-re	3.890 ± 0.008	3.881…3.904
`LC_ALL=en_US.UTF-8 grep-2.28` / simple-re	3.903 ± 0.007	3.887…3.912
`LC_ALL=C grep-3.0` / simple-re	3.894 ± 0.014	3.876…3.918
`LC_ALL=en_US.UTF-8 grep-3.0` / simple-re	3.918 ± 0.019	3.900…3.962
`LC_ALL=C grep-3.3` / simple-re	14.473 ± 0.012	14.457…14.499
`LC_ALL=en_US.UTF-8 grep-3.3` / simple-re	14.479 ± 0.014	14.462…14.499

Looking at pairs of runs of each version of GNU grep with the locale changing from C to UTF-8, we can see that setting LC_ALL=C did not help improve performance.

We can also see that GNU grep version 3.3 was much slower than all other versions tested; this could be a bona fide regression, a fix for a long-standing correctness bug with performance implications, or something else; I did not investigate the matter.

Is it a matter of BSD grep vs GNU grep?

I ran a FreeBSD 12.0-RELEASE virtual machine on the same hardware as previously⁴, same search strings and input:

Command	Mean [s]	Min…Max [s]
`LC_ALL=C bsdgrep` / fixed	8.076 ± 0.040	8.034…8.177
`LC_ALL=en_US.UTF-8 bsdgrep` / fixed	8.049 ± 0.024	8.004…8.088
`LC_ALL=C bsdgrep` / re	8.258 ± 0.041	8.216…8.362
`LC_ALL=en_US.UTF-8 bsdgrep` / re	8.254 ± 0.042	8.197…8.315

Again, setting LC_ALL=C did not help improve performance here.

Side note: on this system, GNU grep is much faster than BSD grep. A quick win is therefore to favor the GNU version of grep, on systems where it is not already the default.

Is this a matter of string that matches versus string that does not match?

No:

Summary
  'LC_ALL=C grep "event-type-id":4727472747 json1g' ran
    1.01 ± 0.02 times faster than
      'LC_ALL=en_US.UTF-8 grep "event-type-id":4727472747 json1g'
    1.24 ± 0.01 times faster than
      'LC_ALL=en_US.UTF-8 grep "boundingPoly":{"vertices": json1g'
    1.26 ± 0.02 times faster than
      'LC_ALL=C grep "boundingPoly":{"vertices": json1g'

While searching for a matching string ran more slowly than searching for a string that does not match, presumably because of the increased amount of output to be written, the locale setting did not make a difference.

Is this a matter of short search string vs long search string?

In all previous benchmarks, I looked for relatively short strings. Would the locale influence performance when the search string is longer?

In the following, the search string $s is abcdefghijklmnopqrstuvwxyz012345 repeated four times, for a total of 128 characters.

Command	Mean [s]	Min…Max [s]
`LC_ALL=C grep`	2.104 ± 0.006	2.098…2.119
`LC_ALL=en_US.UTF-8 grep`	2.101 ± 0.010	2.091…2.126

Summary
  'LC_ALL=en_US.UTF-8 grep $s json10g' ran
    1.00 ± 0.01 times faster than 'LC_ALL=C grep ...'

Setting LC_ALL=C did not help improve performance.

Is this a matter of ASCII encoding versus multibyte?

In the following, the search string $s is 任天堂株式会社／ソニー株式会社／キヤノン株式会社.

Command	Mean [s]	Min…Max [s]
`LC_ALL=C grep`	1.988 ± 0.005	1.982…1.995
`LC_ALL=en_US.UTF-8 grep`	1.980 ± 0.009	1.968…1.999

Summary
  'LC_ALL=en_US.UTF-8 grep $s json10g' ran
    1.00 ± 0.01 times faster than 'LC_ALL=C grep ...'

Setting LC_ALL=C did not help improve performance.

Reference material

Files used to generate inputs and run benchmarks: grepbench.tar.xz

Or, alternatively, a stream from a logging service; either way, grep is CPU-bound, and we want to make the best use of our resources.↩
As of March, 2019. This is excerpted from the GNU grep version 3.3 manual.↩
I chose which versions of GNU grep to compile and test as follows: the earliest and latest in the version 2 and version 3 (current) series, plus version 2.25 to match what my system packages. These versions cover the 25-year time span from May, 1993 to December, 2018.↩
Fun fact: the default grep on FreeBSD 12 there appears to be GNU grep. On my system, this was version 2.5.1-FreeBSD. BSD grep was version 2.6.0-FreeBSD.↩

www.kurokatta.org

Quick links:

Photos: Montréal; Oregon; Paris; Camp info 2007; Camp Faécum 2007; --more--
Doc: Jussieu; Japanese adjectives; Muttrc; Bcc; Montréal; Couleurs LTP; French English words; Petites arnaques; --more--
Hacks: Statmail; DSC-W17 patch; Scarab: dictionnaire de Scrabble; Sigpue
Recipes: Omelette soufflée au sirop d'érable; Camembert fondu au sirop d'érable; La Mona de Tata Zineb; Cake aux bananes, au beurre de cacahuètes et aux pépites de chocolat