Everyone knows grep is faster in the C locale

Index, feed.

[ Date | 2019-03-30 01:07 -0400 ]
[ Mod. | 2021-07-16 09:53 -0400 ]

It is a well-known "fact", in some circles, that running grep in the C locale is much faster than in UTF-8 locales, the latter being a common default on current client systems.

Indeed, just the other day, a colleague of mine was running something akin to:

grep '"event-type-id":4727' app.log

where app.log is a multi-gigabyte file of JSON lines1, the goal being to quickly see whether a newly-added type of event is seen. Almost immediately, someone jumped in to recommend running grep in the C locale, instead of the default UTF-8 locale, for Guaranteed Extra Speediness™.

I wondered whether this was indeed good advice. (TL;DR: using the C locale does not help performance in general, or jump to conclusions.)

Searching for a fixed string

This is my most common real-world use case at the moment: search for a small fragment of JSON within a file of JSON lines, both of them in practice ASCII.

In the following benchmark runs:

On to the results:

Command Mean [s] Min…Max [s]
LC_ALL=C grep 3.156 ± 0.136 3.098…3.542
LC_ALL=en_US.UTF-8 grep 3.106 ± 0.007 3.091…3.113

This is how hyperfine summarizes these results:

Summary
  'LC_ALL=en_US.UTF-8 grep -f json-fixed.pat json10g' ran
    1.02 ± 0.04 times faster than 'LC_ALL=C grep ...'

Section conclusion: setting LC_ALL=C did not help improve performance.

Searching using a simple non-fixed regexp string

Of course, it is possible that things get hairier when one uses more of grep's power, which accepts a language of full regular expressions as search patterns, and not only fixed strings. Let us put this to the test; the search string is now the non-fixed regular expression event-type-id":4[0-7].[0-9]:

Command Mean [s] Min…Max [s]
LC_ALL=C grep 3.835 ± 0.015 3.809…3.859
LC_ALL=en_US.UTF-8 grep 3.845 ± 0.006 3.837…3.857
Summary
  'LC_ALL=C grep  -f json-simple-re.pat json10g' ran
    1.00 ± 0.00 times faster than 'LC_ALL=en_US.UTF-8 grep ...'

Section conclusion: setting LC_ALL=C did not help improve performance.

Caveat: the regular expression used here is fixed-length and does not use very many of grep's features; perhaps more complex examples would fare differently.

Searching using a less simple non-fixed regexp string

The performance section from the GNU grep documentation says the following2:

Generally speaking grep operates more efficiently in single-byte locales, since it can avoid the special processing needed for multi-byte characters. If your patterns will work just as well that way, setting LC_ALL to a single-byte locale can help performance considerably. Setting LC_ALL='C' can be particularly efficient, as grep is tuned for that locale.

Outside the C locale, case-insensitive search, and search for bracket expressions like [a-z] and [[=a=]b], can be surprisingly inefficient due to difficulties in fast portable access to concepts like multi-character collating elements.

Wow! The blanket LC_ALL=C advice appears a bit misleading since, as far as I can tell, when the locale does not make a difference to the pattern, neither does it make a difference to performance. It is however interesting to note that this snippet points us directly to a pattern on which grep should not perform the same with varying locales:

Command Mean [s] Min…Max [s]
LC_ALL=C grep 12.491 ± 0.047 12.455…12.597
LC_ALL=en_US.UTF-8 grep 180.482 ± 1.476 179.123…182.781
Summary
  'LC_ALL=C grep -f json-complex-re.pat json10g' ran
   14.45 ± 0.13 times faster than 'LC_ALL=en_US.UTF-8 grep ...'

Section conclusion: setting LC_ALL=C did help improve performance, tremendously.

Conclusion

In the use cases I tested, setting LC_ALL=C either:

Note that, in the second case, changing the locale also changes the results, so that one cannot just blindly set it to C. It is fine to use the C locale if, say, [a-f] is meant to be exactly [abcdef], but the range could include more characters in different locales:

$ echo 012@épaulées%DEF | grep -o '[a-z]*'
épaulées

$ echo 012@épaulées%DEF | LC_ALL=C grep -o '[a-z]*'
paul
es

Theories

Why would the apparently rarely-relevant LC_ALL=C advice be so prevalent?

Is it a matter of old vs current versions of GNU grep?

Perhaps older versions3 of GNU grep behaved differently from the one that comes with my current system, and they were indeed helped by running in the C locale, regardless of pattern?

All runs below use the same 10GB JSON input json10g as in the previous section, and the same two search strings (fixed, and bona-fide regexp "re").

Command Mean [s] Min…Max [s]
LC_ALL=C grep-2.0 / fixed 4.850 ± 0.152 4.701…5.037
LC_ALL=en_US.UTF-8 grep-2.0 / fixed 4.848 ± 0.143 4.661…5.008
LC_ALL=C grep-2.25 / fixed 3.114 ± 0.008 3.106…3.131
LC_ALL=en_US.UTF-8 grep-2.25 / fixed 3.115 ± 0.008 3.097…3.123
LC_ALL=C grep-2.28 / fixed 3.132 ± 0.010 3.106…3.141
LC_ALL=en_US.UTF-8 grep-2.28 / fixed 3.136 ± 0.011 3.123…3.157
LC_ALL=C grep-3.0 / fixed 3.132 ± 0.008 3.119…3.145
LC_ALL=en_US.UTF-8 grep-3.0 / fixed 3.134 ± 0.009 3.119…3.153
LC_ALL=C grep-3.3 / fixed 14.628 ± 0.475 14.465…15.978
LC_ALL=en_US.UTF-8 grep-3.3 / fixed 14.476 ± 0.008 14.462…14.490
LC_ALL=C grep-2.0 / simple-re 5.422 ± 0.133 5.173…5.530
LC_ALL=en_US.UTF-8 grep-2.0 / simple-re 5.316 ± 0.149 5.159…5.476
LC_ALL=C grep-2.25 / simple-re 3.834 ± 0.015 3.818…3.858
LC_ALL=en_US.UTF-8 grep-2.25 / simple-re 3.838 ± 0.010 3.816…3.849
LC_ALL=C grep-2.28 / simple-re 3.890 ± 0.008 3.881…3.904
LC_ALL=en_US.UTF-8 grep-2.28 / simple-re 3.903 ± 0.007 3.887…3.912
LC_ALL=C grep-3.0 / simple-re 3.894 ± 0.014 3.876…3.918
LC_ALL=en_US.UTF-8 grep-3.0 / simple-re 3.918 ± 0.019 3.900…3.962
LC_ALL=C grep-3.3 / simple-re 14.473 ± 0.012 14.457…14.499
LC_ALL=en_US.UTF-8 grep-3.3 / simple-re 14.479 ± 0.014 14.462…14.499

Looking at pairs of runs of each version of GNU grep with the locale changing from C to UTF-8, we can see that setting LC_ALL=C did not help improve performance.

We can also see that GNU grep version 3.3 was much slower than all other versions tested; this could be a bona fide regression, a fix for a long-standing correctness bug with performance implications, or something else; I did not investigate the matter.

Is it a matter of BSD grep vs GNU grep?

I ran a FreeBSD 12.0-RELEASE virtual machine on the same hardware as previously4, same search strings and input:

Command Mean [s] Min…Max [s]
LC_ALL=C bsdgrep / fixed 8.076 ± 0.040 8.034…8.177
LC_ALL=en_US.UTF-8 bsdgrep / fixed 8.049 ± 0.024 8.004…8.088
LC_ALL=C bsdgrep / re 8.258 ± 0.041 8.216…8.362
LC_ALL=en_US.UTF-8 bsdgrep / re 8.254 ± 0.042 8.197…8.315

Again, setting LC_ALL=C did not help improve performance here.

Side note: on this system, GNU grep is much faster than BSD grep. A quick win is therefore to favor the GNU version of grep, on systems where it is not already the default.

Is this a matter of string that matches versus string that does not match?

No:

Summary
  'LC_ALL=C grep "event-type-id":4727472747 json1g' ran
    1.01 ± 0.02 times faster than
      'LC_ALL=en_US.UTF-8 grep "event-type-id":4727472747 json1g'
    1.24 ± 0.01 times faster than
      'LC_ALL=en_US.UTF-8 grep "boundingPoly":{"vertices": json1g'
    1.26 ± 0.02 times faster than
      'LC_ALL=C grep "boundingPoly":{"vertices": json1g'

While searching for a matching string ran more slowly than searching for a string that does not match, presumably because of the increased amount of output to be written, the locale setting did not make a difference.

Is this a matter of short search string vs long search string?

In all previous benchmarks, I looked for relatively short strings. Would the locale influence performance when the search string is longer?

In the following, the search string $s is abcdefghijklmnopqrstuvwxyz012345 repeated four times, for a total of 128 characters.

Command Mean [s] Min…Max [s]
LC_ALL=C grep 2.104 ± 0.006 2.098…2.119
LC_ALL=en_US.UTF-8 grep 2.101 ± 0.010 2.091…2.126
Summary
  'LC_ALL=en_US.UTF-8 grep $s json10g' ran
    1.00 ± 0.01 times faster than 'LC_ALL=C grep ...'

Setting LC_ALL=C did not help improve performance.

Is this a matter of ASCII encoding versus multibyte?

In the following, the search string $s is 任天堂株式会社/ソニー株 式会社/キヤノン株式会社.

Command Mean [s] Min…Max [s]
LC_ALL=C grep 1.988 ± 0.005 1.982…1.995
LC_ALL=en_US.UTF-8 grep 1.980 ± 0.009 1.968…1.999
Summary
  'LC_ALL=en_US.UTF-8 grep $s json10g' ran
    1.00 ± 0.01 times faster than 'LC_ALL=C grep ...'

Setting LC_ALL=C did not help improve performance.

Reference material

Files used to generate inputs and run benchmarks: grepbench.tar.xz


  1. Or, alternatively, a stream from a logging service; either way, grep is CPU-bound, and we want to make the best use of our resources.

  2. As of March, 2019. This is excerpted from the GNU grep version 3.3 manual.

  3. I chose which versions of GNU grep to compile and test as follows: the earliest and latest in the version 2 and version 3 (current) series, plus version 2.25 to match what my system packages. These versions cover the 25-year time span from May, 1993 to December, 2018.

  4. Fun fact: the default grep on FreeBSD 12 there appears to be GNU grep. On my system, this was version 2.5.1-FreeBSD. BSD grep was version 2.6.0-FreeBSD.

www.kurokatta.org


www.kurokatta.org

Quick links:

Photos
Montréal
Oregon
Paris
Camp info 2007
Camp Faécum 2007
--more--
Doc
Jussieu
Japanese adjectives
Muttrc
Bcc
Montréal
Couleurs LTP
French English words
Petites arnaques
--more--
Hacks
Statmail
DSC-W17 patch
Scarab: dictionnaire de Scrabble
Sigpue
Recipes
Omelette soufflée au sirop d'érable
Camembert fondu au sirop d'érable
La Mona de Tata Zineb
Cake aux bananes, au beurre de cacahuètes et aux pépites de chocolat
*