[ | Date | | | 2019-03-30 01:07 -0400 | ] |
[ | Mod. | | | 2021-07-16 09:53 -0400 | ] |
It is a well-known "fact", in some circles, that running grep in the C locale is much faster than in UTF-8 locales, the latter being a common default on current client systems.
Indeed, just the other day, a colleague of mine was running something akin to:
grep '"event-type-id":4727' app.log
where app.log
is a multi-gigabyte file of JSON lines1, the goal being to quickly see whether a newly-added type of event is seen. Almost immediately, someone jumped in to recommend running grep in the C locale, instead of the default UTF-8 locale, for Guaranteed Extra Speediness™.
I wondered whether this was indeed good advice. (TL;DR: using the C locale does not help performance in general, or jump to conclusions.)
This is my most common real-world use case at the moment: search for a small fragment of JSON within a file of JSON lines, both of them in practice ASCII.
In the following benchmark runs:
the file being searchd, json10g
, is made of 10GB of JSON lines, each about a kilobyte in length;
the search string is just "event-type-id":4727
;
the search string does not match anywhere in the input;
grep
is the native version of grep for the system used; GNU grep version 2.25 from Ubuntu 16.04 LTS;
the host has an Intel Kaby Lake processor (i7-7500U), simultaneous multithreading disabled, with 16GB of RAM;
the storage is SSD, which does not matter since the inputs fits in the buffer cache.
On to the results:
Command | Mean [s] | Min…Max [s] |
---|---|---|
LC_ALL=C grep |
3.156 ± 0.136 | 3.098…3.542 |
LC_ALL=en_US.UTF-8 grep |
3.106 ± 0.007 | 3.091…3.113 |
This is how hyperfine summarizes these results:
Summary
'LC_ALL=en_US.UTF-8 grep -f json-fixed.pat json10g' ran
1.02 ± 0.04 times faster than 'LC_ALL=C grep ...'
Section conclusion: setting LC_ALL=C
did not help improve performance.
Of course, it is possible that things get hairier when one uses more of grep's power, which accepts a language of full regular expressions as search patterns, and not only fixed strings. Let us put this to the test; the search string is now the non-fixed regular expression event-type-id":4[0-7].[0-9]
:
Command | Mean [s] | Min…Max [s] |
---|---|---|
LC_ALL=C grep |
3.835 ± 0.015 | 3.809…3.859 |
LC_ALL=en_US.UTF-8 grep |
3.845 ± 0.006 | 3.837…3.857 |
Summary
'LC_ALL=C grep -f json-simple-re.pat json10g' ran
1.00 ± 0.00 times faster than 'LC_ALL=en_US.UTF-8 grep ...'
Section conclusion: setting LC_ALL=C
did not help improve performance.
Caveat: the regular expression used here is fixed-length and does not use very many of grep's features; perhaps more complex examples would fare differently.
The performance section from the GNU grep documentation says the following2:
Generally speaking
grep
operates more efficiently in single-byte locales, since it can avoid the special processing needed for multi-byte characters. If your patterns will work just as well that way, settingLC_ALL
to a single-byte locale can help performance considerably. SettingLC_ALL='C'
can be particularly efficient, as grep is tuned for that locale.Outside the
C
locale, case-insensitive search, and search for bracket expressions like[a-z]
and[[=a=]b]
, can be surprisingly inefficient due to difficulties in fast portable access to concepts like multi-character collating elements.
Wow! The blanket LC_ALL=C
advice appears a bit misleading since, as far as I can tell, when the locale does not make a difference to the pattern, neither does it make a difference to performance. It is however interesting to note that this snippet points us directly to a pattern on which grep should not perform the same with varying locales:
Command | Mean [s] | Min…Max [s] |
---|---|---|
LC_ALL=C grep |
12.491 ± 0.047 | 12.455…12.597 |
LC_ALL=en_US.UTF-8 grep |
180.482 ± 1.476 | 179.123…182.781 |
Summary
'LC_ALL=C grep -f json-complex-re.pat json10g' ran
14.45 ± 0.13 times faster than 'LC_ALL=en_US.UTF-8 grep ...'
Section conclusion: setting LC_ALL=C
did help improve performance, tremendously.
In the use cases I tested, setting LC_ALL=C
either:
did not make any performance difference at all, when the pattern was a fixed string, or a "simple" regular expression; or
made a notable performance improvement, when the regular expression contained an alphabetic character range.
Note that, in the second case, changing the locale also changes the results, so that one cannot just blindly set it to C. It is fine to use the C locale if, say, [a-f]
is meant to be exactly [abcdef]
, but the range could include more characters in different locales:
$ echo 012@épaulées%DEF | grep -o '[a-z]*'
épaulées
$ echo 012@épaulées%DEF | LC_ALL=C grep -o '[a-z]*'
paul
es
Why would the apparently rarely-relevant LC_ALL=C
advice be so prevalent?
Perhaps older versions3 of GNU grep behaved differently from the one that comes with my current system, and they were indeed helped by running in the C locale, regardless of pattern?
All runs below use the same 10GB JSON input json10g
as in the previous section, and the same two search strings (fixed, and bona-fide regexp "re").
Command | Mean [s] | Min…Max [s] |
---|---|---|
LC_ALL=C grep-2.0 / fixed |
4.850 ± 0.152 | 4.701…5.037 |
LC_ALL=en_US.UTF-8 grep-2.0 / fixed |
4.848 ± 0.143 | 4.661…5.008 |
LC_ALL=C grep-2.25 / fixed |
3.114 ± 0.008 | 3.106…3.131 |
LC_ALL=en_US.UTF-8 grep-2.25 / fixed |
3.115 ± 0.008 | 3.097…3.123 |
LC_ALL=C grep-2.28 / fixed |
3.132 ± 0.010 | 3.106…3.141 |
LC_ALL=en_US.UTF-8 grep-2.28 / fixed |
3.136 ± 0.011 | 3.123…3.157 |
LC_ALL=C grep-3.0 / fixed |
3.132 ± 0.008 | 3.119…3.145 |
LC_ALL=en_US.UTF-8 grep-3.0 / fixed |
3.134 ± 0.009 | 3.119…3.153 |
LC_ALL=C grep-3.3 / fixed |
14.628 ± 0.475 | 14.465…15.978 |
LC_ALL=en_US.UTF-8 grep-3.3 / fixed |
14.476 ± 0.008 | 14.462…14.490 |
LC_ALL=C grep-2.0 / simple-re |
5.422 ± 0.133 | 5.173…5.530 |
LC_ALL=en_US.UTF-8 grep-2.0 / simple-re |
5.316 ± 0.149 | 5.159…5.476 |
LC_ALL=C grep-2.25 / simple-re |
3.834 ± 0.015 | 3.818…3.858 |
LC_ALL=en_US.UTF-8 grep-2.25 / simple-re |
3.838 ± 0.010 | 3.816…3.849 |
LC_ALL=C grep-2.28 / simple-re |
3.890 ± 0.008 | 3.881…3.904 |
LC_ALL=en_US.UTF-8 grep-2.28 / simple-re |
3.903 ± 0.007 | 3.887…3.912 |
LC_ALL=C grep-3.0 / simple-re |
3.894 ± 0.014 | 3.876…3.918 |
LC_ALL=en_US.UTF-8 grep-3.0 / simple-re |
3.918 ± 0.019 | 3.900…3.962 |
LC_ALL=C grep-3.3 / simple-re |
14.473 ± 0.012 | 14.457…14.499 |
LC_ALL=en_US.UTF-8 grep-3.3 / simple-re |
14.479 ± 0.014 | 14.462…14.499 |
Looking at pairs of runs of each version of GNU grep with the locale changing from C to UTF-8, we can see that setting LC_ALL=C
did not help improve performance.
We can also see that GNU grep version 3.3 was much slower than all other versions tested; this could be a bona fide regression, a fix for a long-standing correctness bug with performance implications, or something else; I did not investigate the matter.
I ran a FreeBSD 12.0-RELEASE virtual machine on the same hardware as previously4, same search strings and input:
Command | Mean [s] | Min…Max [s] |
---|---|---|
LC_ALL=C bsdgrep / fixed |
8.076 ± 0.040 | 8.034…8.177 |
LC_ALL=en_US.UTF-8 bsdgrep / fixed |
8.049 ± 0.024 | 8.004…8.088 |
LC_ALL=C bsdgrep / re |
8.258 ± 0.041 | 8.216…8.362 |
LC_ALL=en_US.UTF-8 bsdgrep / re |
8.254 ± 0.042 | 8.197…8.315 |
Again, setting LC_ALL=C
did not help improve performance here.
Side note: on this system, GNU grep is much faster than BSD grep. A quick win is therefore to favor the GNU version of grep, on systems where it is not already the default.
No:
Summary
'LC_ALL=C grep "event-type-id":4727472747 json1g' ran
1.01 ± 0.02 times faster than
'LC_ALL=en_US.UTF-8 grep "event-type-id":4727472747 json1g'
1.24 ± 0.01 times faster than
'LC_ALL=en_US.UTF-8 grep "boundingPoly":{"vertices": json1g'
1.26 ± 0.02 times faster than
'LC_ALL=C grep "boundingPoly":{"vertices": json1g'
While searching for a matching string ran more slowly than searching for a string that does not match, presumably because of the increased amount of output to be written, the locale setting did not make a difference.
In all previous benchmarks, I looked for relatively short strings. Would the locale influence performance when the search string is longer?
In the following, the search string $s
is abcdefghijklmnopqrstuvwxyz012345
repeated four times, for a total of 128 characters.
Command | Mean [s] | Min…Max [s] |
---|---|---|
LC_ALL=C grep |
2.104 ± 0.006 | 2.098…2.119 |
LC_ALL=en_US.UTF-8 grep |
2.101 ± 0.010 | 2.091…2.126 |
Summary
'LC_ALL=en_US.UTF-8 grep $s json10g' ran
1.00 ± 0.01 times faster than 'LC_ALL=C grep ...'
Setting LC_ALL=C
did not help improve performance.
In the following, the search string $s
is 任天堂株式会社/ソニー株 式会社/キヤノン株式会社
.
Command | Mean [s] | Min…Max [s] |
---|---|---|
LC_ALL=C grep |
1.988 ± 0.005 | 1.982…1.995 |
LC_ALL=en_US.UTF-8 grep |
1.980 ± 0.009 | 1.968…1.999 |
Summary
'LC_ALL=en_US.UTF-8 grep $s json10g' ran
1.00 ± 0.01 times faster than 'LC_ALL=C grep ...'
Setting LC_ALL=C
did not help improve performance.
Files used to generate inputs and run benchmarks: grepbench.tar.xz
Or, alternatively, a stream from a logging service; either way, grep is CPU-bound, and we want to make the best use of our resources.↩
As of March, 2019. This is excerpted from the GNU grep version 3.3 manual.↩
I chose which versions of GNU grep to compile and test as follows: the earliest and latest in the version 2 and version 3 (current) series, plus version 2.25 to match what my system packages. These versions cover the 25-year time span from May, 1993 to December, 2018.↩
Fun fact: the default grep on FreeBSD 12 there appears to be GNU grep. On my system, this was version 2.5.1-FreeBSD. BSD grep was version 2.6.0-FreeBSD.↩
Quick links: