While awk is indeed under-appreciated, there are many instances where using grep...

voltagex_ · on July 4, 2017

I have found ack and ag to be much faster than grep (I should actually time this).

petdance · on July 5, 2017

Creator of ack here.

The three biggest reasons to use grepalikes is not necessarily the time to execute the search, but

1) The reduced amount of typing. It's much faster to type "ack foo --ruby" than it is to type "find . -name '* .rb' | xargs grep foo"

2) Better filetype matching, because ack and ag check things beyond just file extensions. ack's --ruby flag checks for *.rb and Rakefile and "ruby" appearing in the shebang line, for example.

3) More features that grep doesn't include, like better highlighting and grouping, or full Perl regular expressions (in ack) and a ton of other features.

Here's a command line to find all the header files in your C source. It takes advantage of the ability to use Perl's regular expressions to specify output:

ack --cc '#include <(.+?)>' -H --output='$1' | sort -u

Here's an article I wrote a while ago on comparing the features of ack and ag. https://blog.newrelic.com/2015/01/28/grep-ack-ag/

Even though I created ack, I don't care which tool you use, so long as it suits your needs and makes you happy, but if all you're using your grep/ack/ag for is "grep -R function_name ." then you're only scratching the surface of what the tools can do for you.

thesmallestcat · on July 4, 2017

They are perceived as faster because they automatically skip binary and VCS-ignored files. grep is faster.

lathiat · on July 4, 2017

They are faster for practical usage, though I'm fairly sure ag also executes the search in parallel no?

I also seem to recall the ripgrep author recently trying to optimise towards grep in another HN post.

Love ag ever since I discovered it, though you have to be careful every so often it doesn't look in a filetype that mattered.

petdance · on July 5, 2017

> they automatically skip binary and VCS-ignored files

If it does less work, and takes less time, then isn't that faster?

TheDong · on July 4, 2017

See http://blog.burntsushi.net/ripgrep/ for a quite nice comparison which is counter to your experience; ack/ag/pt are all slower than either grep or ripgrep.

burntsushi · on July 4, 2017

"ack/ag/pt are all slower than either grep or ripgrep"

You need to be a touch careful with blanket statements, as this conclusion isn't quite what the data in my blog says. What it says is that ag is generally slower than GNU grep on large files, primarily because GNU grep's core search code is quite a bit more optimized than ag's. However, ag can outclass GNU grep when searching across entire directories primarily by culling the set of files that is searched, and also by parallelizing search if you want to exclude running GNU grep in a simple `find ... | xargs -P`-like command. (This is thesmallestcat's point.) This is why the first set of benchmarks in my blog don't actually benchmark GNU grep, because there is an impedance mismatch. Instead, ag/pt are benchmarked against `git grep`, where the comparison is quite a bit closer. (You'll want to check out the benchmarks on my local machine, which avoid penalizing ag for its use of memory maps.[1]) The second set of benchmarks basically swap out `git grep` for GNU grep and compare the tools on very large files.

ack isn't actually included in the benchmarks because it was incredibly slow when I tried it, although there may be something funny going on.[2] To be honest, it isn't terribly surprising to me, since every other tool is compiled down to native code.

The pt/sift story is interesting, and my current hypothesis is that the tools have received a misleading reputation for being fast. In particular, when using the tools with simple literals, it's likely that search will benefit from an AVX2 optimization when searching for the first byte. As soon as you start feeding more complex patterns (including case insensitivity), these tools slow down quite a bit because 1) the underlying regex engine needs more optimization work and 2) the literal analysis is lacking, which means the tools rely even more on the regex engine than other tools do.

The short summary of ripgrep is that it should outclass all the tools on all the benchmarks in my blog. I should update them at some point as ripgrep has gotten a bit faster since that blog post. Primarily, a parallel recursive directory iterator, which is present in sift, pt and ucg as well. Secondarily, its line counting has been sped up with more specialized SIMD routines. (I should hedge a bit here. It is possible to build patterns that a FSM-like engine will do very poorly on, but where a backtracking engine may not degrade as much. The prototypical example is large repetitions, e.g., `(foo){100}`. Of course, the reverse is true as well, since FSMs don't have exponential worst case time. Also, my benchmarks aren't quite exhaustive. For example, they don't benchmark GNU grep/ripgrep's `-f` flag for searching many regexes.)

[1] - https://github.com/BurntSushi/ripgrep/blob/master/benchsuite...

[2] - https://github.com/petdance/ack3/issues/42

yorwba · on July 4, 2017

Regarding your '(foo){100}' example, couldn't you expand this into a literal search for 100 repetitions of 'foo'? I guess this could interact poorly with the regex engine, and you'd be expected to cover more complex instances like '(foo){50,100}', but I think it might be worth the effort in some cases.

burntsushi · on July 4, 2017

Yes, my example was bad. Consider `\pL{100}` instead. :-)

With respect to expanding repetitions on literals, that already works today. e.g.,

    $ rg '(foo){100}' /dev/null --debug
    ...
    DEBUG:grep::literals: required literal found: "foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo"

The truly eagle-eyed would notice that there are actually only 83 instances of `foo` in this string. In fact, literal detection is actually quite a tricky task on arbitrary regular expressions, and one must be careful to bound the number of and size of literals you produce. For example, the regex `\w+` matches an ~infinite number of literals, but `\w` is really just a normal character class, and some character classes are good to include in your literal analysis.

The 83 comes from the fact that 83 times 3 = 249, which buts up against a hard-coded limit in the regex engine for literal detection: https://github.com/rust-lang/regex/blob/d894c631cb6c9a062c13...

In this case, the literal detector knows that the literal is a prefix. That means a prefix match must still be confirmed by entering the regex engine. If you pass a smaller literal, e.g., `(foo){4}`, then you'll see slightly different output:

    DEBUG:grep::literals: literal prefixes detected: Literals { lits: [Complete(foofoofoofoo)], limit_size: 250, limit_class: 10 }

In this case, the regex compiler sees that a literal match corresponds to an overall match of the regex, and can therefore stay out of the regex engine entirely. The "completion" analysis doesn't stop at simple regexes either, for example, `(foo){2}|(ba[rz]){2}` yields:

    DEBUG:grep::literals: literal prefixes detected: Literals { lits: [Complete(foofoo), Complete(barbar), Complete(bazbar), Complete(barbaz), Complete(bazbaz)], limit_size: 250, limit_class: 10 }

This is important, because this particular pattern will probably wind up using a special SIMD multi-pattern matcher. Failing that, it will use the "advanced" version of Aho-Corasick, which is a DFA that is computed ahead-of-time without extra failure transitions (as opposed to the more general lazy DFA used by the regex engine).

Literal detection is a big part of the secret sauce of ripgrep. Pretty much every single regex you feed to a tool like ripgrep will contain a literal somewhere, and this will greatly increase search speed when compared to tools that don't do literal extraction.

Of course, literal extraction has downsides. If your literal extractor picks out a prefix that is a very very common string in your haystack, then it would probably be better to just use the regex engine instead of ping-ponging back-and-forth between the prefix searcher and the regex engine. Of course, there are probably ways to detect and stop this ping-ponging, but I haven't invested much effort in that yet. Alas, the performance improvement in the common case (literals make things faster) seems to wind up being more beneficial for the general use case.

There is more in my blog post: http://blog.burntsushi.net/ripgrep/#literal-optimizations