More precisely, it is arbitrarily deep nested clauses that cannot be parsed, on account of true REs not being able to either use recursion or keep count of how deep they are.
Can regex engines actually process arbitrarily deep nestings themselves though?
It seems to me it would start getting computationally expensive very quickly to search for those kinds of regexes, so is 'arbitrarily deep' a practical concern for todays hardware?
Yes, regex engines that are designed to do so run in linear time on the size of the regex (as well as linear time with respect to the length of the input).
It can process arbitrary deep nestings? I'm not familiar with it, but I had a look and I can't see anything in there that would allow it to do it. Can you point to syntax that allows it?
In general regex's that run in linear time are based on DFA's (as opposed to NFA's - which is what most use).
A DFA is part of the Chomsky hierarchy of languages that has DFA's at the bottom (aka regex's, least powerful - can't handle recursive structures), a DFA coupled with an infinite stack for memory (aka LR parsers, can't handle context sensitive grammars, also runs in very close to linear time), and finally a DFA coupled with an infinite tape (aka Turing Machine, can do any sort of computation).
In each of these cases the DFA (regex) is essentially a computer program and all they are doing is allowing it to store stuff on various kinds of memory (none, stack, tape with arbitrary movements). An interesting corollary of this is any computer program can be expressed as a DFA.
The GP is presumably citing it as an example of a linear time regex, and not as something that can parse arbitrarily deep nestings. Note that "linear time regex engines" is a characterization that holds the size of the regex constant, such that "linear time" is with respect to the input only. The actual worst case time complexity of such regex engines is O(m * n), where m ~ len(regex) and n ~ len(input). So even with a linear time regex engine, if you bloat the size of the regex, you'll make searching slower. But, it will still be done linearly with respect to the size of the input.
> In general regex's that run in linear time are based on DFA's (as opposed to NFA's - which is what most use).
No. Please stop spreading the misuse of these terms. DFAs and NFAs have very precise meanings. It's one thing to call things regexes that aren't actually regular, but DFAs and NFAs should retain their precise theoretical meaning. A linear time regex engine such as Rust's `regex` crate uses both DFAs and NFAs in its implementation.
The Perl and PCRE folks (along with Jeffrey Friedl) have bastardized this terminology. PCRE for example claims to provide a "DFA" regex engine in its API, but it's actually an NFA. The main implementation inside PCRE (and Perl) is backtracking oriented, and carries with it much more power than an NFA. If it didn't, then the NFA could be executed in linear time, because NFAs and DFAs have equivalent power.
> An interesting corollary of this is any computer program can be expressed as a DFA.
No they can't. You're confusing terminology quite a bit here.
There is an importance to theoretical considerations that goes beyond their immediate practical applications. Ultimately, it comes down to how well one understands what is going on.
Having said that, I think there are practical applications. Suppose you were unaware of this result, and thought a true RE could validate your input? A more knowledgeable attacker might be able to craft an input that exploits the inherent limitations of your validator.
Also in practice, attempting to write a true RE to parse a language that allows restricted recursion rapidly becomes complicated. It is the wrong tool for the job.
Just considering the language of balanced parentheses, the only memory needed is a single integer to track depth. If we see a "(", increment. If we see a ")", decrement. If we never go negative and end up with zero, that's a balanced parentheses expression. (To be clear, the parser I've described is not a regex parser.)
Adding the rest of the regex machinery makes this a bit more complex, but in general, if the computer can store the string, it can check whether it's regular.
As you say, this is not a regex parser, but some readers may not know the theorem that the regular languages are exactly those which can be parsed by a Turing machine with constant memory, and parenthesis count can have log n bits.
In fact, any Turing machine using o(log log n) space recognizes a regular language (so there is an equivalent machine using O(1) space).
As typically defined academically, yes. In the real world things might be more complex, but I think it would hold true for many implementations.
Academically a regex is a string of of:
- character literals
- "epsilon" characters (meaning empty string)
- "+" characters (which means either what we see before the + or what we see after the +, in real-world regex these are usually represented as | instead of + as well as used implicitly in constructs like [a-zA-Z])
- "*" characters (same meaning as in real-world regex)
- And parentheses to allow controlling the order of operations
In other words, "if we build a state machine which is different from a regex, it will perform differently than a regex would." That's trivially true, but is it interesting beyond restating "there is no Silver Bullet"?
Technically we aren't discussing the regex itself here but the syntax for defining a regex.
You could use a different syntax for defining the same regex, and that syntax could have different properties. For instance you could use parantheses like racket where "(...)" and "[...]" have the same meaning, but "(...]" is not well formed syntax (if I remember my racket correctly). Using such a syntax to define regex a counter would no longer suffice to decide if a string is a regex.
You could have a stack of (paren-type, integer) pairs. If the paren has a different type than the top pair on the stack (or the stack is empty), push a new pair with the integer set to one. If the integer goes to zero, pop the pair. At the end of the expression, you have to have an empty stack.
When I check brackets by hand or manually I do actually get my hands out and when I say open I uncurl a finger, and when I say close, I furl a finger. "Open, open, open, close, open, close, close, close". Provided I have no fingers on display at the end, then the brackets are balanced. They might not be in the right place but at least they are balanced. If I find myself starting with close then I close the lid and go and do something else.
I'm sure some sort of rather cheap algorithm falls out of the above. It wont guarantee correctness in what the balanced brackets actually contain but they will at least be balanced.
https://blogs.msdn.microsoft.com/jaredpar/2008/10/15/regular...