Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

More precisely, it is arbitrarily deep nested clauses that cannot be parsed, on account of true REs not being able to either use recursion or keep count of how deep they are.

https://blogs.msdn.microsoft.com/jaredpar/2008/10/15/regular...



Can regex engines actually process arbitrarily deep nestings themselves though?

It seems to me it would start getting computationally expensive very quickly to search for those kinds of regexes, so is 'arbitrarily deep' a practical concern for todays hardware?


Yes, regex engines that are designed to do so run in linear time on the size of the regex (as well as linear time with respect to the length of the input).

One such engine is rust's https://github.com/rust-lang/regex


> One such engine is rust's https://github.com/rust-lang/regex

It can process arbitrary deep nestings? I'm not familiar with it, but I had a look and I can't see anything in there that would allow it to do it. Can you point to syntax that allows it?

In general regex's that run in linear time are based on DFA's (as opposed to NFA's - which is what most use).

A DFA is part of the Chomsky hierarchy of languages that has DFA's at the bottom (aka regex's, least powerful - can't handle recursive structures), a DFA coupled with an infinite stack for memory (aka LR parsers, can't handle context sensitive grammars, also runs in very close to linear time), and finally a DFA coupled with an infinite tape (aka Turing Machine, can do any sort of computation).

In each of these cases the DFA (regex) is essentially a computer program and all they are doing is allowing it to store stuff on various kinds of memory (none, stack, tape with arbitrary movements). An interesting corollary of this is any computer program can be expressed as a DFA.


> It can process arbitrary deep nestings?

The GP is presumably citing it as an example of a linear time regex, and not as something that can parse arbitrarily deep nestings. Note that "linear time regex engines" is a characterization that holds the size of the regex constant, such that "linear time" is with respect to the input only. The actual worst case time complexity of such regex engines is O(m * n), where m ~ len(regex) and n ~ len(input). So even with a linear time regex engine, if you bloat the size of the regex, you'll make searching slower. But, it will still be done linearly with respect to the size of the input.

> In general regex's that run in linear time are based on DFA's (as opposed to NFA's - which is what most use).

No. Please stop spreading the misuse of these terms. DFAs and NFAs have very precise meanings. It's one thing to call things regexes that aren't actually regular, but DFAs and NFAs should retain their precise theoretical meaning. A linear time regex engine such as Rust's `regex` crate uses both DFAs and NFAs in its implementation.

The Perl and PCRE folks (along with Jeffrey Friedl) have bastardized this terminology. PCRE for example claims to provide a "DFA" regex engine in its API, but it's actually an NFA. The main implementation inside PCRE (and Perl) is backtracking oriented, and carries with it much more power than an NFA. If it didn't, then the NFA could be executed in linear time, because NFAs and DFAs have equivalent power.

> An interesting corollary of this is any computer program can be expressed as a DFA.

No they can't. You're confusing terminology quite a bit here.


Does "linear wrt A and linear wrt B" mean O(A*B) or O(A+B)?


It means O(AB), but is carefully worded to not say (or suggest) it's not possible to do O(A+B) which is a subset of O(AB) ;)

I'm not sure if it's possible to match regex with that time complexity, it looks like the implementation I referenced is O(AB). https://docs.rs/regex/1.3.1/regex/#untrusted-input.

(star symbol removed because hn decided to turn them into a block of italics)


There is an importance to theoretical considerations that goes beyond their immediate practical applications. Ultimately, it comes down to how well one understands what is going on.

Having said that, I think there are practical applications. Suppose you were unaware of this result, and thought a true RE could validate your input? A more knowledgeable attacker might be able to craft an input that exploits the inherent limitations of your validator.

Also in practice, attempting to write a true RE to parse a language that allows restricted recursion rapidly becomes complicated. It is the wrong tool for the job.


Just considering the language of balanced parentheses, the only memory needed is a single integer to track depth. If we see a "(", increment. If we see a ")", decrement. If we never go negative and end up with zero, that's a balanced parentheses expression. (To be clear, the parser I've described is not a regex parser.)

Adding the rest of the regex machinery makes this a bit more complex, but in general, if the computer can store the string, it can check whether it's regular.


As you say, this is not a regex parser, but some readers may not know the theorem that the regular languages are exactly those which can be parsed by a Turing machine with constant memory, and parenthesis count can have log n bits.

In fact, any Turing machine using o(log log n) space recognizes a regular language (so there is an equivalent machine using O(1) space).


This only works if you have a single type of parenthesis. Is that applicable to regex expressions in general that only ()’s can nest?


As typically defined academically, yes. In the real world things might be more complex, but I think it would hold true for many implementations.

Academically a regex is a string of of:

- character literals

- "epsilon" characters (meaning empty string)

- "+" characters (which means either what we see before the + or what we see after the +, in real-world regex these are usually represented as | instead of + as well as used implicitly in constructs like [a-zA-Z])

- "*" characters (same meaning as in real-world regex)

- And parentheses to allow controlling the order of operations


In other words, "if we build a state machine which is different from a regex, it will perform differently than a regex would." That's trivially true, but is it interesting beyond restating "there is no Silver Bullet"?


Technically we aren't discussing the regex itself here but the syntax for defining a regex.

You could use a different syntax for defining the same regex, and that syntax could have different properties. For instance you could use parantheses like racket where "(...)" and "[...]" have the same meaning, but "(...]" is not well formed syntax (if I remember my racket correctly). Using such a syntax to define regex a counter would no longer suffice to decide if a string is a regex.


You could have a stack of (paren-type, integer) pairs. If the paren has a different type than the top pair on the stack (or the stack is empty), push a new pair with the integer set to one. If the integer goes to zero, pop the pair. At the end of the expression, you have to have an empty stack.


)))((()(()


When I check brackets by hand or manually I do actually get my hands out and when I say open I uncurl a finger, and when I say close, I furl a finger. "Open, open, open, close, open, close, close, close". Provided I have no fingers on display at the end, then the brackets are balanced. They might not be in the right place but at least they are balanced. If I find myself starting with close then I close the lid and go and do something else.

I'm sure some sort of rather cheap algorithm falls out of the above. It wont guarantee correctness in what the balanced brackets actually contain but they will at least be balanced.


Good edge case. I’d say to immediately return an ‘unbalanced’ result once the count goes negative.


((((()


doesn't end up with zero at the end, rejected?


I was just trying to suggest a possible edge case, a failing test.


"never go negative"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: