I think the general problem with any automated solution is that there is so much room to game them. For instance, I could selectively replace a few visible ASCII characters with non-ASCII look a likes. Then, the investigators just need to see which characters are missing. Even with the OCR option you could selectively add typos.
Using RegEx is generally frowned upon; RegEx is a bad language to write anything in apart from prototypes. Furthermore, this will not work for most text as even US English text contains special characters. Think paraphrasing other languages, names and imported words.
I think people dislike [Regex] for the some of the same reasons The Principle or Least Power makes sense.
If you can do the same manipulation in three ones of code it’s more likely to be correct and stay correct. And when you look at it again in six months you won’t have to stare at it. All those little time sucks add up as the code grows.
Firsly, for relatively simple regex expressions, any competent developer should be able to grok them very quickly - at least as quick as the equivelant C#/Java/whatever code.
Secondly, it may be that regex is the most performant solution, and sometimes that matters quite a bit.
Honestly, I just don't get why some people are intimidated by regex.
Regex is seldom the most performant code. I mean the engine themselves are fantastic pieces of engineering but in tight loops I've found I can get - sometimes significant - performance improvements by replacing regex matches or substitutions with purpose written string manipulation. Obviously the results depend massively on several big variables:
1/ the regex engine
2/ host language
3/ problem you're trying to solve
But I've found generally I was better off not using regex for performance critical code.
HOWEVER (!!!) where regex consistently wins is development time. Not just writing the code, but testing (it's trivially easy to test regex) and updating the pattern matching (Vs updating the equipment character matching in an imperative language).
Yeah regex can get ugly quickly, but then so can any language if misused.
As the article you link argues, use of regular expressions when they are inappropriate is bad. This particular case - finding and replacing certain characters with other characters - is pretty well-suited to the problem, and is probably more readable than a bunch of open code to do the same thing.
(I'm not sure what you mean by DoS attacks - are you referring to the exponential case of backtracking? If so, don't use a regex engine with that problem, and don't use lookbehind/lookahead assertions, which aren't needed to solve this problem.)
> This particular case - finding and replacing certain characters with other characters - is pretty well-suited to the problem, and is probably more readable than a bunch of open code to do the same thing.
No, it is not a good solution to the problem; you ignore my earlier comments. English or latin text is not comprised of the sole ASCII characterset; it contains characters outside this set (quoting other languages, names, imported words for example).
Honestly, I do get your point about inappropriate use of regex, but this kind of simple text manipulation is well suited for regex. The biggest argument against using regex for this kind of problem is performance verses writing the same code programmatically in the host language (assuming you're using a fast AOT compiled language). However even that is a non-issue given the small quantities of text you're decoding.
Also I'd bet the regex in this instance would actually work out more readable because the transformations are basic so you're localising the text manipulation to simple rules rather than multiple lines of byte array reading and thus also potentially having to manually build in your own rudimentary unicode support too.