Convert to image and run it through an OCR? Needlessly complicated, but it *coul...

claudiulodro · on Jan 1, 2018

Why not just a regex that matches on ASCII characters and removes the rest?

flogic · on Jan 1, 2018

I think the general problem with any automated solution is that there is so much room to game them. For instance, I could selectively replace a few visible ASCII characters with non-ASCII look a likes. Then, the investigators just need to see which characters are missing. Even with the OCR option you could selectively add typos.

dreamfactored · on Jan 1, 2018

How about back and forth through a translation app or two?

adtac · on Jan 1, 2018

Because you still probably want accented characters and other unicode elements, I think.

walshemj · on Jan 1, 2018

why outwith non latin languages accents and oddities like the sharp S dont actually change the meaning.

sharpercoder · on Jan 1, 2018

Using RegEx is generally frowned upon; RegEx is a bad language to write anything in apart from prototypes. Furthermore, this will not work for most text as even US English text contains special characters. Think paraphrasing other languages, names and imported words.

GordonS · on Jan 1, 2018

Frowned upon by who? 'Bad' in what way?

Regular expressions are powerful, and are used in plenty of production software.

hinkley · on Jan 1, 2018

I think people dislike [Regex] for the some of the same reasons The Principle or Least Power makes sense.

If you can do the same manipulation in three ones of code it’s more likely to be correct and stay correct. And when you look at it again in six months you won’t have to stare at it. All those little time sucks add up as the code grows.

Edit: autocorrect got me twice.

GordonS · on Jan 1, 2018

I'd say (as I often do), 'it depends'.

Firsly, for relatively simple regex expressions, any competent developer should be able to grok them very quickly - at least as quick as the equivelant C#/Java/whatever code.

Secondly, it may be that regex is the most performant solution, and sometimes that matters quite a bit.

Honestly, I just don't get why some people are intimidated by regex.

laumars · on Jan 1, 2018

Regex is seldom the most performant code. I mean the engine themselves are fantastic pieces of engineering but in tight loops I've found I can get - sometimes significant - performance improvements by replacing regex matches or substitutions with purpose written string manipulation. Obviously the results depend massively on several big variables:

1/ the regex engine

2/ host language

3/ problem you're trying to solve

But I've found generally I was better off not using regex for performance critical code.

HOWEVER (!!!) where regex consistently wins is development time. Not just writing the code, but testing (it's trivially easy to test regex) and updating the pattern matching (Vs updating the equipment character matching in an imperative language).

Yeah regex can get ugly quickly, but then so can any language if misused.

ogdoad · on Jan 1, 2018

Actually thrice: lines -> ones.

sharpercoder · on Jan 1, 2018

* RegEx opens your application up for DoS attacks

* RegEx is not very readable

* RegEx can be (very) slow

* It's not trivial to write RegEx code that achieves your goal in a high-quality way. Often quircks and edge-cases are missed.

I'm not saying that you should never use them, but oftentimes a (much) better alternative of achieving your goals is present.

See https://blog.codinghorror.com/regular-expressions-now-you-ha...

geofft · on Jan 1, 2018

As the article you link argues, use of regular expressions when they are inappropriate is bad. This particular case - finding and replacing certain characters with other characters - is pretty well-suited to the problem, and is probably more readable than a bunch of open code to do the same thing.

(I'm not sure what you mean by DoS attacks - are you referring to the exponential case of backtracking? If so, don't use a regex engine with that problem, and don't use lookbehind/lookahead assertions, which aren't needed to solve this problem.)

sharpercoder · on Jan 1, 2018

> This particular case - finding and replacing certain characters with other characters - is pretty well-suited to the problem, and is probably more readable than a bunch of open code to do the same thing.

No, it is not a good solution to the problem; you ignore my earlier comments. English or latin text is not comprised of the sole ASCII characterset; it contains characters outside this set (quoting other languages, names, imported words for example).

laumars · on Jan 1, 2018

Good thing most regex engines handle unicode ;)

Honestly, I do get your point about inappropriate use of regex, but this kind of simple text manipulation is well suited for regex. The biggest argument against using regex for this kind of problem is performance verses writing the same code programmatically in the host language (assuming you're using a fast AOT compiled language). However even that is a non-issue given the small quantities of text you're decoding.

Also I'd bet the regex in this instance would actually work out more readable because the transformations are basic so you're localising the text manipulation to simple rules rather than multiple lines of byte array reading and thus also potentially having to manually build in your own rudimentary unicode support too.