I'm not a programmer, but how hard is it to write something that just checks wha...

torstenvl · on April 30, 2024

Pretty hard. There's no true way of knowing when you're right. You can make educated guesses based on statistical likelihood of certain patterns, but nothing stops people from constructing text that happens to look more "normal" when interpreted as another character set.

If I see the bytes 0xE5 0xD1 0xCD 0xC8 0xC7, then it is a decent bet that it's intended to be مرحبا in DOS 708. But there's no definitive reason why it couldn't be σ╤═╪╫ (DOS 437) or åÑÍÈÇ (Windows-1252) or Еямих (KOI8-RU). Especially if you don't know for sure that the data is intended to be natural language.

I can be pretty reasonably certain that it isn't, say, Japanese. 0xE5 in ShiftJIS is the first of a double-byte sequence, and it's odd, so the next byte would have to come from the range 0x40-0x9E, and 0xD1 does not fall within that range. In other words, it simply isn't valid ShiftJIS (it also isn't valid UTF-8). So you can narrow down the possibilities... but you're still making an educated guess. And you have to figure out a way to program that fuzzy, judgment-y analysis into software.

hgs3 · on April 30, 2024

I've written code to do this. If you're lucky there will be a BOM (byte order mark) or MIME type to indicate the encoding form. In this case you know the encoding. If you don't have this information, then you must guess the encoding. The issue is guessing may not produce accurate results, especially if you don't have enough bytes to go by.

The program I wrote to guess the encoding would scan the bytes in multiple passes. Each pass would check if the bytes encoded valid characters in some specific encoding form. After a pass completed I would assign it a score based on how many successfully encoded characters were (or were not) found. After all passes completed I'd pick the highest score and assume that was the encoding. This approach ended up being reasonably reliable assuming there were enough bytes to go by.

bastawhiz · on April 30, 2024

It's not easy! Go check out the source for chardet (Python). It's not pretty.