Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Mailocalypse Is Upon Us: Why Isn’t All Mail UTF-8? (lamsonproject.org)
46 points by uggedal on June 15, 2009 | hide | past | favorite | 26 comments


I may be overly conservative here, but IMHO a MTA should not touch the body of the message.

It should not even touch the headers besides adding a Received-header to the top.

Re-Encoding all Mails passed through is definitely not what I call "not touching".

I know that there are some exceptions (i.e. 8bitmime), but I still think that mail servers should keep their hands of what is passing through them.

Like mailmen who are not supposed to open the envelope, read the letters and reprint them using a nicer font on nicer paper :-)


I don't see a problem with it if the translation is 1-to-1 (and that's a pretty big if). Disregarding federal law for the moment, what is the harm in the mailman reprinting your letters with nicer font on nicer paper? It seems to me that if the mailman wants to do that, as long as they don't lose any of the information of the original letter I would benefit from having an easier to read letter.


But if it becomes normal for a mailman to do this, there is now an expected man-in-the-middle. Sure the mailman might be benign, but what about if there is another man-in-the-middle (less benign), now that I expect my communication to be tampered with I won't notice anything suspicious. [Note: Zed's proposed solution is to forward on the original mail as well so it can be verified]


I'll confess now to definitely not being an expert on mail systems but what is to stop this tampering from happening now by a man-in-the-middle? I don't really see any new avenue of attack that doesn't already exist with current systems (encrypted email excluded).


no one is to stop tampering from happening.

but the SMTP standard and the whole culture around internet mail mandates that the messages are not changed in transit (with the exception of that received header and the stuff around 8bitmime).

UTF-8-Encoding mails just because it's "cleaner" doesn't feel like it's the right thing to do, especially when you consider there to be old systems around that can't handle UTF-8 encoded messages.

Also, standards are there to be adhered to - like HTML and all the others.


What is lamson doing with mails that makes just leaving them in their original charset unworkable?


Processing them in Python. :) No, really. Python is horrible at dealing with strings of different encodings. You could generalize that to any complex app with lots of data sources: the only sane way to do it is to convert everything to a single encoding at the door.

(edit) whups, I hadn't thought about PGP/etc signatures. sigh


yeah charset conversion would also break dkim/domainkeys


Here be dragons. Be afraid. Be very afraid.


Everything you say and more. Internationalization, ick -- anyone who thinks this is easy has no clue how deep this rabbit hole goes.

On Han unification: the Japanese reluctance to this is partly because they're being told "Some of your national literature needs to die so that our data standard can live. Deal." and partly because they're being told "What's with all the resistance, you xenophobic bastards, get with the effing program already.", generally by people who they perceive as not quite getting the issue.

All the educated Americans in the room have read Romeo and Juliet, right? Remember the balcony scene? Remember the world in the balcony scene that you have never heard in any other context?

O Romeo, Romeo, Wherefore art thou Romeo?

Imagine being told "For technical reasons, we're standardizing computers away from being able to accept 'Wherefore' as input or output. As a workaround, we suggest using "why", or perhaps putting the word in an image file and pasting it in when it is required. Most people don't use "wherefore" anyhow and, if you routinely do, you can modify your editing software to accommodate it, as long as it doesn't have to interface with any other computer ever. Oh, by the way, some other words you know are also going to stop working. It's nothing major. Well, OK, 'Gertrudes' might find it somewhat annoying but we've got a nice selection of names from Aluicious to Xavier and, if all else fails, you can spell it phonetically because your language is capable of that, too, and don't pretend otherwise."


"On Han unification: the Japanese reluctance to this is partly because they're being told "Some of your national literature needs to die so that our data standard can live. Deal." and partly because they're being told "What's with all the resistance, you xenophobic bastards, get with the effing program already.", generally by people who they perceive as not quite getting the issue."

As I understand it, the story of Han unification is quite a bit more complex than that, and not nearly so one-sided as you make it out to be. See this article (which seems rather balanced in its treatment) for an example:

http://www.jbrowse.com/text/unij.html


They should have an encoding standard for characters that are not in the standard character set. I've see base64 encoded images in CSS for a long time. Sure, it takes up a lot of space, but isn't this the best of both worlds? Of course, the images could be SVG or a sort of typeface definition. It doesn't matter because it's base64 encoded.

Of course, I'm not trivializing the problems with internationalization. I'm simply pointing out that there are simple solutions to some of the problems, such as this one. I think the biggest problem with character encoding is getting people to agree on a standard.


I think the biggest problem with character encoding is getting people to agree on a standard.

Prior to the Unicode standard springing into being, Japan had relatively little problem with agreeing on standards, by the simple expedient of agreeing on enough of them such that most stakeholders left the table with at least one they liked. People are sort of reluctant to leave these and the tangible benefits they offered, hence the lukewarm reception to Unicode. (If I hear tsk-tsking from any Americans on this point, just try to imagine how much adoption UTF-8 would have in the US if it didn't have the happy-and-totally-not-accidental property of working exactly like ASCII for the language American programmers and legacy systems/documents care about.)

Sidenote: I was working on a translation of features of a particular software package today and had to think a little of how to explain mojibake, and why you'd want a servlet container that can detect and correct it, to an American audience. (It's when some program between the user and the server's application layer applies a heuristic incorrectly and results in one or the other getting complete gibberish.) The problem is every bit as fun to deal with as a developer and a user as you would expect it to be.


To be fair, the Unicode folks totally screwed over the Japanese. Now there are many people who can't even write their name on a computer.

Needless to say, these people would rather use the "legacy" character sets that didn't have this problem.


Do like other languages, and have a glyph represent a letter, and not a word.

So take all those old words, and add a prefix character which means the next glyph is an old word. Unicode already has the concept of combining characters (for diacritics usually). Make a "diacritic" character that encodes the regional flavor, or variant of the character.


I should start an 'ocalypse of the week site.


Among the problems that this has, it will shatter PGP and S/MIME signatures silently.


Ding ding ding! You win the thread. It'll break DKIM/Domainkeys too if you have 8-bit characters in your headers.

Paws off my mail, Zed.


Is this guy MIME encoding everything in base64 or quoted-printable?

Last time I checked email could only be 7bit ascii because of many legacy servers.


I'd be interested in seeing estimates on how many such servers are still being used. I've never come across one.


Indeed. This was a problem in 1990, but amazingly, it's not 1990 anymore.


Drop badly encoded email on the floor because you hope it's probably spam? That deliberately violates Postel's Law, which is what keeps this mess mostly working.

And doesn't multipart/signed rely on knowing the actual charset the signer was using?


Umm, no. Drop badly encoded email because most of it is spam. Or, at least, that's the hypothesis that he's asking you to help verify. Did you even read the article?


He said himself that "a quick eye-ball sample says those failures mail are mostly spam that wasn’t classified right". Mostly? That means he already knows badly-encoded but legitimate email is not only possible but actually exists, so what's left to verify? Either you're okay with losing messages or you aren't.


I was responding to your use of the term 'hope', which made it sound like you thought this was just some blind stab in the dark. He has data, he's asking for more, and he's asking people to shoot him down if it's a dumb idea. It's hardly a faith-based effort.

And now you're parsing his sentence for subtle evidence that he secretly knows he's dropping some legitimate messages? Dude, he posted the fucking numbers himself. It's obvious that he's willing to drop some messages. He's asking how many he actually would be dropping and trying to work out if the benefits of "everything as UTF-8" outweighs the costs of dropping a message every now and then.

Either you're okay with losing messages or you aren't

Yeah, brilliant. There are already several reasons to drop messages (size, mangled headers, etc). He's pushing for another one.


If I hear about one more trivial issue that's called a something-pocalypse...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: