Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Rfc 4180 definitely lays out rules for how to escape double quotes, commas, and newlines.


That's not at all helpful unless the document comes with a statement that it complies with RFC4180. His point wasn't that there are no standards for doing so, more that there are so many standards for doing so, and it's near impossible to determine which standard a document uses by itself.


RFC 4180 just tried to write down what the most common quirks of CSVs were in the wild at the time. It's a reference more than a specification.

It's really a missed opportunity. Had they had the balls to actually specify CSV properly this wouldn't be nearly as much of a problem. This would have probably left a lot of the existing CSV writers non-conformant, but it would have likely improved the future situation considerably.

It doesn't even need to be complex. Just a few rules:

  (basic CSV spec here)
  1. Reserved characters are , " \
  2. If a reserved character appears in a field, escape it with \
  3. Alternatively you can double quote a field, in which case you only need to escape " characters in the field
  4. Line endings are either CR, CR/LF, or LF.  Parsers must suppport all line endings.
  5. Character set is UTF-8, however the " , and \ characters must be 0x22, 0x2C, and 0x5C.  Alternative language lookalike characters will not be treated as control characters.
  6. Whitespace adjacent to a comma is ignored by the parser


RFC4180 is worse than worthless; Do not implement it and claim you support "CSV files". Almost every paragraph is useless, wrong, or dangerous; it's few citations offer conflicting and sometimes clearly-stupid advice where they advice the use of CSV at all. Some examples:

§ 2.1 says that lines end in CRLF. This is in opposition to every UNIX-based system out there (which outnumber systems that use CRLF as a line delimiter by between 2:1 and 8:1 depending on how you choose to estimate this) and means that CSV files either don't exist on Mac OS X or Linux, or aren't "text files" by the standard definition of that term -- both absolutely silly conclusions! Nevertheless, following RFC4180-logic, § 2.6 thus suggests that a bare line-feed can appear unquoted.

§ 2.3 says the header is optional and that a client knows if it is present because the MIME type says "text/csv; header" then § 3 admits that this isn't very helpful and clients will have to "make their own decision" anyway.

§ 2.7 requires fortran-style doubling of double-quote marks, like COBOL and SQL, and ignores that many "CSV" systems use backslash to escape quotes.

§ 3 says that the character set is in the mime type. Operating systems which don't use MIME types for their file system (i.e. almost all of them) thus cannot support any character set other than "US-ASCII".

None of these "suggestions" are true of any operating system I'm aware of, nor are they true of any popular CSV consumer; If a conforming implementation of RFC4180 exists, it's definitely not useful. In fact, one of the citations (ESR) says:

The bad results of proliferating special cases are twofold. First, the complexity of the parser (and its vulnerability to bugs) is increased. Second, because the format rules are complex and underspecified, different implementations diverge in their handling of edge cases. Sometimes continuation lines are supported, by starting the last field of the line with an unterminated double quote — but only in some products! Microsoft has incompatible versions of CSV files between its own applications, and in some cases between different versions of the same application (Excel being the obvious example here).

A better spec would be honest about these special cases. A "good" implementation of CSV:

• needs a flag or switch indicating whether there is a header or not

• needs to be explicitly told the character set

• needs a flag to specify the escaping method \ or ""

• needs the line-ending specified (CR, LF, CRLF, [\r\n]{1,}, or \r{0,}\n{1,})

... and it needs all of these flags and switches "user-accessible". RFC4180 doesn't mention any of this, and so anyone who picks it up looking for guidance is going to be deluded into thinking that there are rules for "escaping double quotes" or "commas" or "newlines" that will help them consume and produce CSV files. Anyone writing specifications for developers who tries to use RFC4180 for guidance to implement the "import CSV files" feature is going to be left to dry.

The devil has demanded I support CSV, so the advice I can give to anyone who has received a similar requirement:

• parse using a state machine (and not by recursive splitting or a regex-with-backtracking).

• use heuristics to "guess" the delimiter by observing that a file is likely rectangular -- tricky, since some implementations don't include trailing delimiters if the trailing values in a row are missing. I use a bag of delimiters (, ; and tab) and choose the one that produces the most columns, but has no rows with more columns than the first.

• use heuristics to "guess" the character set especially if the import might have come from Windows. For web applications I use header-hints to guess the operating system and language settings to adjust my weights.

• use heuristics to "guess" the line-ending method. I normally use whatever the first-line ends in unless [\r\n]{1,} produces a better rectangle and no subsequent lines extend beyond the first one.

A successful and fast implementation of all of these tricks is a challenge for any programmer. If you guess first, then let the user guide changes with switches, your users will be happiest, but this is very relative: Users think parsing CSV is "easy" so they're sure to complain about any issues. I have saved per-user coefficients for my heuristic guesses to try and make users happier. I have wasted more time on CSV than I have any other file format.

My advice is that we should ignore RFC4180 completely and, given the choice, avoid CSV files for any real work. XLSX is unambiguous and easy enough that a useful implementation takes less space than my criticism of CSV takes- and weirdly enough- many uses of "CSV" are just "I want to get my data out of Excel" anyway.


To me the issue isn't how we should read in CSV (as you stated, we should use a different format that is less ambiguous) but rather getting vendors to provide the data in that new format. I've found that most vendors I work with are using a cobbled together solution built on top of a mainframe and getting them to add/remove a field from CSV is a monumental task, there is no way they will move to a different format like XLSX or JSON or XML. Until there is an open, industry-wide standard, I just don't see CSV going away


Which part of that do you think I disagreed with?


None? You seemed to have touched mostly on consuming CSV and safe ways to do that and suggested XLSX - I'm just adding that digesting CSV safely isn't our biggest issue, it's getting vendors to use something other than CSV and unfortunately that is typically out of most people's control.


Amen.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: