Good CSV parsers reach 200MB/s, a formatted datetime is under 40 bytes so assumi...

fnord123 · on Oct 13, 2016

> at 19µs/datetime you can easily end up with that bottlenecking your entire pipeline if the datasource spews (which is common in contexts like HFT, aggregated logs and the like)

+1

This is why a little ELT goes a long way.

>Good CSV parsers reach 200MB/s

By good (and open source) we're talking about libcsv, rust-csv, and rust quick-csv[1]. If you're doing your own custom parsing you can write your own numeric parsers to remove support for parsing nan, inf, -inf, etc and drop scientific notation which will claw back a lot of the time. If you also know the exact width of the date field then you can also shave plenty of time parsing datetimes. But at that point, maybe write data to disk as protobuf or msgpack or avro, or whatever.

[1] https://bitbucket.org/ewanhiggs/csv-game

masklinn · on Oct 13, 2016

> If you're doing your own custom parsing you can write your own numeric parsers to remove support for parsing nan, inf, -inf, etc and drop scientific notation which will claw back a lot of the time.

The 200MB/s, at least for rust-csv, is for "raw" parsing (handling CSV itself) not field parsing and conversions, so those would be additional costs.

> If you also know the exact width of the date field then you can also shave plenty of time parsing datetimes.

Yes if you can have fixed-size fields and remove things like escaping and quoting and the like things get much faster.

userbinator · on Oct 13, 2016

It depends what exactly you mean by CSV parsing, but I've done record scanning on CSVs at >1GB/s on a 3GHz CPU.

btilly · on Oct 13, 2016

It is a lot faster to split lines on , than to handle quoting with embedded commas, returns and so on.

logicallee · on Oct 13, 2016

Interesting. I'd wait to hear their answer - if it's a CSV bottlenecking at 0.2 MB/sec instead of 200 MB/sec it would be interesting (which is the three orders of magnitude you state). But with all that said, they reported a 62x improvement, so 1.8 orders of magnitude.

fnord123 · on Oct 13, 2016

0.2MB/s parsing logs or basically any file is absolutely terrible. Hence why thomas-st wrote his own parser. Parsing at 2MB/s (rounding up the 1.8 to 2 order of magnitude) is also pretty poor.