The PCAP – single file with 50 different protocols

ncmncm · on March 15, 2020

Pcaps are remarkably versatile. The choice to store two lengths in the packet header means that you can record both the actual wire length--the number of bytes that came in--and a bigger number that says where the next record starts, and any metadata you like in the gap.

(This wasn't the original purpose of the second number: the original thought was that you would not be able to afford to store all the bytes of every packet, so the second number would say where you cut it off.)

The 32-bit seconds / 32-bit microseconds timestamp in the packet header has evolved to use 32-bit nanoseconds as we got better at timekeeping, with just a change of magic number. Meanwhile, the 32-bit unsigned seconds count (since 1970, UTC) takes us all the way to C.E.2106 while the time_t crybabies are running in circles about 2038.

The whole fintech industry rides on lz4 pcaps of multicast packet streams, with essential file metadata recorded in the file names. (There is a very complicated "NG" pcap format, but nobody uses it.)

Just for the record, a single 1U server with a disk array can record the complete activity of all the New York exchanges plus CME, CFE, Impact, and OPRA with enough headroom to handle >2x peak daily traffic of previous years, as happened in each of the last three weeks: but only if programmed with C++, ring buffers, and kernel bypass network drivers.

This has been a PSA.

p0cc · on March 15, 2020

> The choice to store two lengths in the packet header means that you can record both the actual wire length--the number of bytes that came in--and a bigger number that says where the next record starts, and any metadata you like in the gap.

> There is a very complicated "NG" pcap format, but nobody uses it.

This is the reason that pcapng was created - to formalize the structure of this metadata. pcapng "Pcap Next Generation" (2006) is an update to the pcap format (which originated with libpcap/tcpdump in the 90s). It's used in the networking industry as it's the default output file of wireshark, tshark, and tcpdump to capture traffic. If you use these tools, chances are that you have pcapng files.

The biggest reason to use it over pcap (in networking) is that you can save packets from multiple network interfaces. If you have a wireless access point and want to capture both ethernet and 802.11 traffic in the same capture, pcapng would be necessary.

touisteur · on March 15, 2020

Oh come on. Ported my pcap parser in 3 hours, and that included reading the docs and looking up a example capture.

SDB is the 'file header' packet. IDB is the 'interface description' packet. EPB is the 'data' packet.

The rest you should ignore except if you need to look through other block kinds, and in this case you're quite happy to have pcap-ng. No custom block in pcap, except forking wireshark...

All the ethernet/ppp/802.11 after that is the same as with pcap. What do people find so complicated there ?

Also pcap-ng is quite the generic data-recording format. Merging internal logs, network capture, interface stats, system configuration state, whatever you might want... In a /simple/ unique format. Easy to write, easy to have it read by wireshark... Even if it's not only network related.

The real strength of pcap-ng is that you can read it in both directions. Size is at the start /and/ end of each block. Seems stupid but it's very useful for some kind of analysis. Used to build indexes for pcap files... Not so useful now. And anyway, if you want to build an index, you can just add it as custom packet at the end.

ncmncm · on March 15, 2020

The reason it is not used in fintech is that it is extremely complex.

The greatest failing of wireshark et al today is not understanding lz4 and zstd compression.

viraptor · on March 15, 2020

> the original thought was that you would not be able to afford to store all the bytes of every packet

I would phrase that differently. The idea was that you can explicitly capture shorter packets, because often you're interested in the flow/metadata only. That means you can still capture all the protocol and application headers/commands without storing the whole 1.5KB packet.

This is very handy when you capture on a non-dedicated hardware for ad-hoc analysis. Especially if you only have a vanilla kernel and capture using tcpdump.

lmeyerov · on March 15, 2020

Interesting, thanks!

RE:NYSE, any sense of avg+peak bytes/s?

(Context being we work on interactive GPU visual analytics for related stuff, so always love fun targets to drive benchmarks!)

ncmncm · on March 15, 2020

Daily OPRA traffic has lately exceeded 6 TB, compressed. So, probably 15 TB. Figure 58 bytes of header + ~50 of payload per packet.

Note the "all NY exchanges" includes a whole hell of a lot more than NYSE.

lmeyerov · on March 15, 2020

Ah neat -- 15TB/s or /d? Single GPUs are at TB/s levels, so interesting locality + pricing implications here!

ncmncm · on March 15, 2020

Per day. But OPRA is only a (large) part of the traffic.

It is hard to see how GPUs figure here. This is a wide-area, latency-sensitive data distribution system. In fintech we say a microsecond is an eon, a millisecond an eternity.

Musk's LEO satellite network will cut several ms from the Europe - US transmission delay, which will be worth billions to certain organizations. Musk might collect additional billions for denying it to certain others.

lmeyerov · on March 16, 2020

Most relevant to the thread, if someone wants to do easy analytics on pcap, enabling regular dataframe / sql / dashboarding code to work on that. Not interesting to traders, but of it/sec/ops, seems so! Both for forensic, monitoring, & controls.

For financial use cases, the most sensitive autotrading, yeah, I'd expect ASICS for serving simple models to win for stable algs, and I'm sure all sorts of other interesting things. At the same time, I'd expect others would enjoy something more manageable when less sensitive. Improving the whole Excel / Bloomberg / Factset etc. experience is probably a saner starting point though. I'd turn it around - where would you expect the most useful starting points to be?

p0cc · on March 15, 2020

If you work in the networking industry, pcaps are used to troubleshoot networks. I see guys walking around with t-shirts with "pcaps or it didn't happen". The behavior of protocols or the data in their fields can point exactly to what is failing, and a pcap captures this.

If you want to search for protocols in packet captures, I created tshark.dev/search/pcaptable/ for this exact purpose. Search 1000+ protocols from 6000+ packet captures.

---

Per capinfos, the author merged a bunch of pcap files with `mergecap` from Wireshark's sample captures. It has 38 interfaces, which is the highest I've ever seen!

  $ tshark -r ultimate.pcapng -T fields -e frame.protocols | sed -e 's/:/\n/g' | sort | uniq | wc -l
  69

Looks like it's actually 69 protocols, which makes it quite novel as a packet capture.

saagarjha · on March 15, 2020

Packet captures are helpful even if you’re not personally involved in the networking layer, but still talk to the internet and would like to keep logs of it. My university organizes capture-the-flag competitions and everything that goes through the game network is captured, both by us and by many of the teams. The captures let us monitor the state of the game, quickly detect and respond to denial of service attacks, and serve as a sanity check of the event in case some critical infrastructure goes down and we lose game data. Good teams log all their traffic so they can reverse-engineer and replay exploits that people shoot at them, of course ;)

sbmthakur · on March 15, 2020

That's an interesting usecase. Is there a blog/article talking more about this?

saagarjha · on March 15, 2020

I’m not sure if we ever actually wrote up anything specific about our architecture (CTF people tend to hate write-ups), but I did find that we did a post when we disqualified LC/BC for a DDoS attack against another team, which we detected using network logs: https://ictf.cs.ucsb.edu/pages/the-2016-2017-ictf-ddos.html

strictfp · on March 15, 2020

I've wanted a protocol reference site for years, listing different network protocol dumps and examples. In fact, I bought pdumps.com and set up a wiki for it, but then the site got hammered with spam so I shut it down.

I have been thinking about setting up a static site instead, having pull requests as a gating procedure to fight off spam.

If more people here are interested, I'm willing to set it up again. Help doing the basic static webpage setup and reviewing PRs would be appreciated. Please reply in thread if you're interested.

jlgaddis · on March 15, 2020

Many years ago, Jeremy Stretch created a series of (networking-centric) "cheat sheets" [0] which may be of interest.

As boring as they can be to read, the RFCs are still the best single reference WRT network protocols and such.

---

[0]: https://packetlife.net/library/cheat-sheets/

cxr · on March 15, 2020

It would probably make more sense to use the Archive Team's existing wiki meant to "solve the file format problem".

http://fileformats.archiveteam.org

jusob · on March 16, 2020

I use Pcapr to look for pcaps: https://pcapr.net/home

simcop2387 · on March 15, 2020

Static site with something like github or gitlab pages sounds perfect for this. I'd be willing to help review things in my spare time.

qqqturing1 · on March 15, 2020

The idea is really cool. Would have loved to learn protocols from a website/interactive format instead of networking books at university.

mercora · on March 15, 2020

I would probably enjoy your republished site :)

p0cc · on March 15, 2020

DM me. Also, take a look at tshark.dev.

tcbawo · on March 16, 2020

My biggest issue with pcap files is lack of random access capability. If you wanted to search a large file for a packet by time or contents, you can't just hop around. You have to scan sequentially. Some people chunk a capture into individual files, but this can also get unwieldy.

rwmj · on March 15, 2020

Is there some reason/significance for collecting all the protocol captures into a single file? I would have thought that a tarball containing separate captures one per protocol would be more useful wouldn't it?

zamadatix · on March 15, 2020

As a network guy this was my first thought as well. Putting everything and the kitchen sink in a single file just serves to obfuscate the relations, interactions, and dependencies of many protocols. For instance good luck opening this up to get an understanding that certain multicast protocols require certain information in IGMPv3 fields to come up the way it shows in the pcap.

Simply filtering for the high level protocol tells you A LOT less than having a barebones capture which shows all of the things that happen for that high level protocol to work and nothing more.

p0cc · on March 15, 2020

Not really. It's kind of like going to Project Gutenberg and zipping together the first chapter of every work. And then advertising it as "This is a sampling of literature!"

I mean it is, but most people wouldn't read literature that way.

peter_d_sherman · on March 15, 2020

Great idea!

But leads to an observation/question:

Why do so many protocols exist in the first place?

I think it would be a great idea for a protocol engineer in the future to try to look at all known protocols, and try to compact and integrate them all into a

Single universal protocol, eliminating all redundant functionality...

Maybe someone has already done this, or is working on this... I don't know. If anyone has any ideas or knowledge in this direction, please feel free to post what you know...

p0cc · on March 15, 2020

This is like asking for a "universal vehicle". Some people want a truck while others want a forklift. It would also not be economical to make a truck extensible so that forklift components could be later tacked on. What vehicle you use depends on the goods you carry.

Likewise, the protocols do different things at different layers. Some are nested like Ethernet/IP/TCP/HTTP. TCP ensures delivery while UDP does not.

wpietri · on March 15, 2020

This is not how you should think about making something new.

What you're optimizing for here is your experience. Your desire for tidiness.

To make a real change in the world, you need to optimize for the experience of the people who will adopt the new thing. For example, how would people using the the web benefit right now from replacing 802.11, IP, TCP, and HTTP with one monster protocol that just did the same stuff? They wouldn't. And neither would most protocol implementers, because a) they have to throw out everything they know and learn it all again, and b) they'd have to rebuild everything that currently exists.

And even if we push past that, it still won't work, because it's based on what I think of as the Architect's Illusion, the notion that a sufficiently smart person can just look at things and think real hard and have a perfect solution appear. The truth is that real products evolve over time as part of a dialog between and among users and makers. Even the simple paperclip, for example, evolved heavily. [1] If smart people can't get the paperclip right on a first try, there is no way they'll get a massive protocol stack right. And even if they did, next week somebody would come up with something new that doesn't fit, and we'd be right back where we were.

[1] https://www.amazon.com/Evolution-Useful-Things-Artifacts-Zip...

zamadatix · on March 15, 2020

Such a protocol would be so dynamic you'd just end up with nodes that are all unique in that no two nodes that aren't running the same OS/software run the same capabilities of "universal protocol". As a result you have the same problem now without the ability to categorize nodes by which features they actually support (currently covered by "protocols" which rely on simpler protocol abstractions to be passed by intermediate nodes that don't understand them).

bobbiechen · on March 15, 2020

A "single universal protocol" sounds to me like you'd need at a minimum, a blob of data, with a header for length and how to interpret the data. Maybe if you flesh out the problem more you'd end up with something like this: https://tools.ietf.org/html/rfc791#section-3.1

What would it even mean to have a single universal protocol beyond IP? Layering on top of IP (TCP, UDP, and higher-level protocols like HTTP, DNS) is the way things already work. Not layering sounds like every application would just invent and parse their own protocols, which is even worse.

tomc1985 · on March 15, 2020

I'd pity the fools that would be stuck implementing such a complex protocol

integricho · on March 15, 2020

Obligatory response: https://xkcd.com/927/