Pcaps are remarkably versatile. The choice to store two lengths in the packet header means that you can record both the actual wire length--the number of bytes that came in--and a bigger number that says where the next record starts, and any metadata you like in the gap.
(This wasn't the original purpose of the second number: the original thought was that you would not be able to afford to store all the bytes of every packet, so the second number would say where you cut it off.)
The 32-bit seconds / 32-bit microseconds timestamp in the packet header has evolved to use 32-bit nanoseconds as we got better at timekeeping, with just a change of magic number. Meanwhile, the 32-bit unsigned seconds count (since 1970, UTC) takes us all the way to C.E.2106 while the time_t crybabies are running in circles about 2038.
The whole fintech industry rides on lz4 pcaps of multicast packet streams, with essential file metadata recorded in the file names. (There is a very complicated "NG" pcap format, but nobody uses it.)
Just for the record, a single 1U server with a disk array can record the complete activity of all the New York exchanges plus CME, CFE, Impact, and OPRA with enough headroom to handle >2x peak daily traffic of previous years, as happened in each of the last three weeks: but only if programmed with C++, ring buffers, and kernel bypass network drivers.
> The choice to store two lengths in the packet header means that you can record both the actual wire length--the number of bytes that came in--and a bigger number that says where the next record starts, and any metadata you like in the gap.
> There is a very complicated "NG" pcap format, but nobody uses it.
This is the reason that pcapng was created - to formalize the structure of this metadata. pcapng "Pcap Next Generation" (2006) is an update to the pcap format (which originated with libpcap/tcpdump in the 90s). It's used in the networking industry as it's the default output file of wireshark, tshark, and tcpdump to capture traffic. If you use these tools, chances are that you have pcapng files.
The biggest reason to use it over pcap (in networking) is that you can save packets from multiple network interfaces. If you have a wireless access point and want to capture both ethernet and 802.11 traffic in the same capture, pcapng would be necessary.
Oh come on. Ported my pcap parser in 3 hours, and that included reading the docs and looking up a example capture.
SDB is the 'file header' packet.
IDB is the 'interface description' packet.
EPB is the 'data' packet.
The rest you should ignore except if you need to look through other block kinds, and in this case you're quite happy to have pcap-ng. No custom block in pcap, except forking wireshark...
All the ethernet/ppp/802.11 after that is the same as with pcap. What do people find so complicated there ?
Also pcap-ng is quite the generic data-recording format. Merging internal logs, network capture, interface stats, system configuration state, whatever you might want... In a /simple/ unique format. Easy to write, easy to have it read by wireshark... Even if it's not only network related.
The real strength of pcap-ng is that you can read it in both directions. Size is at the start /and/ end of each block. Seems stupid but it's very useful for some kind of analysis. Used to build indexes for pcap files... Not so useful now. And anyway, if you want to build an index, you can just add it as custom packet at the end.
> the original thought was that you would not be able to afford to store all the bytes of every packet
I would phrase that differently. The idea was that you can explicitly capture shorter packets, because often you're interested in the flow/metadata only. That means you can still capture all the protocol and application headers/commands without storing the whole 1.5KB packet.
This is very handy when you capture on a non-dedicated hardware for ad-hoc analysis. Especially if you only have a vanilla kernel and capture using tcpdump.
Per day. But OPRA is only a (large) part of the traffic.
It is hard to see how GPUs figure here. This is a wide-area, latency-sensitive data distribution system. In fintech we say a microsecond is an eon, a millisecond an eternity.
Musk's LEO satellite network will cut several ms from the Europe - US transmission delay, which will be worth billions to certain organizations. Musk might collect additional billions for denying it to certain others.
Most relevant to the thread, if someone wants to do easy analytics on pcap, enabling regular dataframe / sql / dashboarding code to work on that. Not interesting to traders, but of it/sec/ops, seems so! Both for forensic, monitoring, & controls.
For financial use cases, the most sensitive autotrading, yeah, I'd expect ASICS for serving simple models to win for stable algs, and I'm sure all sorts of other interesting things. At the same time, I'd expect others would enjoy something more manageable when less sensitive. Improving the whole Excel / Bloomberg / Factset etc. experience is probably a saner starting point though. I'd turn it around - where would you expect the most useful starting points to be?
If you work in the networking industry, pcaps are used to troubleshoot networks. I see guys walking around with t-shirts with "pcaps or it didn't happen". The behavior of protocols or the data in their fields can point exactly to what is failing, and a pcap captures this.
If you want to search for protocols in packet captures, I created tshark.dev/search/pcaptable/ for this exact purpose. Search 1000+ protocols from 6000+ packet captures.
---
Per capinfos, the author merged a bunch of pcap files with `mergecap` from Wireshark's sample captures. It has 38 interfaces, which is the highest I've ever seen!
Packet captures are helpful even if you’re not personally involved in the networking layer, but still talk to the internet and would like to keep logs of it. My university organizes capture-the-flag competitions and everything that goes through the game network is captured, both by us and by many of the teams. The captures let us monitor the state of the game, quickly detect and respond to denial of service attacks, and serve as a sanity check of the event in case some critical infrastructure goes down and we lose game data. Good teams log all their traffic so they can reverse-engineer and replay exploits that people shoot at them, of course ;)
I’m not sure if we ever actually wrote up anything specific about our architecture (CTF people tend to hate write-ups), but I did find that we did a post when we disqualified LC/BC for a DDoS attack against another team, which we detected using network logs: https://ictf.cs.ucsb.edu/pages/the-2016-2017-ictf-ddos.html
I've wanted a protocol reference site for years, listing different network protocol dumps and examples. In fact, I bought pdumps.com and set up a wiki for it, but then the site got hammered with spam so I shut it down.
I have been thinking about setting up a static site instead, having pull requests as a gating procedure to fight off spam.
If more people here are interested, I'm willing to set it up again. Help doing the basic static webpage setup and reviewing PRs would be appreciated. Please reply in thread if you're interested.
My biggest issue with pcap files is lack of random access capability. If you wanted to search a large file for a packet by time or contents, you can't just hop around. You have to scan sequentially. Some people chunk a capture into individual files, but this can also get unwieldy.
Is there some reason/significance for collecting all the protocol captures into a single file? I would have thought that a tarball containing separate captures one per protocol would be more useful wouldn't it?
As a network guy this was my first thought as well. Putting everything and the kitchen sink in a single file just serves to obfuscate the relations, interactions, and dependencies of many protocols. For instance good luck opening this up to get an understanding that certain multicast protocols require certain information in IGMPv3 fields to come up the way it shows in the pcap.
Simply filtering for the high level protocol tells you A LOT less than having a barebones capture which shows all of the things that happen for that high level protocol to work and nothing more.
Not really. It's kind of like going to Project Gutenberg and zipping together the first chapter of every work. And then advertising it as "This is a sampling of literature!"
I mean it is, but most people wouldn't read literature that way.
Why do so many protocols exist in the first place?
I think it would be a great idea for a protocol engineer in the future to try to look at all known protocols, and try to compact and integrate them all into a
Single universal protocol, eliminating all redundant functionality...
Maybe someone has already done this, or is working on this... I don't know. If anyone has any ideas or knowledge in this direction, please feel free to post what you know...
This is like asking for a "universal vehicle". Some people want a truck while others want a forklift. It would also not be economical to make a truck extensible so that forklift components could be later tacked on. What vehicle you use depends on the goods you carry.
Likewise, the protocols do different things at different layers. Some are nested like Ethernet/IP/TCP/HTTP. TCP ensures delivery while UDP does not.
This is not how you should think about making something new.
What you're optimizing for here is your experience. Your desire for tidiness.
To make a real change in the world, you need to optimize for the experience of the people who will adopt the new thing. For example, how would people using the the web benefit right now from replacing 802.11, IP, TCP, and HTTP with one monster protocol that just did the same stuff? They wouldn't. And neither would most protocol implementers, because a) they have to throw out everything they know and learn it all again, and b) they'd have to rebuild everything that currently exists.
And even if we push past that, it still won't work, because it's based on what I think of as the Architect's Illusion, the notion that a sufficiently smart person can just look at things and think real hard and have a perfect solution appear. The truth is that real products evolve over time as part of a dialog between and among users and makers. Even the simple paperclip, for example, evolved heavily. [1] If smart people can't get the paperclip right on a first try, there is no way they'll get a massive protocol stack right. And even if they did, next week somebody would come up with something new that doesn't fit, and we'd be right back where we were.
Such a protocol would be so dynamic you'd just end up with nodes that are all unique in that no two nodes that aren't running the same OS/software run the same capabilities of "universal protocol". As a result you have the same problem now without the ability to categorize nodes by which features they actually support (currently covered by "protocols" which rely on simpler protocol abstractions to be passed by intermediate nodes that don't understand them).
A "single universal protocol" sounds to me like you'd need at a minimum, a blob of data, with a header for length and how to interpret the data. Maybe if you flesh out the problem more you'd end up with something like this: https://tools.ietf.org/html/rfc791#section-3.1
What would it even mean to have a single universal protocol beyond IP? Layering on top of IP (TCP, UDP, and higher-level protocols like HTTP, DNS) is the way things already work. Not layering sounds like every application would just invent and parse their own protocols, which is even worse.
(This wasn't the original purpose of the second number: the original thought was that you would not be able to afford to store all the bytes of every packet, so the second number would say where you cut it off.)
The 32-bit seconds / 32-bit microseconds timestamp in the packet header has evolved to use 32-bit nanoseconds as we got better at timekeeping, with just a change of magic number. Meanwhile, the 32-bit unsigned seconds count (since 1970, UTC) takes us all the way to C.E.2106 while the time_t crybabies are running in circles about 2038.
The whole fintech industry rides on lz4 pcaps of multicast packet streams, with essential file metadata recorded in the file names. (There is a very complicated "NG" pcap format, but nobody uses it.)
Just for the record, a single 1U server with a disk array can record the complete activity of all the New York exchanges plus CME, CFE, Impact, and OPRA with enough headroom to handle >2x peak daily traffic of previous years, as happened in each of the last three weeks: but only if programmed with C++, ring buffers, and kernel bypass network drivers.
This has been a PSA.