The network interface in general on most operating systems seems kind of absurd ...

FooBarWidget · on Aug 4, 2010

> If the server software is using a queue and thread model, it likely then gets that data which the kernel has gone to great effort to demultiplex into N streams and provide it through N separate file descriptors--and multiplexes back onto a single queue for the worker threads to pull from!

This would indeed make no sense were it not for the fact that only a subset of the data received on the network card is actually intended for your server app.

You want a single file descriptor on which you can read data only destined for your port. At that point the data must already be demultiplexed by the kernel or you'd get the data destined for all possible ports.

I'm not sure not demultiplexing data will win you that much because the kernel had to demultiplex it anyway regardless. But if you want to implement this and try it out, by all means go ahead.

Wait a minute, let me get this right... you want to write your own TCP congestion handling code?? Sounds like you're interested in raw sockets, which are already available.

axod · on Aug 4, 2010

It may sound inefficient, but I've never seen it being a bottleneck. It needs to be separated out at some point, and the kernel is the best place to do that.

maybe if you're trying to serve 50,000 video streams from a single machine or something, but then you'll probably hit the limitations of your connection before you see any CPU usage from the networking code.

This whole discussion, as I've said before, seems like a classic case of premature optimization.

Networking code doesn't eat CPU. It's negligible.

Concentrate on all the BS dynamic functional language database crap people seem to insist on using at the higher levels. That's where all the wastage is.

jacquesm · on Aug 4, 2010

Technically nothing stops you from opening the raw device and to start reading packets. Sooner or later you'll need to figure out when your packets can be forwarded to the next layer for processing so work can be done for a particular connection.

And that's what the kernel did for you all along. The abstraction of file descriptors for clients is a useful one, it allows you to decide what to do. The decision here is not 'do I want all this data in one stream' or 'do I want it demultiplexed', the kernel does a fairly good job of the demultiplexing that needs to be done anyway (re-assembling a stream of packets in to functioning connections is non-trivial) and it exports that interface to all user processes to avoid having to recreate that useful abstraction in those user processes.

It's possible there is some gain to be had by 'rolling your own' and by getting rid of the kernel abstraction by collapsing all these layers in to a single user mode program.

But I think you'd soon find that you are simply re-implementing all that code, and that instead of a clean (poll or epoll, either is fairly clean) interface you'd be bogged down in maintaining your own tcp stack.

I once ran in to a situation like the one you describe. I was the programmer in charge of coding up a driver for a high speed serial card (high speed at the time, nowadays we'd laugh at it) doing X.25. I'd nicely laid it out as two separate pieces of software talking to each other using message passing on a weird but wonderful little os called QnX. One process to handle the data link layer, one to handle the X.25 protocol.

After a series of benchmarks with 8 such cards in a single ISA industrial enclosure, together with a (for the time) very expensive micronics 33 MHz motherboard I figured that maybe we can gain some performance by merging the two processes, to save on the IPC. The end result? Identical performance, in spite of the message passing overhead the bit-twiddling, checksumming and actual IO was such a large portion of the work done that the IPC didn't matter at all performance wise.

But it did matter to the code, the second version was decidedly harder to debug and maintain, and eventually that 'branch' was scrapped in favour of the one where the layers were clearly separated.

Poll and Epoll are communications mechanisms, they communicate with the kernel about the state of a bunch of fds. Both implement the same functionality (as long as you don't use the actual events of epoll) from a programmers point of view. Both have slightly different use cases, and for some situations it may be advantageous to use the one or the other. There may be a difference, but on the whole it will probably not be a very large one once you factor in all the other stuff that needs doing.

haberman · on Aug 4, 2010

> But I think you'd soon find that you are simply re-implementing all that code, and that instead of a clean (poll or epoll, either is fairly clean) interface you'd be bogged down in maintaining your own tcp stack.

You're not understanding OP's proposal. He's not talking about implementing TCP in user-space. He said the kernel would still be handling TCP flow control, and would only be sending data destined for a specific port to this fd. So TCP is still in the kernel.

He's just proposing that the payloads for all connections come in on a single fd instead of one fd per connection. But each read from the fd tells you what remote host it came from.

haberman · on Aug 4, 2010

Probably the biggest argument against doing what you describe is that the status quo gives you control over scheduling which fds to service (of the ones that are active).

If you processed incoming network data in strict FIFO order, a single host on your LAN could totally DoS your system by flanking you with incoming data. That host would never get TCP flow-controlled because if you're processing anything, you're processing his obnoxiously voluminous input.

If he was just on FD among many, you would service him once in a while, but then you would service other clients. While you're servicing other clients, he gets TCP flow-controlled, so he doesn't singlehandedly demand all your attention.

echaozh · on Aug 4, 2010

It totally makes sense. You should also be able to choose to use more than one socket (but not so many as there are clients), in case you still want to take advantage of all the cores available.