The network interface in general on most operating systems seems kind of absurd from a server point of view. You've got data coming in from N clients. The data from all those clients gets physically multiplexed into a single data stream that comes in over the wire.
Then in the kernel you've got something that demultiplexes it back into N separate streams, which it accumulates in N different buffers.
The server software is aware of all N of these streams. Through some mechanism the kernel notifies the server software that data is available on some subset of the N buffers, and the server software has to ask for that data.
If the server software is using a queue and thread model, it likely then gets that data which the kernel has gone to great effort to demultiplex into N streams and provide it through N separate file descriptors--and multiplexes back onto a single queue for the worker threads to pull from!
Why not get rid of some of this demultiplexing and re-multiplexing? Let me have a single file descriptor, and everything that comes in destined to my port comes through that single file descriptor, as a message with a header that tells what remote IP address and port it came from and the length, followed by that many bytes of data.
Instead of N separate buffers in the kernel, just have one big buffer per port. You'd still have to keep track of how many bytes in that buffer are associated with each remote connection, for purposes of dealing with TCP flow control, but I think this could be made to work reasonably (yes, I'm hand waving right now).
Now you don't need poll or epoll. Heck, you don't even need to use select. I'll just have a blocking thread whose job is to just read that one file descriptor using blocking reads in a loop to pull in data and put it on my queue.
Looking at this slightly different--everything destined for, say, port 80 is going to the web server, so why does the web server want this to come in through N different file descriptors? Especially when you consider that the code its going to run is the same no matter which it comes in on? All the multiple file descriptors are really accomplishing is providing a way to tell which bytes came from which client. Is that really the best use of file descriptors, as opposed to just having the kernel tell you which client the data came from when you do your read system call?
I can see that the N file descriptor model made sense in the days before threads, when to get concurrency you had to use multiple processes. You then need to have data coming in on port X from different clients not necessarily all going to the same process.
But for servers that are going to use the "single process with multiple threads" model to handle concurrency, a single stream network model would seem to be a lot nicer and efficient.
> If the server software is using a queue and thread model, it likely then gets that data which the kernel has gone to great effort to demultiplex into N streams and provide it through N separate file descriptors--and multiplexes back onto a single queue for the worker threads to pull from!
This would indeed make no sense were it not for the fact that only a subset of the data received on the network card is actually intended for your server app.
You want a single file descriptor on which you can read data only destined for your port. At that point the data must already be demultiplexed by the kernel or you'd get the data destined for all possible ports.
I'm not sure not demultiplexing data will win you that much because the kernel had to demultiplex it anyway regardless. But if you want to implement this and try it out, by all means go ahead.
Wait a minute, let me get this right... you want to write your own TCP congestion handling code?? Sounds like you're interested in raw sockets, which are already available.
It may sound inefficient, but I've never seen it being a bottleneck. It needs to be separated out at some point, and the kernel is the best place to do that.
maybe if you're trying to serve 50,000 video streams from a single machine or something, but then you'll probably hit the limitations of your connection before you see any CPU usage from the networking code.
This whole discussion, as I've said before, seems like a classic case of premature optimization.
Networking code doesn't eat CPU. It's negligible.
Concentrate on all the BS dynamic functional language database crap people seem to insist on using at the higher levels. That's where all the wastage is.
Technically nothing stops you from opening the raw device and to start reading packets. Sooner or later you'll need to figure out when your packets can be forwarded to the next layer for processing so work can be done for a particular connection.
And that's what the kernel did for you all along. The abstraction of file descriptors for clients is a useful one, it allows you to decide what to do. The decision here is not 'do I want all this data in one stream' or 'do I want it demultiplexed', the kernel does a fairly good job of the demultiplexing that needs to be done anyway (re-assembling a stream of packets in to functioning connections is non-trivial) and it exports that interface to all user processes to avoid having to recreate that useful abstraction in those user processes.
It's possible there is some gain to be had by 'rolling your own' and by getting rid of the kernel abstraction by collapsing all these layers in to a single user mode program.
But I think you'd soon find that you are simply re-implementing all that code, and that instead of a clean (poll or epoll, either is fairly clean) interface you'd be bogged down in maintaining your own tcp stack.
I once ran in to a situation like the one you describe. I was the programmer in charge of coding up a driver for a high speed serial card (high speed at the time, nowadays we'd laugh at it) doing X.25. I'd nicely laid it out as two separate pieces of software talking to each other using message passing on a weird but wonderful little os called QnX. One process to handle the data link layer, one to handle the X.25 protocol.
After a series of benchmarks with 8 such cards in a single ISA industrial enclosure, together with a (for the time) very expensive micronics 33 MHz motherboard I figured that maybe we can gain some performance by merging the two processes, to save on the IPC. The end result? Identical performance, in spite of the message passing overhead the bit-twiddling, checksumming and actual IO was such a large portion of the work done that the IPC didn't matter at all performance wise.
But it did matter to the code, the second version was decidedly harder to debug and maintain, and eventually that 'branch' was scrapped in favour of the one where the layers were clearly separated.
Poll and Epoll are communications mechanisms, they communicate with the kernel about the state of a bunch of fds. Both implement the same functionality (as long as you don't use the actual events of epoll) from a programmers point of view. Both have slightly different use cases, and for some situations it may be advantageous to use the one or the other. There may be a difference, but on the whole it will probably not be a very large one once you factor in all the other stuff that needs doing.
> But I think you'd soon find that you are simply re-implementing all that code, and that instead of a clean (poll or epoll, either is fairly clean) interface you'd be bogged down in maintaining your own tcp stack.
You're not understanding OP's proposal. He's not talking about implementing TCP in user-space. He said the kernel would still be handling TCP flow control, and would only be sending data destined for a specific port to this fd. So TCP is still in the kernel.
He's just proposing that the payloads for all connections come in on a single fd instead of one fd per connection. But each read from the fd tells you what remote host it came from.
Probably the biggest argument against doing what you describe is that the status quo gives you control over scheduling which fds to service (of the ones that are active).
If you processed incoming network data in strict FIFO order, a single host on your LAN could totally DoS your system by flanking you with incoming data. That host would never get TCP flow-controlled because if you're processing anything, you're processing his obnoxiously voluminous input.
If he was just on FD among many, you would service him once in a while, but then you would service other clients. While you're servicing other clients, he gets TCP flow-controlled, so he doesn't singlehandedly demand all your attention.
It totally makes sense. You should also be able to choose to use more than one socket (but not so many as there are clients), in case you still want to take advantage of all the cores available.
Then in the kernel you've got something that demultiplexes it back into N separate streams, which it accumulates in N different buffers.
The server software is aware of all N of these streams. Through some mechanism the kernel notifies the server software that data is available on some subset of the N buffers, and the server software has to ask for that data.
If the server software is using a queue and thread model, it likely then gets that data which the kernel has gone to great effort to demultiplex into N streams and provide it through N separate file descriptors--and multiplexes back onto a single queue for the worker threads to pull from!
Why not get rid of some of this demultiplexing and re-multiplexing? Let me have a single file descriptor, and everything that comes in destined to my port comes through that single file descriptor, as a message with a header that tells what remote IP address and port it came from and the length, followed by that many bytes of data.
Instead of N separate buffers in the kernel, just have one big buffer per port. You'd still have to keep track of how many bytes in that buffer are associated with each remote connection, for purposes of dealing with TCP flow control, but I think this could be made to work reasonably (yes, I'm hand waving right now).
Now you don't need poll or epoll. Heck, you don't even need to use select. I'll just have a blocking thread whose job is to just read that one file descriptor using blocking reads in a loop to pull in data and put it on my queue.
Looking at this slightly different--everything destined for, say, port 80 is going to the web server, so why does the web server want this to come in through N different file descriptors? Especially when you consider that the code its going to run is the same no matter which it comes in on? All the multiple file descriptors are really accomplishing is providing a way to tell which bytes came from which client. Is that really the best use of file descriptors, as opposed to just having the kernel tell you which client the data came from when you do your read system call?
I can see that the N file descriptor model made sense in the days before threads, when to get concurrency you had to use multiple processes. You then need to have data coming in on port X from different clients not necessarily all going to the same process.
But for servers that are going to use the "single process with multiple threads" model to handle concurrency, a single stream network model would seem to be a lot nicer and efficient.