Technically nothing stops you from opening the raw device and to start reading p...

Technically nothing stops you from opening the raw device and to start reading packets. Sooner or later you'll need to figure out when your packets can be forwarded to the next layer for processing so work can be done for a particular connection.

And that's what the kernel did for you all along. The abstraction of file descriptors for clients is a useful one, it allows you to decide what to do. The decision here is not 'do I want all this data in one stream' or 'do I want it demultiplexed', the kernel does a fairly good job of the demultiplexing that needs to be done anyway (re-assembling a stream of packets in to functioning connections is non-trivial) and it exports that interface to all user processes to avoid having to recreate that useful abstraction in those user processes.

It's possible there is some gain to be had by 'rolling your own' and by getting rid of the kernel abstraction by collapsing all these layers in to a single user mode program.

But I think you'd soon find that you are simply re-implementing all that code, and that instead of a clean (poll or epoll, either is fairly clean) interface you'd be bogged down in maintaining your own tcp stack.

I once ran in to a situation like the one you describe. I was the programmer in charge of coding up a driver for a high speed serial card (high speed at the time, nowadays we'd laugh at it) doing X.25. I'd nicely laid it out as two separate pieces of software talking to each other using message passing on a weird but wonderful little os called QnX. One process to handle the data link layer, one to handle the X.25 protocol.

After a series of benchmarks with 8 such cards in a single ISA industrial enclosure, together with a (for the time) very expensive micronics 33 MHz motherboard I figured that maybe we can gain some performance by merging the two processes, to save on the IPC. The end result? Identical performance, in spite of the message passing overhead the bit-twiddling, checksumming and actual IO was such a large portion of the work done that the IPC didn't matter at all performance wise.

But it did matter to the code, the second version was decidedly harder to debug and maintain, and eventually that 'branch' was scrapped in favour of the one where the layers were clearly separated.

Poll and Epoll are communications mechanisms, they communicate with the kernel about the state of a bunch of fds. Both implement the same functionality (as long as you don't use the actual events of epoll) from a programmers point of view. Both have slightly different use cases, and for some situations it may be advantageous to use the one or the other. There may be a difference, but on the whole it will probably not be a very large one once you factor in all the other stuff that needs doing.