TCP is a streaming protocol, it is most efficient at sending large amounts of “stream” type data, but the best thing about TCP is that it is reliable. If you send data over TCP, unless there is a catastrophic network failure, you can be sure it will arrive at the other end. UDP on the other hand is better for smaller packet based messages, things like request/response type of communication. However with UDP messages can easily be dropped anywhere or they may arrive out of order, so if you send a UDP message, don’t count on it arriving at its destination. There are many ways to “fix” UDP to make it reliable, and most of the methods involve implementing some TCP like framework on UDP. Unfortunately the downside to implementing TCP in software is that it becomes highly inefficient, and error prone.
Below I will talk about a simple way to frame TCP messages, and still get high throughput.
A simpler approach
It is much easier to implement a message framer on TCP than it is to implement TCP on UDP, here are the problem areas that need to be handled for a complete solution:
- Frame the transmitted messages, remember that a send call does not have to send all bytes.
- Recv may not return a full message, so be sure to call it until a full message has been received.
- Recv may also return MORE than a full message, in other words it may Rx a full message plus part of the next message(s).
To enable message framing, the easiest approach is to use a constant size header in all messages that indicates how many bytes to recv. Then the receiver can get the number of bytes in the header, and know how many bytes are in each message, this is a simple approach but if done incorrectly can have a pretty bad impact on throughput.
The following provides some detailed notes on each of the above situations with ideas on how to get the best performance.
Transmitting is pretty simple, just constantly call the send function until all bytes that need to be sent have been sent, the only trick is to adjust pointers, and byte counters to account for any bytes that have already been sent. The absolute best approach is to use scatter gather facilities available in all major operating systems, (an even better approach is to use a cross platform socket interface like boost which also supports scatter gather.)
The idea is that you have a message queue that your application code adds messages to when it wants to send a message, then a separate transmit task pulls the queued messages off the queue and feeds them into the send call with scatter gather, it is convenient that the messages are already in a nice list, if done properly the list will not need much processing to make them ready for sending with scatter/gather.
An additional thing to remember is that this system uses scatter gather, so if the application needs to send a message that is comprised of a message header, followed by 200 bytes of data at location X, followed by 800 more bytes of data at location Y, then there is no reason for the application to do memory copy to put all of that data into one nice packet to be sent. It can simply add the header to the Tx queue (the header indicates that the message payload is 1000bytes total), followed by memory pointer to X, followed by memory pointer to Y. Then the receiving side will receive a message payload that is 1000 bytes in one nice message without a single memory copy.
When a partial message is received, then the receive call needs to be invoked again until a full message has been received, keeping in mind that a single partial message might not even contain a full message header. So the process for this case is as follows:
- Call receive until at least a full message header has been received.
- Once a full header has been received, continue calling receive until all bytes in the message have been received.
Simple right? Not quite so simple if you care about performance, for receive to work well, it is best not to limit the number of bytes that can be received at once. Again in this case it is best to split receive function out to its own thread and call receive with a large buffer (say 2-4kbytes per call) and have your message framer do the real work of parsing out the data.
So an improved method for receive is as follows:
- Constantly call receive with a large buffer.
- Pass any received data to a framer for parsing.
- The framer moves pointers, and does memory copies when necessary as described below.
Ideally you also want to avoid or reduce memory copies, so the Rx message framer should do little more than move pointers around, but that doesn’t always work. For instance in the case where there is 20kbytes of memory reserved for receiving data, every time receive gets some data the memory pointer is moved by the number of bytes received, eventually you get to the end of memory reserved for receive and you may have an incomplete message right at the end of the reserved memory. In this case you have no choice but to take some drastic action, and use a memory copy, maybe you just copy the incomplete message to the start of the 20kbyte buffer. However care must be taken to ensure that the data is not overwritten which is still needed by the application.
Frequently what I do is have a memory pool of buffers (using a thread safe queue), each buffer is the size of the largest message that might ever be sent or received, assuming the largest message is somewhere on the order of kilobytes. Then frame each buffer as described above, with any incomplete data copied into the next buffer I get from the pool. The application then calls an application level recv function to get framed buffers that are full of data, and when the application is done with the buffer it is returned to the pool. The application level recv function is little more than pulling a fully framed message off another thread safe queue.
If your application needs high throughput with TCP, then size matters, don’t worry so much about MTU, exceed it, ignore it, but use scatter gather, and large buffers with as much data in them as you can fit, and you will be amazed when your network monitor shows ~98% utilization.
Just a quick recap of the overall data flow:
- Application queues buffers to be transmitted into a thread safe queue, avoiding memory copies by taking advantage of scatter/gather.
- Tx task pulls messages off queue, and sends them until everything is sent.
- Tx task waits for more buffers from the application.
- Rx task gets a (large) buffer from a buffer pool.
- Rx task receives data from socket into buffer.
- Rx task passes the newly received data to the message framer.
- Message framer places fully framed buffer into another thread safe queue for delivery to application.
- Application calls function to get any framed buffers that may be available.
- When application is done with the buffer it is returned to the buffer pool for reuse.
Future topics, an easy to use and extend thread safe queue, how to write a RAII buffer pool with automatic buffer return.