TCP incast collapse:
- high-bandwidth, low-latency networks with small buffers in the switches
- clients issue barrier-sync requests in parallel
- servers respond with small amount of data per request
By enabling microsecond-granularity retransmission time-outs (RTO), the authors intended to solve the incast problem commonly seen in data centers,
- modify TCP implementation to use high-resolution kernel timers, timeout = 2^backoff (RTO+ rand(0.5) * RTO )
- prevent TCP incast collapse for up to 47 concurrent senders
- recover in data centers do not affect performance
Comment: This is a different approach to solve the problems in data centers. The other papers we read strived for backward-compatibility, while this one modifies the TCP stack. Since data centers are usually confined to a fixed location and managed by normally one company, I guess it is a feasible way to solve TCP incast collapse.
Although modifying the TCP stack might be ok for data centers, it seems like one of the conditions for a solution to the Incast problems is that it should be generalizable to the wide area network, making backward-compatibility more important.
ReplyDelete