CS 268 Course Blog of Ko-Yang: 15. Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

This paper deals with the TCP incast problem when inbound data overflows the switch buffers by using high resolution timers for TCP time-outs. When there are multiple end-node sending out queries, the client side cannot make forward progress until they receive responses from every server, this is referred to barrier-synchronized requests in the article.

TCP incast collapse:

high-bandwidth, low-latency networks with small buffers in the switches
clients issue barrier-sync requests in parallel
servers respond with small amount of data per request

By enabling microsecond-granularity retransmission time-outs (RTO), the authors intended to solve the incast problem commonly seen in data centers,

modify TCP implementation to use high-resolution kernel timers, timeout = 2^backoff (RTO+ rand(0.5) * RTO )
prevent TCP incast collapse for up to 47 concurrent senders
recover in data centers do not affect performance

Comment: This is a different approach to solve the problems in data centers. The other papers we read strived for backward-compatibility, while this one modifies the TCP stack. Since data centers are usually confined to a fixed location and managed by normally one company, I guess it is a feasible way to solve TCP incast collapse.

CS 268 Course Blog of Ko-Yang

Wednesday, September 23, 2009

15. Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

1 comment:

Followers

Blog Archive