This paper is a quatitative study of Border Gateway Protocol (BGP) misconfigurations by analyzing configuration errors over a 3 week period. The authors concluded that most errors, while 1 on every 25 incidents actually affects connectivity, could be prevented through better router design or protocol revision.
BGP is a policy based protocal in which autonomous systems (ASes) receive and send routing information to hosts and neighbors. They put the misconfiguration into two categories: origin misconfiguration, and export misconfiguration. An example of the former is a wrong prefix injection into global BGP tables; for the latter, the router exports a route that shouldn't go through.
The effects of the misconfiguration were discussed in three different aspects, including routing load, connectivity disruption, and policy violation. The new routes lasting less than a day are carefully considered as misconfiguration while using external vantange points and administrator email surveys to check if it actually caused the above three kinds of effects. Selection bias is addressed by the authors, and by comparing the results of the email surveys and vantage points, it is kept to a minimum in a safe way.
The physical links and relationships between ASes are usually commercial secrets, and Gao's algorithm was used to estimate the actual linkage, which is based on three rules: 1) all valid AS-path routing information goes downward (e.g. from provider to customer), 2) AS-path can have at most one peer-to-peer edge, 3) ASes with more neighbors are likely to be providers.
Results: A equivalent to 0.2-1.0% of the glocal BGP table size suffers from misconfiguration each day, which means up to 3 in 4 new route annoucements per day are misconfigs. While connectivity was surprising solid, affected in onlly 4% of the misconfiged announcements or 13% of the misconfig incidents, the extra loading in the routers takes up to 2% of the time, or more than 10% of the total update load. In an extreme case, it went higher than 60% of the update load with 15minutes averaging.
The paper investigates the causes of misconfiguration and classify the human errors into slips (errors in the excution of a correct plan, e.g. typoes) and mistakes (erros in the plan itself, e.g., logic bugs). In owing to the human operatior error rate affects the system, as seen in phone network, aircraft, and bank databases, the authors suggests that BGP router configuration is far from error-free because of the command-based editing and lack of high-level languages and checking. Protocol revisions/extensions were suggested, and active alerting wrong routing infomation was also proposed.
Afterthoughts:
- This is a carefully examined paper with almost all stastistical and bias errors were carefully considered, which makes the the results very confincing that the BDP is far from perfection.
- It's interesting to see that they conducted this experiment after Christmas during the New Year. There's no explanation of why they specifically chose this period depite their careful, scientific setting. Is it probable that there is less/more routing updates during this period?