The Case of the Flapping BGP Session

Checking the health of your Internet Router

Between one router and another, a BGP speaker sets up a conversation with another BGP speaker to exchange internet routing information. All the prefixes are exchanged, imported, analysed and if necessary adjusted to provide good cost-to-quality connectivity. Sometimes this process goes wrong, and customers experience BGP session ‘flapping’ or instability. Every now and then we receive questions about this, so we thought we would share a recent case, along with our advice.

The Case of the Flapping BGP Session
The following example, based on a recent customer case, gives an insight into what can go wrong:

  • The customer was experiencing problems while loading prefixes, and although time-outs were happening, they were no longer being flagged. Traffic was re-routing in a loop between half-started sessions and alternative paths, causing longer loading times and occasional failures.
  • The network had 3 upstreams, with peering and sub-networks connected.
  • The customer’s sub-networks had 90 IPv4 and 10 IPv6 prefixes.

Let’s analyse what this means in terms of the number of prefixes that the network’s router has to process:

  • NL-ix peering route servers currently serve around 130,000 IPv4 and 35,000 IPv6 prefixes. To ensure redundancy, the Network connects to both Route Servers
  • Our global BGP connectivity (Joint transit), provides all the prefixes worldwide and consists of around 840,000 IPv4 and 110,000 IPv6 prefixes. For redundancy a user will be offered sessions to two individual Transit Routers
  • For additional quality routes a session to a Partial Transit (Open Peering) service is configured. Let’s assume around 420,000 IPv4 and 53,000 IPv6 prefixes.
  • The remaining upstreams are global BGP connectivity i.e. 840,000 IPv4 and 110,000 IPv6 prefixes.

Based on these figures we can calculate the total number of prefixes:

  • Joint Transit 1 950,000
  • Joint Transit 2 950,000
  • Open Peering 473,000
  • NL-ix RS1 165,000
  • NL-ix RS2 165,000
  • Upstream x 950,000
  • Upstream y 950,000
  • Sub-networks 100
  • Total 4,603,100

We looked at the router and found the problem. Many popular enterprise and service-provider routers are only capable of handling 4 million entries in the RIB (routing information base) and a million entries in the FIB (forwarding information base). Because the last customer BGP session went over the 4 million RIB capacity level, the router RIB was being overloaded.

Due to the RIB overload the BGP session was continuously resetting while trying to reload all the prefixes, resulting in an unstable ‘flapping’ BGP session.

We were able to limit the amount of prefixes in order not to exceed 4 million as an interim fix. For Join Transit we were able to change the session to a default-only route. In the medium-term this then allowed the customer to upgrade to a more modern router platform with higher capacity.

Global Prefix Growth
This issue is becoming more common due to the sharp recent growth in the number of prefixes for global connectivity. This is largely due to a shortage of IPv4 address space. More and longer address blocks are being announced with smaller prefixes, and the IP-space is becoming more and more fragmented. At the same time there is overall growth in the number of IPv6 prefixes as IPv6 is deployed more widely.

Other Reasons for Flapping
There are many reasons for flapping, including

  • prefix limits (configuration)
  • MTU size (configuration)
  • unreliable Ethernet connection (link up/down)
  • errors on the line (packet loss)
  • congestion (packet loss)
  • too many BGP sessions (software/hardware limits)
  • CPU processing and RAM memory issues (software/hardware specification)
  • Software behaviour/decisions with certain communities, attributes and/or next-hop (software bug, interoperability issue).

To diagnose, we start with troubleshooting based on logging/counters from both sides. This often helps reveal the root of the problem.

Working around the problem
Setting up new BGP sessions which will be certain to go over the maximum RIB number capacity of a router can have unpredictable results. The RIB and FIB router specifications are generally set out in product and service specifications. If not, you can raise a trouble ticket with your supplier/vendor to ascertain details including RIB and FIB specifications.

If you are unsure about this, to avoid flapping BGP sessions we suggest you get in touch with us to ask what we can do to limit the size of your RIB. There are things we can do to mitigate the problem while you make a case or wait for the CAPEX needed for new router platforms.