fixup! net-tcp-bbr1: for testing, a copy of BBRv1 #10

eiffel-fl · 2024-08-28T13:28:42Z

Hi!

First, thank you for the whole work on BBR as well as the improvement coming from this version!

I had some troubles while compiling the two BBR algorithms as modules and had to tweak a bit the commit adding the first version.
If you feel this is relevant, just take the commit and squash it with the corresponding one.

Best regards.

This commit is a bug fix for the Linux TCP app-limited (application-limited) logic that is used for collecting rate (bandwidth) samples. Previously the app-limited logic only looked for "bubbles" of silence in between application writes, by checking at the start of each sendmsg. But "bubbles" of silence can also happen before retransmits: e.g. bubbles can happen between an application write and a retransmit, or between two retransmits. Retransmits are triggered by ACKs or timers. So this commit checks for bubbles of app-limited silence upon ACKs or timers. Why does this commit check for app-limited state at the start of ACKs and timer handling? Because at that point we know whether inflight was fully using the cwnd. During processing the ACK or timer event we often change the cwnd; after changing the cwnd we can't know whether inflight was fully using the old cwnd. Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc Change-Id: I37221506f5166877c2b110753d39bb0757985e68

…ree up 8 bytes Free up some space for tracking inflight and losses for each bw sample, in upcoming commits. These timestamps are in microseconds, and are now stored in 32 bits. So they can only hold time intervals up to roughly 2^12 = 4096 seconds. But Linux TCP RTT and RTO tracking has the same 32-bit microsecond implementation approach and resulting deployment limitations. So this is not introducing a new limit. And these should not be a limitation for the foreseeable future. Effort: net-tcp_bbr Origin-9xx-SHA1: 238a7e6b5d51625fef1ce7769826a7b21b02ae55 Change-Id: I3b779603797263b52a61ad57c565eb91fe42680c

… in rate_sample CC algorithms may want to snapshot the number of packets in flight at transmit time and pass in rate_sample, to understand the relationship between inflight and losses or ECN signals, to try to find the highest inflight value that has acceptable levels of loss/ECN marking. We split out the code to set an skb's tx.in_flight field into its own function, so that this code can be used for the TCP_REPAIR "fake send" code path that inserts skbs into the rtx queue without sending them. Effort: net-tcp_bbr Origin-9xx-SHA1: b3eb4f2d20efab4ca001f32c9294739036c493ea Origin-9xx-SHA1: e880fc907d06ea7354333f60f712748ebce9497b Origin-9xx-SHA1: 330f825a08a6fe92cef74d799cc468864c479f63 Change-Id: I7314047d0ff14dd261a04b1969a46dc658c8836a

For understanding the relationship between inflight and packet loss signals, to try to find the highest inflight value that has acceptable levels of packet losses. Effort: net-tcp_bbr Origin-9xx-SHA1: 4527e26b2bd7756a88b5b9ef1ada3da33dd609ab Change-Id: I594c2500868d9c530770e7ddd68ffc87c57f4fd5

For understanding the relationship between inflight and ECN signals, to try to find the highest inflight value that has acceptable levels ECN marking. Effort: net-tcp_bbr Origin-9xx-SHA1: 3eba998f2898541406c2666781182200934965a8 Change-Id: I3a964e04cee83e11649a54507043d2dfe769a3b3

…ck API For connections experiencing reordering, RACK can mark packets lost long after we receive the SACKs/ACKs hinting that the packets were actually lost. This means that CC modules cannot easily learn the volume of inflight data at which packet loss happens by looking at the current inflight or even the packets in flight when the most recently SACKed packet was sent. To learn this, CC modules need to know how many packets were in flight at the time lost packets were sent. This new callback, combined with TCP_SKB_CB(skb)->tx.in_flight, allows them to learn this. This also provides a consistent callback that is invoked whether packets are marked lost upon ACK processing, using the RACK reordering timer, or at RTO time. Effort: net-tcp_bbr Origin-9xx-SHA1: afcbebe3374e4632ac6714d39e4dc8a8455956f4 Change-Id: I54826ab53df636be537e5d3c618a46145d12d51a

When tcp_shifted_skb() updates state as adjacent SACKed skbs are coalesced, previously the tx.in_flight was not adjusted, so we could get contradictory state where the skb's recorded pcount was bigger than the tx.in_flight (the number of segments that were in_flight after sending the skb). Normally have a SACKed skb with contradictory pcount/tx.in_flight would not matter. However, with SACK reneging, the SACKed bit is removed, and an skb once again becomes eligible for retransmitting, fragmenting, SACKing, etc. Packetdrill testing verified the following sequence is possible in a kernel that does not have this commit: - skb N is SACKed - skb N+1 is SACKed and combined with skb N using tcp_shifted_skb() - tcp_shifted_skb() will increase the pcount of prev, but leave tx.in_flight as-is - so prev skb can have pcount > tx.in_flight - RTO, tcp_timeout_mark_lost(), detect reneg, remove "SACKed" bit, mark skb N as lost - find pcount of skb N is greater than its tx.in_flight I suspect this issue iw what caused the bbr2_inflight_hi_from_lost_skb(): WARN_ON_ONCE(inflight_prev < 0) to fire in production machines using bbr2. Effort: net-tcp_bbr Origin-9xx-SHA1: 1a3e997e613d2dcf32b947992882854ebe873715 Change-Id: I1b0b75c27519953430c7db51c6f358f104c7af55

When we fragment an skb that has already been sent, we need to update the tx.in_flight for the first skb in the resulting pair ("buff"). Because we were not updating the tx.in_flight, the tx.in_flight value was inconsistent with the pcount of the "buff" skb (tx.in_flight would be too high). That meant that if the "buff" skb was lost, then bbr2_inflight_hi_from_lost_skb() would calculate an inflight_hi value that is too high. This could result in longer queues and higher packet loss. Packetdrill testing verified that without this commit, when the second half of an skb is SACKed and then later the first half of that skb is marked lost, the calculated inflight_hi was incorrect. Effort: net-tcp_bbr Origin-9xx-SHA1: 385f1ddc610798fab2837f9f372857438b25f874 Origin-9xx-SHA1: a0eb099690af net-tcp_bbr: v2: fix tcp_fragment() tx.in_flight recomputation [prod feb 8 2021; use as a fixup] Origin-9xx-SHA1: 885503228153ff0c9114e net-tcp_bbr: v2: introduce tcp_skb_tx_in_flight_is_suspicious() helper for warnings Change-Id: I617f8cab4e9be7a0b8e8d30b047bf8645393354d

Add a a new ca opts flag TCP_CONG_WANTS_CE_EVENTS that allows a congestion control module to receive CE events. Currently congestion control modules have to set the TCP_CONG_NEEDS_ECN bit in opts flag to receive CE events but this may incur changes in ECN behavior elsewhere. This patch adds a new bit TCP_CONG_WANTS_CE_EVENTS that allows congestion control modules to receive CE events independently of TCP_CONG_NEEDS_ECN. Effort: net-tcp Origin-9xx-SHA1: 9f7e14716cde760bc6c67ef8ef7e1ee48501d95b Change-Id: I2255506985242f376d910c6fd37daabaf4744f24

Reorganize the API for CC modules so that the CC module once again gets complete control of the TSO sizing decision. This is how the API was set up around 2016 and the initial BBRv1 upstreaming. Later Eric Dumazet simplified it. But with wider testing it now seems that to avoid CPU regressions BBR needs to have a different TSO sizing function. This is necessary to handle cases where there are many flows bottlenecked on the sender host's NIC, in which case BBR's pacing rate is much lower than CUBIC/Reno/DCTCP's. Why does this happen? Because BBR's pacing rate adapts to the low bandwidth share each flow sees. By contrast, CUBIC/Reno/DCTCP see no loss or ECN, so they grow a very large cwnd, and thus large pacing rate and large TSO burst size. Change-Id: Ic8ccfdbe4010ee8d4bf6a6334c48a2fceb2171ea

…cp_ack_snd_check() Add logic for an experimental TCP connection behavior, enabled with tp->fast_ack_mode = 1, which disables checking the receive window before sending an ack in __tcp_ack_snd_check(). If this behavior is enabled, the data receiver sends an ACK if the amount of data is > RCV.MSS. Change-Id: Iaa0a0fd7108221f883137a79d5bfa724f1b096d4

When sending a TLP retransmit, record whether the outstanding flight of data is application limited. This is important for congestion control modules that want to respond to losses repaired by TLP retransmits. This is important because the following scenarios convey very different information: (1) a packet loss with a small number of packets in flight; (2) a packet loss with the maximum amount of data in flight allowed by the CC module; Effort: net-tcp_bbr Change-Id: Ic8ae567caa4e4bfd5fd82c3d4be12a5d9171655e

Before this commit, when there is a packet loss that creates a sequence hole that is filled by a TLP loss probe, then tcp_process_tlp_ack() only informs the congestion control (CC) module via a back-to-back entry and exit of CWR. But some congestion control modules (e.g. BBR) do not respond to CWR events. This commit adds a new CA event with which the core TCP stack notifies the CC module when a loss is repaired by a TLP. This will allow CC modules that do not use the CWR mechanism to have a custom handler for such TLP recoveries. Effort: net-tcp_bbr Change-Id: Ieba72332b401b329bff5a641d2b2043a3fb8f632

Introduce is_acking_tlp_retrans_seq into rate_sample. This bool will export to the CC module the knowledge of whether the current ACK matched a TLP retransmit. Note that when this bool is true, we cannot yet tell (in general) whether this ACK is for the original or the TLP retransmit. Effort: net-tcp_bbr Change-Id: I2e6494332167e75efcbdc99bd5c119034e9c39b4

Define and implement a new per-route feature, RTAX_FEATURE_ECN_LOW. This feature indicates that the given destination network is a low-latency ECN environment, meaning both that ECN CE marks are applied by the network using a low-latency marking threshold and also that TCP endpoints provide precise per-data-segment ECN feedback in ACKs (where the ACK ECE flag echoes the received CE status of all newly-acknowledged data segments). This feature indication can be used by congestion control algorithms to decide how to interpret ECN signals over the given destination network. This feature is appropriate for datacenter-style ECN marking, such as the ECN marking approach expected by DCTCP or BBR congestion control modules. Signed-off-by: David Morley <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Tested-by: David Morley <[email protected]> Change-Id: I6bc06e9c6cb426fbae7243fc71c9a8c18175f5d3

BBR v3 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower queues, lower loss, and better Reno/CUBIC coexistence than BBR v1. BBR v3 maintains the core of BBR v1: an explicit model of the network path that is two-dimensional, adapting to estimate the (a) maximum available bandwidth and (b) maximum safe volume of data a flow can keep in-flight in the network. It maintains the estimated BDP as a core guide for estimating an appropriate level of in-flight data. BBR v3 makes several key enhancements: o Its bandwidth-probing time scale is adapted, within bounds, to allow improved coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a) extended dynamically based on estimated BDP to improve coexistence with Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more scalable and responsive than Reno and CUBIC. o Rather than being largely agnostic to loss and ECN marks, it explicitly uses loss and (DCTCP-style) ECN signals to maintain its model. o It aims for lower losses than v1 by adjusting its model to attempt to stay within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh, respectively). o It adapts to loss/ECN signals even when the application is running out of data ("application-limited"), in case the "application-limited" flow is also "network-limited" (the bw and/or inflight available to this flow is lower than previously estimated when the flow ran out of data). o It has a three-part model: the model explicit three tracks operating points, where an operating point is a tuple: (bandwidth, inflight). The three operating points are: o latest: the latest measurement from the current round trip o upper bound: robust, optimistic, long-term upper bound o lower bound: robust, conservative, short-term lower bound These are stored in the following state variables: o latest: bw_latest, inflight_latest o lo: bw_lo, inflight_lo o hi: bw_hi[2], inflight_hi To gain intuition about the meaning of the three operating points, it may help to consider the analogs in CUBIC, which has a somewhat analogous three-part model used by its probing state machine: BBR param CUBIC param ----------- ------------- latest ~ cwnd lo ~ ssthresh hi ~ last_max_cwnd The analogy is only a loose one, though, since the BBR operating points are calculated differently, and are 2-dimensional (bw,inflight) rather than CUBIC's one-dimensional notion of operating point (inflight). o It uses the three-part model to adapt the magnitude of its bandwidth to match the estimated space available in the buffer, rather than (as in BBR v1) assuming that it was always acceptable to place 0.25*BDP in the bottleneck buffer when probing (commodity datacenter switches commonly do not have that much buffer for WAN flows). When BBR v3 estimates it hit a buffer limit during probing, its bandwidth probing then starts gently in case little space is still available in the buffer, and the accelerates, slowly at first and then rapidly if it can grow inflight without seeing congestion signals. In such cases, probing is bounded by inflight_hi + inflight_probe, where inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to keep losses low and bounded if a bottleneck remains congested, while rapidly/scalably utilizing free bandwidth when it becomes available. o It has a slightly revised state machine, to achieve the goals above. BBR_BW_PROBE_UP: pushes up inflight to probe for bw/vol BBR_BW_PROBE_DOWN: drain excess inflight from the queue BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty o The estimated BDP: BBR v3 continues to maintain an estimate of the path's two-way propagation delay, by tracking a windowed min_rtt, and coordinating (on an as-ndeeded basis) to try to expose the two-way propagation delay by draining the bottleneck queue. BBR v3 continues to use its min_rtt and (currently-applicable) bandwidth estimate to estimate the current bandwidth-delay product. The estimated BDP still provides one important guideline for bounding inflight data. However, because any min-filtered RTT and max-filtered bw inherently tend to both overestimate, the estimated BDP is often too high; in this case loss or ECN marks can ensue, in which case BBR v3 adjusts inflight_hi and inflight_lo to adapt its sending rate and inflight down to match the available capacity of the path. o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v3 requires more space. Note that much of the space is due to support for per-socket parameterization and debugging in this release for research and debugging. With that state removed, the full "struct bbr" is 140 bytes, or 144 with padding. This is an increase of 40 bytes over the existing ca_priv space. o Code: BBR v3 reuses many pieces from BBR v1. But it omits the following significant pieces: o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(), bbr_can_grow_inflight()) o long-term bandwidth estimator ("policer mode") The code layout tries to keep BBR v3 code near the bottom of the file, so that v1-applicable code in the top does not accidentally refer to v3 code. o Docs: See the following docs for more details and diagrams decsribing the BBR v3 algorithm: https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00 https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00 o Internal notes: For this upstream rebase, Neal started from: git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c then removed dev instrumentation (dynamic get/set for parameters) and code that was only used by BBRv1 Effort: net-tcp_bbr Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102 Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05

Adds a new flag TCP_ECN_ECT_PERMANENT that is used by CCAs to indicate that retransmitted packets and pure ACKs must have the ECT bit set. This is necessary for BBR, which when using ECN expects ECT to be set even on retransmitted packets and ACKs. Previous to this addition of TCP_ECN_ECT_PERMANENT, CCAs which can use ECN but don't "need" it did not have a way to indicate that ECT should be set on retransmissions/ACKs. Signed-off-by: Adithya Abraham Philip <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Change-Id: I8b048eaab35e136fe6501ef6cd89fd9faa15e6d2

Analogous to other important ECN information, export TCPI_OPT_ECN_LOW in tcp_info tcpi_options field. Signed-off-by: Neal Cardwell <[email protected]> Change-Id: I08d8d8c7e8780e6e37df54038ee50301ac5a0320

…s enabled This commit provides a kernel config file for GCE. It builds most (all?) of the available congestion control modules and uses bbr2 as the default. Tested: On GCE. Effort: net-test Change-Id: Ibc4dfdc119c804f1ad2853b3ee2c1c503bca01a9

… GCE machine This commit adds a script to build an upstream Linux kernel and install it and boot it on a Google Cloud (GCE) virtual machine. Usage: ./gce-install.sh -m <MACHINE_IP> e.g.: ./gce-install.sh -m 1.2.3.4 ssh 1.2.3.4 Tested: On GCE. Effort: net-test Change-Id: I149233b802202335af93183728050aadb52cca2c

- runs a small set of simple tests - sets up netem to emulate a configured network scenario - runs /usr/bin/netperf and /usr/bin/netserver to generate traffic - writes pcaps and ss logs - analyzes test results - generates graphs Usage: cd gtests/net/tcp/bbr/nsperf/ ./configure.sh ./run_tests.sh ./graph_tests.sh Thanks for Jason Xing <[email protected]> for an included bug fix: 0e156e9 Effort: net-test Change-Id: I38662f554b3c905aa79947a2c52a2ecfe3943f8c

Change-Id: I418eb97552991b29723137392e9a6aebe66f8b82

Change-Id: Iac1ffdd9eb84452eff22d7575b48a805f5f42284

…mation This commit provides a .patch file that is intended to be applied to an iproute2 source tree fetched from: git://git.kernel.org/pub/scm/network/iproute2/iproute2.git The patch provides support to allow the "ss" command line tool to prind the information exported by BBRv3.

…oute feature This commit provides a .patch file that is intended to be applied to an iproute2 source tree fetched from: git://git.kernel.org/pub/scm/network/iproute2/iproute2.git The patch provides support to allow a new ecn_low per-route feature. Change-Id: Ifb7e9a3071ec51f1f08c3e760b9323c380dcc8eb

…fo tcpi_options TCPI_OPT_ECN_LOW bit is set This commit provides a .patch file that is intended to be applied to an iproute2 source tree fetched from: git://git.kernel.org/pub/scm/network/iproute2/iproute2.git The patch provides support to show a new ecn_low per-route feature.

Change-Id: I53254e30ada20dae7a4e68d6e6e6a9ebbf356dee

Signed-off-by: Francis Laniel <[email protected]>

nealcardwell and others added 28 commits July 14, 2023 03:15

tcp: export TCPI_OPT_ECN_LOW in tcp_info tcpi_options field

a1d32ad

Analogous to other important ECN information, export TCPI_OPT_ECN_LOW in tcp_info tcpi_options field. Signed-off-by: Neal Cardwell <[email protected]> Change-Id: I08d8d8c7e8780e6e37df54038ee50301ac5a0320

net-tcp-bbr1: for testing, a copy of BBRv1

d0d8043

Change-Id: I418eb97552991b29723137392e9a6aebe66f8b82

net-test: udpate config.gce to recent kernel and enable BBR1 for testing

fea8e5a

Change-Id: Iac1ffdd9eb84452eff22d7575b48a805f5f42284

net-tcp_bbr: v3: add a README.md for TCP BBR v3 release

7542cc7

Change-Id: I53254e30ada20dae7a4e68d6e6e6a9ebbf356dee

fixup! net-tcp-bbr1: for testing, a copy of BBRv1

69142bf

Signed-off-by: Francis Laniel <[email protected]>

nealcardwell force-pushed the v3 branch 2 times, most recently from 0dcb177 to 90210de Compare March 18, 2025 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixup! net-tcp-bbr1: for testing, a copy of BBRv1 #10

fixup! net-tcp-bbr1: for testing, a copy of BBRv1 #10

eiffel-fl commented Aug 28, 2024

fixup! net-tcp-bbr1: for testing, a copy of BBRv1 #10

Are you sure you want to change the base?

fixup! net-tcp-bbr1: for testing, a copy of BBRv1 #10

Conversation

eiffel-fl commented Aug 28, 2024