Opened 7 months ago

Closed 7 months ago

#24665 closed defect (fixed)

sched: In KIST, the extra_space kernel value needs to be allowed to be negative

Reported by: dgoulet Owned by: dgoulet
Priority: Very High Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version: Tor: 0.3.2.1-alpha
Severity: Normal Keywords: tor-sched
Cc: pastly Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

KIST, when updating the TCP socket information, computes a limit of bytes we are allowed to write on the socket of the given active channel.

First, the tcp_space tells us how much TCP buffer space we have in the kernel for this socket. The computation is below. I encourage anyone to go read the comment in update_socket_info_impl() to know more about the why:

tcp_space = (ent->cwnd - ent->unacked) * (int64_t)(ent->mss);

After that, we compute some extra_space to be used to give the kernel a bit more data so when the ACK comes back from the packets sitting in the tcp_space, it can then take some in that extra space and doesn't have to wait on the scheduler to feed more data. Here is how it is computed:

extra_space =
  clamp_double_to_int64(
    (ent->cwnd * (int64_t)ent->mss) * sock_buf_size_factor) - ent->notsent;

It uses the notsent value which is the size of the queue in the kernel with data *not* sent so the data in there is not reflected in the unacked value because they haven't been sent yet on the wire.

That queue can be large, someimtes bigger than the tcp_space we computed above because the congestion window moves over time and the kernel can move as much as its want from the congestion windows into the output queue, that is the TCP stack black magic. On minute the cwnd = 10 and the other it is 67.

If extra_space becomes negative because notsent is bigger than the current congestion window, this means that the regular tcp_space needs to shrink down. Right now, we just add the extra_space if it is positive but the reality is that the current tcp space needs to consider the notsent size also.

Bottom line, if tcp_space + extra_space end up < 0, the allowed limit needs to be 0 and not what tcp_space is.

We've been able to find this issue while looking at very loaded relays that kept putting data in the outbuf while the connection socket was not ready to write. We realized that the notsent queue was huge but still KIST was allowing more bytes to be written over and over again filling the outbuf at a rapid rate and thus the memory.

Child Tickets

Change History (8)

comment:1 Changed 7 months ago by dgoulet

Status: assignedneeds_review

Branch: bug24665_032_01

comment:2 Changed 7 months ago by yawning

The branch looks sensible to me.

My inner pedant will say that "If a connection to a relay was unreliable meaning tor was struggling to flush bytes towards the relay" is misleading at best, since the congestion window shrinking (by quite a bit) is an expected part of how TCP/IP works and not particularly indicative of an overloaded condition on it's own.

While you're here, assuming the scheduler is called significantly faster than the RTT of most links (read that as "If 10 ms is lower than the RTT of most if not all links"), you can/should reduce sock_buf_size_factor as well, because you aren't going to get a full congestion window worth of ACKs back between scheduler calls in common cases.

comment:3 in reply to:  2 ; Changed 7 months ago by dgoulet

Replying to yawning:

The branch looks sensible to me.

My inner pedant will say that "If a connection to a relay was unreliable meaning tor was struggling to flush bytes towards the relay" is misleading at best, since the congestion window shrinking (by quite a bit) is an expected part of how TCP/IP works and not particularly indicative of an overloaded condition on it's own.

Ah I think I failed to explain my comment correctly. The point of that line was a reason for KIST to actually consider the notsent queue size *because* it could be that the connection is struggling towards the relay.

How would you phrase it in a proper English?

While you're here, assuming the scheduler is called significantly faster than the RTT of most links (read that as "If 10 ms is lower than the RTT of most if not all links"), you can/should reduce sock_buf_size_factor as well, because you aren't going to get a full congestion window worth of ACKs back between scheduler calls in common cases.

Interesting... if the channel is quite active, yes the scheduler tick for it should be 10ms.

What is a reasonable size factor in your opinion? It seems we can get some RTT information with the getsockopt() call within struct tcp_info, maybe we could adjust a scaling factor based on those values? (If that is an idea, we should open a ticket to make way for this one to be merged)

comment:4 in reply to:  3 ; Changed 7 months ago by yawning

Replying to dgoulet:

Replying to yawning:

The branch looks sensible to me.

My inner pedant will say that "If a connection to a relay was unreliable meaning tor was struggling to flush bytes towards the relay" is misleading at best, since the congestion window shrinking (by quite a bit) is an expected part of how TCP/IP works and not particularly indicative of an overloaded condition on it's own.

Ah I think I failed to explain my comment correctly. The point of that line was a reason for KIST to actually consider the notsent queue size *because* it could be that the connection is struggling towards the relay.

How would you phrase it in a proper English?

"The KIST scheduler did not correctly account for data already enqueued in each connection's send socket buffer, particularly in cases when the TCP/IP congestion window was reduced between scheduler calls. This situation lead to excessive per-connection buffering in the kernel, and a potential memory DoS. Fixes bug 24665; bugfix on 0.3.2.1-alpha."

Maybe not human friendly enough.

While you're here, assuming the scheduler is called significantly faster than the RTT of most links (read that as "If 10 ms is lower than the RTT of most if not all links"), you can/should reduce sock_buf_size_factor as well, because you aren't going to get a full congestion window worth of ACKs back between scheduler calls in common cases.

Interesting... if the channel is quite active, yes the scheduler tick for it should be 10ms.

What is a reasonable size factor in your opinion? It seems we can get some RTT information with the getsockopt() call within struct tcp_info, maybe we could adjust a scaling factor based on those values? (If that is an idea, we should open a ticket to make way for this one to be merged)

There isn't a good "one size fits all" solution. Setting it too low will gimp performance on fast low latency links, setting it too high right now bloats the various buffers. I would personally opt more toward avoiding the latter given all the Fun that's happening.

As you noted, tcpi_rtt gives the smoothed RTT estimate (and tcpi_rttvar the RTT variance if you need it), which is probably sufficient to give a better reasonable guess here, as a first pass, I would recommend doing something based on the the scheduler interval to smoothed RTT ratio, with a hard maximum at 1.0, but as you noted this is probably best discussed in a separate ticket.

comment:5 in reply to:  4 Changed 7 months ago by dgoulet

Ok I made the changes recommended by Yawning on the wording (THANKS!!!)

See branch: bug24665_032_01

I've also opened #24694 for the RTT discussion.

comment:6 Changed 7 months ago by pastly

Status: needs_reviewneeds_revision

I would change the following comment. Then it LGTM.

diff --git a/src/or/scheduler_kist.c b/src/or/scheduler_kist.c
index 0bd51bed7..a50be345b 100644
--- a/src/or/scheduler_kist.c
+++ b/src/or/scheduler_kist.c
@@ -266,8 +266,7 @@ update_socket_info_impl, (socket_table_ent_t *ent))

   /* These values from the kernel are uint32_t, they will always fit into a
    * int64_t tcp_space variable but if the congestion window cwnd is smaller
-   * than the unacked packets, the remaining TCP space is set to 0 so we don't
-   * write more on this channel. */
+   * than the unacked packets, the remaining TCP space is set to 0. */
   if (ent->cwnd >= ent->unacked) {
     tcp_space = (ent->cwnd - ent->unacked) * (int64_t)(ent->mss);
   } else {

comment:7 Changed 7 months ago by dgoulet

Status: needs_revisionmerge_ready

Fixed!

Branch: bug24665_032_01

comment:8 Changed 7 months ago by nickm

Resolution: fixed
Status: merge_readyclosed

Merged to 0.3.2 and forward!

Note: See TracTickets for help on using tickets.