Bw Auths should penalize nodes for circ extend failures

added MikePerryIterationFires20111106 actualpoints::3 component::core tor/torflow owner::mikeperry points::3 priority::medium resolution::fixed status::closed type::enhancement labels

We should also spend some time thinking if 0 is the right number here. It's not clear if there's any other good choices, though... We have no idea exactly how much load is causing these things to become overloaded to the point of failure, it's just a binary thing..

Trac:
Actualpoints: N/A to N/A
Description: Right now we have about 50 extremely overloaded guard nodes (the Pandora* set) that are failing TLS connections, dir connections, and just about everything else.

However, when they do manage to actually rarely complete a circuit, they have huge bandwidth capacity available.

What we should do is assign a measurement of 0 every time we try to use a node as a first hop, but it fails to accept our extend.

We can try to do this to the 2nd hop too, but that is less reliable, since it won't be clear if that extend failed because the guard sucks or if the node is actually broken..

We could ensure that each exit is measured at least twice as an entry, or something.

to

Right now we have about 50 extremely overloaded guard nodes (the Pandora* set) that are failing TLS connections, dir connections, and just about everything else.

However, when they do manage to actually rarely complete a circuit, they have huge bandwidth capacity available.

What we should do is assign a measurement of 0 every time we try to use a node as a first hop, but it fails to accept our extend.

We can try to do this to the 2nd hop too, but that is less reliable, since it won't be clear if that extend failed because the 1st hop sucks or if 2nd hop is actually broken... We could ensure that each exit is measured at least twice as an entry, or something, to improve this property (maybe).
Points: N/A to N/A

Trac:
Description: Right now we have about 50 extremely overloaded guard nodes (the Pandora* set) that are failing TLS connections, dir connections, and just about everything else.

However, when they do manage to actually rarely complete a circuit, they have huge bandwidth capacity available.

What we should do is assign a measurement of 0 every time we try to use a node as a first hop, but it fails to accept our extend.

We can try to do this to the 2nd hop too, but that is less reliable, since it won't be clear if that extend failed because the 1st hop sucks or if 2nd hop is actually broken... We could ensure that each exit is measured at least twice as an entry, or something, to improve this property (maybe).

to

Right now we have about 50 extremely overloaded guard nodes (the Pandora* set) that are failing TLS connections, dir connections, and just about everything else.

However, when they do manage to actually rarely complete a circuit, they have huge bandwidth capacity available.

What we should do is assign a measurement of 0 every time we try to use a node as a first hop, but it fails to accept our extend.

We can try to do this to the 2nd hop too, but that is less reliable, since it won't be clear if that extend failed because the 1st hop sucks or if 2nd hop is actually broken... We could ensure that each exit is measured at least twice as an entry, or something, to improve this property (maybe).

We may want to ensure that each exit is measured at least N times as an entry anyways (for N=1 or 2).

Trac:
Description: Right now we have about 50 extremely overloaded guard nodes (the Pandora* set) that are failing TLS connections, dir connections, and just about everything else.

However, when they do manage to actually rarely complete a circuit, they have huge bandwidth capacity available.

What we should do is assign a measurement of 0 every time we try to use a node as a first hop, but it fails to accept our extend.

We can try to do this to the 2nd hop too, but that is less reliable, since it won't be clear if that extend failed because the 1st hop sucks or if 2nd hop is actually broken... We could ensure that each exit is measured at least twice as an entry, or something, to improve this property (maybe).

We may want to ensure that each exit is measured at least N times as an entry anyways (for N=1 or 2).

to

Right now we have about 50 extremely overloaded guard nodes (the Pandora* set) that are failing TLS connections, dir connections, and just about everything else, due to maxing out their CPU load on crypto.

However, when they do manage to actually rarely complete a circuit, they have huge bandwidth capacity available.

What we should do is assign a measurement of 0 every time we try to use a node as a first hop, but it fails to accept our extend.

We can try to do this to the 2nd hop too, but that is less reliable, since it won't be clear if that extend failed because the 1st hop sucks or if 2nd hop is actually broken... We could ensure that each exit is measured at least twice as an entry, or something, to improve this property (maybe).

We may want to ensure that each exit is measured at least N times as an entry anyways (for N=1 or 2).

Trac:
Type: defect to enhancement

The first test here is to export failure rate information for first and second hops and take a look at it..

The bw auths do not currently use exits as the first op, so if we only counted failures for the first hop we would never see exit overload of this nature.

Trac:
Cc: N/A to aagbsn

Yeah, we're going to have to fix this ASAP or #1976 (moved) is going to destroy the Tor network :/.

Trac:
Keywords: N/A deleted, MikePerryIterationFires2011106 added

Trac:
Keywords: MikePerryIterationFires2011106 deleted, MikePerryIterationFires20111106 added

My plan for this is to count circ failures against the node that was being extended to, and stream failures against the exit.

The current plan is to make each of these failures count as a 0 measurement for that node, and not count as a measurement at all for the other node in the path.

The plumbing was already here for this one, but it was a little tricky to get the dampening right. We will still need to watch it (in #4425 (moved)).

Trac:
Resolution: N/A to fixed
Summary: Bw Auths should assign 0 bw to first hops that fail to Bw Auths should penalize nodes for circ extend failures
Status: new to closed
Actualpoints: N/A to 3
Points: N/A to 3

closed

changed time estimate to 24h

added 24h of time spent

mentioned in issue #5464 (moved)

mentioned in issue #7023 (moved)

moved to tpo/network-health/torflow#1984 (closed)

mentioned in issue tpo/network-health/torflow#7023 (closed)

Bw Auths should penalize nodes for circ extend failures

Child items ...

Activity