Opened 3 months ago

Last modified 4 weeks ago

#23662 assigned defect

prop224: Service edge-case where it re-uploads descriptor with same rev counter

Reported by: asn Owned by: dgoulet
Priority: Medium Milestone: Tor: 0.3.3.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs, prop224
Cc: Actual Points:
Parent ID: Points:
Reviewer: dgoulet Sponsor:

Description (last modified by asn)

There is an edge case where if a service's HSDir connection times out before getting a 200 back, Tor will open a second connection with the same outbuf as the previous one and retry the upload. See connection_ap_detach_retriable() which gets called from connection_ap_expire_beginning() which detaches the stream from the circuit and marks it as pending again.

That's a problem for v3 HSes, since that logic bypasses the HS subsystem and the rev counter does not get incremented for the second upload.

The problem is that if the first connections ends up succeeding (even tho it timed out), the second connection will fail because of rev counter issues.

This is not a reachability issue since one of the two conns will eventually succeed and the right descriptor will be uploaded. However it generates a log_warn on the service-side that might be annoying, and a log_info on the HSdir side.

The right fix here might involve intercepting that retry, and piping it through the HS subsystem. But this will require refactoring.

Child Tickets

Change History (10)

comment:1 Changed 3 months ago by asn

Description: modified (diff)

comment:2 Changed 3 months ago by dgoulet

Status: newneeds_review

See branch: ticket23662_032_01

First commit is a big refactoring of connection_ap_expire_beginning() so we can make sense out of it and introduce a function that checks if we should retry the connection on timeout.

Second commit adds a callback in the HS subsystem for any directory connection linked to the AP one that has timed out so the retry can be handled properly.

This whole thing might be over the top but if we think it is a good idea, lets put it back in needs_revision so I can add a unit test for connection_ap_expire_beginning() for which none exists right now. If that is too much for 032, I can try to do a minimal change but at the cost of technical debt of course :).

comment:3 Changed 3 months ago by asn

Status: needs_reviewneeds_revision

Hello, thanks for the patch. Here is a review:

  • The refactoring commit is indeed heavy and almost unreviewable if it's not split into three commits: one commit to create the stub funcs, one commit to just move code and one last commit that finalizes code. Right now it's far too hard for a reviewer (either me or nick). In the end we can squash all three commits together before merging.
  • Final commit looks nice. A question: Will a single descriptor upload time out, trigger an upload to all the HSDirs, or to just the hsdir that failed? IIUC the current patch will reupload to all HSDirs, right? Are we sure we want that? :X
  • Not a big fan of tor_assert(dir_conn->hs_ident);. Let's turn it into a BUG for less surprises.

Given the big amount of changes required for this not-very-relevant bugfix (no actual reachability/security issue), I wouldn't be opposed to delaying this big fix to 0.3.3, if we have a minimal fix that will silence the warns.

comment:4 Changed 2 months ago by dgoulet

Owner: set to dgoulet
Status: needs_revisionaccepted

comment:5 Changed 2 months ago by dgoulet

Status: acceptedneeds_revision

comment:6 Changed 7 weeks ago by dgoulet

Status: needs_revisionneeds_information

I've look back at this patch and it is indeed heavy for an 032 fix. The problem though is that this log will popped occasionally in service's log. I'm actually still hitting that sometimes on my service.

    log_warn(LD_REND, "Uploading hidden service descriptor: http "
                      "status 400 (%s) response from dirserver "
                      "'%s:%d'. Malformed hidden service descriptor?",

No real consequence so maybe something we could do is to move that warning to log_info() and adding a comment that we know what's causing it and once we fix this ticket, we put it back in log_warn()?

The problem is that even though we get a report from an operator that 400 was returned, without info logs, we can't do squat :S.

comment:7 Changed 4 weeks ago by asn

Status: needs_informationneeds_revision

We discussed this with David and came with the following plan:
1) In 0.3.2 we turn this warning to protocol_warn since we know of natural ways that it can be caused (i.e. this ticket)
2) In 0.3.3 we fix this ticket properly and any other related issues found, and turn the warning back to a real warning since it shouldn't be caused naturally anymore.

comment:8 Changed 4 weeks ago by asn

Status: needs_revisionneeds_review

Please see my branch bug23662_032 for the 0.3.2 part as described in comment:7 above.

After merging this, let's keep the ticket open and repurpose it to 0.3.3 to continue with the comment:7 plan.

comment:9 Changed 4 weeks ago by dgoulet

Reviewer: dgoulet
Status: needs_reviewmerge_ready

lgtm;

comment:10 Changed 4 weeks ago by nickm

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final
Status: merge_readyassigned

merged to 0.3.2 and forward. Putting this as "assigned" again in 0.3.3

Note: See TracTickets for help on using tickets.