Opened 2 years ago

Last modified 6 months ago

#22233 accepted defect

Reconsider behavior on .z URLs with Accept-Encoding header

Reported by: nickm Owned by: Hello71
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: 034-triage-20180328, 034-removed-20180328, reviewer-was-teor-20190422
Cc: ahf Actual Points:
Parent ID: Points:
Reviewer: Sponsor: Sponsor4

Description (last modified by arma)

In proposal 278, I said:

  If a directory server receives a request to a document with the ".z"
  suffix, where the client does not include an "Accept-Encoding" header,
  the server should respond with the zlib compressed version of the
  document for backwards compatibility with client that does not support
  this proposal.

But on #22206 it became apparent that we've got a problem there: there are already tools (built e.g. on wget) that ask for the .z URL but which send "Accept-Encodings: Identity."

And onn #22206, Yawning says:

an error (or a double compressed payload) should be returned when the .z request contains an Accept-Encoding header that specifies anything other than identity/deflate.

We'd like the end result here to be something where new Tor clients can interoperate with older directory caches without breaking anything, and getting the new compression type if they support it. And we certainly don't want anybody doing two layers of compression: that's a waste of cycles. But we should see if there's a way where we can be more standards compliant without breaking anything we care about.

Child Tickets

Change History (18)

comment:1 Changed 2 years ago by ahf

Cc: ahf added

comment:2 Changed 2 years ago by arma

Description: modified (diff)

comment:3 Changed 2 years ago by arma

FYI, my wget didn't send any accept-encoding header. Neither did Sebastian's. Maybe Yawning's did? You can tell it to *add* an accept-encoding header, but then what do you expect.

I think the issue here is more that there are two ways to indicate you want compression -- adding a .z to the url, and saying so in the accept-encoding header -- and we should build the two by two decision matrix and do the smart thing for all four cases.

comment:4 in reply to:  3 ; Changed 2 years ago by yawning

Replying to arma:

FYI, my wget didn't send any accept-encoding header. Neither did Sebastian's. Maybe Yawning's did? You can tell it to *add* an accept-encoding header, but then what do you expect.

wget http://example.com on my system does this:

GET / HTTP/1.1
User-Agent: Wget/1.19.1 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: example.com
Connection: Keep-Alive

Python's HTTP client also includes the header with identity.

I think the issue here is more that there are two ways to indicate you want compression -- adding a .z to the url, and saying so in the accept-encoding header -- and we should build the two by two decision matrix and do the smart thing for all four cases.

Yes. The existing code tries to treat .z as Accept-Encoding: deflate, which is a shortcut, and not always correct. Assuming we do not want to double compress, what I would consider working behavior looks like:

File Accept-Encoding Action
foo N/A foo
foo identity Content-Encoding: identity, foo
foo deflate Content-Encoding: deflate, deflate(foo)
foo identity, deflate Content-Encoding: deflate, deflate(foo)
foo identity, gzip Content-Encoding: gzip, gzip(foo)
foo gzip Content-Encoding: gzip, gzip(foo)
foo deflate, gzip Content-Encoding: gzip, gzip(foo)
foo.z N/A deflate(foo)
foo.z identity Content-Encoding: identity, deflate(foo)
foo.z deflate 406 Not Acceptable
foo.z identity, deflate Content-Encoding: identity, deflate(foo)
foo.z identity, gzip Content-Encoding: identity, deflate(foo)
foo.z gzip 406 Not Acceptable
foo.z deflate, gzip 406 Not Acceptable

(gzip used as a placeholder algorithm for "Something that is supported that is not deflate)

The current code mishandles the cases in the table that should either double compress or return 406.

comment:5 Changed 2 years ago by yawning

ahf points out that prop 278 specifies sending 415 on negotiation failure, the proposal should be changed to 406, because it's the correct error code to use when a request is is impossible to fulfill due to Accept-* headers.

comment:6 Changed 2 years ago by ahf

Owner: set to ahf
Status: newassigned

comment:7 Changed 2 years ago by nickm

Milestone: Tor: 0.3.1.x-finalTor: 0.3.3.x-final

comment:8 Changed 20 months ago by ahf

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

comment:9 Changed 19 months ago by nickm

Keywords: 034-triage-20180328 added

comment:10 Changed 19 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:11 Changed 19 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

comment:12 in reply to:  4 Changed 12 months ago by Hello71

Replying to yawning:

Replying to arma:

FYI, my wget didn't send any accept-encoding header. Neither did Sebastian's. Maybe Yawning's did? You can tell it to *add* an accept-encoding header, but then what do you expect.

wget http://example.com on my system does this:

GET / HTTP/1.1
User-Agent: Wget/1.19.1 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: example.com
Connection: Keep-Alive

Python's HTTP client also includes the header with identity.

I think the issue here is more that there are two ways to indicate you want compression -- adding a .z to the url, and saying so in the accept-encoding header -- and we should build the two by two decision matrix and do the smart thing for all four cases.

Yes. The existing code tries to treat .z as Accept-Encoding: deflate, which is a shortcut, and not always correct. Assuming we do not want to double compress, what I would consider working behavior looks like:

File Accept-Encoding Action
foo N/A foo
foo identity Content-Encoding: identity, foo
foo deflate Content-Encoding: deflate, deflate(foo)
foo identity, deflate Content-Encoding: deflate, deflate(foo)
foo identity, gzip Content-Encoding: gzip, gzip(foo)
foo gzip Content-Encoding: gzip, gzip(foo)
foo deflate, gzip Content-Encoding: gzip, gzip(foo)
foo.z N/A deflate(foo)
foo.z identity Content-Encoding: identity, deflate(foo)
foo.z deflate 406 Not Acceptable
foo.z identity, deflate Content-Encoding: identity, deflate(foo)
foo.z identity, gzip Content-Encoding: identity, deflate(foo)
foo.z gzip 406 Not Acceptable
foo.z deflate, gzip 406 Not Acceptable

(gzip used as a placeholder algorithm for "Something that is supported that is not deflate)

The current code mishandles the cases in the table that should either double compress or return 406.

I believe this is not consistent with modern HTTP and web client behavior. I am fairly sure that modern web clients do one of the following:

  1. send Accept-Encoding: deflate, gzip (or gzip, deflate)
  2. if the response is Content-Encoding: deflate or gzip, transparently decompress it.
  3. process the decompressed content as the type indicated in Content-Type.
  1. do not send Accept-Encoding, or send Accept-Encoding: identity
  2. do not decompress the content
  3. process the content as the type indicated in Content-Type.

Note that not sending any Accept-Encoding is identical to sending Accept-Encoding: identity, as specified in RFC 7231 (https://tools.ietf.org/html/rfc7231#section-5.3.4).

I am fairly sure that this behavior also does not depend on the file extension of the URL. Therefore, it is not correct to return 406 if the server thinks that compressing the content is stupid (note that this is not just the case for gzipped files. it also applies to image files, video files, font files, and so on; too many for the browser to even attempt to make a comprehensive list of file extensions). Instead, it should simply not compress the content, not send Content-Encoding: identity, and send it as is. You can see this behavior if you execute for example curl --compressed -v torproject.org. Compression is offered, but the server doesn't want to bother, so it just doesn't compress it. This is supported by https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Encoding, which says "As long as the identity value, meaning no encoding, is not explicitly forbidden, by an identity;q=0 or a *;q=0 without another explicitly set value for identity, the server must never send back a 406 Not Acceptable error.".

Therefore, I think your table should look more like this:

File Accept-Encoding Action
foo none or identity no Content-Encoding, foo
foo deflate Content-Encoding: deflate, deflate(foo)
foo gzip Content-Encoding: gzip, gzip(foo)
foo deflate, gzip Content-Encoding: deflate or gzip, deflate(foo) or gzip(foo) respectively
foo.z none or identity no Content-Encoding, deflate(foo)
foo.z deflate no Content-Encoding, deflate(foo)
foo.z gzip no Content-Encoding, deflate(foo)
foo.z deflate, gzip no Content-Encoding, deflate(foo)

I doubt there exist any actual modern web clients than do not fit one of these. If there are, it's probably fine to send them whatever as long as they accept it, explicitly or implicitly.

Note that this guarantees that anybody who requests foo will see the actual contents of foo in their browser, or saved to their disk or whatever. Additionally, anybody who requests foo.z will always receive a deflated version of foo, and (theoretically) will not have their browser decompress it behind their backs. Also, we do not unnecessarily compress anything twice.

For what it's worth, my wget also sends Accept-Encoding: identity by default. I'm using wget 1.19.5.

comment:13 in reply to:  4 Changed 12 months ago by Hello71

Status: assignedneeds_review

In fact, I would argue for an implementation that simply doesn't send or accept ".z". An optimal implementation would support ".z", ".lzma", ".zstd", and whatever, but there's no need to. This interface is intended for programmatic clients only; any human access is for debugging purposes only.

https://github.com/torproject/tor/pull/416

comment:14 Changed 12 months ago by teor

Reviewer: teor
Status: needs_reviewneeds_revision

Please split your pull request: it contains two different changes that might end up in different releases.

Here's my review for the ".z" part:

Directory mirrors on 0.3.0 and earlier rely on ".z" to know when to send compressed content. And I don't know if we have tested later versions without ".z".

So we can't stop sending ".z" until:

  • we have tested 0.3.1 and later without ".z", and
  • all relays are 0.3.1 or later (0.2.9 is supported until 2020)

OR

  • we have tested 0.3.1 and later without ".z", and
  • we make 0.3.3 and later declare a new directory protocol version, and
  • we make tor test for the new protocol version (or 0.3.1 or later) before removing ".z"

We could put the second option into 0.3.6 if it's well-tested. The first option would have to wait until 2021 (~0.4.0).

comment:15 Changed 12 months ago by Hello71

Owner: changed from ahf to Hello71
Status: needs_revisionassigned

comment:16 Changed 12 months ago by Hello71

Status: assignedneeds_revision

comment:17 Changed 12 months ago by Hello71

Status: needs_revisionaccepted

There is technically a third option: don't send ".z" and accept that clients will sometimes get uncompressed content. It's not a great option because almost one third of relays are still on 0.3.0 or earlier.

The second option sounds too hard. It's mostly a theoretical correctness issue anyways, so if #28100 is merged, I think this can safely wait until 2021.

To that end, I updated my PR, but it doesn't need review until 2020. Unfortunately, there doesn't seem to be a "closed, later" option, so hopefully we can just stick this in accepted until then. Unless someone else wants to implement the version checking.

Last edited 12 months ago by Hello71 (previous) (diff)

comment:18 Changed 6 months ago by teor

Keywords: reviewer-was-teor-20190422 added
Reviewer: teor

If these tickets go back in to needs_review, and I am on leave, they will need another reviewer.

Note: See TracTickets for help on using tickets.