Opened 3 years ago

Last modified 3 days ago

#20025 new defect

document.characterSet leaks locale when HTML page does not specify its own encoding

Reported by: dcf Owned by: tbb-team
Priority: Medium Milestone:
Component: Applications/Tor Browser Version:
Severity: Normal Keywords: tbb-fingerprinting-locale
Cc: xfix Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by dcf)

At comment:18:ticket:10703, xfix reports on another means of discovering the browser's fallback character encoding, the document.characterSet property (and possibly its aliases document.charset and document.inputEncoding). There is a demo site here:

https://hsivonen.com/test/moz/check-charset.htm

Using tor-browser-linux64-6.5a2_en-US.tar.xz, I get the output

Your fallback charset is: windows-1252

But using tor-browser-linux64-6.0.4_ko.tar.xz, I get the output

Your fallback charset is: EUC-KR

This is a separate issue from #10703. I'll leave a comment with a demo page that shows both techniques, with the one in #10703 giving the same result and document.characterSet giving different results.

The really strange thing is that this only seems to be effective when the server has HSTS (a valid Strict-Transport-Security header). I couldn't reproduce the result of the hsivonen.com demo site with a local web server, nor with an onion service, even when copying the demo and its header exactly. Only when I put it on an HTTPS server with HSTS could I reproduce it. I'll leave a comment with two demo pages allowing you to compare.

Edit 2019-10-02: Ignore the above paragraph about HSTS. The difference is actually due to whether the document specifies its own encoding. See comment:7.

Child Tickets

Attachments (5)

en-us-with-hsts.png (8.9 KB) - added by dcf 3 years ago.
tor-browser-linux64-6.5a2_en-US.tar.xz on https://people.torproject.org/~dcf/tor20025/check-charset.html (has HSTS)
en-us-without-hsts.png (8.9 KB) - added by dcf 3 years ago.
tor-browser-linux64-6.5a2_en-US.tar.xz on https://people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html (no HSTS)
ko-with-hsts.png (14.1 KB) - added by dcf 3 years ago.
tor-browser-linux64-6.0.4_ko.tar.xz on https://people.torproject.org/~dcf/tor20025/check-charset.html (has HSTS)
ko-without-hsts.png (8.9 KB) - added by dcf 3 years ago.
tor-browser-linux64-6.0.4_ko.tar.xz on https://people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html (no HSTS)
ohnoes.png (66.2 KB) - added by Thorin 3 weeks ago.
results in glorious technicolor

Download all attachments as: .zip

Change History (16)

Changed 3 years ago by dcf

Attachment: en-us-with-hsts.png added

tor-browser-linux64-6.5a2_en-US.tar.xz on https://people.torproject.org/~dcf/tor20025/check-charset.html (has HSTS)

Changed 3 years ago by dcf

Attachment: en-us-without-hsts.png added

tor-browser-linux64-6.5a2_en-US.tar.xz on https://people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html (no HSTS)

Changed 3 years ago by dcf

Attachment: ko-with-hsts.png added

tor-browser-linux64-6.0.4_ko.tar.xz on https://people.torproject.org/~dcf/tor20025/check-charset.html (has HSTS)

Changed 3 years ago by dcf

Attachment: ko-without-hsts.png added

tor-browser-linux64-6.0.4_ko.tar.xz on https://people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html (no HSTS)

comment:1 Changed 3 years ago by dcf

I set up a demo page on two servers, one with HSTS and one without. Only the one with HSTS shows a difference in document.characterSet. Note that neither of the servers specifies the encoding in the Content-Type header, so you get a warning in the browser console and the browser has to infer the encoding.

The technique from #10703 always finds iso-8859-1. (I think that technique has trouble distinguishing iso-8859-1 and windows-1252.)

with HSTS

HSTS demo page: https://people.torproject.org/~dcf/tor20025/check-charset.html

document.characterSet is windows-1252 for the en-US bundle and EUC-KR for the ko bundle.

en-US ko
tor-browser-linux64-6.5a2_en-US.tar.xz on https://people.torproject.org/~dcf/tor20025/check-charset.html (has HSTS) tor-browser-linux64-6.0.4_ko.tar.xz on https://people.torproject.org/~dcf/tor20025/check-charset.html (has HSTS)

without HSTS

non-HSTS demo page: https://people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html

document.characterSet is windows-1252 for both the en-US and ko bundles.

en-US ko
tor-browser-linux64-6.5a2_en-US.tar.xz on https://people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html (no HSTS) tor-browser-linux64-6.0.4_ko.tar.xz on https://people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html (no HSTS)

comment:2 Changed 3 years ago by dcf

I checked and the same HSTS weirdness happens with stock Firefox 45.3.0. To reproduce, go to Preferences → Content → Fonts & Colors → Advanced → Text Encoding for Legacy Content, and select Korean. Then the HSTS demo page https://people.torproject.org/~dcf/tor20025/check-charset.html will show EUC-KR for document.characterSet. The non-HSTS demo page https://people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html continues to show windows-1252.

Chromium 52.0.2743.116 doesn't appear to make a difference between HSTS and non-HSTS. Go to Settings → Web content → Customize fonts → Encoding and change to Korean. Both demo pages show EUC-KR.

comment:3 Changed 16 months ago by cypherpunks

Latest Tor Browser:
https://www.bamsoftware.com/people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html

Using ambiguous bytes (#10703) iso-8859-1
document.characterSet (#20025) UTF-8
document.charset UTF-8
document.inputEncoding UTF-8

Anyone got same result?
(Firefox 61 with resistFingerprint also have this value)

comment:4 in reply to:  3 Changed 16 months ago by dcf

Replying to cypherpunks:

Latest Tor Browser:
https://www.bamsoftware.com/people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html

Using ambiguous bytes (#10703) iso-8859-1
document.characterSet (#20025) UTF-8
document.charset UTF-8
document.inputEncoding UTF-8

cypherpunks, please also try https://people.torproject.org/~dcf/tor20025/check-charset.html.

For me, with Tor Browser 8.0a8 en-US, I get:

https://www.bamsoftware.com/people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html

Using ambiguous bytes (#10703) iso-8859-1
document.characterSet (#20025) UTF-8
document.charset UTF-8
document.inputEncoding UTF-8

https://people.torproject.org/~dcf/tor20025/check-charset.html

Using ambiguous bytes (#10703) iso-8859-1
document.characterSet (#20025) windows-1252
document.charset windows-1252
document.inputEncoding windows-1252

I conjectured that the difference may be because of HSTS, but that appears not to be the case, because bamsoftware.com has HSTS.

Last edited 3 weeks ago by dcf (previous) (diff)

comment:5 Changed 3 weeks ago by Thorin

gk: can we change the keyword to tbb-fingerprinting-locale please? TIA :)

---

I am only going on previous comments about which sites have HSTS and which don't (and those commments are contradictory, I think, I need coffee - let me know if I have it the wrong way round). Either way, there are four test sites

The thorin test page links to and opens the other three in a new tab.

Obligatory Pic

  • spreadsheet to follow

Results:

  • all tests done in 9.0a6
  • all 30 non en-US bundles tested were set to spoof
  • excluding the windows-1252 fallback, there are 12 buckets covering 14 languages
  • ko - not tested, waiting for #31886 , but reading above it would be windows-1252 anyway
  • mk - had to install the Macedonian language pack and set spoof etc, see #31725

Notes

  • Options>General>Languages>Fonts and Colors>Advanced>Text Encoding for Legacy Content
  • this sets the pref intl.charset.fallback.override if you change it from "Default for current locale"

Solution

  • Set intl.charset.fallback.override = windows-1252 when privacy.spoof_english = 2, and reset it when privacy.spoof_english !== 2
  • Do this upstream (not sure if #10703 also needs upstreaming)
  • thinking out loud: If they're requesting pages as en-US, etc (spoof = 2) .. then the breakage should be nothing more than a normal en-US bundle, right? IDK, does the override pref affect chrome? Does this impact users on non-English OSes?

Class, discuss! :) .. pic to follow

Last edited 3 weeks ago by Thorin (previous) (diff)

Changed 3 weeks ago by Thorin

Attachment: ohnoes.png added

results in glorious technicolor

comment:6 Changed 3 weeks ago by pospeselr

Keywords: tbb-fingerprinting-locale added

comment:7 in reply to:  5 Changed 3 weeks ago by dcf

Keywords: tbb-fingerprinting removed

Replying to Thorin:

I am only going on previous comments about which sites have HSTS and which don't

You can forget about HSTS. That conjecture was wrong. bamsoftware.com has HSTS and it doesn't show the leak. The reason the previous results seem contradictory is that the page that in 2016 was at https://people.eecs.berkeley.edu/ (no HSTS) now redirects to a different server, https://www.bamsoftware.com/ (HSTS).

If the cause of the difference is not HSTS, what is it? My new guess is that it must have to do with the Content-Type header and whether it specifies an encoding or not.

leaks Content-Type
thorin-oakenpants.github.io notext/html; charset=utf-8
www.bamsoftware.com notext/html; charset=UTF-8
hsivonen.com yestext/html
people.torproject.org yestext/html

You can check the Content-Type header yourself using the curl command.

$ curl --head https://thorin-oakenpants.github.io/testing/bug20025.html
HTTP/2 200 
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Sun, 29 Sep 2019 15:29:53 GMT
etag: "5d90cdf1-7ec"
access-control-allow-origin: *
expires: Sun, 29 Sep 2019 16:52:42 GMT
cache-control: max-age=600
x-proxy-cache: MISS
x-github-request-id: XXX
accept-ranges: bytes
date: Sun, 29 Sep 2019 16:42:42 GMT
via: 1.1 varnish
age: 0
x-served-by: XXX
x-cache: MISS
x-cache-hits: 0
x-timer: S1569775362.340251,VS0,VE329
vary: Accept-Encoding
x-fastly-request-id: XXX
content-length: 2028

$ curl --head https://www.bamsoftware.com/people.eecs.berkeley.edu/~fifield/tor20025/check-charset.html
HTTP/1.1 200 OK
Date: Sun, 29 Sep 2019 16:41:16 GMT
Server: Apache/2.4.25 (Debian)
Vary: User-Agent,Referer,Accept-Encoding
Last-Modified: Thu, 01 Feb 2018 20:06:42 GMT
ETag: "5d2-5642c2265f880"
Accept-Ranges: bytes
Content-Length: 1490
Strict-Transport-Security: max-age=15768000
Content-Type: text/html; charset=UTF-8

$ curl --head https://hsivonen.com/test/moz/check-charset.htm
HTTP/2 200 
server: nginx/1.17.4
date: Sun, 29 Sep 2019 16:42:22 GMT
content-type: text/html
content-length: 353
last-modified: Mon, 25 Feb 2013 11:31:59 GMT
etag: "3998-161-4d68ae39709c0"
accept-ranges: bytes
vary: Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload

$ curl --head https://people.torproject.org/~dcf/tor20025/check-charset.html
HTTP/1.1 200 OK
Date: Sun, 29 Sep 2019 16:41:08 GMT
Server: Apache
X-Content-Type-Options: nosniff
X-Frame-Options: sameorigin
X-Xss-Protection: 1
Referrer-Policy: no-referrer
Strict-Transport-Security: max-age=15768000; preload
Public-Key-Pins: pin-sha256="EfzQ7Gg2LG2mQyjStHmfD4yVzzi/30yyRnAKquPlPMQ="; pin-sha256="Tnmd19BxbL/grn2RdYAAyck34e1KeIq9n5CK6ZZVP1w="; max-age=5184000
Last-Modified: Tue, 30 Aug 2016 05:30:00 GMT
ETag: "5d2-53b4345990616"
Accept-Ranges: bytes
Content-Length: 1490
Vary: Accept-Encoding
Content-Type: text/html

comment:8 Changed 3 weeks ago by Thorin

nvm .. i need more coffee

Last edited 3 weeks ago by Thorin (previous) (diff)

comment:9 Changed 2 weeks ago by Thorin

Edit: PS: can we change the title: replace HSTS with legacy encoding or something - thanks)

---
There's an error in my spreadsheet... hu and pl are the same, but I said they were different.. so that's one less bucket. But, I tested all the legacy fallback options available in Firefox, and ko returns EUC-KR, so I would expect that to be the same in TB.

There are 14 values in the UI legacy fallback coimbox, they are

arabic - windows-1256
baltic - windows-1257
central european, ISO - ISO-8859-2
central european, Microsoft - windows-1250
chinese, simpliflied - GBK
chinese, traditional - Big5
cyrillic - windows-1251
greek - ISO-8859-7
hebrew - windows-1255
japanese - Shift_JIS
korean - EUC-KR
thai - windows-874
turkish - windows-1254
vietnamese - windows-1258

as well as "default for current locale", which would cover any others, I guess (IANA expert) - e.g I am not sure what happens with Lithuanian, Malay: but Thai would leak as windows-874

Last edited 2 weeks ago by Thorin (previous) (diff)

comment:10 Changed 2 weeks ago by dcf

Description: modified (diff)
Summary: document.characterSet enables fingerprinting of localization (only with HSTS?)document.characterSet leaks locale when HTML page does not specify its own encoding
Note: See TracTickets for help on using tickets.