Opened 2 years ago

Closed 14 months ago

Last modified 13 months ago

#13171 closed enhancement (fixed)

meek's reflector should forward the client's IP address/port to the bridge.

Reported by: yawning Owned by: dcf
Priority: Medium Milestone:
Component: Obfuscation/meek Version:
Severity: Normal Keywords:
Cc: isis Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

It would be nice to do this so the value passed to the ExtORPort was correct for better metrics. A few ways this could be done, off the top of my head:

  • Set X-Forwarded-For. The "standard" layout of this field doesn't include the port, but since it's unofficial, there's nothing stopping us from adding it. This would require us to secure the link between the reflector and the meek-server instance separately, which means TLS.
  • Set a custom header (Eg: Meek-Forwarded-For), with a encrypted/encoded IP/Port pair. Less overhead than bringing TLS into the picture. I would use something like a Base64 encoded NaCl crypto_secretbox. Key management here may be an issue, though it depends on who runs the bridge and reflector (The other method has cert management to deal with so this isn't a strict minus IMO).

Child Tickets

Attachments (4)

0001-Count-fraction-of-hosts-with-X-Forwarded-For.patch (2.2 KB) - added by dcf 15 months ago.
Test patch to check fraction of clients sending X-Forwarded-For.
meek-clients-2016-01-15.png (33.5 KB) - added by dcf 14 months ago.
Divergence of meek users and US users since this ticket.
meek-clients-2016-01-22.png (44.8 KB) - added by dcf 13 months ago.
meek-dirreqs-2016-02-01.png (81.9 KB) - added by dcf 13 months ago.

Download all attachments as: .zip

Change History (20)

comment:1 follow-ups: Changed 2 years ago by dcf

Can you be more specific about the metrics you're looking for? The one I can think of is user graphs that are broken down by pluggable transport and geoip, but there aren't graphs like that and they aren't planned: #10218.

If you're just looking for user counts, we already have that. The user graphs don't come from counting unique IPs (they used to); they come from counting directory requests: see #8462 and https://gitweb.torproject.org/metrics-web.git/blob/HEAD:/doc/users-q-and-a.txt and https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf.

I'm a little opposed to adding a custom header, for a few reasons:

  • It can be considered a feature that the bridge doesn't know client IPs. From the client point of view, "screw your metrics, I want my anonymity."
  • We don't actually control the reflector code on any platform other than Google. When we use a CDN, the CDN just adds whatever headers it wants. Actually it seems that CloudFront already adds X-Forwarded-For.
  • As I understand it, Tor could be modified to have its clients send a NETINFO cell containing the address. That's what we would want to do if metrics are really important, because it would generalize across other transports.

comment:2 in reply to: ↑ 1 Changed 2 years ago by dcf

Replying to dcf:

Actually it seems that CloudFront already adds X-Forwarded-For.

I made #13174 to keep an eye on this.

comment:3 in reply to: ↑ 1 ; follow-up: Changed 2 years ago by yawning

Replying to dcf:

Can you be more specific about the metrics you're looking for? The one I can think of is user graphs that are broken down by pluggable transport and geoip, but there aren't graphs like that and they aren't planned: #10218.

That was what I wanted, though karsten's concerns are still valid (See the PT component of Sponsor S for why I care about this all of a sudden).

If you're just looking for user counts, we already have that. The user graphs don't come from counting unique IPs (they used to); they come from counting directory requests: see #8462 and https://gitweb.torproject.org/metrics-web.git/blob/HEAD:/doc/users-q-and-a.txt and https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf.

I'm a little opposed to adding a custom header, for a few reasons:

  • It can be considered a feature that the bridge doesn't know client IPs. From the client point of view, "screw your metrics, I want my anonymity."
  • We don't actually control the reflector code on any platform other than Google. When we use a CDN, the CDN just adds whatever headers it wants. Actually it seems that CloudFront already adds X-Forwarded-For.
  • As I understand it, Tor could be modified to have its clients send a NETINFO cell containing the address. That's what we would want to do if metrics are really important, because it would generalize across other transports.

Hmm, all valid reasons for not using a custom header. I would guess that most CDN platforms would set X-Forwarded-For, and if we wanted to use that information in meek-server, adding the header in the GAE go code would be trivial. I'll think more about #10218.

comment:4 in reply to: ↑ 3 Changed 2 years ago by dcf

Replying to yawning:

Hmm, all valid reasons for not using a custom header. I would guess that most CDN platforms would set X-Forwarded-For, and if we wanted to use that information in meek-server, adding the header in the GAE go code would be trivial. I'll think more about #10218.

You're probably right that all CDNs make the information available somehow. You don't want to use the client port, even if it is available, because a stream is made of multiple HTTP requests and the port is changing all the time. You would want to derive the port from the session-ID somehow.

If you dig through Psiphon's history on meek-client,

you can see that they added support for sending an encrypted cookie through the CDN to the server, I believe for similar reasons:

comment:5 Changed 22 months ago by dcf

This is starting to look more important, now that meek users are a larger fraction of total bridge users. According to my eyeballing of this graph, it's around 20% now. (meek/(Any PT + Default OR protocol) = 5000/(17500+5000) = 22%.)
https://metrics.torproject.org/userstats-bridge-transport.png
That means up to 20% of bridge users are having their country recorded incorrectly, which will affect https://metrics.torproject.org/userstats-bridge-country.html.

comment:6 Changed 22 months ago by isis

  • Cc isis added

This is starting to cause the graph of Tor bridge users from the US to look a bit insane:

https://metrics.torproject.org/userstats-bridge-country.png

with direct correlation to the increase in meek users:

https://metrics.torproject.org/userstats-bridge-transport.png

comment:7 Changed 15 months ago by dcf

  • Severity set to Normal
  • Status changed from new to needs_review

Here's a branch for review:

https://gitweb.torproject.org/pluggable-transports/meek.git/log/?h=bug13171&id=758d7b7b45e96e00d46225fa26933bf15e76111a
https://gitweb.torproject.org/pluggable-transports/meek.git/diff/?h=bug13171&id=758d7b7b45e96e00d46225fa26933bf15e76111a&id2=ed3e8c9b0c34db38ffcd99a0d38d7d4fc1785a62

It adds support for reading X-Forwarded-For, which is set by the Amazon and Azure CDNs, to meek-server. Additionally it recognizes a new made-up header Meek-IP as a synonym for X-Forwarded-For, for App Engine which does not allow you to set X-Forwarded-For:

For security reasons, the following headers cannot be modified by the application: Content-Length, Host, Vary, Via, X-Appengine-Inbound-Appid, X-Forwarded-For, X-ProxyUser-IP.

In the absence of an X-Forwarded-For or Meek-IP header, we fall back (as before) on the client's source address (Request.RemoteAddr). If one of the headers is present but cannot be parsed, we do not fall back to Request.RemoteAddr, because in that case we do not know what the true client address is, but it is probably different from Request.RemoteAddr.

Last edited 15 months ago by dcf (previous) (diff)

comment:8 Changed 15 months ago by dcf

I have had this branch running on the meek-azure bridge for about a day. Here are the bridge stats. By both dirreq-v3-reqs and bridge-ips, us comes first, followed by ru and cn. It's surprising that there are so many meekers from the U.S.

@type bridge-extra-info 1.3
extra-info starman AA033EEB61601B2B7312D89B62AAA23DC3ED8A34
dirreq-stats-end 2015-12-15 11:08:49 (86400 s)
dirreq-v3-reqs us=4392,ru=1688,cn=1632,de=1376,gb=936,fr=576,br=456,ca=344,mx=336,it=296,in=288,au=264,tr=264,nl=192,sa=192,ir=160,es=152,at=144,ch=144,ua=128,co=120,kr=120,my=120,za=120,id=112,fi=104,be=88,cl=80,ae=72,by=72,hk=72,kz=72,ph=72,se=72,cz=64,ie=64,pl=64,dk=56,jo=56,pe=56,pt=56,eg=48,no=48,sg=48,uz=48,bg=40,gt=40,il=40,ng=40,ro=40,ve=40,bd=32,dz=32,ec=32,hu=32,ke=32,kw=32,pr=32,ps=32,py=32,sk=32,sv=32,th=32,vn=32,ar=24,cr=24,do=24,fj=24,jp=24,tm=24,tn=24,aw=16,bf=16,bo=16,cy=16,ee=16,gr=16,hn=16,iq=16,lb=16,lu=16,lv=16,mu=16,mv=16,ni=16,nz=16,si=16,sn=16,tw=16,uy=16,ye=16,af=8,al=8,am=8,ao=8,bh=8,cd=8,ci=8,et=8,hr=8,la=8,ly=8,ma=8,mc=8,md=8,ml=8,om=8,pa=8,pk=8,sy=8,tt=8,ug=8
bridge-stats-end 2015-12-15 11:08:55 (86400 s)
bridge-ips us=1696,ru=816,cn=656,de=552,gb=352,br=224,fr=160,in=128,it=128,es=104,id=104,ir=96,tr=96,au=88,mx=88,ca=80,nl=80,ua=72,at=64,co=64,my=64,ch=56,kr=56,sa=56,za=48,be=40,ie=40,pl=40,sk=40,cl=32,cz=32,jo=32,kz=32,ae=24,by=24,eg=24,fi=24,hk=24,il=24,ng=24,ro=24,se=24,ve=24,ar=16,bd=16,bf=16,cr=16,dk=16,dz=16,ec=16,gt=16,hu=16,jp=16,ke=16,no=16,pe=16,ph=16,pt=16,th=16,tn=16,ye=16,af=8,al=8,am=8,ao=8,aw=8,bg=8,bh=8,bo=8,cd=8,ci=8,cy=8,do=8,ee=8,et=8,fj=8,gh=8,gr=8,hn=8,hr=8,iq=8,kw=8,la=8,lb=8,lu=8,lv=8,ly=8,ma=8,mc=8,md=8,ml=8,mu=8,mv=8,ni=8,nz=8,om=8,pa=8,pk=8,pr=8,ps=8,py=8,qa=8,sg=8,si=8,sn=8,sv=8,sy=8,tm=8,tt=8,tw=8,ug=8,uy=8,uz=8,vn=8

Here are stats from before the patch. Note here the discrepancy between dirreq-v3-req and bridge-ips. The reason that bridge-ips is so low is that it deduplicates client IPs, and before the patch it appears that all connections come from a small number of CDN IP addresses.

@type bridge-extra-info 1.3
extra-info starman AA033EEB61601B2B7312D89B62AAA23DC3ED8A34
dirreq-stats-end 2015-12-12 08:27:37 (86400 s)
dirreq-v3-reqs us=13232,gb=1432,??=416,nl=400,at=384,au=304,kr=112,se=104,sg=88,fi=80,tw=40,be=8,ir=8
bridge-stats-end 2015-12-12 08:27:41 (86400 s)
bridge-ips us=40,??=8,at=8,au=8,be=8,fi=8,gb=8,ir=8,kr=8,nl=8,se=8,sg=8,tw=8

For comparison, here are the stats for the other default bridges. meek-google requests seem to come from a very small number of IP addresses, less than 8.

@type bridge-extra-info 1.3
extra-info UtahMeekBridge 88F745840F47CE0C6A4FE61D827950B06F9E4534
dirreq-stats-end 2015-12-12 21:16:05 (86400 s)
dirreq-v3-reqs us=17416
bridge-stats-end 2015-12-12 21:16:10 (86400 s)
bridge-ips kr=8,ru=8,us=8

meek-amazon strangely has a high number (1720) for unique client IP addresses. Maybe the Amazon CDN automatically rotates them or something?

@type bridge-extra-info 1.3
extra-info TorLandMeek F4AD82B2032EDEF6C02C5A529C42CFAFE516564D
dirreq-stats-end 2015-12-11 22:34:32 (86400 s)
dirreq-v3-reqs us=7768,dz=16,jp=8
bridge-stats-end 2015-12-11 22:34:35 (86400 s)
bridge-ips us=1720,dz=8,jp=8

Changed 15 months ago by dcf

Test patch to check fraction of clients sending X-Forwarded-For.

comment:9 Changed 15 months ago by dcf

I was curious as to what fraction of clients send X-Forwarded-For. On the meek-azure bridge, it is about 99.5%.

I temporarily applied this patch, which counts the number meek sessions that started with an X-Forwarded-For request. It prints stats after every 1000 new sessions. I let it run for 5000 sessions (it took 20 minutes to get there).

useraddr:  total 5000  X-Forwarded-For 4978 good 0 bad  RemoteAddr 22 good 0 bad  parse 5000 good 0 bad

This means, that out of 5000 requests, 4978 contained X-Forwarded-For or Meek-IP, and in all of those we were able to extract the client IP address from the header. For the remaining 22 requests, we fell back to the old behavior of using the source IP address (RemoteAddr). We were able to parse the resulting IP address in all 5000 cases.

comment:10 Changed 14 months ago by dcf

  • Resolution set to fixed
  • Status changed from needs_review to closed

Merged and tagged 0.21.

Last edited 14 months ago by dcf (previous) (diff)

Changed 14 months ago by dcf

Divergence of meek users and US users since this ticket.

comment:11 follow-up: Changed 14 months ago by dcf

Here's a graph showing how the graphs of meek users and US users begins to diverge after the merging of this patch.
Divergence of meek users and US users since this ticket.
The vertical lines are:

  • 2015-12-14: ticket merged into meek-azure
  • 2015-12-20: ticket merged into meek-google and meek-amazon.
library(ggplot2)
data <- read.csv("clients.csv", colClasses=c(date="Date"))
data$key <- ifelse(data$node=="bridge"&data$country=="us", "us",
        ifelse(data$node=="bridge"&data$transport=="meek", "meek", NA))
p <- ggplot(data[!is.na(data$key), ], aes(date, clients, color=key)) +
        geom_line() +
        geom_vline(xintercept=as.numeric(as.Date("2015-12-14"), "UTC")) +
        geom_vline(xintercept=as.numeric(as.Date("2015-12-20"), "UTC")) +
        coord_cartesian(xlim=as.Date(c("2015-10-01", "2016-01-05")), ylim=c(0, 7500))
ggsave("meek-clients-2016-01-15.png", p, width=6, height=4, dpi=90)
Last edited 14 months ago by dcf (previous) (diff)

comment:12 Changed 14 months ago by dcf

Should we also set X-Forwarded-For/Meek-IP in the nginx, php, and wsgi reflectors? These are the ones that people are meant to set up on their own.

I think hardly anyone is using these at the moment. But if they do, we'll get inaccurate stats. On the other hand, I suppose it is more likely that people will make mistakes setting them up (such as not using an HTTPS origin) and in that case we would want not to forward the IP address.

comment:13 in reply to: ↑ 11 Changed 14 months ago by dcf

Replying to dcf:

Here's a graph showing how the graphs of meek users and US users begins to diverge after the merging of this patch.
The vertical lines are:

  • 2015-12-14: ticket merged into meek-azure
  • 2015-12-20: ticket merged into meek-google and meek-amazon.

We recently found out that meek-amazon hadn't actually upgraded to a version including this patch, so it was still counting essentially all users as being from the U.S. Since yesterday it has been upgraded.

Changed 13 months ago by dcf

comment:14 Changed 13 months ago by dcf

Here is an updated graph, this time including cn in addition to us. The three vertical lines show the dates when the azure, google, and amazon bridges merged this patch in order.

I'm confused by why the cn line is pretty much flat since the merging of this patch. When I look at the dirreq stats for these bridges, about 12% of requests come from cn. Not sure why it's not reflected in the graph.


library(ggplot2)
data <- read.csv("clients.csv", colClasses=c(date="Date"))
data$key <- ifelse(data$node=="bridge"&data$country=="us", "us",
        ifelse(data$node=="bridge"&data$country=="cn", "cn",
        ifelse(data$node=="bridge"&data$transport=="meek", "meek", NA)))
p <- ggplot(data[!is.na(data$key), ], aes(date, clients, color=key)) +
        geom_line() +
        geom_vline(xintercept=as.numeric(as.Date("2015-12-14"), "UTC")) +
        geom_vline(xintercept=as.numeric(as.Date("2015-12-20"), "UTC")) +
        geom_vline(xintercept=as.numeric(as.Date("2016-01-11"), "UTC")) +
        coord_cartesian(xlim=c(as.Date("2015-10-01"), as.Date("2016-01-22")), ylim=c(0, 7500))
ggsave("meek-clients-2016-01-22.png", p, width=6, height=4, dpi=90)

comment:15 Changed 13 months ago by karsten

I think I found the bug that led to the confusion above. See #18167 for details.

Changed 13 months ago by dcf

comment:16 Changed 13 months ago by dcf


Here is a picture that graphs directory requests, as suggested in #18167. This graph makes more sense to me. Here, the countries are broken down per bridge. You can easily see the dates when each bridge merged this ticket, #13171, and switched from nearly 100% us to a mixture of countries. The top 5 countries are us, ru, de, gb, and cn.

Note: See TracTickets for help on using tickets.