Mobile phones are unstable and their IP changes all the time. Hidden services don't work well on them.
Here are some things that can go wrong when the mobile phone (and hence the HS) loses network or changes its IP address:
The circuits to the intro points get broken, the HS establishes new intro points and republishes its descriptor. Its clients are not aware of the new intro points, and keep on trying the old ones. This is #8239 (moved) which might be fixed soon.
The rendezvous circuits to current clients get broken, and the HS does not reestablish them. Then clients keep on trying the same broken rendezvous point on and on, instead of re-introducing themselves (or fetching a new descriptor entirely). We should verify that this behavior is broken, and think of better ones here.
I recently analyzed the behavior of Hidden Services (HS) when their IP address changes and identified the following problems.
A client connected to a HS doesn't notice, that the established circuit is broken when the HS changes it's address. This happens because the circuit won't be closed until the entry guard of the HS detects a TCP timeout and sends a destroy cell down the circuit. The problem can be handled by the application itself by using own acknowledgments and closing the circuit when it detects a timeout. The circuit can be closed by using the Tor Control Protocol (use GETINFO stream-status to find the id and CLOSECIRCUIT to close it).
A HS has to notice, that it's own connections are broken after an IP address change so it can reestablish circuits to the introduction points. On linux this lasts until TCP reports a timeout which can take quite long. On Android connections get killed by the OS when an interface changes.
On Android I noticed, that an HS chose new introduction points after an IP change because Tor thought they were down. The issue seems to be addressed here: #8239.
Another related problem is described in: #16966. It may result in a 5 minutes waiting time when a HS reuses it's introduction points after a downtime.
Any other opinions on how these problems can be solved?
I'm currently working on [https://github.com/kit-tm/PTP] and therefore interested in making Hidden Services work smooth on mobile devices (particularly Android).
Trac: Username: timonh Severity: N/Ato Normal Reviewer: N/AtoN/A
I recently analyzed the behavior of Hidden Services (HS) when their IP address changes and identified the following problems.
Hello timonh! Thanks for doing this analysis!
I don't know much about mobile networking and programming, but it's definitely something we need to improve ASAP, so any pointers/feedback is welcome!
A client connected to a HS doesn't notice, that the established circuit is broken when the HS changes it's address. This happens because the circuit won't be closed until the entry guard of the HS detects a TCP timeout and sends a destroy cell down the circuit. The problem can be handled by the application itself by using own acknowledgments and closing the circuit when it detects a timeout. The circuit can be closed by using the Tor Control Protocol (use GETINFO stream-status to find the id and CLOSECIRCUIT to close it).
A HS has to notice, that it's own connections are broken after an IP address change so it can reestablish circuits to the introduction points. On linux this lasts until TCP reports a timeout which can take quite long. On Android connections get killed by the OS when an interface changes.
Hm, I feel the two points above are connected. For example, if the client realizes that the rend circuit is broken before the HS reestablishes its intro circuits, then the client will try to introduce herself to a broken intro point. That's no good.
Since the client has to reintroduce herself when a rend circuit dies (right?), it probably makes sense to have the HS reestablish intro circuits as fast as possible.
On this topic, you mentioned that in Android connections get killed when the interface changes; do you think this behavior is something we could use? Or maybe Tor already uses this behavior implicitly, since it will notice the killed connections and try to reestablish its intro circuits? Can we do better here?
On Android I noticed, that an HS chose new introduction points after an IP change because Tor thought they were down. The issue seems to be addressed here: #8239.
This should be fixed now.
Another related problem is described in: #16966. It may result in a 5 minutes waiting time when a HS reuses it's introduction points after a downtime.
Looking at the ticket, this seems like something we will fix as part of "next gen hidden services" (proposal 224). Does this happen frequently enough for you, that we should consider baking it into the current system as well?
Any other opinions on how these problems can be solved?
I'm currently working on [https://github.com/kit-tm/PTP] and therefore interested in making Hidden Services work smooth on mobile devices (particularly Android).
Interesting project :) Best of luck and keep in touch! You might also want to join our IRC channels on OFTC (#tor-dev).
Hm, I feel the two points above are connected. For example, if the client realizes that the rend circuit is broken before the HS reestablishes its intro circuits, then the client will try to introduce herself to a broken intro point. That's no good.
Since the client has to reintroduce herself when a rend circuit dies (right?), it probably makes sense to have the HS reestablish intro circuits as fast as possible.
I totally agree with you on that. If the HS reestablishes it's intro circuits fast clients can just reconnect after they noticed a timeout without the need to fetch a new descriptor.
On this topic, you mentioned that in Android connections get killed when the interface changes; do you think this behavior is something we could use? Or maybe Tor already uses this behavior implicitly, since it will notice the killed connections and try to reestablish its intro circuits? Can we do better here?
The behavior of Android is good for us in the sense that we don't have to wait for the long TCP timeout as on Linux. I could confirm, that Tor 0.2.8 notices that the circuits are broken after an IP change and tries to reestablish the intro circuits. But it seems that when I switch from wifi to mobile network Tor tries to reconnect too early when the interface isn't up yet and therefore thinks the intro points aren't reachable anymore. This results in Tor choosing new intro points.
I attached the interesting part of the log. Looking at the log Tor notices that the network is unreachable but draws the wrong conclusion from it (intro point isn't reachable anymore).
Another problem here is that a client, that has an old descriptor of a HS that chose new intro points and published a new descriptor in the meantime will try to reach the old intro points for a long time before trying to fetch the descriptor again. The old intro points don't notice that the circuit to the HS is broken because of the long TCP timeout. Therefore they acknowledge the RELAY_COMMAND_INTRODUCE1 cells of the client.
Looking at the ticket, this seems like something we will fix as part of "next gen hidden services" (proposal 224). Does this happen frequently enough for you, that we should consider baking it into the current system as well?
I haven't noticed the problem during my tests yet. Is there a schedule when proposal 224 will be implemented/released?
I reran the test switching from wifi to mobile network on android. Looking at the log it seems that Tor retries on the intro points once but then decides to try other ones.
See attachment torlog2.
I'm using Tor 0.2.8 so #8239 should be fixed and Tor should retry each intro point three times.
Looking at the code (rend_consider_services_intro_points() in rendservice.c) the intro points to retry are determined by a call to remove_invalid_intro_points(). Then Tor will try to establish a circuit to them. But after that Tor will try other intro points if there aren't enough yet. So if the retry points fail Tor will choose other ones.
The code to retry an introduction point three times is contained in remove_invalid_intro_points() using MAX_INTRO_POINT_CIRCUIT_RETRIES.
So subsequent calls to remove_invalid_intro_points() will return an introduction point to retry up to three times.
But a single call to rend_consider_services_intro_points() will only retry each introduction point once and then try others.
Is this intended behavior?
Another problem here is that a client, that has an old descriptor of a HS that chose new intro points and published a new descriptor in the meantime will try to reach the old intro points for a long time before trying to fetch the descriptor again. The old intro points don't notice that the circuit to the HS is broken because of the long TCP timeout. Therefore they acknowledge the RELAY_COMMAND_INTRODUCE1 cells of the client.
Regarding this issue the idea came up that an intro point could wait for a INTRODUCE2_ACK from the HS before sending a INTRODUCE_ACK to the client.
Then a client would notice that the HS didn't receive the RELAY_COMMAND_INTRODUCE2 cell using a timeout and wouldn't wait at the rendezvous point for a long time.
I'm not sure which other implications the change would have.
Another possibility would be to close ready rendezvous points earlier using a timeout and than fetch the descriptor again.
I reran the test switching from wifi to mobile network on android. Looking at the log it seems that Tor retries on the intro points once but then decides to try other ones.
See attachment torlog2.
I'm using Tor 0.2.8 so #8239 should be fixed and Tor should retry each intro point three times.
Looking at the code (rend_consider_services_intro_points() in rendservice.c) the intro points to retry are determined by a call to remove_invalid_intro_points(). Then Tor will try to establish a circuit to them. But after that Tor will try other intro points if there aren't enough yet. So if the retry points fail Tor will choose other ones.
The code to retry an introduction point three times is contained in remove_invalid_intro_points() using MAX_INTRO_POINT_CIRCUIT_RETRIES.
So subsequent calls to remove_invalid_intro_points() will return an introduction point to retry up to three times.
But a single call to rend_consider_services_intro_points() will only retry each introduction point once and then try others.
Is this intended behavior?
Hey timonh! Thanks for helping us track down this bug. I opened a ticket for it at #19522 (moved). I also CCed you in case you want to try to fix it, or maybe you want to help us test any patches.
It's worth noting that Tor clients also have some of these issues on mobile, whether accessing hidden services or exits. So general improvements that improve tor's response to broken circuits will also help with hidden services on mobile.
It's worth noting that Tor clients also have some of these issues on mobile, whether accessing hidden services or exits. So general improvements that improve tor's response to broken circuits will also help with hidden services on mobile.
To detect broken circuits earlier Tor could use TCP keepalive or use own keepalive messages. If Tor would use the keepalive messages to detect broken connections (and through that circuits) it would be necessary to negotiate the interval (so the other end knows when keepalive messages should arrive and when the connection expired). Right now keepalive messages are only used to keep firewalls from expiring connections and the interval is set by KeepalivePeriod.
I don't know how easy it is to use TCP keepalive in a platform-independent manner. Another question is if this might violate the privacy of a user. If a user uses a different interval than the majority he might stand out.
Also on Linux there is a nice option called TCP_USER_TIMEOUT which allows to set a "maximum amount of time in milliseconds that transmitted data may remain unacknowledged before TCP will forcibly close the corresponding connection". So this would improve the situation where an IP sends a RELAY_COMMAND_INTRODUCE2 to a HS which isn't reachable anymore. The IP would detect earlier (depending on TCP_USER_TIMEOUT) that the connection is broken.
For idle connections it would still be necessary to use a keepalive mechanism.