I assume it's a codepath rarely used, for when we give up on a circuit we're making because it's been a while since we sent a create cell, that is triggered more commonly during our "million new users" crisis.
The warning is nothing to worry about, but it's something we should track down and fix in the code.
Looks like the call chain is that command_process_relay_cell() receives an extend or extend2 cell, passes it to circuit_receive_relay_cell(), which passes it to connection_edge_process_relay_cell() (in the CELL_DIRECTION_IN case), which sees RELAY_COMMAND_EXTENDED or RELAY_COMMAND_EXTENDED2 and calls circuit_finish_handshake() which pulls out the 'created' cell and calls onion_skin_client_handshake(),
which returns -1 but doesn't say why.
So if I were to suggest a fix, I'd suggest that onion_skin_client_handshake() should have an *errmsg parameter which it sets if it wants to complain about something. Then the various subfunctions it calls can set it, or we can choose not to complain if a failure isn't worth telling the user about.
skruffy describes a plausible explanation for why these client-side warns are happening: the client is using an old ntor onion key, and in the new ntor design, there's no way for the relay to know what onion key the create request is intended for. So the relay answers anyway, and the client is shocked to find that it can't complete the handshake.
I guess it would be nice if the relay could realize that the handshake is going to fail, so it can help the client understand what's happening. But failing that, I guess we should make the client not complain when it fails?
Trac: Summary: Tor fails handshakes, onion_skin_client_handshake, circuit_finish_handshake, connection_edge_process_relay_cell to Tor clients warn when they use the wrong ntor onion key
skruffy describes a plausible explanation for why these client-side warns are happening: the client is using an old ntor onion key, and in the new ntor design, there's no way for the relay to know what onion key the create request is intended for. So the relay answers anyway, and the client is shocked to find that it can't complete the handshake.
rransom is correct that there is a perfectly good way (KEYID) for the relay to learn which ntor key was supposed to get used.
But, if for some weird reason, the client has an onion key so ancient that the relay doesn't recognize it any longer, then rather than send a DESTROY cell, the relay will pick a junk onion key and use that to complete the handshake anyway, on the theory that it's better not to leak anything in timing information. Could that be what's going on here?
But, if for some weird reason, the client has an onion key so ancient that the relay doesn't recognize it any longer, then rather than send a DESTROY cell, the relay will pick a junk onion key and use that to complete the handshake anyway, on the theory that it's better not to leak anything in timing information. Could that be what's going on here?
If the client has an invalid ntor onion key, then it looks like this would be a symptom of it because the AUTH value sent by the server will not match the client's computed auth_input. This seems possible, and the warning will be meaningless to the user, so a possible patch is to simply zero Y, before sending it back to the client, if the server chooses the junk key. This will be the point at infinity in the exponentiation, so current implementations will still fail and warn, but new implementations can check for this value and not warn on failure. Is it worth reducing the security level of the handshake just so we can handle this exception?
Alternatively, maybe just demote the message from warning to notice or info level. In either case, because there are subsequent log messages, we'll need to be able to inform the calling function that we had an error but not to emit the warning. We could return -2 and catch that up the stack, I guess. What's choice 3?
I've done a "better log message" branch as "bug9635". I think it's fine to merge it in 0.2.5.
We aren't actually seeing a lot of reports of these IRL, though, so I'll suggest that deferring to 0.2.6 could be fine too.
(Grepping an authority's logs isn't representative, since we'd expect an authority to see a larger-than-usual share of nodes that are trying to configure themselves.)
Seeing a few of these in my log:- [warn] Invalid result from curve25519 handshake: 4Always followed by:-Jun 19 07:51:24.000 [warn] onion_skin_client_handshake failed.Jun 19 07:51:24.000 [warn] circuit_finish_handshake failed.Jun 19 07:51:24.000 [warn] connection_edge_process_relay_cell (at origin) failed.
Is this the same bug? I haven't seen the Invalid result from... message before.
So, "4" in that context means the auth value was wrong, which could indeed correspond to the case discussed above where the client asks for an onion key the server doesn't believe in, so the server uses one that it does, knowing that it will generate a bogus result.
Oct 01 22:18:58.000 [warn] Invalid result from curve25519 handshake: 4
Oct 01 22:18:58.000 [warn] onion_skin_client_handshake failed.
Oct 01 22:18:58.000 [warn] circuit_finish_handshake failed.
Oct 01 22:18:58.000 [warn] connection_edge_process_relay_cell (at origin) failed.
on
Oct 01 22:12:46.000 [notice] Tor 0.2.5.8-rc (git-a64f3ab3ee5c433c) opening log file.
Turning these into info-level logs makes sense to me -- it's not anything the user can do anything about, and it does indeed happen often enough that people keep reporting it.
That said, is there a patch? I think it might be a bit complex to downgrade the log severity here without downgrading the general "a circuit build attempt failed" case.
My "if I were to suggest a fix" paragraph above still makes sense, but it's sure not a one-liner.
Ug. Yeah, that's not a one-liner at all. See branch bug9635_warnings_025. It's a 400-line diff touching over a dozen modules. It's against 0.2.5, but despite the name I think maybe this one should go into 0.2.6.
Of especial interest when reviewing this is the change in relay.c, which makes a circuit_mark_for_close happen earlier and changes a return value.
circuit_finish_handshake() +1284: adding log_warn(LD_BUG, "Couldn't initialize cpath crypto"); seems redondant because circuit_init_cpath_crypto() already has logging for every possible failure.
onion_skin_ntor_client_handshake() +291: So this means that if msg_out is NULL, there are no log at all. Seems reasonable for bad & 4 because it was downgraded but bad & 3 is still a warning thus should it be logged even if msg_out is NULL?