Opened 7 years ago

Closed 5 years ago

#6538 closed enhancement (fixed)

Use bit-twiddling tricks to make choose-by-bandwith algorithm even more time-invariant

Reported by: nickm Owned by:
Priority: Medium Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: 023backport tor-client
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

See #6537 for discussion. There's a branch in the middle of our node selection algorithm, which is bad for time-invariance. Furthermore, a sufficiently clever compiler might decide to reinsert the break statement that 6537 so carefully removed. (I have not yet found a compiler this clever, but better safe than sorry!)

We can probably do better by using a bit-twiddling approach to setting i_chosen. For example, rransom wrote:

  i_chosen = 0;
  i_hasnt_been_chosen = ~0;
  for (...) {
     int64_t choose_this_i;
     tmp += int_bandwidths[i];
    choose_this_i = rand_bw - (tmp+1);
    choose_this_i = choose_this_i >> 63;
    /* choose_this_i = -1 if rand_bw < (tmp+1); choose_this_i = 0 otherwise */
    i_chosen |= (i & choose_this_i & i_hasnt_been_chosen);
    i_hasnt_been_chosen &= ~choose_this_i;
 }
 i = i_chosen;

This looks okay to me; more eyes are needed here, though.

Child Tickets

Change History (23)

comment:1 Changed 7 years ago by arma

Summary: Use bit-twiddling tricks to make choose-by-bandwith algorithm even more time-invairantUse bit-twiddling tricks to make choose-by-bandwith algorithm even more time-invariant

I think I'd argue this can wait until 0.2.4. (I might have argued #6537 could wait until 0.2.4 too.)

comment:2 Changed 7 years ago by nickm_mobile

I'm not sure I'm comfortable with that. Honestly, you may have to talk me out of doing this in 0.2.2.

That said, there is a much simpler fix, if we can trust the compiler not to get too smart. Instead of setting i_has_been_chosen to 1 (as we do in the current code), set rand_bw t the o INT64_MAX (or whatever its type's max is). Remove i_has_been_chosen. Should work I think.

comment:3 in reply to:  2 Changed 7 years ago by rransom

Replying to nickm_mobile:

That said, there is a much simpler fix, if we can trust the compiler not to get too smart. Instead of setting i_has_been_chosen to 1 (as we do in the current code), set rand_bw t the o INT64_MAX (or whatever its type's max is). Remove i_has_been_chosen. Should work I think.

That would work, although you'll still need to switch to using integer math (rather than operating on the floating-point bandwidth values directly).

comment:4 Changed 7 years ago by nickm

I'll bite; what's wrong with using DBL_MAX ?

comment:5 Changed 7 years ago by nickm

ah; because you can't set rand_bw to DBL_MAX, because rand_bw is a uint64.

comment:6 Changed 7 years ago by nickm

Status: newneeds_review

Okay. Branch "bug6538_023" in my public repository has the minimal fix for branch 0.2.3. Branch "bug6538" has additional code to refactor those functions a little more and extract the common logic into a testable function; it should be considered only for 0.2.4.

comment:7 Changed 7 years ago by nickm

Oops. Hadn't actually pushed "bug6538_023" to my public repo. Done. (If you started looking at "bug6538", please don't worry -- it contains all of "bug6538_023" and a little more.)

arma: I am still liable to merge bug6538_023 into maint-0.2.3 if nobody stops me.

comment:8 Changed 7 years ago by arma

Looking at bug6538_023:

     log_warn(LD_BUG, "Round-off error in computing bandwidth had an effect on "
              " which router we chose. Please tell the developers. "

is more likely to trigger now, since we make it all the way through the loop each time -- even though in nearly none of these cases does it actually have an effect on which router we chose? I recognize this is a potential problem with the original fix, not this one (unless I'm wrong).

"incancations"

More generally, this appears to be a complex tweak to a theoretical issue that is theoretically mostly solved in what we already merged? It includes messings to our autoconf and lots of new code. Why is it scheduled for 0.2.3? Or said another way, is the risk/reward ratio here suitable for forcing it on all wheezy users at the next rc? I see the risk as being somewhat low and the reward as being very low.

comment:9 in reply to:  8 ; Changed 7 years ago by nickm

Replying to arma:

More generally, this appears to be a complex tweak to a theoretical issue that is theoretically mostly solved in what we already merged? It includes messings to our autoconf and lots of new code. Why is it scheduled for 0.2.3? Or said another way, is the risk/reward ratio here suitable for forcing it on all wheezy users at the next rc? I see the risk as being somewhat low and the reward as being very low.

The risk is that somebody on your LAN (or a local eavesdropper, or your entry node), can figure out about what your path is. Conservative wisdom is that a fairly local attacker can detect 10-100ns timing leaks, and a remote one can detect 1-10ms timing leaks. I'm especially worried about slow mobile devices here.

Instead of calling the risk "low", do you mean "unproven" or something?

comment:10 in reply to:  9 ; Changed 7 years ago by arma

Replying to nickm:

Instead of calling the risk "low", do you mean "unproven" or something?

I mean the risk of applying it. That is, what are the chances it will produce strange crashes, strange path selection bugs, or a spew of "Sky is falling tell the developers!" warnings that forcce us to rush out a quick new release (which may or may not get picked up by distros like Wheezy, depending on timing).

The slow mobile devices question is a good point though. Do you seriously think that setting a variable 1000 times vs 1100 times is going to be noticeable to even a local network attacker?

comment:11 Changed 7 years ago by nickm

rransom points out that he has some integer scaling code that's better than what I wrote:

13:48 < rransom> Was there a reason to use llround (and add autocrap goo to 
                 check for it) instead of my approach to int64_t-izing the 
                 bandwidths?
13:50 < nickm> Argh.  Context-switching.  I had thought that my approach was to 
               int64_t-ize: that is, to move the bandwidth array to uint64_t 
               earlier in the computation and get "double" out of the picture 
               entirely by the time that we did the "choose one randomly by 
               weight" computation.
13:51 < nickm> The reason I want the computation to be (array of uint64_t, 
               length, entropy) -> index is that I trust myself to reasoning 
               about this stuff much better when all values are integers.
13:51 < nickm> *to reason
13:52 < rransom> You used llround on (approximately) the values that would have 
                 been in the array of doubles before. I rescaled them to a 
                 total bandwidth around 2^61 and justcasted to int64_t .
13:54 < nickm> A decent idea; maybe we should do that too.  Why 2^61?
13:55 < rransom> Because it's big, but not close enough to 2^63 that I would 
                 have to think carefully about whether the roundoff errors 
                 could overflow and really break things.
14:01 < nickm> mind  if I C&P the above to the bug report?
14:01 < rransom> Go ahead.

comment:12 in reply to:  10 Changed 7 years ago by nickm

Replying to arma:

The slow mobile devices question is a good point though. Do you seriously think that setting a variable 1000 times vs 1100 times is going to be noticeable to even a local network attacker?

So the offending code in Tor today is:

  for (i=0; i < (unsigned)smartlist_len(sl); i++) {
    tmp += bandwidths[i];
    if (tmp >= rand_bw && !i_has_been_chosen) {
      i_chosen = i;
      i_has_been_chosen = 1;
    }
  }

Note that in the part of the loop before we choose the router, the tmp >= rand_bw part will be false and the !i_has_been_chosen part will never be executed. When we choose the router, both tests are executed and we run the conditional block. After we choose the route, tmp >= rand_bw will be true and !i_has_been chosen will be executed and will turn out to be false.

I wouldn't expect any cache misses to be different in these cases. So what we need to worry about is branch mispredictions. In the worst case, let's say that in the first half of the loop, we correctly predict the first test, and in the second part of the loop, we mispredict both tests every time (because our architecture sucks hard). So, the biggest difference in timing will be on the order of N_ROUTERS * branch_misprediction_cost * 2. Something like 3-10 cycles is a reasonable order of magnitude for branch_misprediction_cost on a modern CPU. I don't know where you'll find an embedded device with less than 400 MHz running Tor these days, so let's say our minimum clock speed is 200 MHz. Let's say there are 5000 routers. So in the worst imaginable case, that's 0.25 ms, which will definitely be detectable locally, and might be detectable remotely depending on assumptions, prevailing conditions, and the phase of the moon.

Making some more favorable assumptions, and assuming no huge branch mispredictions, assuming that it's only one extra cycle to check !i_has_been_chosen, assuming that our clock speed is a nice hefty 1 GHz: that's still 5 microseconds, which should be detectable locally.

comment:13 in reply to:  10 Changed 7 years ago by rransom

Replying to arma:

Replying to nickm:

Instead of calling the risk "low", do you mean "unproven" or something?

I mean the risk of applying it. That is, what are the chances it will produce strange crashes, strange path selection bugs, or a spew of "Sky is falling tell the developers!" warnings that forcce us to rush out a quick new release (which may or may not get picked up by distros like Wheezy, depending on timing).

I have been using the patch that I wrote to fix this issue for over a month, with not even a scary log message.

The slow mobile devices question is a good point though. Do you seriously think that setting a variable 1000 times vs 1100 times is going to be noticeable to even a local network attacker?

The current code (after #6537) loops at a different frequency after it selects a relay, which will be quite noticeable to someone picking up a smartphone's RF emissions.

comment:14 Changed 7 years ago by nickm

wrt the scale issue, I'm seeing typical values that suggest that the integer roundoff doesn't matter much here. Is their counterveiling evidence?

comment:15 in reply to:  14 Changed 7 years ago by rransom

Replying to nickm:

wrt the scale issue, I'm seeing typical values that suggest that the integer roundoff doesn't matter much here. Is their counterveiling evidence?

Whether integer roundoff matters or not depends on the current value of weight_scale.

Also, are you sure that the C comparison operator you want to leave in (tmp > rand_bw) will be implemented in constant time on 32-bit systems?

comment:16 in reply to:  10 Changed 7 years ago by arma

Replying to arma:

The slow mobile devices question is a good point though. Do you seriously think that setting a variable 1000 times vs 1100 times is going to be noticeable to even a local network attacker?

Ok. I'm fine having this go into 0.2.3 if you come up with something you still want to put into 0.2.3.

comment:17 Changed 7 years ago by nickm

Branch revised. Now targeting 0.2.4 with some possibility for backport.

comment:18 Changed 7 years ago by nickm

Still looks good to me. Still passes unit tests. Going to merge soon.

The one to look at is "bug6538" in my public repository. I think it answers all of rransom's notes.

comment:19 Changed 7 years ago by nickm

Keywords: 023backport added

Merged this to master. I'm going to stick it in 023backport too; if I do an 0.2.3-plus-extras series, then assuming a lack of bugs, it goes in.

comment:20 Changed 7 years ago by nickm

Keywords: tor-client added

comment:21 Changed 7 years ago by nickm

Component: Tor ClientTor

comment:22 Changed 7 years ago by arma

Looks like we're not doing an 0.2.3-plus-extra series.

Do you still think this should go in mainline 0.2.3 (i.e. is it a huge security flaw that must be fixed or we're screwing a large number of our users)?

comment:23 Changed 5 years ago by nickm

Milestone: Tor: 0.2.3.x-finalTor: 0.2.4.x-final
Resolution: fixed
Status: needs_reviewclosed

Marking a batch of tickets that had been under consideration for 0.2.3 backport as fixed-in-0.2.4. There is no plan for further 0.2.3 releases.

Note: See TracTickets for help on using tickets.