Opened 7 years ago

Closed 7 years ago

#5397 closed task (fixed)

Understand why downloads are stalling in simulators, and fix it somehow

Reported by: arma Owned by:
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Keywords:
Cc: robgjansen, kevin Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

In #5336 and #4486 Rob reports a significant fraction of failed downloads now that he's reduced the capacity of relays.

Do these happen because of some Tor bug? Or some simulation bug?

It is especially important to find the answer because a client that stops getting bytes but doesn't get an end cell will stop putting load on the network, thus influencing the rest of the simulation.

Child Tickets

Attachments (1)

5336taska-filtered.log.xz (1.9 KB) - added by robgjansen 7 years ago.
short message log

Download all attachments as: .zip

Change History (21)

comment:1 Changed 7 years ago by kevin

Interesting! I've been debugging with what seems like the same issue with the latest Tor code in ExperimenTor (and hence, I've been reluctant to post my obviously broken results).

comment:2 in reply to:  1 Changed 7 years ago by robgjansen

Replying to kevin:

Interesting! I've been debugging with what seems like the same issue with the latest Tor code in ExperimenTor (and hence, I've been reluctant to post my obviously broken results).

At least broken results make others aware, and hopefully remove some thoughts that we are using random graph generators :P I've been operating under the assumption that the usual caveat applies to unpublished work ;-)

comment:3 Changed 7 years ago by arma

In #5336 Rob asked "Did something change in a recent version of Tor?"

We could answer that question by comparing a run of the "maint-0.2.2" branch with a run of the "master" branch, yes?

comment:4 Changed 7 years ago by arma

Summary: Understand why downloads are stalling in Shadow, and fix it somehowUnderstand why downloads are stalling in simulators, and fix it somehow

comment:5 in reply to:  3 Changed 7 years ago by arma

Replying to arma:

In #5336 Rob asked "Did something change in a recent version of Tor?"

We could answer that question by comparing a run of the "maint-0.2.2" branch with a run of the "master" branch, yes?

Actually, a better plan is "compare a branch that worked for you on some previous run with the master branch".

My guess is that the problem is introduced by squeezing down the capacity of relays, rather than some recent change in Tor. But it would be great to narrow out code changes.

comment:6 Changed 7 years ago by arma

If somebody can keep debug logs at a client where this is happening, and ideally with the exit that it's matched up with, I'll try to make some guesses about what's being wedged (if it is in fact inside of Tor).

comment:7 Changed 7 years ago by nickm

How much hardware/software would I need to reproduce this? If Rob and Kevin are the only ones who can reproduce this bug, it's going to be trickier to solve it than if everybody could repeat it.

Changed 7 years ago by robgjansen

Attachment: 5336taska-filtered.log.xz added

short message log

comment:8 in reply to:  6 ; Changed 7 years ago by robgjansen

Replying to arma:

If somebody can keep debug logs at a client where this is happening, and ideally with the exit that it's matched up with, I'll try to make some guesses about what's being wedged (if it is in fact inside of Tor).

I've uploaded a log from a one of the clients from my original run of #5336 taska.

This doesn't contain much information. I'm re-running this now with a more verbose log level.

comment:9 in reply to:  7 Changed 7 years ago by robgjansen

Replying to nickm:

How much hardware/software would I need to reproduce this? If Rob and Kevin are the only ones who can reproduce this bug, it's going to be trickier to solve it than if everybody could repeat it.

I've been using the small size network (found in scallion/resource). I've found this generally runs in about 10 GiB of RAM. You can run this on a m1.xlarge or m2.xlarge EC2 instance as described here.

comment:10 Changed 7 years ago by nickm

Hm. "tiny" won't work here? I'd like to reproduce this on my desktop if possible. (I guess that buying more RAM is also an option.)

comment:11 in reply to:  10 Changed 7 years ago by robgjansen

Replying to nickm:

Hm. "tiny" won't work here? I'd like to reproduce this on my desktop if possible. (I guess that buying more RAM is also an option.)

Ideally tiny would work just as well as others. I've had bad luck in the past consistently getting experiments properly bootstrapped on tiny. The bootstrapping issues tend to go away on medium. I've not yet had time to focus on the details.

Try tiny first if you wish. If issues still exist in Tor/Shadow on tiny, I'd be more than happy to have a competent Tor developer distinguish and properly categorize them.

comment:12 Changed 7 years ago by nickm

Hm. It ran for a while, than said:

fopen(): No such file or directory
fopen(): No such file or directory
fopen(): No such file or directory
./bulk.dl: No such file or directory
**
ERROR:/home/nickm/src/shadow/shadow/src/utility/shd-cdf.c:178:cdf_free: assertion failed: (cdf && (cdf->magic == MAGIC_VALUE))
[2012-06-11 15:50:55.800677] Shadow wrapper script: run returned -6
[2012-06-11 15:50:56.094233] scallion: Shadow returned 250 in 0:05:49.170612 seconds

Perhaps we can take this "Nick tries to do shadow" stuff to email, and maybe even schedule a time for you to teach me how to do this stuff properly.

comment:13 in reply to:  12 Changed 7 years ago by robgjansen

Perhaps we can take this "Nick tries to do shadow" stuff to email, and maybe even schedule a time for you to teach me how to do this stuff properly.

Sounds good, I'll email you shortly. This means Shadow needs a "getting started" guide, in my copious free time :)

comment:14 in reply to:  8 ; Changed 7 years ago by robgjansen

This doesn't contain much information. I'm re-running this now with a more verbose log level.

I do not see the problem in my second run of #5336 taska. The only thing that changed is a slightly updated version of Shadow but after looking back through my commits, I don't think I addressed the issue.

I'll run #5336 taska a few more times and see if I notice the problem again.

comment:15 in reply to:  14 Changed 7 years ago by robgjansen

Replying to robgjansen:

I'll run #5336 taska a few more times and see if I notice the problem again.

Still have not noticed it after 3 more experiments.

comment:16 Changed 7 years ago by nickm

Do you think it's worth trying to bisect what might have caused this, or trying to reproduce the conditions where it occurred, or should we just shrug our shoulders and move on?

comment:17 in reply to:  16 Changed 7 years ago by robgjansen

Replying to nickm:

Do you think it's worth trying to bisect what might have caused this, or trying to reproduce the conditions where it occurred, or should we just shrug our shoulders and move on?

Shrug. But I'll keep an eye out for it as I re-run experiments for my thesis over the next few months.

I don't know enough about how I should be using trac to know what status to assign this, so I'm leaving it as is. Probably something like 'defer'?

comment:18 Changed 7 years ago by karsten

This ticket might be related to #6271.

comment:19 Changed 7 years ago by robgjansen

In #6341, we debugged this problem and drastically improved bootstrapping in Shadow (thanks to this patch and this patch). Given that, and that #6271 fixed some Tor behavior, is it safe to close this ticket?

comment:20 Changed 7 years ago by arma

Resolution: fixed
Status: newclosed

Sounds good. We can open a new one if we find new surprises.

Thanks all!

Note: See TracTickets for help on using tickets.