Check prioritisation, it should make 2 measurements that are 24 appart

changed milestone to %sbws: 1.0.x-final

added component::core tor/sbws milestone::sbws: 1.0.x-final parent::28042 priority::medium resolution::fixed sbws-1.0-nice-closed-moved-20181128 severity::normal status::closed type::defect labels

Fix typo in ticket title

Trac:
Summary: Check prioritisation, it should make 2 measurments that are 24 appart to Check prioritisation, it should make 2 measurements that are 24 appart

There's no parameter that directly controls the frequency with which sbws will scan an arbitrary relay. If you want to increase (or decrease) how often sbws gets around to scanning relay Foo, then you can

increase (or decrease) the number of measurement threads (def: 3, I use 6 when I test sbws)
decrease (or increase) the number of downloads to make per measurement (def: 10, I leave alone)
decrease (or increase) the duration a download must be in order to be considered valid (def: documented here)

If you're against changing those parameters and still can't manage to accumulate the desired number of measurements for relay Foo within the validity period, then you can increase the validity period.

In the name of the added code complexity, I'm against adding a notion to the prioritizer that says it should aim for making a measurement for relay Foo every X hours. Or that measurements should be performed differently when we have a lot of "catching up" to do.

Thanks for the list of options. I'm not against any. I'm going to use the try/error method to check whether some parameter combination increases the number of relays' measurements to 2. There's no concrete relay Foo to be prioritized, they all need to get the 2 measurements.

I am very confused by these results. It seems strange to me that sbws can't make 2 measurements of a relay over a few days.

Can you please tell us:

how long sbws has been running for?
the number of relays measured each day?
the exact code that you are using to do the "2 measurements at least 24 hours apart" test?
the exact measurement times for a relay that passes the test?
the exact measurement times for a relay with multiple measurements that fails the test?

In case it helps.

I've been running sbws from home for just over 24 hours.

(venv) matt@spacecow:~/.sbws/datadir$ sbws --log-level info stats --error-types
2018-10-17 10:24:18,814 INFO MainThread sbws.py:73 - main - sbws 0.8.1-dev0 with python 3.5.3 on Linux-4.9.0-8-amd64-x86_64-with-debian-9.5, stem 1.7.0, and requests 2.19.1
7031 relays have recent results
Mean 0.80 successful measurements per relay
For Result: min=1 q1=1 med=2 q3=2 max=4
For ResultSuccess: min=0 q1=0 med=1 q3=1 max=3
For ResultErrorCircuit: min=0 q1=0 med=0 q3=1 max=4
For ResultErrorStream: min=0 q1=0 med=0 q3=0 max=3
12240 total results, and 45.7% are successes
5594 success results and 6646 error results
The fastest download was 1901.14 KiB/s
Results come from 2018-10-16 13:58:27 to 2018-10-17 14:24:15 over a period of 1 day, 0:25:48
1128/12240 (9.22%) results were error-stream
5518/12240 (45.08%) results were error-circ

Months ago when I ran sbws I would see ~80% success measurement rate. The < 50% success rate is very concerning to me. That was probably with different parameters, and was definitely on a different machine than my home desktop. So I will now begin to run it on a server instead.

Replying to teor:

I am very confused by these results. It seems strange to me that sbws can't make 2 measurements of a relay over a few days.

It also seems weird to me, so that i've been debugging prioritization today, so far i didn't find the reason.

Can you please tell us:

how long sbws has been running for?

Forever :) Results dated older than 7 days ago are not took Those results were done without the last recent day, but still.

the number of relays measured each day:

This is the number of results each day, not relays:

1555 2018-10-10.txt 1485 2018-10-11.txt 1363 2018-10-12.txt 1574 2018-10-13.txt 1665 2018-10-14.txt 1707 2018-10-15.txt 1819 2018-10-16.txt 3332 2018-10-17.txt

The 2018-10-16 around 14h i increased the number of threads, what does not seem to increase the speed too much, probably because of https://gitweb.torproject.org/sbws.git/tree/sbws/core/scanner.py#n358

Today, around 14h, i have commented the rtt measurements. By the number of results in the file today, it seems faster now... I also noticed that today every prioritization loop (~300 relays) happens every ~45min. The days before, it was happening every 6h. I still don't understand why such a difference eliminating that code.

With the data from today, now i get:

number of successfully measured relays with the restriction: 1319
number of measured relays: 5933

the exact code that you are using to do the "2 measurements at least 24 hours apart" test?

https://gitweb.torproject.org/sbws.git/tree/sbws/lib/v3bwfile.py#n278

the exact measurement times for a relay that passes the test?

Attachment

the exact measurement times for a relay with multiple measurements that fails the test?

Attachment All from today, hmm?

Today, number of relays with 1 result: 4610, with 2: 1308, with 3: 15

Trac:
count

Trac:
donotcount

Replying to pastly:

In case it helps.

I've been running sbws from home for just over 24 hours. [...] Months ago when I ran sbws I would see ~80% success measurement rate. The < 50% success rate is very concerning to me. That was probably with different parameters, and was definitely on a different machine than my home desktop. So I will now begin to run it on a server instead.

I'm running it in a server, i get ~80% success.

Replying to juga:

Replying to teor:

I am very confused by these results. It seems strange to me that sbws can't make 2 measurements of a relay over a few days.

It also seems weird to me, so that i've been debugging prioritization today, so far i didn't find the reason.

Is there some documentation for relay priority? If we prioritise failed relays, then most of our circuits will fail. In some cases, we will overload relays that are already failing.

Can you please tell us:

how long sbws has been running for?

Forever :) Results dated older than 7 days ago are not took Those results were done without the last recent day, but still.

the number of relays measured each day:

This is the number of results each day, not relays:

1555 2018-10-10.txt 1485 2018-10-11.txt 1363 2018-10-12.txt 1574 2018-10-13.txt 1665 2018-10-14.txt

If it takes 5 days for sbws to measure 7000 relays, then maybe sbws should use results from the last 10 days.

1707 2018-10-15.txt 1819 2018-10-16.txt 3332 2018-10-17.txt

...

the exact code that you are using to do the "2 measurements at least 24 hours apart" test?

https://gitweb.torproject.org/sbws.git/tree/sbws/lib/v3bwfile.py#n278

Ok, so there is a bug in this code.

The original task was:

If any of these things are true, do not put the relay in the bandwidth file:

there are less than 2 sbws measured bandwidths

all the sbws measured bandwidths are within 24 hours of each other

But the code implements:

If there are less than 2 results, do not put the relay in the bandwidth file
If any result is within 24 hours of the previous result, remove it. Then, if there is only one result, do not put the relay in the bandwidth file.

Here are some results and the outputs they should produce: 0h produces None 0h, 24h produces 0h, 24h 0h, 1h, 24h produces 0h, 1h, 24h (but I think the current function produces 0h, 24h) 0h, 24h, 36h produces 0h, 24h, 36h (but I think the current function produces 0h, 24h)

Maybe these functions need some unit tests?

the exact measurement times for a relay that passes the test?

Attachment

I don't understand this attachment. It looks like a list of all the times that pass the test, with no relay ids.

But I wanted to see a list of times for one relay that passes the test.

the exact measurement times for a relay with multiple measurements that fails the test?

Attachment All from today, hmm?

Let's just focus on a single relay.

Replying to teor: [...]

Is there some documentation for relay priority?

Relays are sorted by their data_period - time in results. https://gitweb.torproject.org/sbws.git/tree/sbws/lib/relayprioritizer.py#n65 Relays with error results, get their delta multiplied by 0.5 or 0.4. So they will be measured earlier. https://gitweb.torproject.org/sbws.git/tree/sbws/lib/relayprioritizer.py#n76 Relays with more than 1 result sum that delta https://gitweb.torproject.org/sbws.git/tree/sbws/lib/relayprioritizer.py#n77

If we prioritise failed relays, then most of our circuits will fail. In some cases, we will overload relays that are already failing.

Can you please tell us:

how long sbws has been running for?

Forever :) Results dated older than 7 days ago are not took Those results were done without the last recent day, but still.

the number of relays measured each day:

This is the number of results each day, not relays:

1555 2018-10-10.txt 1485 2018-10-11.txt 1363 2018-10-12.txt 1574 2018-10-13.txt 1665 2018-10-14.txt

If it takes 5 days for sbws to measure 7000 relays, then maybe sbws should use results from the last 10 days.

That would work. Though after i commented the rtt code, i seems sbws can measure almost all the network in 1 day: 6219 2018-10-17.txt

1707 2018-10-15.txt 1819 2018-10-16.txt 3332 2018-10-17.txt

...

the exact code that you are using to do the "2 measurements at least 24 hours apart" test?

https://gitweb.torproject.org/sbws.git/tree/sbws/lib/v3bwfile.py#n278

Ok, so there is a bug in this code.

The original task was:

If any of these things are true, do not put the relay in the bandwidth file:

there are less than 2 sbws measured bandwidths

all the sbws measured bandwidths are within 24 hours of each other

But the code implements:

If there are less than 2 results, do not put the relay in the bandwidth file

Is this not correct?

If any result is within 24 hours of the previous result, remove it. Then, if there is only one result, do not put the relay in the bandwidth file.

Here are some results and the outputs they should produce: 0h produces None 0h, 24h produces 0h, 24h 0h, 1h, 24h produces 0h, 1h, 24h (but I think the current function produces 0h, 24h) 0h, 24h, 36h produces 0h, 24h, 36h (but I think the current function produces 0h, 24h)

Ok, i understood this wrong, though this does not change the number of relays.

Maybe these functions need some unit tests?

Yeah, going to implement that

the exact measurement times for a relay that passes the test?

Attachment

I don't understand this attachment. It looks like a list of all the times that pass the test, with no relay ids.

But I wanted to see a list of times for one relay that passes the test.

Relay with 2 results that passes the test:

7044955A4D7B04CF70011D73B467B13CE5E69D41 ['2018-10-12T23:53:34', '2018-10-17T09:48:31']

the exact measurement times for a relay with multiple measurements that fails the test?

Attachment All from today, hmm?

Let's just focus on a single relay.

Relay with 3 results that discards 1 result:

241C0134A6F62F40D378562B348531E51B946A07 Results NOT away from each other at least 86400s: 2018-10-17T15:51:23, 2018-10-17T15:51:04 Final results ts: ['2018-10-13T22:02:00', '2018-10-17T15:51:23']

Relay with 2 results that does not pass the test:

D1352AECB24F050C54CA13A8BC84F9613E4E61FD Results NOT away from each other at least 86400s:2018-10-17T11:58:17, 2018-10-17T11:57:47

None of the relays with 1 result pass the test

And from the last results, there is something wrong cause after a relay has been measured, it should not be measured immediately after

Trac:
log

Attached the part of the log where the last relay get measured twice

Replying to teor: [...]

If any of these things are true, do not put the relay in the bandwidth file:

there are less than 2 sbws measured bandwidths

all the sbws measured bandwidths are within 24 hours of each other

But the code implements:

If there are less than 2 results, do not put the relay in the bandwidth file

If any result is within 24 hours of the previous result, remove it. Then, if there is only one result, do not put the relay in the bandwidth file. [...]

Should be fixed in #28103 (moved)

pastly, i suspect that the reason why the same relay gets measured twice within minutes, is because the threads are not sharing the same vision on the best_priority iterator. Would you mind to have a look to it?

Replying to juga:

pastly, i suspect that the reason why the same relay gets measured twice within minutes, is because the threads are not sharing the same vision on the best_priority iterator. Would you mind to have a look to it?

I don't think the measurement threads have different copies or ideas about the RelayPrioritizer because it is created in the main thread and never given to worker threads. See it defined here and never given to the workers.

My untested theory is:

A thread will grab one of the last relays in the RelayPrioritizer. While it is measuring that relay, the RelayPrioritizer is recalculating priority, finishes, and gives the same relay to another thread to also measure.

This might be a good time to switch to using Python's PriorityQueue while figuring out the problem. I'll give it some thought this evening.

Replying to pastly:

Replying to juga:

pastly, i suspect that the reason why the same relay gets measured twice within minutes, is because the threads are not sharing the same vision on the best_priority iterator. Would you mind to have a look to it?

I don't think the measurement threads have different copies or ideas about the RelayPrioritizer because it is created in the main thread and never given to worker threads. See it defined here and never given to the workers.

I agree, i saw that, with "vision" i didn't mean copy

My untested theory is:

A thread will grab one of the last relays in the RelayPrioritizer. While it is measuring that relay, the RelayPrioritizer is recalculating priority, finishes, and gives the same relay to another thread to also measure.

by "vision" i was thinking something like that.

This might be a good time to switch to using Python's PriorityQueue while figuring out the problem. I'll give it some thought this evening.

multiprocessing also have a https://docs.python.org/2/library/multiprocessing.html#multiprocessing.Queue class. However i thought that if threads have to wait to 1 measurement to finish, then will slow down the process. What about just getting a list of relays to measure (instead of an iterator), and do not call best_priority again until the list is almost finished?. If you're a busy, i can try that. Thanks for looking at it.

Replying to juga:

What about just getting a list of relays to measure (instead of an iterator), and do not call best_priority again until the list is almost finished?. If you're a busy, i can try that. Thanks for looking at it.

I think we'll have the same problem unless: when the we're measuring the last relay the prioritizer returned (either via an iterator or a list) we don't give the other measurement threads anything until it's done. When it is done, we recalculate priority and start handing out relays to the worker threads again.

I don't think having all the measurement threads stop and wait once in a while for the slowest of them to finish up so we can regenerate a new set of relays to measure is going to significantly impact how long it takes to scan the entire network.

https://github.com/torproject/sbws/pull/281

i had that branch running for 1 day and checked the dates of the results that do not pass the restrictions. They were close in hours, not minutes, as it happened before.

Trac:
Status: new to needs_review

I don't think this patch addresses the problem and I've left my review comment stating why. Let me know what you think.

Trac:
Status: needs_review to needs_revision

ops

Trac:
Status: needs_revision to needs_review

Compare the timestamp of the results being meausured from yesterday:

2018-10-26 13:35:05,392 DEBUG MainThread v3bwfile.py:349 - results_away_each_other - Results are NOT away from each other in at least 86400s: ['2018-10-25T11:54:17', '2018-10-26T07:45:14'] 2018-10-26 13:35:05,395 DEBUG MainThread v3bwfile.py:349 - results_away_each_other - Results are NOT away from each other in at least 86400s: ['2018-10-25T15:47:16', '2018-10-26T04:04:18'] ...

To the ones days before: 2018-10-26 14:59:16,270 DEBUG MainThread v3bwfile.py:349 - results_away_each_other - Results are NOT away from each other in at least 60s: ['2018-10-25T12:58:22', '2018-10-25T12:58:40'] 2018-10-26 14:59:18,271 DEBUG MainThread v3bwfile.py:349 - results_away_each_other - Results are NOT away from each other in at least 60s: ['2018-10-24T12:42:44', '2018-10-24T12:43:18']

And i'm missing something here?

Solved as suggested. Merged

Trac:
Status: needs_review to closed
Resolution: N/A to fixed

Move all closed sbws 1.0 nice tickets to sbws 1.0.x-final

Trac:
Milestone: sbws 1.0 (MVP nice) to sbws: 1.0.x-final
Keywords: N/A deleted, sbws-1.0-nice-closed-moved-20181128 added

closed

mentioned in issue #28103 (moved)

mentioned in issue #28663 (moved)

moved to tpo/network-health/sbws#28061 (closed)

mentioned in issue tpo/network-health/sbws#28103 (closed)

Check prioritisation, it should make 2 measurements that are 24 appart

Child items ...

Activity