Opened 12 months ago

Closed 12 months ago

Last modified 11 months ago

#28061 closed defect (fixed)

Check prioritisation, it should make 2 measurements that are 24 appart

Reported by: juga Owned by:
Priority: Medium Milestone: sbws: 1.0.x-final
Component: Core Tor/sbws Version:
Severity: Normal Keywords: sbws-1.0-nice-closed-moved-20181128
Cc: pastly, juga, teor Actual Points:
Parent ID: #28042 Points:
Reviewer: Sponsor:

Child Tickets

Attachments (3)

count (185.5 KB) - added by juga 12 months ago.
donotcount (3.3 KB) - added by juga 12 months ago.
log (794 bytes) - added by juga 12 months ago.

Download all attachments as: .zip

Change History (27)

comment:1 Changed 12 months ago by juga

Summary: Check prioritisation, it should make 2 measurments that are 24 appartCheck prioritisation, it should make 2 measurements that are 24 appart

Fix typo in ticket title

comment:2 Changed 12 months ago by pastly

There's no parameter that directly controls the frequency with which sbws will scan an arbitrary relay. If you want to increase (or decrease) how often sbws gets around to scanning relay Foo, then you can

  • increase (or decrease) the number of measurement threads (def: 3, I use 6 when I test sbws)
  • decrease (or increase) the number of downloads to make per measurement (def: 10, I leave alone)
  • decrease (or increase) the duration a download must be in order to be considered valid (def: documented here)

If you're against changing those parameters and still can't manage to accumulate the desired number of measurements for relay Foo within the validity period, then you can increase the validity period.

In the name of the added code complexity, I'm against adding a notion to the prioritizer that says it should aim for making a measurement for relay Foo every X hours. Or that measurements should be performed differently when we have a lot of "catching up" to do.

comment:3 Changed 12 months ago by juga

Thanks for the list of options. I'm not against any.
I'm going to use the try/error method to check whether some parameter combination increases the number of relays' measurements to 2.
There's no concrete relay Foo to be prioritized, they all need to get the 2 measurements.

comment:4 Changed 12 months ago by teor

I am very confused by these results. It seems strange to me that sbws can't make 2 measurements of a relay over a few days.

Can you please tell us:

  • how long sbws has been running for?
  • the number of relays measured each day?
  • the exact code that you are using to do the "2 measurements at least 24 hours apart" test?
  • the exact measurement times for a relay that passes the test?
  • the exact measurement times for a relay with multiple measurements that fails the test?

comment:5 Changed 12 months ago by pastly

In case it helps.

I've been running sbws from home for just over 24 hours.

(venv) matt@spacecow:~/.sbws/datadir$ sbws --log-level info stats --error-types
2018-10-17 10:24:18,814 INFO MainThread sbws.py:73 - main - sbws 0.8.1-dev0 with python 3.5.3 on Linux-4.9.0-8-amd64-x86_64-with-debian-9.5, stem 1.7.0, and requests 2.19.1
7031 relays have recent results
Mean 0.80 successful measurements per relay
For Result: min=1 q1=1 med=2 q3=2 max=4
For ResultSuccess: min=0 q1=0 med=1 q3=1 max=3
For ResultErrorCircuit: min=0 q1=0 med=0 q3=1 max=4
For ResultErrorStream: min=0 q1=0 med=0 q3=0 max=3
12240 total results, and 45.7% are successes
5594 success results and 6646 error results
The fastest download was 1901.14 KiB/s
Results come from 2018-10-16 13:58:27 to 2018-10-17 14:24:15 over a period of 1 day, 0:25:48
1128/12240 (9.22%) results were error-stream
5518/12240 (45.08%) results were error-circ

Months ago when I ran sbws I would see ~80% success measurement rate. The < 50% success rate is very concerning to me. That was probably with different parameters, and was definitely on a different machine than my home desktop. So I will now begin to run it on a server instead.

comment:6 in reply to:  4 ; Changed 12 months ago by juga

Replying to teor:

I am very confused by these results. It seems strange to me that sbws can't make 2 measurements of a relay over a few days.

It also seems weird to me, so that i've been debugging prioritization today, so far i didn't find the reason.

Can you please tell us:

  • how long sbws has been running for?

Forever :) Results dated older than 7 days ago are not took
Those results were done without the last recent day, but still.

  • the number of relays measured each day:

This is the number of results each day, not relays:

1555 2018-10-10.txt
1485 2018-10-11.txt
1363 2018-10-12.txt
1574 2018-10-13.txt
1665 2018-10-14.txt
1707 2018-10-15.txt
1819 2018-10-16.txt
3332 2018-10-17.txt

The 2018-10-16 around 14h i increased the number of threads, what does not seem to increase the speed too much, probably because of https://gitweb.torproject.org/sbws.git/tree/sbws/core/scanner.py#n358

Today, around 14h, i have commented the rtt measurements. By the number of results in the file today, it seems faster now...
I also noticed that today every prioritization loop (~300 relays) happens every ~45min. The days before, it was happening every 6h.
I still don't understand why such a difference eliminating that code.

With the data from today, now i get:

  • number of successfully measured relays with the restriction: 1319
  • number of measured relays: 5933
  • the exact code that you are using to do the "2 measurements at least 24 hours apart" test?

https://gitweb.torproject.org/sbws.git/tree/sbws/lib/v3bwfile.py#n278

  • the exact measurement times for a relay that passes the test?

Attachment

  • the exact measurement times for a relay with multiple measurements that fails the test?

Attachment
All from today, hmm?

  • Today, number of relays with 1 result: 4610, with 2: 1308, with 3: 15

Changed 12 months ago by juga

Attachment: count added

Changed 12 months ago by juga

Attachment: donotcount added

comment:7 in reply to:  5 Changed 12 months ago by juga

Replying to pastly:

In case it helps.

I've been running sbws from home for just over 24 hours.
[...]
Months ago when I ran sbws I would see ~80% success measurement rate. The < 50% success rate is very concerning to me. That was probably with different parameters, and was definitely on a different machine than my home desktop. So I will now begin to run it on a server instead.

I'm running it in a server, i get ~80% success.

comment:8 in reply to:  6 ; Changed 12 months ago by teor

Replying to juga:

Replying to teor:

I am very confused by these results. It seems strange to me that sbws can't make 2 measurements of a relay over a few days.

It also seems weird to me, so that i've been debugging prioritization today, so far i didn't find the reason.

Is there some documentation for relay priority?
If we prioritise failed relays, then most of our circuits will fail. In some cases, we will overload relays that are already failing.

Can you please tell us:

  • how long sbws has been running for?

Forever :) Results dated older than 7 days ago are not took
Those results were done without the last recent day, but still.

  • the number of relays measured each day:

This is the number of results each day, not relays:

1555 2018-10-10.txt
1485 2018-10-11.txt
1363 2018-10-12.txt
1574 2018-10-13.txt
1665 2018-10-14.txt

If it takes 5 days for sbws to measure 7000 relays, then maybe sbws should use results from the last 10 days.

1707 2018-10-15.txt
1819 2018-10-16.txt
3332 2018-10-17.txt

...

  • the exact code that you are using to do the "2 measurements at least 24 hours apart" test?

https://gitweb.torproject.org/sbws.git/tree/sbws/lib/v3bwfile.py#n278

Ok, so there is a bug in this code.

The original task was:

If any of these things are true, do not put the relay in the bandwidth file:

  • there are less than 2 sbws measured bandwidths
  • all the sbws measured bandwidths are within 24 hours of each other

But the code implements:

  1. If there are less than 2 results, do not put the relay in the bandwidth file
  2. If any result is within 24 hours of the previous result, remove it. Then, if there is only one result, do not put the relay in the bandwidth file.

Here are some results and the outputs they should produce:
0h produces None
0h, 24h produces 0h, 24h
0h, 1h, 24h produces 0h, 1h, 24h (but I think the current function produces 0h, 24h)
0h, 24h, 36h produces 0h, 24h, 36h (but I think the current function produces 0h, 24h)

Maybe these functions need some unit tests?

  • the exact measurement times for a relay that passes the test?

Attachment

I don't understand this attachment.
It looks like a list of all the times that pass the test, with no relay ids.

But I wanted to see a list of times for one relay that passes the test.

  • the exact measurement times for a relay with multiple measurements that fails the test?

Attachment
All from today, hmm?

Let's just focus on a single relay.

comment:9 in reply to:  8 Changed 12 months ago by juga


Replying to teor:
[...]

Is there some documentation for relay priority?

Relays are sorted by their data_period - time in results. https://gitweb.torproject.org/sbws.git/tree/sbws/lib/relayprioritizer.py#n65
Relays with error results, get their delta multiplied by 0.5 or 0.4. So they will be measured earlier. https://gitweb.torproject.org/sbws.git/tree/sbws/lib/relayprioritizer.py#n76
Relays with more than 1 result sum that delta https://gitweb.torproject.org/sbws.git/tree/sbws/lib/relayprioritizer.py#n77

If we prioritise failed relays, then most of our circuits will fail. In some cases, we will overload relays that are already failing.

Can you please tell us:

  • how long sbws has been running for?

Forever :) Results dated older than 7 days ago are not took
Those results were done without the last recent day, but still.

  • the number of relays measured each day:

This is the number of results each day, not relays:

1555 2018-10-10.txt
1485 2018-10-11.txt
1363 2018-10-12.txt
1574 2018-10-13.txt
1665 2018-10-14.txt

If it takes 5 days for sbws to measure 7000 relays, then maybe sbws should use results from the last 10 days.

That would work. Though after i commented the rtt code, i seems sbws can measure almost all the network in 1 day:
6219 2018-10-17.txt

1707 2018-10-15.txt
1819 2018-10-16.txt
3332 2018-10-17.txt

...

  • the exact code that you are using to do the "2 measurements at least 24 hours apart" test?

https://gitweb.torproject.org/sbws.git/tree/sbws/lib/v3bwfile.py#n278

Ok, so there is a bug in this code.

The original task was:

If any of these things are true, do not put the relay in the bandwidth file:

  • there are less than 2 sbws measured bandwidths
  • all the sbws measured bandwidths are within 24 hours of each other

But the code implements:

  1. If there are less than 2 results, do not put the relay in the bandwidth file

Is this not correct?

  1. If any result is within 24 hours of the previous result, remove it. Then, if there is only one result, do not put the relay in the bandwidth file.

Here are some results and the outputs they should produce:
0h produces None
0h, 24h produces 0h, 24h
0h, 1h, 24h produces 0h, 1h, 24h (but I think the current function produces 0h, 24h)
0h, 24h, 36h produces 0h, 24h, 36h (but I think the current function produces 0h, 24h)

Ok, i understood this wrong, though this does not change the number of relays.

Maybe these functions need some unit tests?

Yeah, going to implement that

  • the exact measurement times for a relay that passes the test?

Attachment

I don't understand this attachment.
It looks like a list of all the times that pass the test, with no relay ids.

But I wanted to see a list of times for one relay that passes the test.

Relay with 2 results that passes the test:

7044955A4D7B04CF70011D73B467B13CE5E69D41 ['2018-10-12T23:53:34', '2018-10-17T09:48:31']

  • the exact measurement times for a relay with multiple measurements that fails the test?

Attachment
All from today, hmm?

Let's just focus on a single relay.

Relay with 3 results that discards 1 result:

241C0134A6F62F40D378562B348531E51B946A07 Results NOT away from each other at least 86400s: 2018-10-17T15:51:23, 2018-10-17T15:51:04
Final results ts: ['2018-10-13T22:02:00', '2018-10-17T15:51:23']

Relay with 2 results that does not pass the test:

D1352AECB24F050C54CA13A8BC84F9613E4E61FD Results NOT away from each other at least 86400s:2018-10-17T11:58:17, 2018-10-17T11:57:47

None of the relays with 1 result pass the test

comment:10 Changed 12 months ago by juga

And from the last results, there is something wrong cause after a relay has been measured, it should not be measured immediately after

Changed 12 months ago by juga

Attachment: log added

comment:11 Changed 12 months ago by juga

Attached the part of the log where the last relay get measured twice

comment:12 in reply to:  8 Changed 12 months ago by juga

Replying to teor:
[...]

If any of these things are true, do not put the relay in the bandwidth file:

  • there are less than 2 sbws measured bandwidths
  • all the sbws measured bandwidths are within 24 hours of each other

But the code implements:

  1. If there are less than 2 results, do not put the relay in the bandwidth file
  2. If any result is within 24 hours of the previous result, remove it. Then, if there is only one result, do not put the relay in the bandwidth file.

[...]

Should be fixed in #28103

comment:13 Changed 12 months ago by juga

pastly, i suspect that the reason why the same relay gets measured twice within minutes, is because the threads are not sharing the same vision on the best_priority iterator. Would you mind to have a look to it?

comment:14 in reply to:  13 ; Changed 12 months ago by pastly

Replying to juga:

pastly, i suspect that the reason why the same relay gets measured twice within minutes, is because the threads are not sharing the same vision on the best_priority iterator. Would you mind to have a look to it?

I don't think the measurement threads have different copies or ideas about the RelayPrioritizer because it is created in the main thread and never given to worker threads. See it defined here and never given to the workers.

My untested theory is:

A thread will grab one of the last relays in the RelayPrioritizer. While it is measuring that relay, the RelayPrioritizer is recalculating priority, finishes, and gives the same relay to another thread to also measure.

This might be a good time to switch to using Python's PriorityQueue while figuring out the problem. I'll give it some thought this evening.

comment:15 in reply to:  14 ; Changed 12 months ago by juga

Replying to pastly:

Replying to juga:

pastly, i suspect that the reason why the same relay gets measured twice within minutes, is because the threads are not sharing the same vision on the best_priority iterator. Would you mind to have a look to it?

I don't think the measurement threads have different copies or ideas about the RelayPrioritizer because it is created in the main thread and never given to worker threads. See it defined here and never given to the workers.

I agree, i saw that, with "vision" i didn't mean copy

My untested theory is:

A thread will grab one of the last relays in the RelayPrioritizer. While it is measuring that relay, the RelayPrioritizer is recalculating priority, finishes, and gives the same relay to another thread to also measure.

by "vision" i was thinking something like that.

This might be a good time to switch to using Python's PriorityQueue while figuring out the problem. I'll give it some thought this evening.

multiprocessing also have a https://docs.python.org/2/library/multiprocessing.html#multiprocessing.Queue class. However i thought that if threads have to wait to 1 measurement to finish, then will slow down the process.
What about just getting a list of relays to measure (instead of an iterator), and do not call best_priority again until the list is almost finished?.
If you're a busy, i can try that.
Thanks for looking at it.

comment:16 in reply to:  15 Changed 12 months ago by pastly

Replying to juga:

What about just getting a list of relays to measure (instead of an iterator), and do not call best_priority again until the list is almost finished?.
If you're a busy, i can try that.
Thanks for looking at it.

I think we'll have the same problem unless: when the we're measuring the last relay the prioritizer returned (either via an iterator or a list) we don't give the other measurement threads anything until it's done. When it is done, we recalculate priority and start handing out relays to the worker threads again.

I don't think having all the measurement threads stop and wait once in a while for the slowest of them to finish up so we can regenerate a new set of relays to measure is going to significantly impact how long it takes to scan the entire network.

comment:18 Changed 12 months ago by juga

i had that branch running for 1 day and checked the dates of the results that do not pass the restrictions. They were close in hours, not minutes, as it happened before.

comment:19 Changed 12 months ago by juga

Status: newneeds_review

comment:20 Changed 12 months ago by pastly

Status: needs_reviewneeds_revision

I don't think this patch addresses the problem and I've left my review comment stating why. Let me know what you think.

comment:21 Changed 12 months ago by juga

Status: needs_revisionneeds_review

ops

comment:22 Changed 12 months ago by juga

Compare the timestamp of the results being meausured from yesterday:

2018-10-26 13:35:05,392 DEBUG MainThread v3bwfile.py:349 - results_away_each_other - Results are NOT away from each other in at least 86400s: ['2018-10-25T11:54:17', '2018-10-26T07:45:14']
2018-10-26 13:35:05,395 DEBUG MainThread v3bwfile.py:349 - results_away_each_other - Results are NOT away from each other in at least 86400s: ['2018-10-25T15:47:16', '2018-10-26T04:04:18']
...

To the ones days before:
2018-10-26 14:59:16,270 DEBUG MainThread v3bwfile.py:349 - results_away_each_other - Results are NOT away from each other in at least 60s: ['2018-10-25T12:58:22', '2018-10-25T12:58:40']
2018-10-26 14:59:18,271 DEBUG MainThread v3bwfile.py:349 - results_away_each_other - Results are NOT away from each other in at least 60s: ['2018-10-24T12:42:44', '2018-10-24T12:43:18']

And i'm missing something here?

comment:23 Changed 12 months ago by juga

Resolution: fixed
Status: needs_reviewclosed

Solved as suggested. Merged

comment:24 Changed 11 months ago by teor

Keywords: sbws-1.0-nice-closed-moved-20181128 added
Milestone: sbws 1.0 (MVP nice)sbws: 1.0.x-final

Move all closed sbws 1.0 nice tickets to sbws 1.0.x-final

Note: See TracTickets for help on using tickets.