I think the following keys should probably be restarted, to make less confusing what appears in the bandwidth file. For instance, after restarting longclaw's sbws, this is what the state file contains, consensus count and priority lists are reset, but not the others:
The header seems correct except for the failures, that is counting things that are not really failures:
when loading the results, the relays that changed ip are excluded. I could add a patch to count all the results first, which reduces the number of failures in around 1000.
when sbws is restarted, all the queued relays to measure that didn't start to be measured yet, are not saved in the results, but the attempt has already been count.
We could also add a patch to count the attempt in measure_relay, instead of main_loop, though then we won't know the number of attempts counting from the moment they were queued.
real failures would happen when result_putter_error is triggered, or some bug makes sbws stall, which then will print the TICKET_MSG error.
In the 1st case, we could actually count an error.
I don't think these fixes are critical, the main problem right now is that the description of the failures KeyValue in the spec doesn't really reflect what is happening.
Regarding the relay lines, i'm not sure relay_recent_measurement_attempt_count and relay_recent_priority_list_count are correct, i need to check that.
I see an unrelated bug in them that i'll comment in #33350 (moved).
After running these patches in a server for more than 5 days, these are the KeyValues (modified by this patch) in the bandwidth file:
{{{
recent_consensus_count=120
recent_measurement_attempt_count=30688
recent_measurement_failure_count=1216
recent_measurements_excluded_error_count=885
recent_measurements_excluded_few_count=705
recent_measurements_excluded_near_count=112
recent_measurements_excluded_old_count=0
recent_priority_list_count=80
recent_priority_relay_count=30688
}}}
{{{
bw=140000 bw_mean=702333 bw_median=731577 consensus_bandwidth=130000000 consensus_bandwidth_is_unmeasured=False desc_bw_avg=1073741824 desc_bw_bur=1073741824 desc_bw_obs_last=79380480 desc_bw_obs_mean=79442187 error_circ=0 error_destination=0 error_misc=0 error_second_relay=0 error_stream=0 master_key_ed25519=9nj0+BXKfPg90QYZ21M55VTXYIy6QOvjk4vmDa63hhk nick=LittleFoxRahja node_id=$708A968F3644F8A547156368FEA3DB664110E631 relay_in_recent_consensus_count=105 relay_recent_measurement_attempt_count=1 relay_recent_priority_list_count=1 success=4 time=2020-03-23T02:52:25
bw=775934 bw_mean=758382 bw_median=775934 consensus_bandwidth=0 consensus_bandwidth_is_unmeasured=False desc_bw_avg=1073741824 desc_bw_bur=1073741824 desc_bw_obs_last=0 desc_bw_obs_mean=1 error_circ=0 error_destination=0 error_misc=0 error_second_relay=0 error_stream=0 master_key_ed25519=GzKLDvvdnSshuAQA1o8aXPl07e/OMDrIRuh3icZ19EM nick=harleyz98 node_id=$598EA0E595C541EA91B4434BC643F01110862AC5 relay_in_recent_consensus_count=33 relay_recent_measurement_attempt_count=1 relay_recent_priority_list_count=1 success=3 time=2020-03-22T18:59:14
}}}
The header seems correct except for the failures, that is counting things that are not really failures:
when loading the results, the relays that changed ip are excluded. I could add a patch to count all the results first, which reduces the number of failures in around 1000.
when sbws is restarted, all the queued relays to measure that didn't start to be measured yet, are not saved in the results, but the attempt has already been count.
We could also add a patch to count the attempt in measure_relay, instead of main_loop, though then we won't know the number of attempts counting from the moment they were queued.
real failures would happen when result_putter_error is triggered, or some bug makes sbws stall, which then will print the TICKET_MSG error.
In the 1st case, we could actually count an error.
I don't think these fixes are critical, the main problem right now is that the description of the failures KeyValue in the spec doesn't really reflect what is happening.
Yes, I think, too, this is not critical. Maybe open a new ticket starting with mentioning the findings above.
Regarding the relay lines, i'm not sure relay_recent_measurement_attempt_count and relay_recent_priority_list_count are correct, i need to check that.
Please do, this looks a bit suspicious here to me (too).
I see an unrelated bug in them that i'll comment in #33350 (moved).