Opened 17 months ago

Closed 7 months ago

#30905 closed defect (fixed)

Maybe monitoring values in the state file should be reset when sbws is restarted

Reported by: juga Owned by: juga
Priority: Medium Milestone: sbws: 1.1.x-final
Component: Core Tor/sbws Version: sbws: 1.1.0
Severity: Normal Keywords:
Cc: juga, gk Actual Points: 9
Parent ID: #30719 Points: 1
Reviewer: ahf, gk Sponsor:

Description

I think the following keys should probably be restarted, to make less confusing what appears in the bandwidth file. For instance, after restarting longclaw's sbws, this is what the state file contains, consensus count and priority lists are reset, but not the others:

    "recent_measurement_attempt_count": 671,
    "recent_consensus_count": 1,
    "recent_priority_relay_count": 984,
    "recent_priority_list_count": 3,
    "min_perc_reached": "2019-05-06T06:20:49",

Child Tickets

Change History (13)

comment:1 Changed 17 months ago by teor

Milestone: sbws: unspecifiedsbws: 1.1.x-final

Yes, this is a bug we should fix in 1.1.0, but it is not a blocker.

We should only do it if we need accurate statistics to fix one of the blockers.

comment:2 Changed 17 months ago by teor

Maybe these keys should only include measurements from the last 5 days?

comment:3 Changed 8 months ago by juga

Parent ID: #30719

Make it child of #30719, because it will make easier to understand what's going on and it seems important to deploy sbws in all bwauths.

comment:4 Changed 8 months ago by juga

Owner: set to juga
Status: newassigned

comment:5 Changed 7 months ago by juga

Reviewer: ahf, gk
Status: assignedneeds_review

I've created #33691 to make reviews or migrate to gitlab.tpo, in the meanwhile i've pushed to GH: https://github.com/torproject/sbws/pull/372.

See last commit documentation for some explanation about all changes.

I think all these KeyValues were incorrect:

In the header:

  • recent_consensus_count
  • recent_measurement_attempt_count
  • recent_priority_list_count
  • recent_priority_relay_count

In the bandwidth lines:

  • relay_in_recent_consensus_count
  • relay_recent_measurement_attempt_count
  • relay_recent_priority_list_count

I have not created different tickets for each of them since some of the changes touch the same parts of the code, there would be merge conflicts.

comment:6 Changed 7 months ago by gk

Cc: gk added

comment:8 Changed 7 months ago by juga

After running these patches in a server for more than 5 days, these are the KeyValues (modified by this patch) in the bandwidth file:

recent_consensus_count=120
recent_measurement_attempt_count=30688
recent_measurement_failure_count=1216
recent_measurements_excluded_error_count=885
recent_measurements_excluded_few_count=705
recent_measurements_excluded_near_count=112
recent_measurements_excluded_old_count=0
recent_priority_list_count=80
recent_priority_relay_count=30688
bw=140000 bw_mean=702333 bw_median=731577 consensus_bandwidth=130000000 consensus_bandwidth_is_unmeasured=False desc_bw_avg=1073741824 desc_bw_bur=1073741824 desc_bw_obs_last=79380480 desc_bw_obs_mean=79442187 error_circ=0 error_destination=0 error_misc=0 error_second_relay=0 error_stream=0 master_key_ed25519=9nj0+BXKfPg90QYZ21M55VTXYIy6QOvjk4vmDa63hhk nick=LittleFoxRahja node_id=$708A968F3644F8A547156368FEA3DB664110E631 relay_in_recent_consensus_count=105 relay_recent_measurement_attempt_count=1 relay_recent_priority_list_count=1 success=4 time=2020-03-23T02:52:25
bw=775934 bw_mean=758382 bw_median=775934 consensus_bandwidth=0 consensus_bandwidth_is_unmeasured=False desc_bw_avg=1073741824 desc_bw_bur=1073741824 desc_bw_obs_last=0 desc_bw_obs_mean=1 error_circ=0 error_destination=0 error_misc=0 error_second_relay=0 error_stream=0 master_key_ed25519=GzKLDvvdnSshuAQA1o8aXPl07e/OMDrIRuh3icZ19EM nick=harleyz98 node_id=$598EA0E595C541EA91B4434BC643F01110862AC5 relay_in_recent_consensus_count=33 relay_recent_measurement_attempt_count=1 relay_recent_priority_list_count=1 success=3 time=2020-03-22T18:59:14

The header seems correct except for the failures, that is counting things that are not really failures:

  • when loading the results, the relays that changed ip are excluded. I could add a patch to count all the results first, which reduces the number of failures in around 1000.
  • when sbws is restarted, all the queued relays to measure that didn't start to be measured yet, are not saved in the results, but the attempt has already been count. We could also add a patch to count the attempt in measure_relay, instead of main_loop, though then we won't know the number of attempts counting from the moment they were queued.
  • real failures would happen when result_putter_error is triggered, or some bug makes sbws stall, which then will print the TICKET_MSG error. In the 1st case, we could actually count an error.

I don't think these fixes are critical, the main problem right now is that the description of the failures KeyValue in the spec doesn't really reflect what is happening.

Regarding the relay lines, i'm not sure relay_recent_measurement_attempt_count and relay_recent_priority_list_count are correct, i need to check that.
I see an unrelated bug in them that i'll comment in #33350.

comment:9 in reply to:  8 Changed 7 months ago by gk

Replying to juga:

After running these patches in a server for more than 5 days, these are the KeyValues (modified by this patch) in the bandwidth file:

recent_consensus_count=120
recent_measurement_attempt_count=30688
recent_measurement_failure_count=1216
recent_measurements_excluded_error_count=885
recent_measurements_excluded_few_count=705
recent_measurements_excluded_near_count=112
recent_measurements_excluded_old_count=0
recent_priority_list_count=80
recent_priority_relay_count=30688
bw=140000 bw_mean=702333 bw_median=731577 consensus_bandwidth=130000000 consensus_bandwidth_is_unmeasured=False desc_bw_avg=1073741824 desc_bw_bur=1073741824 desc_bw_obs_last=79380480 desc_bw_obs_mean=79442187 error_circ=0 error_destination=0 error_misc=0 error_second_relay=0 error_stream=0 master_key_ed25519=9nj0+BXKfPg90QYZ21M55VTXYIy6QOvjk4vmDa63hhk nick=LittleFoxRahja node_id=$708A968F3644F8A547156368FEA3DB664110E631 relay_in_recent_consensus_count=105 relay_recent_measurement_attempt_count=1 relay_recent_priority_list_count=1 success=4 time=2020-03-23T02:52:25
bw=775934 bw_mean=758382 bw_median=775934 consensus_bandwidth=0 consensus_bandwidth_is_unmeasured=False desc_bw_avg=1073741824 desc_bw_bur=1073741824 desc_bw_obs_last=0 desc_bw_obs_mean=1 error_circ=0 error_destination=0 error_misc=0 error_second_relay=0 error_stream=0 master_key_ed25519=GzKLDvvdnSshuAQA1o8aXPl07e/OMDrIRuh3icZ19EM nick=harleyz98 node_id=$598EA0E595C541EA91B4434BC643F01110862AC5 relay_in_recent_consensus_count=33 relay_recent_measurement_attempt_count=1 relay_recent_priority_list_count=1 success=3 time=2020-03-22T18:59:14

The header seems correct except for the failures, that is counting things that are not really failures:

  • when loading the results, the relays that changed ip are excluded. I could add a patch to count all the results first, which reduces the number of failures in around 1000.
  • when sbws is restarted, all the queued relays to measure that didn't start to be measured yet, are not saved in the results, but the attempt has already been count. We could also add a patch to count the attempt in measure_relay, instead of main_loop, though then we won't know the number of attempts counting from the moment they were queued.
  • real failures would happen when result_putter_error is triggered, or some bug makes sbws stall, which then will print the TICKET_MSG error. In the 1st case, we could actually count an error.

I don't think these fixes are critical, the main problem right now is that the description of the failures KeyValue in the spec doesn't really reflect what is happening.

Yes, I think, too, this is not critical. Maybe open a new ticket starting with mentioning the findings above.

Regarding the relay lines, i'm not sure relay_recent_measurement_attempt_count and relay_recent_priority_list_count are correct, i need to check that.

Please do, this looks a bit suspicious here to me (too).

I see an unrelated bug in them that i'll comment in #33350.

comment:10 Changed 7 months ago by juga

Revised, pushed fixups for most of the comments

comment:11 Changed 7 months ago by gk

Status: needs_reviewmerge_ready

maint-1.1_bug30905_KeyValues_count_squashed1_reworded looks good to me.

comment:12 Changed 7 months ago by juga

Merged

comment:13 Changed 7 months ago by juga

Actual Points: 9
Resolution: fixed
Status: merge_readyclosed
Note: See TracTickets for help on using tickets.