Opened 3 years ago
Last modified 2 years ago
#6443 needs_information enhancement
Graph cdf of probability of selecting among the biggest k exits
Reported by: | arma | Owned by: | |
---|---|---|---|
Priority: | Medium | Milestone: | |
Component: | Analysis | Version: | |
Severity: | Keywords: | ||
Cc: | gsathya, robgjansen, phw, amj703, nikita@… | Actual Points: | |
Parent ID: | #6460 | Points: | |
Sponsor: |
Description
Now that we have our scripts in #6232 extracting useful data, here's another graph that would be useful.
On the x axis is our set of 900 Exit relays, ordered by chance of being chosen. f(x) is the chance that the user's selected exit is in the first (biggest) x relays.
We'll likely find that we should zoom in on just the x \in [0..50] range or something, since otherwise the graph will just shoot up to 1.0 and stay there.
Auto generating this graph for the current consensus, and sticking it near consensus-health, might be wise.
Once we've done the basic graph, we might find that graphing f(10) over time tells us something interesting about #6232 (for various values of 10).
Child Tickets
Attachments (11)
Change History (28)
comment:1 Changed 3 years ago by arma
comment:2 Changed 3 years ago by arma
Another interesting visualization would be "graph x such that f(x) = .5" over time. In this case higher x means a safer network.
comment:3 Changed 3 years ago by arma
Here is an early graph from Karsten:
https://trac.torproject.org/projects/tor/attachment/ticket/6443/exit-probability-cdf-2012-07-23-2.png
5 relays were 30% of the network on July 20, 2012. Two of them gained the Guard flag since then, cutting in half their chance of being chosen in the exit position.
The top 10 relays were 45% of the network; top 20 relays were 60% of the network; and top 40 relays were 80% of the network.
Changed 3 years ago by karsten
Changed 3 years ago by karsten
comment:4 Changed 3 years ago by karsten
comment:5 Changed 3 years ago by arma
- Parent ID changed from #6232 to #6460
comment:6 Changed 3 years ago by arma
Out of these three, the cdf is way way easier to read.
As another attempt to graph progress over time, how about a cdf graph with four curves: a) today, b) a week ago, c) a month ago, and d) a year ago.
We should also ponder some sort of smoothing or averaging, since I don't want to know how things were on June 24 2012 at 19:00, I want to know how things were "in June 2012". I fear most such approaches will quickly turn into garbage science though.
Changed 3 years ago by karsten
Changed 3 years ago by karsten
Changed 3 years ago by karsten
comment:7 follow-up: ↓ 10 Changed 3 years ago by karsten
- Cc karsten removed
- Owner set to karsten
- Status changed from new to assigned
R and I spent the afternoon together and painted three new bitmaps for U:
- This graph shows the CDF with five curves (your four plus one more for "3 months before").
- This graph and that one are the timeplots from last time, but with 1 data point per week instead of per day, and with fewer lines overall.
Next steps:
- Compute exit probabilities based on advertised bandwidths as suggested in the first comment. Make new graphs to compare probabilities based on consensus weights and advertised bandwidths.
- Wait until we have a final decision which graphs we'd want to be auto-generated, if any. Then automate generating them and add them to the metrics website.
Changed 3 years ago by karsten
comment:8 Changed 3 years ago by karsten
Here's another graph that visualizes exit probabilities. The blank space is reserved for relays coming after the top-50. I think the plot is called "mosaic plot" or "tree map". We could group relays belonging to the same family, country, or AS together and assign a different colour to each group. We could also label rects with nicknames instead of probabilities, at least for the top-10 or top-20 relays. We could also add the remaining relays which didn't make it into the top-50. But this is just a quick prototype to discuss whether the graph type would by useful or not.
Changed 3 years ago by karsten
Changed 3 years ago by karsten
Changed 3 years ago by karsten
comment:9 Changed 3 years ago by karsten
- Status changed from assigned to needs_information
comment:10 in reply to: ↑ 7 Changed 3 years ago by arma
Replying to karsten:
- This graph shows the CDF with five curves (your four plus one more for "3 months before").
This graph would be more readable if we sorted the curves in the legend (so the first curve listed is the highest curve in the graph). Perhaps sorting them by the value at f(20) is a good approximation?
Changed 3 years ago by karsten
comment:11 follow-up: ↓ 12 Changed 3 years ago by karsten
Reordering legend entries doesn't work so well for automatically generated graphs. It's technically possible, but it might be confusing for viewers. Here's a new graph that uses different shades of green, ordered by displayed date. Is that more readable?
comment:12 in reply to: ↑ 11 Changed 3 years ago by arma
Replying to karsten:
Reordering legend entries doesn't work so well for automatically generated graphs. It's technically possible, but it might be confusing for viewers. Here's a new graph that uses different shades of green, ordered by displayed date. Is that more readable?
Wow, that's subtle. It wasn't until today (when I'm prepping graphs for a meeting with the funder) that I realized that the shades of green corresponded to time. So yes, now that I've realized it, it is better -- but before I realized it, I just thought they were horrible color choices. I guess that means the answer is 'no, not more readable'. :/
I still think it would be really useful to have some version of this graph on the fast-exits metrics page. But I can't figure out which one. Anybody else have suggestions on how to visualize this data usefully?
comment:13 Changed 3 years ago by amj703
- Cc amj703 added
comment:14 Changed 3 years ago by karsten
- Owner karsten deleted
- Status changed from needs_information to assigned
I'm running out of ideas. If someone has suggestions for a visualization, I can try to implement that in R/ggplot2. Unassigning this ticket from me, because I'm currently not working on it.
comment:15 Changed 3 years ago by karsten
- Status changed from assigned to needs_information
And this ticket still needs information.
comment:16 Changed 3 years ago by mo
I do like cdf-a and cdf-b the most. If I had to decide on one (can't we have at least these two?), it would be cdf-b.
comment:17 Changed 2 years ago by nikita
- Cc nikita@… added
Graphing cdf of exit probabilities using consensus weights, and also cdf using descriptor bandwidths, could be a good way of visualizing the tradeoff we're making by concentrating traffic onto the faster relays.