Opened 4 years ago

Closed 3 years ago

#18967 closed enhancement (wontfix)

Add UUID to families in Onionoo

Reported by: seansaito Owned by:
Priority: Medium Milestone:
Component: Metrics/Onionoo Version:
Severity: Normal Keywords: persistence,
Cc: virgil Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

This is an enhancement of the implementation proposed in #16599

Some services that depend onOnionoorequire persistence in family data. For instance, some of the proposed features of Roster such as replacing Tor Weather require knowing when a certain relay goes down. Analogous to Tor spec proposal #242,Onionoo should implement such a scheme.

Put simply, the implementation would be as follows:

  • Each family will have some UUID, which would be tagged on all member relays (like an extra fingerprint).
  • A new relay will be tagged by its family's UUID by looking up the ID of older relatives.
  • There are two schemes for storing key-value based data. One is for looking up families via UUID, the other for looking up a UUID via relay fingerprint.

Unlike #16599, this implementation does not require any a priori information about the family. The UUIDs are guaranteed to be unique. Currently, Roster has a half-baked implementation of the above. Despite the simplicity of the implementation, the benefits are potentially great, as querying for and storing persistent data of families would become possible.

[1] https://trac.torproject.org/projects/tor/ticket/16599 

[2] https://gitweb.torproject.org/torspec.git/tree/proposals/242-better-families.txt 

On uniqueness of UUIDs

[3] https://stackoverflow.com/questions/703035/when-are-you-truly-forced-to-use-uuid-as-part-of-the-design/786541#786541 

Child Tickets

Change History (12)

comment:1 Changed 4 years ago by karsten

How would the Onionoo server generate such a UUID? The part that I'm worried about is that two distinct Onionoo servers should ideally produce the exact same data, and that would require only using data like relay fingerprints as input for generating UUIDs rather than data the Onionoo server generates (randomly, using its system time, etc.). The idea behind different Onionoo servers producing the same data is that Onionoo clients would be able to round-robin between multiple Onionoo servers. That would fail badly if different Onionoo servers produced different family UUIDs. There are currently unrelated problems with data from different Onionoo servers not being exactly the same (encoding problems and the like), but those are bugs we can fix, whereas a new server-specific UUID would make this load balancing impossible. (Note that this problem is different from UUID not being unique which I'm not worried about here. It's rather the opposite, with UUIDs by different servers being unique when they shouldn't be.)

Assuming we can work around the problem above, how would you treat the following scenarios? 1) Families AB and CD merge to ABCD; which family UUID do they use? 2) Family ABCD splits up into AB and CD; do they both keep their previous UUID? And there might be even more complicated cases than those. How would you handle these cases?

I'm not saying that we cannot solve these issues, but I don't see an easy solution right now. Maybe you have some ideas?

comment:2 Changed 4 years ago by seansaito

Hi Karsten,

Apologies for the late reply. Is there a centralized data store where we can keep the UUIDs for each fingerprint? That way only one of the servers has to generate the UUID once.

As for 1), do you think there is a problem with the larger family subsuming the smaller family? So the smaller family's UUID would be overwritten. For 2), based on my current scheme, there would need to be an additional component conducting sanity checks of the UUIDs (i.e. making sure no two families have the same UUID). 

comment:3 in reply to:  2 Changed 4 years ago by karsten

Replying to seansaito:

Hi Karsten,

Apologies for the late reply. Is there a centralized data store where we can keep the UUIDs for each fingerprint? That way only one of the servers has to generate the UUID once.

I'd really want to avoid relying on a centralized data store, because that's a single point of failure. Let's try to find a model where that is not required.

As for 1), do you think there is a problem with the larger family subsuming the smaller family? So the smaller family's UUID would be overwritten. For 2), based on my current scheme, there would need to be an additional component conducting sanity checks of the UUIDs (i.e. making sure no two families have the same UUID). 

This is not an implementation issue that can be fixed by performing sanity checks, but a conceptual one. Consider the following series of family UUID assignments by two separate Onionoo instances, where the second misses an update in the middle and receives it at a later time:

 AB=1, CDE=2 -> ABCDE=2 -> ABC=2
 AB=1, CDE=2 ->         -> ABC=1 -> ABCDE=1

Both instances did the right thing according to your algorithm, yet they come up with a different identifier for this family in the end. This is bad.

Here's an alternative algorithm. We define the largest fingerprint ever seen in an extended family as family identifier. If we assume that A < B < C < D < E, then we'd have the following identifiers in the situation above:

 AB=B, CDE=E -> ABCDE=E -> ABC=E
 AB=B, CDE=E ->         -> ABC=C -> ABCDE=E

This model doesn't have the issue where the order of processing descriptors leads to different identifiers, but it has another minor (?) issue: there's no way to split a family. Once a relay said it's part of a family and one member of the family says the same, they get the same family identifier. No divorce, no annulment possible. The smaller families will always be shown together when a user looks up their family identifier.

Note that the implementation would ideally only rely on server descriptors, regardless when those were referenced by a consensus. When a relay A publishes a descriptor saying that it considers B to be part of its family, we store that forever. And when B publishes a descriptor saying that A is part of its family, then they're considered part of the same family, here B.

How does that sound? Would you want to implement this algorithm and feed a month of recent descriptors into it to see what identifiers you'd come up with? This is something we should test a bit, ideally in a stand-alone Python script or small Java program, before building it into Onionoo. Thanks!

comment:4 Changed 4 years ago by virgil

If an Onionoo is allowed to skip some consensuses, doesn't that mean the family UUIDs must be a function of *solely* the most recent consensus?

Not to say that's impossible, but that's a huge constraint.

Even if you require a complete record going back k steps, you're going to lose any persistence going back >k steps. Instead of accepting that the Onionoo server will have gaps in its consensus history, wouldn't it be better simply to put them on a blockchain of some sort?

Last edited 4 years ago by virgil (previous) (diff)

comment:5 in reply to:  4 Changed 4 years ago by karsten

Replying to virgil:

If an Onionoo is allowed to skip some consensuses, doesn't that mean the family UUIDs must be a function of *solely* the most recent consensus?

Not to say that's impossible, but that's a huge constraint.

Not skip but reorder, but you're right, it's a constraint, maybe a too big one.

Even if you require a complete record going back k steps, you're going to lose any persistence going back >k steps. Instead of accepting that the Onionoo server will have gaps in its consensus history, wouldn't it be better simply to put them on a blockchain of some sort?

Let's not over-engineer this, but I could imagine us designing something based on your suggestion before editing your comment:

Another solution would be that the UUID is a function of the last $k$ consensuses. And for an Onionoo server to update the family, it must have the a k-length sequence of consecutive consensuses.

Consensuses are the wrong descriptor type, because we're only interested in mutual agreements that two relays A and B are in the same family, regardless of what the directory authorities think. And consensuses don't contain family information, so we'd have to follow references to server descriptors, which makes this harder to implement. We could simply look at server descriptors published in the past k hours or days. And it's probably okay if an Onionoo server misses a single server descriptor, because that kind of inconsistency will heal by itself once that server descriptor is older than k hours.

Would you two want to suggest a UUID algorithm based only on server descriptors published in the past k hours?

comment:6 Changed 4 years ago by virgil

Would you two want to suggest a UUID algorithm based only on server descriptors published in the past k hours?

I looked into candidate designs for this. In short, the k hours model entails there will be no persistence beyond k hours. And even for values like k=48, I presume want UUIDs to last longer than 2 days. It just becomes a 48-hour delay before "use the largest fingerprint".

We could in fact do this, but whenever a the highest fingerprint of a family changes, it's going to have its points reset.

I thought about this a bit more. But if we aren't going back to an established beginning, everytime a new Onionoo server joins, it will have a different perception of what the families are. I see no way around this without a central-server or blockchain-like infrastructure.

It may be that we simply have to go with the highest fingerprint. This is unfortunate.

comment:7 Changed 4 years ago by teor

What about the fingerprint with the oldest claimed time of appearance (first descriptor - uptime)?
That would likely be more stable than the highest fingerprint.
It wouldn't be perfect if new onionoo servers joined, but it would be close.

comment:8 Changed 4 years ago by virgil

teor, this sounds like a good idea. But I need a little handholding---can you be more specific?

It sounds like you're saying this a single snap-shot (never looking backward). And then, for each family, we sort all of the relays by claimed first-appearance, and then, the uuid of the family the fingerprint of the relay with the longest claimed-uptime?

comment:9 Changed 4 years ago by teor

Unless all onionoo instances use exactly the same descriptor archive, you will never get matching family identifiers. But you can get close:

Find the relay in each family that appeared on the network first, by:

  • finding the oldest descriptor you have for each relay, and
  • using something in that descriptor to deduce the age of the relay when that descriptor was posted
    • uptime is a clue, but not ideal;
    • relays with older tor releases are likely older;

or maybe certificates or something else in the descriptor can tell you how old the relay is.

comment:10 Changed 4 years ago by virgil

I hoping to find a solution here, but I'm still struggling.

Under the suggestion to use the fingerprint of the first-seen relay, that's going to be different when a new Onionoo instance starts up. For the newly started Onionoo instance, *all of the relays* are going to be "first seen".

I see no clear way around this. Right now the best I got (and I really don't like it) is simply using the largest fingerprint in the family. Maybe just implement Prop 242 and be done with it? Tomorrow I'm meeting with SeanSaito. Maybe we'll think of something.

comment:11 Changed 4 years ago by seansaito

Hi Karsten,

I'm wondering whether it's possible to create a Family Document scheme for Onionoo. This Document will aggregate every relay within an extended family, including stats such as bandwidth, consensus weight, country, etc. It will also be able to fetch Family Documents from the past (where the largest / oldest fingerprint becomes the identifier). This way we don't need family IDs at all, though we still need to decide how to deal with the issues you mentioned above.

A sample implementation of aggregating relays can be found here (it's a script I use for Roster):
https://github.com/seansaito/Roster/blob/master/app/models/family_aggregator.py

To summarize the script, each relay is grouped into a family dictionary object via the extended_family field. The family dictionary object aggregates most of the relay's attributes, including bandwidth, consensus_weight, countries. For fields such as last_seen and maximum_uptime, the family object takes the earliest / longest record. Each family is initialized as a dictionary with the following keys:

family = {"observed_bandwidth": 0,
           "exit_bandwidth": 0,
           "consensus_weight_fraction": 0,
           "consensus_weight": 0,
           "families": [], # This stores each relay document within that document. 
           "contact": [],
           "middle_probability": 0,
           "exit_probability": 0, "bitcoin_addr": "None",
           "bandwidth_points": 0, "consensus_points": 0,
           "last_seen": str(datetime.datetime.now()), 
           "maximum_uptime": str(datetime.datetime.now()),
           "countries": [], "exit": 0, "guard": 0,
           "t_shirts": [],
           "eligible_for_tshirt": False}

Here is a sample output of aggregating relays into families:
https://github.com/seansaito/Roster/blob/master/sample_family.json

The scheme I am proposing will hash these family objects by oldest/largest fingerprint, and add persistence to the data.

If a creation of a Family Document seems too time/effort consuming, I can first implement a similar scheme specifically for Roster and see how it goes. What do you think?

Last edited 4 years ago by seansaito (previous) (diff)

comment:12 Changed 3 years ago by karsten

Resolution: wontfix
Status: newclosed

None of the suggestions above seem practical, and this ticket has stalled for 10 months now. I don't see why we'd have to keep it open at this point. If we ever want to build something like this, we'll have to design it from scratch. Closing.

Note: See TracTickets for help on using tickets.