Opened 3 years ago

Closed 3 years ago

#19118 closed enhancement (wontfix)

Add organization name to each relay

Reported by: virgil Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Onionoo Version:
Severity: Normal Keywords: hardening
Cc: karsten, seansaito, twim@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Roster quantifies things like "organization diversity" as something more stringent than mere AS diversity, i.e., AS-diversity is necessary but not sufficient for organization diversity.

To do this, we are leveraging data from CAIDA.org. Particularly this data set:

Here's my python code for downloading the most recent AS-organizations data

This python generates a JSON file. Here's the one for April 2016:

http://dl.dropbox.com/u/3308162/as2orgs.json.gz

In this JSON file, it has the organization name for each AS number. For example, here's the entry for the AS number 44925 on which torproject.org is hosted:

"44925": {"aut_name": "THE-1984-AS","changed": "","org_name": "1984 ehf","source": "RIPE"}The next step is to have this data inserted inOnionoo. Particularly, everyOnionoorelay entry should include the: 'aut_name' and 'org_name' as given in the as2orgs.jsonfile.

Now whenOnionooclients request information on a relay, they will receive the "aut_name" and "org_name" its AS belongs to.

That's it---just run download_latest_as2orgname_data.py once a month, weave the relevant JSON entries into the Onionoo data, and you're gold.

Child Tickets

Attachments (3)

collapsed_pairs.txt (116.0 KB) - added by virgil 3 years ago.
organizations that are the same in Caida yet in MaxMind only merely highly similar
semisame_in_mm-same_in_caida.log (56.8 KB) - added by virgil 3 years ago.
same in CAIDA, semi-similar in MaxMind
MMs_very_diff.txt (259.3 KB) - added by virgil 3 years ago.
same in CAIDA

Download all attachments as: .zip

Change History (14)

comment:1 Changed 3 years ago by karsten

Status: newneeds_information

I'm yet unclear what we'd gain by adding CAIDA.org data. We're using MaxMind's GeoLite ASN file which contains the following entry for 1984 Hosting:

1566564352,1566566399,"AS44925 1984 ehf AS number"

Onionoo would include that as follows in a relay details document:

"as_number":"AS44925","as_name":"1984 ehf AS number"

(Admittedly, the "AS number" part in that string doesn't make much sense and looks like a data import problem on MaxMind's side. But we can probably expect similar problems with CAIDA.org's data, just not with this particular entry.)

But let's also look at a bigger AS/organization that hosts a lot of relays: OVH. Here's what CAIDA.org says about OVH:

ORG-OS3-RIPE||OVH SAS|FR|RIPE
16276||OVH|ORG-OS3-RIPE|RIPE
35540||OVH-TELECOM|ORG-OS3-RIPE|RIPE

And here's what MaxMind's ASN file says about OVH:

86441984,86474751,"AS16276 OVH SAS"
92733440,92798975,"AS16276 OVH SAS"
96731136,96796671,"AS16276 OVH SAS"
134738944,134739199,"AS16276 OVH SAS"
135430144,135430399,"AS16276 OVH SAS"
135432192,135434239,"AS16276 OVH SAS"
135441408,135441663,"AS16276 OVH SAS"
135556608,135556863,"AS16276 OVH SAS"
135604480,135604735,"AS16276 OVH SAS"
135792640,135794687,"AS16276 OVH SAS"
135945728,135945983,"AS16276 OVH SAS"
136175616,136175871,"AS16276 OVH SAS"
136237056,136239103,"AS16276 OVH SAS"
136404992,136407039,"AS16276 OVH SAS"
136413184,136415743,"AS16276 OVH SAS"
624623616,624689151,"AS16276 OVH SAS"
624701440,624705535,"AS16276 OVH SAS"
633012224,633077759,"AS16276 OVH SAS"
635305984,635338751,"AS16276 OVH SAS"
635371520,635437055,"AS16276 OVH SAS"
778633216,778698751,"AS16276 OVH SAS"
1056243712,1056251903,"AS16276 OVH SAS"
1466073088,1466105855,"AS16276 OVH SAS"
1532647424,1532649471,"AS16276 OVH SAS"
1534656512,1534722047,"AS16276 OVH SAS"
1558052864,1558118399,"AS16276 OVH SAS"
1578565632,1578631167,"AS16276 OVH SAS"
1728384000,1728385023,"AS16276 OVH SAS"
1841168384,1841233919,"AS35540 OVH SAS"
2382675968,2382684159,"AS16276 OVH SAS"
2809266176,2809331711,"AS16276 OVH SAS"
2954821632,2954887167,"AS16276 OVH SAS"
2988441600,2988572671,"AS16276 OVH SAS"
3001868288,3001872383,"AS16276 OVH SAS"
3104444672,3104444927,"AS16276 OVH SAS"
3104579584,3104580095,"AS16276 OVH SAS"
3164930048,3164985007,"AS16276 OVH SAS"
3164985009,3164995583,"AS16276 OVH SAS"
3227451392,3227467775,"AS16276 OVH SAS"
3227713536,3227779071,"AS16276 OVH SAS"
3244823296,3244823551,"AS16276 OVH SAS"
3245162240,3245162495,"AS16276 OVH SAS"
3278773760,3278774271,"AS16276 OVH SAS"
3287738368,3287738879,"AS16276 OVH SAS"
3323674624,3323691007,"AS16276 OVH SAS"
3325198336,3325231103,"AS16276 OVH SAS"
3328479232,3328483327,"AS16276 OVH SAS"
3337957376,3337961471,"AS16276 OVH SAS"
3585744896,3585753087,"AS16276 OVH SAS"
3590029312,3590045695,"AS16276 OVH SAS"

Wouldn't we include the exact same output after switching to CAIDA.org data?

I'm hesitant to add another data source, because I expect inconsistencies between the two data sources where we don't have the exact same AS numbers in the two files and similar issues.

Another (minor) issue is the additional overhead for Onionoo server operators.

Stated differently, I'd want us to have a good reason for adding another data source. Can you maybe give a counterexample where using CAIDA.org data in addition to MaxMind data would enhance Onionoo data notably?

comment:2 Changed 3 years ago by virgil

As far as I can tell, the OVH example you gave is, by happenstance, an instance where the CAIDA data is superior.

Lets look at:

ORG-OS3-RIPE||OVH SAS|FR|RIPE
16276||OVH|ORG-OS3-RIPE|RIPE
35540||OVH-TELECOM|ORG-OS3-RIPE|RIPE

For these entries, CAIDA says that *the same organization* with id=ORG-OS3-RIPE, owns both AS16276 and AS35540.

When you look up OVH on MaxMind, you get only AS16276. In this case I believe the issue is that when you look at the raw records, AS16276 has the as-name OVH while AS35540 has the as-name OVH-TELECOM. Ergo on a plain-text matching they are not the same. The CAIDA data squashes these two distinct strings into the same organization id, and thus into the same organization name.

Going beyond this specific case, the CAIDA data does some cleverness (see http://www.caida.org/research/topology/as2org/ for the methodology) to determine who the "real organization" is who owns the AS-number. This is helpful when firm A purchases firm B, and then firm A becomes an upstream provider of firm B (making firm B part of firm A's "cone"). The MaxMind data would list firm B, but CAIDA would list firm A.

These sorts of measures are important to ensure we're getting "real organizational diversity".

Last edited 3 years ago by virgil (previous) (diff)

comment:3 Changed 3 years ago by karsten

Hang on, the MaxMind data that I quoted above does include this row:

1841168384,1841233919,"AS35540 OVH SAS"

So, in this case the CAIDA data looks about as good as MaxMind's.

Can you give an example, or better a handful of examples, where CAIDA data is obviously better than MaxMind's?

comment:4 Changed 3 years ago by virgil

Share the MaxMind data with me and I will do the comparison.

Changed 3 years ago by virgil

Attachment: collapsed_pairs.txt added

organizations that are the same in Caida yet in MaxMind only merely highly similar

comment:6 Changed 3 years ago by virgil

Comparing the two data sets...

#ASNs incaida: 73,256
#ASNs inmaxmind: 53,354

#ASNs only in both: 52,810
#ASNs only incaida: 20,446
#ASNs only inmaxmind: 544

(1) On the above alone, theCaidadata is more comprehensive.

(2) I also attach a list of 849 pairs of entries that, withinmaxmindclosely resemble each other (between 90-99%) yet incaidaare merged into a single entity. If we want an organization diversity measure, we need the entities to match.

We can go deeper here. But I feel the point has been made. The next point would be that if firm A buys firm B, our org-diversity measure needs to look firm A, not firm B. This is what CAIDA doesandmaxminddoesnot. http://www.caida.org/research/topology/as2org/

Last edited 3 years ago by virgil (previous) (diff)

Changed 3 years ago by virgil

same in CAIDA, semi-similar in MaxMind

comment:7 Changed 3 years ago by virgil

ForfunI added another 492 pairs of ASs that CAIDA has the same, yetmaxmindhas as different.  Inspecting the names of these pairs, as well as the BGP relationships among those ASs, CAIDA is right thing to merge them.

comment:8 Changed 3 years ago by karsten

Okay, let's evaluate the pros and cons of adding CAIDA data. I'm counting pros as +1, neutrals as 0, and cons as -1. Let's see whether we'll get a positive number here.

  • The CAIDA data doesn't contain IP address ranges, so we'll have to keep using MaxMind data in addition to CAIDA data. Okay. But that means that CAIDA's comprehensiveness in terms of number of ASNs is meaningless to us, because we're limited to whatever ASNs are in MaxMind data. (0)
  • MaxMind contains 67 of its 2833 ASNs (not sure where your 53k number comes from) that CAIDA does not know about. Right now we'd have organization names for these ASNs, but once we switch over to using CAIDA's organization names we'd provide less information there. And I'm not willing to provide MaxMind data if CAIDA doesn't have anything for a given ASN, because nobody will understand that, nor do I want to provide both organization names. This is a serious problem that I don't know how to work around cleanly. (-1)
  • CAIDA data is only updated every three months, MaxMind provides a new update every month. It already happens that people ping me because MaxMind's data is old, and that's only going to get worse with CAIDA. Somewhat related, MaxMind has been providing ASN data for many years now without major issues whereas CAIDA apparently started providing data only 2 years ago. (-1)
  • We'd still need to write, review, and test code to handle CAIDA's data format. This could become a neutral if somebody submits a good patch, but please only do that if that makes the overall sum positive, or that patch might not get accepted. (-1)
  • Operating an Onionoo server becomes a bit harder with an additional data source to update. We want more people to run Onionoo servers at some point, so we should make that process easier not harder. (-1)
  • MaxMind indeed contains similar but not equivalent organization names which should be exactly the same. However, the actual number is lower than what your pairwise comparison implies, and somebody measuring organization diversity could always use a similarity metric as yours when looking at these strings. Anyway, CAIDA is indeed better here than MaxMind. (1)

I'm calculating -3 as sum here. That means no. Sorry. Leaving this ticket open for a few more days in case you have convincing arguments why the cons I'm listing about are actually neutrals or can be turned into pros.

Changed 3 years ago by virgil

Attachment: MMs_very_diff.txt added

same in CAIDA

comment:9 Changed 3 years ago by virgil

The CAIDA data doesn't contain IP address ranges, so we'll have to keep using MaxMind data in addition to CAIDA data. Okay. But that means that CAIDA's comprehensiveness in terms of number of ASNs is meaningless to us, because we're limited to whatever ASNs are in MaxMind data. (0)

You're right. We still only start with the IP#, and it would be a pain to implement a method to learn the AS numbers. Okay, that kills any utility of CAIDA having more ASs.

MaxMind contains 67 of its 2833 ASNs (not sure where your 53k number comes from) that CAIDA does not know about. Right now we'd have organization names for these ASNs, but once we switch over to using CAIDA's organization names we'd provide less information there. And I'm not willing to provide MaxMind data if CAIDA doesn't have anything for a given ASN, because nobody will understand that, nor do I want to provide both organization names. This is a serious problem that I don't know how to work around cleanly. (-1)

CAIDA data is only updated every three months, MaxMind provides a new update every month. It already happens that people ping me because MaxMind's data is old, and that's only going to get worse with CAIDA. Somewhat related, MaxMind has been providing ASN data for many years now without major issues whereas CAIDA apparently started providing data only 2 years ago. (-1)

The 53k figure is actually correct. Additionally, I would never wholly replace MaxMind data with CAIDA---the fields convey very different things. MaxMind says which organization is the registered owner, while CAIDA does some cleverness to learn the parent organization. Thisveryareverydifferent. I would propose that there be a new field, called something like `parent_organization` for each relay which is populated by CAIDA [when it exists].  I claim this sets both of the above (-1)s to (0).

We'd still need to write, review, and test code to handle CAIDA's data format. This could become a neutral if somebody submits a good patch, but please only do that if that makes the overall sum positive, or that patch might not get accepted. (-1)

The CAIDA format is a standard CSV. https://commons.apache.org/proper/commons-csv/  (0)

Operating an Onionoo server becomes a bit harder with an additional data source to update. We want more people to run Onionoo servers at some point, so we should make that process easier not harder. (-1)

This is indeed an issue. It seems entirely reasonable to me if someonedoesntwant to do the CAIDA data, they simply won't have the `parent_organization` field. Totally cool with that. (0?)

MaxMind indeed contains similar but not equivalent organization names which should be exactly the same. However, the actual number is lower than what your pairwise comparison implies, and somebody measuring organization diversity could always use a similarity metric as yours when looking at these strings. Anyway, CAIDA is indeed better here than MaxMind. (1)

So I actually low-balled this for you.

Here'sthe actual numbers.

  • # of ASNs for which MM's organizations are different, yet CAIDA's 'parent organization' are the same: 3299
  • # of ASNs for which MM's organization are _very_ different, yet CAIDA's 'parent organization' are the same: 1935

I attach a list of those 1935 pairs as MMs_very_diff.txt .

Two AS-ORG names being similar is not sufficient nor necessary for two ASs to be correctly grouped under the same parent organization. We totally tried to learn these relationships from themaxminddata, and failed. I was in the process of deriving my own method from the academic literature until I found the CAIDA data which did everything I needed.

I have no stake in this. We tried to use something like MaxMind for Roster, failed, but then discovered CAIDA worked. You then requested that we move as much functionality intoOnionooas possible. So this is me trying to do that. It's of course totally fine to say that this is too niche a need to be worth including intoOnionoo. In which case, Roster will just continue to use its own database for this---which is totally cool. I'm just trying to, as you requested, upload the goods we found to theOnionooMothership. This is me exerting effort to be a good uploader of candidate good things toOnionoo.

comment:10 Changed 3 years ago by twim

Cc: twim@… added

comment:11 Changed 3 years ago by karsten

Resolution: wontfix
Status: needs_informationclosed

Still not convinced that this is worthwhile of our time. Closing. Sorry!

Note: See TracTickets for help on using tickets.