Opened 22 months ago

Last modified 8 months ago

#22026 new enhancement

Create new service to retrieve raw documents

Reported by: irl Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Ideas Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Given a fingerprint, return the latest raw server descriptor for that server (be it a relay or bridge). This could be linked from Atlas for advanced users (it probably would have saved me about 20 minutes just now debugging a thing).

Child Tickets

Change History (12)

comment:1 Changed 22 months ago by iwakeh

Hmm, that's an interesting request.

Unfortunately, Onionoo 'forgets' about descriptor files once the data is read and aggregated in its datastore.
This seems out of Onionoo's scope to me.

Or, did I misinterpret the request?

comment:2 Changed 22 months ago by karsten

If I understand the request correctly, the ask is to introduce a new document type for raw server descriptors (and possibly other descriptors related to a relay or bridge). Right now we're not storing raw descriptors, but if we were to implement this, we'd start doing that.

The scope question is not an easy one. I do agree that it's not obviously in scope in the sense that most Atlas users (or users of other Onionoo clients) wouldn't care. But having this feature could indeed help debugging the network. See #13424 and #13425 for related requests. Thinking somewhat long-term I'd want us to build better tools that support debugging the network. The question is whether it's easier to extend Onionoo's scope or whether it's better to build a new tool just for that.

comment:3 Changed 22 months ago by iwakeh

Yes, such a tool for retrieving descriptors by certain properties would be useful.

I think that would be rather a CollecTor module, but not Onionoo. Maybe turn this into a CollecTor ticket (and add the other two tickets as children)?

comment:4 Changed 22 months ago by irl

I have no objection to this being a CollecTor ticket instead. I really wanted to have a link to something text/plain so they could just be static files named by the fingerprint even.

comment:5 Changed 22 months ago by karsten

Component: Metrics/OnionooMetrics

I'm opposed to making this a CollecTor ticket/module for the following reasons:

  • The main purpose of CollecTor is to reliably collect data from the Tor network, not to provide convenient access to that data. If both things can be achieved at the same time, great, but I see the risk of adding quite some complexity for such a descriptor query module that would at least shift focus away from descriptor completeness.
  • A good descriptor retrieval tool would also retrieve descriptors from weeks, months, or even years ago. But CollecTor does not have a database with all descriptors in its archive, nor does it need one for its main purpose stated above.
  • CollecTor's current web interface is pretty dumb, which I consider a feature. It's good that we don't need a Tomcat for CollecTor and can serve all contents using Apache.

I'd prefer a solution where we either extend Onionoo to provide raw descriptor contents (contained in JSON documents) or build a new descriptor retrieval tool that takes descriptors from CollecTor, imports them into a database, and provides a query interface for that. I don't have a clear preference for either of the two solutions.

I'm moving this ticket to the Metrics supercomponent, admitting that it's somewhat out of scope for existing subcomponents and hence might not happen in the near future.

comment:6 in reply to:  5 Changed 22 months ago by iwakeh

Replying to karsten:

I'm opposed to making this a CollecTor ticket/module for the following reasons:

  • The main purpose of CollecTor is to reliably collect data from the Tor network, not to provide convenient access to that data.

That's a fine definition/specification for CollecTor's purpose. Should be added somewhere prominently.

... If both things can be achieved at the same time, great, but I see the risk of adding quite some complexity for such a descriptor query module that would at least shift focus away from descriptor completeness.

  • A good descriptor retrieval tool would also retrieve descriptors from weeks, months, or even years ago. But CollecTor does not have a database with all descriptors in its archive, nor does it need one for its main purpose stated above.
  • CollecTor's current web interface is pretty dumb, which I consider a feature. It's good that we don't need a Tomcat for CollecTor and can serve all contents using Apache.

Aggreed, I actually suggested CollecTor, b/c of a lack of matching components in the Metrics product list.
In total, it really is a new retrieval tool and way more than a component that can be attached to an existing Metrics product.

I'd prefer a solution where we either extend Onionoo to provide raw descriptor contents (contained in JSON documents) or build a new descriptor retrieval tool that takes descriptors from CollecTor, imports them into a database, and provides a query interface for that. I don't have a clear preference for either of the two solutions.

I vote for the new retrieval tool.

I'm moving this ticket to the Metrics supercomponent, admitting that it's somewhat out of scope for existing subcomponents and hence might not happen in the near future.

We could create a trac component for tickets/ideas like this one? Maybe, 'Metrics Ideas' (there are surely better naming options).

comment:7 Changed 22 months ago by karsten

Component: MetricsMetrics/Ideas

Created a Metrics/Ideas subcomponent and moved this ticket there.

comment:8 Changed 18 months ago by karsten

Summary: Add new raw document typesCreate new service to retrieve raw documents

comment:9 Changed 16 months ago by irl

I've thought some more about this. In fact, the use case is almost satisfied by the directory protocol. To fetch the latest server descriptor or extrainfo descriptor, /tor/{server,extrainfo}/fp/<FP> provides this.

There are some problems I can see with this approach that I still think needs another layer in between though:

  1. We should probably run a relay dedicated for the purpose of serving these descriptors, probably reverse proxy it behind nginx/Apache and give it a DNS name and an SSL certificate.
  2. As part of the reverse proxying, we should (maybe? not sure if actually required) rewrite the content type to text/plain.
  3. As another part of the reverse proxying, we should add an Access-Control-Allow-Origin: * header.

comment:10 Changed 16 months ago by karsten

Interesting idea. Some thoughts:

  • We shouldn't have to run a relay in order to provide a Metrics service. Nor should we simply forward requests to an existing relay. Ideally we'd handle requests ourselves.
  • This approach won't support retrieval of descriptors that are older than a few days, depending on type.
  • There is no easy way to extend the protocol on our side. For example, we couldn't serve descriptors other than relay descriptors (like bridge descriptors).
  • Another limitation is that we'd really only support lookups by fingerprint or descriptor digest. However, we should expect the second or third feature request to be something like a lookup by some other field which the directory protocol does not (need to) support.

comment:11 Changed 14 months ago by teor

Here's a stop-gap measure:

This makes the current descriptor available for some relays on Relay Search.

comment:12 Changed 8 months ago by hiro

Hi, I have been working on making a little elasticsearch experiment that can be used to analyse raw logs.
It can be accessed here: https://0xacab.org/spaghetti/metrics-search
At the moment it only downloads tpf logs from onionperf, but it can be configured to do a lot more.
The experiment runs via docker-compose so that it can be tested on your own machine.
I'd be happy to experiment with different data if we like to try this idea some more.

Note: See TracTickets for help on using tickets.