Develop database schema to support Onionoo's search parameter efficiently
The current way for handling incoming client requests is to load all search-relevant data about relays and bridges into memory and handle requests from there. This has two major downsides: it's difficult to extend and it requires us to limit searches to relays and bridges that have been running in the past seven days. We would like to overcome both.
After some experimenting with database schemas it seems that supporting the search
parameter efficiently will be most difficult. Here's what it does (from https://onionoo.torproject.org/protocol.html):
Return only (1) relays with the parameter value matching (part of a) nickname, (possibly $-prefixed) beginning of a hex-encoded fingerprint, beginning of a base64-encoded fingerprint without trailing equal signs, or beginning of an IP address, (2) bridges with (part of a) nickname or (possibly $-prefixed) beginning of a hashed hex-encoded fingerprint, and (3) relays and/or bridges matching a given qualified search term. Searches by relay IP address include all known addresses used for onion routing and for exiting to the Internet. Searches for beginnings of IP addresses are performed on textual representations of canonical IP address forms, so that searches using CIDR notation or non-canonical forms will return empty results. Searches are case-insensitive, except for base64-encoded fingerprints. If multiple search terms are given, separated by spaces, the intersection of all relays and bridges matching all search terms will be returned. Complete hex-encoded fingerprints should always be hashed using SHA-1, regardless of searching for a relay or a bridge, in order to not accidentally leak non-hashed bridge fingerprints in the URL. [...]
Before providing my experimental code (which I'd have to clean up anyway), I'd want to keep this discussion as open as possible and only present my general ideas how this could be implemented:
-
In the following I'll assume that we're going to use PostgreSQL as database. I already have some experience with it from other Tor-related projects, and our sysadmins like it more than other SQL databases. If NoSQL turns out to be superior for this use case based on some actual performance evaluations, I'm happy to consider that.
-
We'll have to support three comparison modes for the
search
paramter: "starts with", "starts with ignore case", and "contains as substring". -
PostgreSQL does not support substring searches (
LIKE '%foo%'
) out of the box, at least not efficiently, but there's a package calledpg_trgm
that can "determine the similarity of text based on trigram matching". It's contained in Debian'spostgresql-contrib
package, so it should be available to us. -
Right now, search terms are supported starting at a minimum length of 1 character. I could imagine raising that to 3 characters if it has major benefits to search efficiency. Though if it doesn't, let's keep supporting searches for 1 or 2 characters.
-
I briefly experimented with a normalized database schema with a
servers
table containing one row per relay or bridge, anaddresses
table with one or more addresses per server, and afingerprints
table with original and hashed fingerprint per server. The performance was not very promising, because searches would have to happen in all three tables. Happy to try again if somebody has hints what I could have done wrong. -
I also considered (but did not test) a schema with a single
servers
table that encodes all fields that are relevant for thesearch
parameter in a single string with format"lower-case-nickname#base64-fingerprint|lower-case-hex-fingerprint|lower-case-hashed-hex-fingerprint|lower-case-address1|lower-case-address2"
. For example, Tonga would have the combined string"tonga#SgzNLdx5lQg9c/XWZxAMilgx8W0|4a0ccd2ddc7995083d73f5d667100c8a5831f16d|e654ae16b76cf002bd26adaf060f8a9c5d333cc9|82.94.251.203"
, and searches forTonga
would use the following condition:WHERE search LIKE '%tonga%#' OR search LIKE '%#Tonga%' OR search LIKE '%|tonga%'
. -
There may be variants of these two schemas that have advantages that I didn't think of yet. Suggestions very welcome.
If we can find a good database schema for the search
parameter, implementing the other parameters should be relatively easy.
Here's Tonga's search data for a very first sample:
{
"t": true,
"f": "4A0CCD2DDC7995083D73F5D667100C8A5831F16D",
"n": "Tonga",
"ad": [
"82.94.251.203"
],
"cc": "nl",
"as": "AS3265",
"fs": "2007-10-27 12:00:00",
"ls": "2015-04-18 13:00:00",
"rf": [
"Authority",
"Fast",
"HSDir",
"Running",
"Stable",
"V2Dir",
"Valid"
],
"cw": 20,
"r": true,
"c": "4096/fd3428b4 lucky green <shamrock@cypherpunks.to>"
}
I also uploaded more sample search data in case that helps the discussion.