The current way for handling incoming client requests is to load all search-relevant data about relays and bridges into memory and handle requests from there. This has two major downsides: it's difficult to extend and it requires us to limit searches to relays and bridges that have been running in the past seven days. We would like to overcome both.
After some experimenting with database schemas it seems that supporting the search parameter efficiently will be most difficult. Here's what it does (from https://onionoo.torproject.org/protocol.html):
Return only (1) relays with the parameter value matching (part of a) nickname, (possibly $-prefixed) beginning of a hex-encoded fingerprint, beginning of a base64-encoded fingerprint without trailing equal signs, or beginning of an IP address, (2) bridges with (part of a) nickname or (possibly $-prefixed) beginning of a hashed hex-encoded fingerprint, and (3) relays and/or bridges matching a given qualified search term. Searches by relay IP address include all known addresses used for onion routing and for exiting to the Internet. Searches for beginnings of IP addresses are performed on textual representations of canonical IP address forms, so that searches using CIDR notation or non-canonical forms will return empty results. Searches are case-insensitive, except for base64-encoded fingerprints. If multiple search terms are given, separated by spaces, the intersection of all relays and bridges matching all search terms will be returned. Complete hex-encoded fingerprints should always be hashed using SHA-1, regardless of searching for a relay or a bridge, in order to not accidentally leak non-hashed bridge fingerprints in the URL. [...]
Before providing my experimental code (which I'd have to clean up anyway), I'd want to keep this discussion as open as possible and only present my general ideas how this could be implemented:
In the following I'll assume that we're going to use PostgreSQL as database. I already have some experience with it from other Tor-related projects, and our sysadmins like it more than other SQL databases. If NoSQL turns out to be superior for this use case based on some actual performance evaluations, I'm happy to consider that.
We'll have to support three comparison modes for the search paramter: "starts with", "starts with ignore case", and "contains as substring".
PostgreSQL does not support substring searches (LIKE '%foo%') out of the box, at least not efficiently, but there's a package called pg_trgm that can "determine the similarity of text based on trigram matching". It's contained in Debian's postgresql-contrib package, so it should be available to us.
Right now, search terms are supported starting at a minimum length of 1 character. I could imagine raising that to 3 characters if it has major benefits to search efficiency. Though if it doesn't, let's keep supporting searches for 1 or 2 characters.
I briefly experimented with a normalized database schema with a servers table containing one row per relay or bridge, an addresses table with one or more addresses per server, and a fingerprints table with original and hashed fingerprint per server. The performance was not very promising, because searches would have to happen in all three tables. Happy to try again if somebody has hints what I could have done wrong.
I also considered (but did not test) a schema with a single servers table that encodes all fields that are relevant for the search parameter in a single string with format "lower-case-nickname#base64-fingerprint|lower-case-hex-fingerprint|lower-case-hashed-hex-fingerprint|lower-case-address1|lower-case-address2". For example, Tonga would have the combined string "tonga#SgzNLdx5lQg9c/XWZxAMilgx8W0|4a0ccd2ddc7995083d73f5d667100c8a5831f16d|e654ae16b76cf002bd26adaf060f8a9c5d333cc9|82.94.251.203", and searches for Tonga would use the following condition: WHERE search LIKE '%tonga%#' OR search LIKE '%#Tonga%' OR search LIKE '%|tonga%'.
There may be variants of these two schemas that have advantages that I didn't think of yet. Suggestions very welcome.
If we can find a good database schema for the search parameter, implementing the other parameters should be relatively easy.
Here's Tonga's search data for a very first sample:
How important is disk/memory space usage?
Can you afford to denormalise and blow out the usage slightly, say one row per address, rather than one row per server?
For example, you could have one table server_address with:
one row per relay or bridge address,
duplicated server information against each address, and
a column for the original fingerprint and another column for the hashed fingerprint
Then you could search in the one table, but still have the advantages of searching separate columns by the start of a string (this works well with indexes).
One assumption that I'm making here is that you're focusing on the speed of a single-string search. Doing an intersection is then O(n) by the number of terms, or perhaps better if you feed the results of the previous search into your next search.
Oh, do you have a unique indexed / primary key / integer identity column on the table?
That will help with creating unions and looking up rows, and other indexes.
The problem **is **the LIKE '%foo%' . That's the only problem. All the other queries will be fast because they are index-able using prefix search, LIKE 'foo%' . Looking for a schema or de-normalizing from 3NF won't fix it. It's not going to be solved by a trigram-match either.
Check out the documentation and look at how the trigram-match is implemented. The trigram is faster only because it allows a form of indexing, but the implementation is computation heavy with a potentially massive search-space. I was wondering--what is the importance of a similarity metric? The trigram-match computes similarity for ranking results. How important (or relevant) is this if searching for example in fingerprints? If similarity isn't that important maybe a user-defined function would work better.
The performance of LIKE '%foo%' is like compiling a regex in Java for every query, every time. It would completely lack any VM runtime optimization unlike the case for substring or contains.
Another possibility would be to opportunistically try to complete the query using a prefix search first.
You might also do for nickname, and ip, what you do for fingerprint. Make it a separate query type in the protocol and do search_fp, search_nn, search_ip. Not ideal, I know, but it would avoid a multi-table search combined with a scan of each. This would also make possible pattern length adjustments. I mean a single character search might make sense for nickname search, but does it equally make sense for ip or fingerprint?
I'm really interested in the user-defined function possibility because that wouldn't require any compromise. Especially if similarity isn't that important a metric. I'm sure there are other well known algorithms that are suitable replacements.
Oh, and, did you see, in 9.4, improvements to HSTORE and JSONB :D So it doesn't have to be a choice between NoSQL and PostgreSQL.
But before going into the details, let's briefly talk about some simplifications:
The only reason we need LIKE '%foo%' is to support substring searches in nicknames. But if it's too hard to implement this in a database search, then maybe we'll have to sacrifice this feature. I could imagine that we're only supporting starts-with searches if this makes this easier to build and more likely to stay efficient over time.
We might as well talk about giving up case-sensitive searches for base64-encoded fingerprints and just make them case-insensitive as everything else. Again, let's not do this easily, but if it reduces complexity and improves performance, let's consider it.
And let me repeat my suggestion from above to require search terms to be at least 3 characters long. It's not the case right now, but I could imagine adding that requirement if it helps us.
However, note that there's another parameter, contact, that I didn't mention before, because it's simpler than search. But it depends even more on substring searches, so even if we'd simplify the search parameter, we would still have to support substring searches in the contact parameter for it to be useful. Unless there are better ideas for supporting searches in contact lines?
What do you think, how would these three simplifications affect our chances to get this implemented cleanly and in an efficient way?
And what would you say if we'd not only try to support searches for the 15k relays and bridges that have been running in the past week, but for the currently 600k relays and bridges that have existed in the past 10 years and another 1.4M that will exist in the next years?
I like leeroy's suggestion of a substring(foo) (or, perhaps more efficiently, a position(foo)) over a LIKE %foo%. Some database engines can recognize their equivalence in the simple case of %text%, but you don't want to rely on this. And you don't really need people to to be able to do %foo%bar%, do you?
Allowing people to insert unquoted metacharacters, even into a LIKE string, can open you up to DoS issues. Depends how powerful postgres LIKE regexes are.
For comparing the starts of strings, sometimes left(identity, len(search_string)) = search_string) can be faster than identity LIKE search_string + '%' for similar reasons.
The database may be more efficient if it's collation is case-insensitive overall. And it will be easier to program if that is the default. Individual columns and comparisons can still be case-sensitive (or, even more efficiently, compared exactly, which is equivalent for search purposes here).
You might also want to consider storing the base64 encoded string in its decoded binary form in a bit varying column, and then doing the decoding for each search term before comparing it using position(). The combined decoding and comparison might be faster, and the storage and indexing will definitely be faster and smaller. This strategy might apply to some of the other fingerprints (store in binary form, convert before search). This would also avoid some of your case-insensitive matching. It would also give you a canonical form for storage and comparison. You'd have to encode them for display, or, as a speed-space tradeoff, store a lookup table of binary->encoded for each fingerprint in the database.
You can ever compare partial bytes, which may be a feature, or a bug if you're looking at the middle of strings. You might need to restrict it to position() > 0 && position() % 4 == 0 for hex, and position() > 0 && position() % 6 == 0 for base 64, so you only compare whole encoded character positions, and not sub-character bit positions. Otherwise users might complain that they put certain characters in, and don't get any out.
By the way, full-text search won't help, because in postgres it does words, not substrings.
And what would you say if we'd not only try to support searches for the 15k relays and bridges that have been running in the past week ![..]
A database is an important transition to remove the restrictions on searchable period-of-time. You've mentioned you only get a week with the onionoo resources as provisioned. It's reasonable to expect a database to scale to the entirety of available history. It's the substring search on the entire history that'll get you. Your limit would be the storage space used by the database, and the i/o of requests. Storage isn't so much an issue, and neither is transaction i/o which can scale with pgpool.
What do you think, how would these three simplifications affect our chances to get this implemented cleanly and in an efficient way?
1a) If you're only interested in substring search on nickname and contact then an alternative to LIKE '%foo%' must be considered. Some real substring performance tests should be performed. What are the candidates? There's the position function of postgreSQL, as teor mentioned, user-defined functions, and a comparison with trigram-match indexing? In particular this testing would apply to substring since you defined the rest of the queries as prefix search capable (best case).
1b) I can think of ways around the inherent limits imposed by an unbounded data set. A key consideration is hiding the latency of a big query without hurting performance for everyone else. Ideally it should scale and be fair. A simplification of this thinking is:
spawn n-helpers in a pool, dynamically-sized based on load and resources available
helpers are distributed across requests to hide latency of big requests
receive a request on the front-end
associate a request with n-helper
each helper is associated with a request and a range of data
helpers are scheduled to prevent overloading the server
for each helper apply function on range of data
helper returns results then re-joins the pool
repeat for n-helper associated with a request until all results gathered
1c) A thought about introducing describing notation to the search pattern itself. A marker of sorts to encode the search type. In the absence of any marker perform a most general search. @foo@, or something like it, could transform into a contact-specific search. This would avoid changes to parametrization and keep backwards compatibility. There wouldn't be much difference in processing a request; either way you have to parse the parameters. The thinking here is that the current parameterization will become more painful as Onionoo is extended.
2a) I do agree it would simplify things to have a case insensitive search of base64 fingerprints. However, you've specified this type of search as prefix search capable. So it would be best case performance. A query that spans multiple tables can use a combination of substring-search and prefix-search to maximize performance of the query. It's even conceivable to perform the query in separate transactions rather than one.
2b) A potential improvement here might be to split the fingerprint and index only part. This could apply to any of the other fields as well. In particular, this asks a question about partial fingerprint searches.
A general change to a minimum of 3 characters would cause problems in nickname searches won't it? A nickname is allowed to be a single character. Lets suppose for an moment that the minimum is 1. Well then only one character needs to match in order to return a result. Compare this to 3 characters where all 3 need to match. I do, however, think there's value in the careful consideration of a "single-unit" with respect to an entity-type. It influences the returned data for a particular query. A single character for nicknames, a single octet for an ip, a hex-value for fingerprint, etc. This may produce more relevant results for a given pattern.
I don't think there's any doubt a clean and efficient implementation is doable. As long as it's understood that it's unlikely to ever be as fast as Java. The Java VM offers runtime optimization, postgreSQL offers a transactional data store. Even in a best case: postgreSQL, n-workers, compiled user-defined substring search, and Java, there's still an overhead from the transaction. I think the perception that a transition to a database is a performance optimization requires clarification. Performance depends on perspective.
A database makes possible efficient storage, and management of a large set of data.
It will enable writes and reads to overlap. Importing new data while providing existing data becomes possible.
Built in replication means creating copies of the current state is less of a hassle.
Enables a highly extensible protocol. Effectively turning Onionoo into a sort of ORM for Tor data.
Onionoo can prepare the response while waiting on the data.
The Java deployment is a pain to scale, postgreSQL is designed to scale independently from the Java powered engine behind the scenes. It's a compromise.
For the purpose of testing it makes sense to define what to watch for in the current worst-case. What defines the worst case and what are the impacts of poor performance?
Slow client connections (slowloris), big requests.
Long duration requests tie up resources and reduce availability for other requests.
If the record that matches a search is at the end of a returned data-set, there might alot of data in between. A mitigation, in the case of having the match at the beginning, is to cancel.
The reality, in an extensible Onionoo protocol, is that some queries will take longer. It's impossible to make every database query conform. What impact will this have on the current Onionoo deployment? Whatever concerns this may raise need to be addressed !before! the database transition.
In consideration of ways to investigate the impact of performance here is a list thus far.
Improvements to substring search
Decouple response creation from the data
Indexing parts of a field
Increasing relevance of results
Currently deployed worst-case analysis
And in addition to the above.
Limiting searchable period of time per request. You might not want a single request to search the past several years, at least not at once. On the other-hand several searches, each covering n-months, might be more acceptable.
Complex queries involving multiple joins, conditions, predicates to simulate extending the protocol. For example a search involving a range in first-seen or last-seen.
HSTORE-like fields, for example, like adding whois data. It can be added to the database as retrieved. Doesn't need to be retrieved immediately. Doesn't need any sort of search. Can be compressed. You know it's going to happen, eventually, because this kind of data can change over time.
Bin the data. Like CollecTor (not exactly). Implicitly restricts size of data to recent and longer interval.
What other tests will help with the transition?
Perhaps we could look at possible 3NF forms and consider what types of problems might occur from complex query involving joins, condition, and predicate. I'm not a huge fan of de-normalization. It's a bad idea to optimize prematurely for a type of query. As mentioned above I think there are lots of other performance problems to consider (not lightly). I digress; what forms and issues will be encountered in the summary document having data in 3NF.
An id is included as a preliminary key for generality. It's there to make relationships clear between tables that may be added later (for other documents). It's much easier to index a long integer than, say, an entire fingerprint (or in base64). At least that's the premise. That being said the router fingerprint will end up indexed by the query optimizer anyway.
This form doesn't consider the possibility of storing changes to the flags for a given router (determined by a fingerprint) over time. It only stores the current flags for a router-fingerprint. Maybe a change such as having an HSTORE in Router map an id (of Router_flag) to a hash of the flags. Router_flag would have the id and encoded flags in a single record. Upon detecting changes in flags create a new record in Router_flags and update the HSTORE of Router. A date field in Router_flag might also be used to keep track of a last-seen.
A realistic schema would mean the router family needs to be filled in during a second pass.
So how does this look for the search query? I'll run some EXPLAIN (ANALYZE, BUFFERS)** **to add to the discussion after filling in the tables with some data. While I'm at it I'll also start some indexing and substring tests. I'm not sure how to set-the-bar for performance. What should I look for when performing a query with servers (router), addresses (ip), and fingerprints? As-fast-as-possible comes to mind :D
Please feel free to recommend other forms, or to make adjustments.
edit 05/10: readability, improvements, and metric gathering
Sorry, it's long, and some of this might merit it's own ticket, but not being sure I'll put it here for now.
Since this appear to be a discussion thread, i will throw in my two cents as a newbie here.
Has it been considered to let a search engine take care of the data instead, and let the raw data stay as is (in files)? Both Solr and Lucene can be embedded just like jetty into the code. Painful queries like "%a%" will always be a bit slow but is still manageable. Lucene/Solr`s filter support will make sure that the queries for "fresh data" will go as fast as ever even with 10 years of data as mentioned earlier.
Ueland, it might be that search engines can help us here, but AFAIK, nobody has investigated that yet. Note that both our data and possible searches are highly structured, so that databases might be a better fit than search engines. That's why I'm more optimistic about databases than search engines. But I'm not ruling out anything here, and if you want to look into search engines for this purpose, that might be quite interesting. If you're interested, please take a look at the protocol specification for the possible searches that the search engine would have to support. Also note the possible simplifications to that protocol mentioned on this ticket.
teor, leeroy, thanks a lot for your comments. I started writing a response a couple of times now, but was overwhelmed by the amount of detail in this thread and gave up. But I really want to respond, because I feel we're getting somewhere. I hope the following response makes sense, even though it doesn't address all your suggestions. But all the good ideas remain in this thread, and whenever we start writing code, we should revisit this entire thread.
We don't need to support %foo%bar%. We should check all user input and avoid searches including % or other special characters before passing it to the database.
If substring searches are just too hard to implement, let me suggest another simplification: how about we sacrifice the contact parameter together with substring searches in the nickname field? I think that people are using the contact parameter to obtain all relays run by the same entity, but they could as well use the family parameter for that. If the choice is to either keep supporting that parameter or to offer searches over the entire history, I'd prefer the latter. Again, not something to decide quickly, but worth considering. I should ask people on tor-dev@ before we make a final decision though.
At the same time, maybe we shouldn't restrict searches to 3+ characters for the reasons you state above. Scratch that idea, it probably wouldn't help that much anyway.
Example: a search for "u" would still return all relays called "Unnamed" and "u", but not "default".
Regarding performance, I'm not concerned about mean time but much more about variance. If a request takes 100 ms on average, that's fine by me, but if some requests take 60 s, then something's very wrong. I also worry about potential lack of scalability, either to thousands of concurrent clients or to twice or three times as many entries in the database. I'd prefer a solution that has reasonable performance for most use cases, rather than one that is highly optimized for one use case and that might perform badly for use cases we didn't think. Of course it's no hard requirement that the database performs better than the current in-memory search.
How do we proceed? leeroy, did you want to write some database code and run some performance measurements? If so, here's some sample data (the same that I mentioned above).
There are a couple things to look into if you decide to test the possibility of a search engine. Consider a search of nickname, fingerprint, ip, and contact.
Onionoo file-based i/o represents a virtual document view. It's usually delegated to the client to render it. A search engine makes the document view more concrete on the Onionoo server. This can lead to duplicate copies of data across multiple files. Which may be more difficult, and resource-consuming, to update or modify. Compression may also be harder to implement efficiently on a document-basis instead of on a field-basis.
The documents must be created/updated in a separate atomic step from searching/indexing/reading.
It's a challenge to define a normalization of the searchable content. You need a normalization for nicknames, fingerprints, ip, and contact. In the case of ip, a tokenization leads to the problem of recognizing which octet is being searched. In the case of contact you need to be careful to not eliminate stop-words, and also take into consideration the impact of multiple languages (stemming).
To perform substring search, or even the equivalent prefix-search within words requires using a dynamically created regex.
Full document indexing may induce undesired tradeoffs similar to the week-long limit currently seen.
If substring searches are just too hard to implement, let me suggest another simplification: how about we sacrifice the contact parameter together with substring searches in the nickname field?
It occurs to me the contact field doesn't even show up in summary documents. So that field wouldn't apply to a request for summary data? Eliminating substring searches on the server would help with those big requests. I don't think it would hurt terribly for small requests though. Maybe if it were limited in it's search range it wouldn't be a concern? At least theoretically it can be as efficient as Java--maybe even more if compiled.
Example: a search for "u" would still return all relays called "Unnamed" and "u", but not "default"
That's a great point. Why would a substring search for a single-character be useful anway? If the search pattern is short it should be reasonable to say only prefix-search is available. That or there's some other limits imposed.
If a request takes 100 ms on average, that's fine by me, but if some requests take 60 s, then something's very wrong.
Would that be just the server processing the query, or does that include the possibility that a client has a slow connection? Absolutely, if a client has to wait 60 s to get any data, something's definitely not working properly. Instrumentation should be built in from the start--to keep track of events like this.
I also worry about potential lack of scalability, either to thousands of concurrent clients or to twice or three times as many entries in the database.
Are there any situations like this with the currently deployed code? Besides the data in memory. Are there any circumstances that induce worst-case performance? Does the worst-case have any other potential consequences? I'm just concerned that the scalability question lacks any consideration of the rest of the system.
I'd prefer a solution that has reasonable performance for most use cases, rather than one that is highly optimized for one use case and that might perform badly for use cases we didn't think.
Agreed. Designing around some well known use-cases leads to code which relies to much on the optimizations. This can make additions more difficult, due to code that needs rewriting, and a database which you're reluctant to change. First the database should be clean, and efficient, and the rest of the system should be checked for bottlenecks. Amdahl's law isn't it?
Of course it's no hard requirement that the database performs better than the current in-memory search.
I think it's perfectly reasonable to expect the database transition should result in a system that performs comparably to the currently deployed code. That might be a good starting point. A data-set representative of the current in-memory search, only using a database, would make for a good comparison.
How do we proceed?
Some thoughts about that:
Maybe by coming up with a better normal form. One which starts with the details-view of data. Summary data might be better represented as a derivation. I noticed a couple problems with the one I posted above. It includes a lot of information not seen in a summary document. The ASN and Country fields might make more sense part of the IP table. It also doesn't consider ip changes over time or other geolocation information.
History objects and aggregate data need a database representation too. PostgreNOSQL features and compression might come in handy here. These fields are created, read by the client for rendering, but aren't really searched as far as I can tell.
Determine which data needs to be tracked over time. For example, flags, and ip associated with the lifetime of a router determined by fingerprint. Are there others? What about consensus weight, and position probabilities?
Functions for importing new data, or even importing the initial data set
Functions to update last-modified response on receiving an update notification from the database
I'm doing it again aren't I? The long response thing :D
leeroy, did you want to write some database code and run some performance measurements? If so, here's some sample data (the same that I mentioned above).
I do have that data. I guess I could get some more from CollecTor or Onionoo. Which Java is deployed? 7? Besides that, do you have something in particular in mind for code? A wishlist?
There are a couple things to look into if you decide to test the possibility of a search engine. Consider a search of nickname, fingerprint, ip, and contact. [...]
It seems like switching to a search engine would be a pretty major change even compared to switching to a database. I'd say if somebody wants to look into this, that's cool, but the database switch seems like the much-lower hanging fruit for now.
If substring searches are just too hard to implement, let me suggest another simplification: how about we sacrifice the contact parameter together with substring searches in the nickname field?
It occurs to me the contact field doesn't even show up in summary documents. So that field wouldn't apply to a request for summary data? Eliminating substring searches on the server would help with those big requests. I don't think it would hurt terribly for small requests though. Maybe if it were limited in it's search range it wouldn't be a concern? At least theoretically it can be as efficient as Java--maybe even more if compiled.
You can use the contact parameter for any document type, even one that doesn't contain the contact field. The way this works is that all requests are answered using a node index that can handle all possible parameters and that returns a list of fingerprints. In the next step, all documents matching these fingerprints are written to the response. For example, right now you can use the contact parameter when asking for weights documents.
Example: a search for "u" would still return all relays called "Unnamed" and "u", but not "default"
That's a great point. Why would a substring search for a single-character be useful anway? If the search pattern is short it should be reasonable to say only prefix-search is available. That or there's some other limits imposed.
I'm not too worried about this simplification, because nicknames have become less important a while ago with the Named flag being discontinued. They are convenient ways to refer to relays, but there is no guarantee that you'll get the relay you're hoping to get.
If a request takes 100 ms on average, that's fine by me, but if some requests take 60 s, then something's very wrong.
Would that be just the server processing the query, or does that include the possibility that a client has a slow connection? Absolutely, if a client has to wait 60 s to get any data, something's definitely not working properly. Instrumentation should be built in from the start--to keep track of events like this.
Well, clients having slow connections should be treated separately. We already have performance statistics in the deployed code:
Request statistics (2015-05-13 18:52:48, 3600 s): Total processed requests: 285066 Most frequently requested resource: details (283763), summary (547), bandwidth (372) Most frequently requested parameter combinations: [lookup, fields] (278974), [flag, type, running] (2531), [lookup] (2393) Matching relays per request: .500<2, .900<2, .990<2048, .999<16384 Matching bridges per request: .500<1, .900<1, .990<1, .999<8192 Written characters per response: .500<256, .900<512, .990<2097152, .999<4194304 Milliseconds to handle request: .500<8, .900<16, .990<32, .999<64 Milliseconds to build response: .500<4, .900<4, .990<256, .999<512
If we switch request handling from the current in-memory approach to a database approach, we should watch out for the last but one line there, "Milliseconds to handle request". That's the time from receiving a request to returning a list of fingerprint of documents to be returned. You read that line like this: 50% of requests were handled in less than 8 milliseconds, ..., and 99.9% in less than 64 milliseconds.
Slow clients would show up in the next line, "Milliseconds to build response", which includes reading documents from disk as well as writing them to the response. (We could split up that line, but then we'd have to buffer complete responses in memory, which might be bad.)
Happy to add new statistics there, if they help us investigate potential bottlenecks.
I also worry about potential lack of scalability, either to thousands of concurrent clients or to twice or three times as many entries in the database.
Are there any situations like this with the currently deployed code? Besides the data in memory. Are there any circumstances that induce worst-case performance? Does the worst-case have any other potential consequences? I'm just concerned that the scalability question lacks any consideration of the rest of the system.
There are indeed other parts of the system that would be affected by an increase in concurrent clients or entries in the database. As I described above, once the database would return a list of fingerprints, those documents would have to be looked up and written to the response. Right now, all documents are read from disk, but we might want to put them into the database as well. And yes, we would want to make sure that those parts of the system scale as well.
I'd prefer a solution that has reasonable performance for most use cases, rather than one that is highly optimized for one use case and that might perform badly for use cases we didn't think.
Agreed. Designing around some well known use-cases leads to code which relies to much on the optimizations. This can make additions more difficult, due to code that needs rewriting, and a database which you're reluctant to change. First the database should be clean, and efficient, and the rest of the system should be checked for bottlenecks. Amdahl's law isn't it?
Agreed.
Of course it's no hard requirement that the database performs better than the current in-memory search.
I think it's perfectly reasonable to expect the database transition should result in a system that performs comparably to the currently deployed code. That might be a good starting point. A data-set representative of the current in-memory search, only using a database, would make for a good comparison.
That would be a fine comparison to start with, but even if the result is that the database is slower by a factor 1.5 or so, its other advantages might outweigh that.
How do we proceed?
Some thoughts about that:
Maybe by coming up with a better normal form. One which starts with the details-view of data. Summary data might be better represented as a derivation. I noticed a couple problems with the one I posted above. It includes a lot of information not seen in a summary document. The ASN and Country fields might make more sense part of the IP table. It also doesn't consider ip changes over time or other geolocation information.
Looking at details documents is a fine start. But let's be careful with adding new parameters that would make it harder to make the database fast enough. One problem at a time.
History objects and aggregate data need a database representation too. PostgreNOSQL features and compression might come in handy here. These fields are created, read by the client for rendering, but aren't really searched as far as I can tell.
History documents are not searched, and that's perfectly fine.
Determine which data needs to be tracked over time. For example, flags, and ip associated with the lifetime of a router determined by fingerprint. Are there others? What about consensus weight, and position probabilities?
Let's be careful with adding historical data. There are over 7.5 years of data in CollecTor, and adding anything of that to Onionoo might not scale very well. Onionoo started out by giving out details about relays and bridges that are currently running. Adding non-running relays and bridges that have been running in the last week was just a convenience. Removing that 1-week limitation would be even more convenient, but we need to be very careful how to do that. For example, it might be possible to search by any IP address assigned to a relay in the past. But allowing users to look up which relays had the Stable flag on January 1, 2009 might be a little too much.
Functions for importing new data, or even importing the initial data set
Functions to update last-modified response on receiving an update notification from the database
Yes, this code needs to be written. I just thought it might be easier to start with writing some fine new SQL code to do performance experiments and not mess with the Onionoo code base.
I'm doing it again aren't I? The long response thing :D
At least this one is easier to respond to. :)
leeroy, did you want to write some database code and run some performance measurements? If so, here's some sample data (the same that I mentioned above).
I do have that data. I guess I could get some more from CollecTor or Onionoo. Which Java is deployed? 7? Besides that, do you have something in particular in mind for code? A wishlist?
I'd say ignore CollecTor for the moment and only look at Onionoo. The document I pasted above is the current input to the request-handling code. You could also fetch the latest details documents for additional input. But feel free to mention any other data here first, and I might comment on possible scalability issues from including those in the search.
The deployed Java version is indeed 7. I don't have anything particular in mind for the code, but I'm happy to comment on patches. Release early, release often.
Thank you for all the clarifications. To start with:
Create an updated normal form with tables to ensure document coverage.
Create related SQL for document queries.
Deploy a test database, and a test Onionoo instance, using a restricted data-set for comparison.
Benchmark them for similar queries.
Iteratively make adjustments to database to achieve desired quality of service.
With updates, refinements, and patches in this ticket along the way.
Slow clients would show up in the next line, "Milliseconds to build response", which includes reading documents from disk as well as writing them to the response. (We could split up that line, but then we'd have to buffer complete responses in memory, which might be bad.)
Happy to add new statistics there, if they help us investigate potential bottlenecks.
It might be useful to look at how much time is spent just building the JSON part of the response. In particular, the data is serialized into JSON using GsonBuilder in ResponseBuilder . That class appears to contain non-trivial branching and processing logic. PostgreSQL can serialize the response into JSON too. I heard it's shiny.
But now that I look at the rest of the code for serializing/caching JSON it looks like it won't be much use in the deployed code. At least not without something to compare it to. Since I'll be running a modified Onionoo for comparison I'll make that part of (4).
If it is wanted, i can take some time and play around with the data and get it indexed to see how quickly and effectively the data can be queried, as a first step. Then put it up on a server so you guys can test it for yourself. Then if it looks good, implement it within the Onionoo code itself.
I mentioned both Solr and Lucene earlier, but as my work experience is with Solr, i will set up a demo using that.
Is it something i should dig into, or are you guys more or less set for a database engine at this time? I suspect there might be some other/better things for me to do if that is the case. :)
If i should dig into it, i would like to get hold of as much historic data as possible, to see that it can be handled as effectively as we want. Is there a way for me to get hold of more than the 4MB sample data mentioned earlier?
Note that the use of GsonBuilder in ResponseBuilder is a dirty hack that I would like to get rid of if possible. It's only used if the user asks for certain fields in details documents, in which case we'll have to parse the full details document and return one containing only those fields. If there's a better way to implement that, let's look into that. Though I'm not sure if storing all details fields in the database and compiling all details documents on the fly is the best solution. However, should we save this problem for later?
Trying out a search engine would indeed be interesting, I just don't know enough about it to be very helpful there. But if you want to dig into that, I'd be curious to learn what you find out.
I just uploaded the full search history (28M). But note that it's in a different data format than the other file. Maybe this parsing code is useful to understand the format.
[..] However, should we save this problem for later?
Allow me to clarify. The test Onionoo described is only modified to fit both the PostgreSQL and Onionoo test servers to the same aws-server, and enforce a proxy on Onionoo. It will use only the first update (roughly half the data), and without cron updater. Not with any particular improvements. I just wanted to point out PostgreSQL has JSON serialization built-in, so once the comparison begins, and granted favorable prior results, it would be worth checking this out.
If anyone else wants to post table designs for testing I could simplify the update code of Onionoo for use with a database. When that time arises, that is.
Trying out a search engine would indeed be interesting, I just don't know enough about it to be very helpful there. But if you want to dig into that, I'd be curious to learn what you find out.
I just uploaded the full search history (28M). But note that it's in a different data format than the other file. Maybe this parsing code is useful to understand the format.
Thank you Karsten, i will look into it and see what i find out.
I've been familiarizing myself with the code in an effort to understand the properties of the various datatypes. I would like to ensure I'm correct in my understanding: Bridges can only be located by sha1 of fingerprint? Relays can be located by fingerprint, sha1 of fingerprint, or base64 encoding of fingerprint? A function index using digest() and encode() will ensure all three are fast with the advantage of only storing one.
I would also like to ask how strict is the protocol wrt the ordering of keys in JSON responses? For example using Atlas as a client. I wouldn't want to break anything when using pgnosql. The RFC defines the set as unordered for interoperability.
I've been familiarizing myself with the code in an effort to understand the properties of the various datatypes. I would like to ensure I'm correct in my understanding: Bridges can only be located by sha1 of fingerprint?
That, plus the sha1 of the sha1 of the fingerprint. Let me explain: when CollecTor sanitizes bridge descriptors it puts in the sha1 of the bridge fingerprint. So, Onionoo doesn't even learn the original fingerprint. But Onionoo also returns that bridge if you ask it for the sha1 of the sha1 of the fingerprint, because Onionoo clients are encouraged to sha1 any fingerprint they receive from clients in order not to accidentally include a non-sha1 fingerprint.
Feel free to ignore the background and go by these rules:
relays can be looked up by fingerprint and sha1 of fingerprint,
bridges can be looked up by sha1 of fingerprint and sha1 of sha1 of fingerprint.
Relays can be located by fingerprint, sha1 of fingerprint, or base64 encoding of fingerprint?
Yep.
A function index using digest() and encode() will ensure all three are fast with the advantage of only storing one.
Keep in mind that you'll have to handle partial fingerprints as input, so your plan might not work. Similarly, converting partial fingerprints between hex and base64 might be problematic because of 4 vs. 6 bits per character. I'd say never mind the storage and just put in everything you want into the database.
I would also like to ask how strict is the protocol wrt the ordering of keys in JSON responses? For example using Atlas as a client. I wouldn't want to break anything when using pgnosql. The RFC defines the set as unordered for interoperability.
Ordering matters if users ask for a specific ordering using the order parameter. Of course, if they don't pass that, you're free to return results in whatever order you want.
Interestingly a database (or full-text search) may not be the only solution here. Another solution is to use in-memory compression (like zram) with off-heap, disk-based, data structures for the updater. The cpu is underused during updates. If the server has SSD storage even better.
(June 17) Okay so I've had more time to look into this. First, it's a given that a schema exists (semi-structured works nice). Now, don't hate me for this, but it doesn't look like the database is the solution it appears. Here's why I'm skeptical.
_The two JVM's negatively affect the rest of the server. _The documentation encourages the use of 4GB of heap for each of the updater and web server. I understand why it's done. The current implementation relies on heap allocated objects. The problem is that, in general, this means the two processes combined will use 8GB. This is very wasteful compared to the actual requirements. The JVM will happily hog this memory and only invoke the GC if forced. Which means the GC may not be invoked until it's too late for server performance. Imagine having Postgres and Apache running (they both depend on the OS for memory management).
Torproject is agile (isn't it?). It isn't a simple hack to add a database. To effectively use the database would require rewriting a lot code. If you don't rewrite a lot of code the database won't perform well, and quality suffers.
Onionoo already has a database. The current implementation can be improved. It already provides a schemaless, semi-structured db. It already provides the best you can expect from a separate db process. Simply moving to disk based data structures provides relief from heap dependency. It also enables a merging of data processing steps. Fewer steps is less time spent importing and updating. Onionoo can do what JSONB does for postgres.
I have started to write a solr schema and a script to import all the data which karsten has linked to. But some unexpected events came up in real life, so unfortunately my work halted for a while. I hope that i get back on track soon and get get this up and working so i can see how it performs.
Any input on how i should proceed if/when i have a setup and data available for testing? Is simply pushing both the solr setup and data to a GitHub repo acceptable, or is something else preferred?
Ueland: I don't like speaking for others but, yes, a git repo somewhere, is preferred from my interactions with Karsten.
Karsten: I've forked Onionoo too. After the complications from working with metrics-lib it looks like the best idea. This way I don't need to keep commenting here, and I can still fix bugs, and implement improvements (feature/performance), without having to use a task-branch for some bug/feature. The two forks (of metrics-lib, and onionoo) are meant to demonstrate the results I mention here. Although I'm sure a DB can still be kludged in (and full text search with Solr is still an interesting result, Ueland). The search will be reimplemented using a formal language. This is another application of grammar recognition (besides parsing and semantic analysis improvements for metrics-lib).
Unfortunately this also means documentation improvements will (probably) focus on the forks. Only because that's where the documentation is expected to benefit the community the most. In both their current states, onionoo and metrics-lib, are too unpredictable to expect such improvement to be a benefit...
I now how a Solr instance up and running locally with the data from the status-summary file, around 600K rows. The first issue i noticed is figuring out what to do with each node`s port list, but i will not dig more into that unless this is something worth working further on, which i let you guys decide. (This is only a issue if you want to list all relays that allows port 49, for example).
The Solr index is around 200MB in size with this data, let me know if any of you want a zipped copy to play around in.
Here are a few example queries and their response time on first(uncached) query:
"nickname:h3x" - 215ms (looking for my relays with prefix and suffix wildcard)
"nickname:h3x*" - 6 ms (suffix wildcard)
"nickname:*h3x" - 209 ms (prefix wildcard)
"countryCode:no" - 22 ms (countrycode query)
"id:0" - 600ms (prefix and suffix wildcard, one char)
"contact:gmail.com" - 28ms (all contacts mentioning gmail)
"consensusWeight:[100 TO 9000]" - 13ms (all relays within given consensu weight)