Opened 3 years ago

Closed 3 years ago

#14011 closed enhancement (implemented)

Implement lazy parsing for stem

Reported by: phw Owned by: phw
Priority: High Milestone:
Component: Core Tor/Stem Version:
Severity: Keywords: descriptor, zoossh
Cc: atagar Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Damian and I had a small discussion regarding lazy parsing (see below) and how it could speed up dealing with descriptor data. This might not be an awful lot of work for zoossh, so it might be worth implementing it.

18:28 <atagar>  phw: Side note concerning zoossh, another option could be lazy parsing for descriptors. If I was to do stem's parsers again that's what I'd opt for to make them more performant. That would be a fair bit of work, but would both benefit all stem users and have performance just as fast as any Go solution (time would all be IO).
18:29 <atagar>  That said though, Zoossh seems like a great way of learning the language so if that's the goal have fun. :)
18:36 <phw>     atagar: that's actually a good idea, thanks
18:39 <atagar>  phw: Oh! If you're interested then please open a ticket under the Stem component. This is something I've idly given some thought to for over a year but never bothered to actually jot down the idea. ;P
18:39 <atagar>  Didn't expect you to actually think about opting for this route.
18:41 <atagar>  Thought was that reading a descriptor dumps to a simple object that's a {keyword: [lines...]} dictionary. The getter methods then parse the actual content and cache the results. Upside: far, far faster since you only parse the fields you care about, downside: no upfront validation is done so malformed content would be acceptable.
18:42 <atagar>  That said, validation is a far, far smaller concern for our users than performance in practice so this is a tradeoff I'd be fine with.
18:42 <atagar>  We could then have a validate() method that simply calls all the getters to achieve the same thing we do now.
18:45 <atagar>  Previously I thought that doing this would break backward compatibility which made me a little less keen on it (since we'd then need 'descriptor v2' objects) but on refelction it doesn't. We could slip this in transparently. The only difference users would see would be a tremendous speedup if the opt to not have validation.

Child Tickets

Change History (6)

comment:1 Changed 3 years ago by atagar

Keywords: descriptor added; parsing removed
Priority: normalmajor

This might not be an awful lot of work for zoossh, so it might be worth implementing it.

Up to you of course regarding zoossh. Personally I can't say I'm thrilled about a third descriptor parser. Much of my work in the metrics space (such as replacing the java implementation of DocTor) has been so we can eventually deprecate metrics-lib. Dropping the java descriptor parser would mean one fewer things for Karsten to maintain. Descriptor parsing is a moving target, and hence requires a small amount of continual upkeep.

For Stem I don't think this would be a terribly hard project. It would just be a lot of refactoring. If I didn't have a day job I'd guess around a month, but as-is it would probably take me around a month per descriptor type.

Best approach for this would be to take a descriptor type (say, server descriptors) and try it out. Rather than the massive parse loop we presently have parsing behaviour would be moved to a series of getter methods with an @lru_cache() annotation.

Not hard, just time consuming. :)

comment:2 Changed 3 years ago by atagar

Summary: Implement lazy parsing for zoossh (and maybe Stem)Implement lazy parsing for stem

Did a proof of concept of this...

https://gitweb.torproject.org/user/atagar/stem.git/log/?h=lazyload

It improved the speed at which we can read server descriptors by ~20% (the high parse times with validation is mostly due to cryptographic validation)...

Before:

% python scrap.py 
read 6679 descriptors with validation, took 13.67 seconds
read 6679 descriptors without validation, took 5.47 seconds

After:

% python scrap.py 
read 6679 descriptors with validation, took 13.77 seconds
read 6679 descriptors without validation, took 4.41 seconds

Script used to test...

import time

from stem.descriptor import parse_file

start_time = time.time()
fingerprints = []

for desc in parse_file('/home/atagar/.tor/cached-descriptors', validate = True):
  fingerprints.append(desc.fingerprint)

print 'read %i descriptors with validation, took %0.2f seconds' % (len(fingerprints), time.time() - start_time)

start_time = time.time()
fingerprints = []

for desc in parse_file('/home/atagar/.tor/cached-descriptors', validate = False):
  fingerprints.append(desc.fingerprint)

print 'read %i descriptors without validation, took %0.2f seconds' % (len(fingerprints), time.time() - start_time)

We'll probably see a similar ~20% improvement for other descriptor types. There's no point in having a zoossh ticket in the stem component, so just having this be about this change.

comment:3 Changed 3 years ago by phw

Not directly relevant for this ticket but FYI: I also implemented lazy parsing for zoossh over the last two days and here are the performance numbers (the tests with an "L" in the name did lazy parsing):

BenchmarkConsensusParsing                       121986572 ns/op
BenchmarkLConsensusParsing                       58939870 ns/op
BenchmarkConsensusParsingAndGetting             122703967 ns/op
BenchmarkLConsensusParsingAndGetting            139462761 ns/op
BenchmarkDescriptorParsing                       27764446 ns/op
BenchmarkLDescriptorParsing                      16322695 ns/op
BenchmarkDescriptorParsingAndGetting             27638051 ns/op
BenchmarkLDescriptorParsingAndGetting            37836873 ns/op

As expected, lazy parsing is much faster when the content of router statuses or descriptors is not accessed. However, when the content of all statuses/descriptors is accessed, lazy parsing is slower as it involves more overhead.

comment:4 Changed 3 years ago by atagar

What does "ns/op" stand for? For benchmarking I plan to point Stem at some CollecTor archives (https://collector.torproject.org/) to get descriptors/s.

I've been working on this for Stem over the last couple weeks and very, very close to being done (finished with everything except the router status entries). I'll be sending an email this weekend with the results.

As expected, lazy parsing is much faster when the content of router statuses or descriptors is not accessed. However, when the content of all statuses/descriptors is accessed, lazy parsing is slower as it involves more overhead.

Gotcha. For Stem lazy loading means a little more memory usage (since we're storing an extra 'keyword => values' dict) but I wouldn't expect accessing all attributes to be appreciably slower since it's pretty much an identical codepath as eager loading.

comment:5 in reply to:  4 Changed 3 years ago by phw

Replying to atagar:

What does "ns/op" stand for? For benchmarking I plan to point Stem at some CollecTor archives (https://collector.torproject.org/) to get descriptors/s.

"ns/op" means nanoseconds per loop. One loop iteration consists of parsing a consensus or server descriptor file. The benchmarks are based on some random consensus and server descriptor file from CollecTor.

comment:6 Changed 3 years ago by atagar

Resolution: implemented
Status: newclosed
Note: See TracTickets for help on using tickets.