#12676 closed defect (fixed)

Bridge descriptors CollecTor's recent/ directory contain many duplicates

The recent/ directory should only contain new descriptors, and ideally no duplicates. I just found that the latter is not the case:

$ grep -c "@type" recent/bridge-descriptors/server-descriptors/2014-07-22-07-04-02-server-descriptors 
$ grep -c "@type" recent/bridge-descriptors/extra-infos/2014-07-22-07-04-02-extra-infos 

Compare this to relay descriptors:

$ grep -c "@type" recent/relay-descriptors/server-descriptors/2014-07-22-07-05-52-server-descriptors 
$ grep -c "@type" recent/relay-descriptors/extra-infos/2014-07-22-07-05-52-extra-infos 
$ grep -c "@type" recent/relay-descriptors/microdescs/micro/2014-07-22-07-05-52-micro 

The reason is that only novel relay descriptors will be downloaded and stored to disk, but the parsed bridge descriptor tarballs are full snapshots of Tonga's cached descriptor files. We need to add a check whether we already have a sanitized bridge descriptor and only store it if not.

Priority is minor, because this only adds some additional load on clients parsing descriptors more than once. But other than that it's mostly harmless.

comment:1 Changed 5 years ago by karsten

Fixed here, I think. Deployed on yatei now. Will resolve in a few hours if nothing breaks horribly.

comment:2 Changed 5 years ago by karsten

comment:3 Changed 5 years ago by isis

Perhaps related to #15707?

comment:4 in reply to:  3 Changed 5 years ago by karsten

Replying to isis:

Perhaps related to #15707?

That might have amplified the problem, but the problem I fixed was related to duplication between hourly runs. So, even if Tonga wouldn't duplicate any descriptors in the files it provides, we'd duplicate them between runs. That's the part that I fixed.

