Request for doctor user and /srv directory

added component::internal services/tor sysadmin team owner::phobos priority::low resolution::implemented status::closed type::enhancement labels

Karsten re-enabled the crontab on yatei so scratch the part about that being safe to delete. When this ticket is addressed I'll re-transfer the contents to cappadocicum and disable the copy on yatei again.

Trac:
Cc: N/A to karsten

There is now an /srv/doctor.torproject.org on cappadocicum, owned by a doctor uid and gid.

You can sudo to the doctor user.

Sorry this got lost between the cracks. Ping me if you need anything else.

Trac:
Status: new to needs_review

There is now an /srv/doctor.torproject.org on cappadocicum, owned by a doctor uid and gid.

Great! Disabled Doctor on yatei and spun it up on cappadocicum. Next steps are to...

Remove my access to yatei and the metrics user.
Remove Doctor from yatei (rm -rf /srv/doctor.torproject.org). I've removed it from the crontab so Doctor is now unused. The only reason I didn't delete it myself was because I wasn't sure if there was an automated backup process or something else acting on /srv I should be wary of.

After that this can be resolved.

Sorry this got lost between the cracks. Ping me if you need anything else.

No worries. I thought about pinging you but decided this wasn't very important (Doctor was still chugging along happily on yatei).

Cheers! -Damian

Trac:
Status: needs_review to new

Hmmm, on reflection looks like it might not be all roses. I saw one successful run but since this I've started getting cron emails saying simply that consensus_health_checker.py was killed. I'll need to look into this tomorrow.

DocTor runs three checkers: descriptor validation, sybil checks, and consensus health checks. The first two are running fine since they only operate on a single consensus at a time. The last however is triggering the OOM killer.

DocTor's consensus health checker downloads both the vote and consensus from each authority, then runs a series of checks. This means 18 network status documents in memory at a time (2 documents x 9 authorities). This was fine for yatei, but cappadocicum only has 496 MB.

I've sunk an hour into seeing if there's a quick change I can made to sidestep this, but not spotting anything. Can we boost cappadocicum's memory? If not then this might need to swap back to yatei.

Re-enabled the consensus health checks on yatei, though still running the other two on cappadocicum. Peter suggested checking with Andrew about bumping the memory so making it so.

Trac:
Owner: N/A to phobos
Status: new to accepted

Replying to atagar:

DocTor's consensus health checker downloads both the vote and consensus from each authority, then runs a series of checks. This means 18 network status documents in memory at a time (2 documents x 9 authorities). This was fine for yatei, but cappadocicum only has 496 MB.

How much memory does it need?

Unfortunately I'm not really sure. Mind if we try 2-4 GB?

The VM is hosted in iceland, so we have to pay for it. Can we try 1 GB first?

Sure.

request to upgrade sent

upgrade done.

Thanks! Memory is indeed up...

atagar@cappadocicum:~$ free
             total       used       free     shared    buffers     cached
Mem:       1026504     492732     533772          0     100312     229212
-/+ buffers/cache:     163208     863296
Swap:       724988          0     724988

I gave our consensus tracker a whirl and it worked great (actually, barely dipped into swap which I found surprising). We should be good to go - I disabled DocTor's cron on yatei and reactivated it on cappadocicum.

Please wait a couple days so I can be sure DocTor is chugging along happily on cappadocicum. If I don't report any further issues then we can move forward with removing my access to metrics/yatei and cleaning DocTor off it.

Cheers! -Damian

Sounds like everything is working. great.

Trac:
Resolution: N/A to implemented
Status: accepted to closed

closed

Request for doctor user and /srv directory

Child items 0

Activity