Opened 3 years ago

Closed 3 years ago

#12457 closed enhancement (fixed)

Memory request for cappadocicum

Reported by: atagar Owned by: phobos
Priority: Low Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Hi Andrew. Over the last day or so I've received 21 notices about cappadocicum running out of memory...

Traceback (most recent call last):
  File "/srv/doctor.torproject.org/doctor/consensus_health_checker.py", line 636, in <module>
    util.send("Script Error", body_text = msg, destination = util.ERROR_ADDRESS)
  File "/srv/doctor.torproject.org/doctor/util.py", line 97, in send
    stderr = subprocess.PIPE,
  File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1153, in _execute_child
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Strange. This is indicating an OOM while sending email at the end, not for fetching the descriptors (which should be the memory intensive bit). Oh well...

We ran into memory issues before and you said that it's trivial to add more, but want to keep the virt as lean as possible. Mind adding more? Unfortunately the above stacktrace doesn't give me any hint about how much is necessary so I leave that to your judgment.

Child Tickets

Change History (10)

comment:2 Changed 3 years ago by phobos

Is this still happening?

comment:3 in reply to:  1 Changed 3 years ago by phobos

Replying to weasel:

Hmm. https://tor-guest@munin.torproject.org/torproject.org/cappadocicum.torproject.org/memory.html

I believe weasel is noticing that there is around 800 MB of free memory on average. perhaps something else is hitting an internal limit which isn't tied to the kernel OOM limit.

comment:4 Changed 3 years ago by atagar

Resolution: worksforme
Status: newclosed

Is this still happening?

Nope. The last occurrence was on 6/25. I'll tentatively close this for now and we'll see if it reoccurs.

comment:5 Changed 3 years ago by atagar

Component: Tor Sysadmin TeamDocTor
Resolution: worksforme
Status: closedreopened
Summary: Memory request for cappadocicumconsensus_health_checker repeatedly failing

Reopening as this has happened sixteen more times over the last couple days. One provided the aforementioned stacktrace but most simply say 'killed'. I'm watching the host a bit right now. First thing that I noticed is that there were multiple python processes running in parallel. Cron should be spacing them out to avoid that, but it looks like the consensus checker is taking a lot longer to run of late...

grep "Checks finished" /srv/doctor.torproject.org/doctor/logs/consensus_health_checker | tail -n 15
07/03/2014 14:49:02 [DEBUG] Checks finished, runtime was 241.29 seconds
07/03/2014 15:51:28 [DEBUG] Checks finished, runtime was 385.82 seconds
07/03/2014 16:48:15 [DEBUG] Checks finished, runtime was 193.36 seconds
07/03/2014 17:48:54 [DEBUG] Checks finished, runtime was 232.95 seconds
07/03/2014 18:51:13 [DEBUG] Checks finished, runtime was 371.43 seconds
07/03/2014 19:46:21 [DEBUG] Checks finished, runtime was 79.41 seconds
07/04/2014 03:46:56 [DEBUG] Checks finished, runtime was 113.61 seconds
07/04/2014 04:46:16 [DEBUG] Checks finished, runtime was 75.24 seconds
07/04/2014 05:51:23 [DEBUG] Checks finished, runtime was 381.72 seconds
07/04/2014 06:52:03 [DEBUG] Checks finished, runtime was 421.76 seconds
07/04/2014 08:00:32 [DEBUG] Checks finished, runtime was 930.34 seconds
07/04/2014 15:01:49 [DEBUG] Checks finished, runtime was 1006.77 seconds
07/04/2014 17:00:36 [DEBUG] Checks finished, runtime was 934.79 seconds
07/04/2014 17:58:50 [DEBUG] Checks finished, runtime was 827.66 seconds
07/04/2014 19:02:07 [DEBUG] Checks finished, runtime was 1025.27 seconds

Changing the title and reassigning to my component as I'm really sure if this is actually memory related or not.

comment:6 Changed 3 years ago by atagar

Component: DocTorTor Sysadmin Team
Summary: consensus_health_checker repeatedly failingMemory request for cappadocicum

Ok, sending this back over to Andrew. The host has 1 GB memory and the consensus tracker is gobbling up almost all of it. The host is eating into swap which is probably why the consensus tracker is taking quite a bit longer now, overlapping with the other checks.

Would it be prohibitively costly to bump the memory to 1.5 GB? If so then maybe I should go back to running this from my desktop here at home.

KiB Mem:   1026500 total,   977456 used,    49044 free,     2956 buffers
KiB Swap:   724988 total,   535648 used,   189340 free,    17528 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
20667 doctor    20   0 1168m 863m 1964 R   1.7 86.1   1:22.02 python

comment:7 Changed 3 years ago by phobos

Owner: set to phobos
Status: reopenedaccepted

comment:8 Changed 3 years ago by phobos

upgrade request submitted. sorry it took so long.

comment:9 Changed 3 years ago by atagar

No problem, and thanks!

comment:10 Changed 3 years ago by phobos

Resolution: fixed
Status: acceptedclosed

upgraded to 2 GB.

Note: See TracTickets for help on using tickets.