Opened 8 months ago

Last modified 8 months ago

#32692 new defect

meronense database server crashes on I/O errors

Reported by: anarcat Owned by: tpa
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

meronense crashed today. local time (UTC-5)

08:35:18 <nsa> tor-nagios: [meronense] system - filesystem check is WARNING: /dev/mapper/vg_meronense-srv (/home) has 2 errors
09:03:53 <nsa> tor-nagios: [meronense] process - postgresql96 - master is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (postgres), command name postgres, args 
               /usr/lib/postgresql/9.6/bin/postgres
09:08:33 <nsa> tor-nagios: [meronense] system - all services running is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds.
09:22:43 <nsa> tor-nagios: [meronense] system - all services running is WARNING: degraded

i restarted psql when i noticed:

10:11:50 <anarcat> restarted postgres on meronense, unknown reason crash

but it crashed again.

the errors actually started here (UTC), in /var/log/postgresql/postgresql-9.6-main.log:

2019-12-07 12:09:30 GMT WARNING:  corrupted statistics file "pg_stat_tmp/global.stat"
[repeats until]
2019-12-07 12:44:28 GMT WARNING:  corrupted statistics file "pg_stat_tmp/global.stat"
2019-12-07 12:44:28 GMT LOG:  using stale statistics instead of current ones because stats collector is not responding
2019-12-07 12:44:28 GMT WARNING:  corrupted statistics file "pg_stat_tmp/global.stat"
2019-12-07 12:47:26 GMT PANIC:  could not fsync file "base/16401/17372": Input/output error
2019-12-07 12:47:26 GMT LOG:  checkpointer process (PID 1284) was terminated by signal 6: Aborted
2019-12-07 12:47:26 GMT LOG:  terminating any other active server processes
2019-12-07 12:47:26 GMT WARNING:  terminating connection because of crash of another server process
2019-12-07 12:47:26 GMT DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly 
corrupted shared memory.
2019-12-07 12:47:26 GMT HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2019-12-07 12:47:26 GMT LOG:  archiver process (PID 1288) exited with exit code 1
2019-12-07 12:47:26 GMT WARNING:  terminating connection because of crash of another server process
2019-12-07 12:47:26 GMT DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly 
corrupted shared memory.
2019-12-07 12:47:26 GMT HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2019-12-07 12:47:26 GMT CONTEXT:  SQL statement "UPDATE bwhist
            SET read = array_merge(read, insert_read),
                written = array_merge(written, insert_written),
                dirread = array_merge(dirread, insert_dirread),
                dirwritten = array_merge(dirwritten, insert_dirwritten)
            WHERE fingerprint = insert_fingerprint AND date = insert_date"
        PL/pgSQL function insert_bwhist(character,date,bigint[],bigint[],bigint[],bigint[]) line 12 at SQL statement
2019-12-07 12:47:36 GMT LOG:  statistics collector process (PID 1289) was terminated by signal 9: Killed
2019-12-07 12:47:36 GMT LOG:  all server processes terminated; reinitializing
2019-12-07 12:47:58 GMT LOG:  database system was interrupted; last known up at 2019-12-07 11:02:32 GMT
2019-12-07 12:48:09 GMT LOG:  could not fsync file "./base/16401/17372_fsm": Input/output error
2019-12-07 12:48:09 GMT LOG:  could not fsync file "./base/16401/17372_vm": Input/output error
2019-12-07 12:48:09 GMT LOG:  could not fsync file "./base/16401/17488": Input/output error
2019-12-07 12:48:17 GMT LOG:  could not fsync file "./pg_stat_tmp/global.stat": Input/output error
2019-12-07 12:48:17 GMT LOG:  database system was not properly shut down; automatic recovery in progress
2019-12-07 12:48:17 GMT LOG:  redo starts at 3D7/57E470D8
2019-12-07 12:48:17 GMT LOG:  redo done at 3D7/5BB04860
2019-12-07 13:49:37 GMT PANIC:  could not fsync file "base/16401/17488": Input/output error
2019-12-07 13:49:37 GMT LOG:  startup process (PID 20894) was terminated by signal 6: Aborted
2019-12-07 13:49:37 GMT LOG:  aborting startup due to startup process failure
2019-12-07 13:49:37 GMT LOG:  database system is shut down

here's my attempt at restarting it:

2019-12-07 15:11:33 GMT FATAL:  the database system is starting up
2019-12-07 15:11:33 GMT FATAL:  the database system is starting up
2019-12-07 15:11:34 GMT LOG:  could not stat file "./pg_stat_tmp/global.stat": Structure needs cleaning
2019-12-07 15:11:34 GMT LOG:  database system was not properly shut down; automatic recovery in progress
2019-12-07 15:11:34 GMT FATAL:  the database system is starting up
2019-12-07 15:11:34 GMT FATAL:  the database system is starting up
2019-12-07 15:11:35 GMT FATAL:  the database system is starting up
2019-12-07 15:11:35 GMT FATAL:  the database system is starting up
2019-12-07 15:11:36 GMT FATAL:  the database system is starting up
2019-12-07 15:11:36 GMT LOG:  redo starts at 3D7/57E470D8
2019-12-07 15:11:36 GMT FATAL:  the database system is starting up
2019-12-07 15:11:37 GMT FATAL:  the database system is starting up
2019-12-07 15:11:37 GMT FATAL:  the database system is starting up
2019-12-07 15:11:38 GMT FATAL:  the database system is starting up
2019-12-07 15:11:38 GMT LOG:  incomplete startup packet
2019-12-07 15:11:46 GMT LOG:  redo done at 3D7/5BB04860
2019-12-07 15:48:19 GMT PANIC:  could not fsync file "base/16401/17488": Input/output error
2019-12-07 15:48:19 GMT LOG:  startup process (PID 27916) was terminated by signal 6: Aborted
2019-12-07 15:48:19 GMT LOG:  aborting startup due to startup process failure
2019-12-07 15:48:19 GMT LOG:  database system is shut down

then i shutdown the host from kvm4, and restarted it. then psql was able to recover:

2019-12-07 16:54:17 GMT LOG:  incomplete startup packet
2019-12-07 16:54:17 GMT LOG:  database system shutdown was interrupted; last known up at 2019-12-07 15:26:47 GMT
2019-12-07 16:54:18 GMT FATAL:  the database system is starting up
2019-12-07 16:54:18 GMT FATAL:  the database system is starting up
2019-12-07 16:54:18 GMT LOG:  database system was not properly shut down; automatic recovery in progress
2019-12-07 16:54:18 GMT LOG:  redo starts at 3D7/57E470D8
2019-12-07 16:54:19 GMT FATAL:  the database system is starting up
2019-12-07 16:54:19 GMT FATAL:  the database system is starting up
2019-12-07 16:54:20 GMT FATAL:  the database system is starting up
2019-12-07 16:54:20 GMT FATAL:  the database system is starting up
2019-12-07 16:54:21 GMT FATAL:  the database system is starting up
2019-12-07 16:54:21 GMT FATAL:  the database system is starting up
2019-12-07 16:54:22 GMT FATAL:  the database system is starting up
2019-12-07 16:54:22 GMT FATAL:  the database system is starting up
2019-12-07 16:54:23 GMT FATAL:  the database system is starting up
2019-12-07 16:54:23 GMT LOG:  incomplete startup packet
2019-12-07 16:54:23 GMT LOG:  redo done at 3D7/5BB04860
2019-12-07 16:57:28 GMT LOG:  MultiXact member wraparound protections are now enabled
2019-12-07 16:57:28 GMT LOG:  autovacuum launcher started
2019-12-07 16:57:28 GMT LOG:  database system is ready to accept connections

disk arrays are fine according to mdadm on kvm4,the parent host:

root@kvm4:~# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md2 : active raid1 sda[2] sdb[3]
      3906887488 blocks super 1.2 [2/2] [UU]
      bitmap: 12/30 pages [48KB], 65536KB chunk

md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
      479002944 blocks super 1.2 [2/2] [UU]
      bitmap: 3/4 pages [12KB], 65536KB chunk

md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0]
      20955136 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

this is the second I see this happen. not sure what's going on. i need to run now, would love another pair of eyes on this.

Child Tickets

Change History (1)

comment:1 Changed 8 months ago by anarcat

there's a plan to decom kvm4, the parent host of this machine, which might solve this problem in the long run, see #32802.

Note: See TracTickets for help on using tickets.