meronense database server crashes on I/O errors

meronense crashed today. local time (UTC-5)

08:35:18 <nsa> tor-nagios: [meronense] system - filesystem check is WARNING: /dev/mapper/vg_meronense-srv (/home) has 2 errors
09:03:53 <nsa> tor-nagios: [meronense] process - postgresql96 - master is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (postgres), command name postgres, args 
               /usr/lib/postgresql/9.6/bin/postgres
09:08:33 <nsa> tor-nagios: [meronense] system - all services running is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds.
09:22:43 <nsa> tor-nagios: [meronense] system - all services running is WARNING: degraded

i restarted psql when i noticed:

10:11:50 <anarcat> restarted postgres on meronense, unknown reason crash

but it crashed again.

the errors actually started here (UTC), in /var/log/postgresql/postgresql-9.6-main.log:

2019-12-07 12:09:30 GMT WARNING:  corrupted statistics file "pg_stat_tmp/global.stat"
[repeats until]
2019-12-07 12:44:28 GMT WARNING:  corrupted statistics file "pg_stat_tmp/global.stat"
2019-12-07 12:44:28 GMT LOG:  using stale statistics instead of current ones because stats collector is not responding
2019-12-07 12:44:28 GMT WARNING:  corrupted statistics file "pg_stat_tmp/global.stat"
2019-12-07 12:47:26 GMT PANIC:  could not fsync file "base/16401/17372": Input/output error
2019-12-07 12:47:26 GMT LOG:  checkpointer process (PID 1284) was terminated by signal 6: Aborted
2019-12-07 12:47:26 GMT LOG:  terminating any other active server processes
2019-12-07 12:47:26 GMT WARNING:  terminating connection because of crash of another server process
2019-12-07 12:47:26 GMT DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly 
corrupted shared memory.
2019-12-07 12:47:26 GMT HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2019-12-07 12:47:26 GMT LOG:  archiver process (PID 1288) exited with exit code 1
2019-12-07 12:47:26 GMT WARNING:  terminating connection because of crash of another server process
2019-12-07 12:47:26 GMT DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly 
corrupted shared memory.
2019-12-07 12:47:26 GMT HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2019-12-07 12:47:26 GMT CONTEXT:  SQL statement "UPDATE bwhist
            SET read = array_merge(read, insert_read),
                written = array_merge(written, insert_written),
                dirread = array_merge(dirread, insert_dirread),
                dirwritten = array_merge(dirwritten, insert_dirwritten)
            WHERE fingerprint = insert_fingerprint AND date = insert_date"
        PL/pgSQL function insert_bwhist(character,date,bigint[],bigint[],bigint[],bigint[]) line 12 at SQL statement
2019-12-07 12:47:36 GMT LOG:  statistics collector process (PID 1289) was terminated by signal 9: Killed
2019-12-07 12:47:36 GMT LOG:  all server processes terminated; reinitializing
2019-12-07 12:47:58 GMT LOG:  database system was interrupted; last known up at 2019-12-07 11:02:32 GMT
2019-12-07 12:48:09 GMT LOG:  could not fsync file "./base/16401/17372_fsm": Input/output error
2019-12-07 12:48:09 GMT LOG:  could not fsync file "./base/16401/17372_vm": Input/output error
2019-12-07 12:48:09 GMT LOG:  could not fsync file "./base/16401/17488": Input/output error
2019-12-07 12:48:17 GMT LOG:  could not fsync file "./pg_stat_tmp/global.stat": Input/output error
2019-12-07 12:48:17 GMT LOG:  database system was not properly shut down; automatic recovery in progress
2019-12-07 12:48:17 GMT LOG:  redo starts at 3D7/57E470D8
2019-12-07 12:48:17 GMT LOG:  redo done at 3D7/5BB04860
2019-12-07 13:49:37 GMT PANIC:  could not fsync file "base/16401/17488": Input/output error
2019-12-07 13:49:37 GMT LOG:  startup process (PID 20894) was terminated by signal 6: Aborted
2019-12-07 13:49:37 GMT LOG:  aborting startup due to startup process failure
2019-12-07 13:49:37 GMT LOG:  database system is shut down

here's my attempt at restarting it:


2019-12-07 15:11:33 GMT FATAL:  the database system is starting up
2019-12-07 15:11:33 GMT FATAL:  the database system is starting up
2019-12-07 15:11:34 GMT LOG:  could not stat file "./pg_stat_tmp/global.stat": Structure needs cleaning
2019-12-07 15:11:34 GMT LOG:  database system was not properly shut down; automatic recovery in progress
2019-12-07 15:11:34 GMT FATAL:  the database system is starting up
2019-12-07 15:11:34 GMT FATAL:  the database system is starting up
2019-12-07 15:11:35 GMT FATAL:  the database system is starting up
2019-12-07 15:11:35 GMT FATAL:  the database system is starting up
2019-12-07 15:11:36 GMT FATAL:  the database system is starting up
2019-12-07 15:11:36 GMT LOG:  redo starts at 3D7/57E470D8
2019-12-07 15:11:36 GMT FATAL:  the database system is starting up
2019-12-07 15:11:37 GMT FATAL:  the database system is starting up
2019-12-07 15:11:37 GMT FATAL:  the database system is starting up
2019-12-07 15:11:38 GMT FATAL:  the database system is starting up
2019-12-07 15:11:38 GMT LOG:  incomplete startup packet
2019-12-07 15:11:46 GMT LOG:  redo done at 3D7/5BB04860
2019-12-07 15:48:19 GMT PANIC:  could not fsync file "base/16401/17488": Input/output error
2019-12-07 15:48:19 GMT LOG:  startup process (PID 27916) was terminated by signal 6: Aborted
2019-12-07 15:48:19 GMT LOG:  aborting startup due to startup process failure
2019-12-07 15:48:19 GMT LOG:  database system is shut down

then i shutdown the host from kvm4, and restarted it. then psql was able to recover:


2019-12-07 16:54:17 GMT LOG:  incomplete startup packet
2019-12-07 16:54:17 GMT LOG:  database system shutdown was interrupted; last known up at 2019-12-07 15:26:47 GMT
2019-12-07 16:54:18 GMT FATAL:  the database system is starting up
2019-12-07 16:54:18 GMT FATAL:  the database system is starting up
2019-12-07 16:54:18 GMT LOG:  database system was not properly shut down; automatic recovery in progress
2019-12-07 16:54:18 GMT LOG:  redo starts at 3D7/57E470D8
2019-12-07 16:54:19 GMT FATAL:  the database system is starting up
2019-12-07 16:54:19 GMT FATAL:  the database system is starting up
2019-12-07 16:54:20 GMT FATAL:  the database system is starting up
2019-12-07 16:54:20 GMT FATAL:  the database system is starting up
2019-12-07 16:54:21 GMT FATAL:  the database system is starting up
2019-12-07 16:54:21 GMT FATAL:  the database system is starting up
2019-12-07 16:54:22 GMT FATAL:  the database system is starting up
2019-12-07 16:54:22 GMT FATAL:  the database system is starting up
2019-12-07 16:54:23 GMT FATAL:  the database system is starting up
2019-12-07 16:54:23 GMT LOG:  incomplete startup packet
2019-12-07 16:54:23 GMT LOG:  redo done at 3D7/5BB04860
2019-12-07 16:57:28 GMT LOG:  MultiXact member wraparound protections are now enabled
2019-12-07 16:57:28 GMT LOG:  autovacuum launcher started
2019-12-07 16:57:28 GMT LOG:  database system is ready to accept connections

disk arrays are fine according to mdadm on kvm4,the parent host:

root@kvm4:~# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md2 : active raid1 sda[2] sdb[3]
      3906887488 blocks super 1.2 [2/2] [UU]
      bitmap: 12/30 pages [48KB], 65536KB chunk

md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
      479002944 blocks super 1.2 [2/2] [UU]
      bitmap: 3/4 pages [12KB], 65536KB chunk

md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0]
      20955136 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

this is the second I see this happen. not sure what's going on. i need to run now, would love another pair of eyes on this.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

meronense database server crashes on I/O errors

Child items ...

Activity