ping fails as a regular user on new VMs

added component::internal services/tor sysadmin team owner::anarcat priority::medium resolution::fixed severity::normal status::closed type::defect labels

I'm not sure how this happened on loghost, but it didn't happen on bacula-director-01, as created in #31786 (moved) (which has the detailed commandline used in the creation).

root@bacula-director-01:~# /sbin/getcap /usr/bin/ping 
/usr/bin/ping = cap_net_raw+ep
root@bacula-director-01:~# stat /usr/bin/ping
  Fichier : /usr/bin/ping
   Taille : 65272     	Blocs : 128        Blocs d'E/S : 4096   fichier
Périphérique : 801h/2049d	Inœud : 137498      Liens : 1
Accès : (0755/-rwxr-xr-x)  UID : (    0/    root)   GID : (    0/    root)
 Accès : 2019-10-03 20:31:00.957049432 +0000
Modif. : 2018-08-03 16:53:09.000000000 +0000
Changt : 2019-10-03 20:30:49.057153189 +0000
  Créé : -
root@bacula-director-01:~#

On the other hand, gettor-01, created as part of #31785 (moved), does have the problem so I'm not sure what's going on here.

root@gettor-01:~# getcap /usr/bin/ping
root@gettor-01:~# stat /usr/bin/ping
  Fichier : /usr/bin/ping
   Taille : 65272     	Blocs : 128        Blocs d'E/S : 4096   fichier
Périphérique : 801h/2049d	Inœud : 393547      Liens : 1
Accès : (0755/-rwxr-xr-x)  UID : (    0/    root)   GID : (    0/    root)
 Accès : 2019-10-04 11:56:47.972263612 +0000
Modif. : 2018-08-03 16:53:09.000000000 +0000
Changt : 2019-10-02 17:16:26.600316666 +0000
  Créé : -
root@gettor-01:~#

It seems to me we don't have a clear way of reproducing this. From here on, maybe we need VM creators to explicitly document exactly which steps were taken to create the box? Maybe there's a slight change in the way partitions are created or something...

Trac:
Status: new to needs_information

I reinstalled iputils-ping on gettor-01 and confirms it fixes the problem. I have not done that on bacula-director-01, just to be clear.

i confirm the problem occurs when the filesystem is cached. filed the bug in Debian in https://bugs.debian.org/942114 and wrote a patch that fixes the problem, deployed on both nodes.

Trac:
Owner: tpa to anarcat
Status: needs_information to assigned

i tested this and it seems to work. next step is to have another person confirm that, to get this merged in the debian package, have an upload to fix it in unstable and then stable, and then deploy this debian package on the nodes.

in the meantime, we should at least add a note in the new-machine docs before this ticket is marked as resolved.

Trac:
Status: assigned to needs_review

I audited all the machines accessible through Cumin, and found the following machines without proper caps set:

crm-ext-01.torproject.org
crm-int-01.torproject.org
gitlab-01.torproject.org
hetzner-hel1-01.torproject.org
hetzner-nbg1-01.torproject.org
oo-hetzner-03.torproject.org

It's strange, because they also didn't have the getcap binary (from the libcap2-bin) package was also missing. So I ran this everywhere:

apt install libcap2-bin; getcap /bin/ping | grep -q . || apt install --reinstall iputils-ping; getcap /bin/ping

Particularly interesting are hetzner-hel1-01.torproject.org and hetzner-nbg1-01.torproject.org because they were setup using the Hetzner cloud stuff, so our install procedure is broken there as well.

So, long story short, remaining todo is:

document the workaround in the ganeti installer (done, in [[https://help.torproject.org/tsa/howto/ganeti/|ganeti]], will need to be reverted when the package is fixed)
ship the new package in Debian, get it to stable (in progress: announced intention to NMU)
fix the hetzner-cloud installer or at least add a node to check it
check the other installers

filed bug 944538 in the debian bts to get stable updated...

Trac:
Summary: ping on new VMs to ping fails as a regular user on new VMs

ping'd bts again, will upload the hotfix on our debian repo monday because we are missing the fix on new ganeti machines in the cluster.

uploaded the hotfix to our own debian repo so that changes get deployed on new boxes automatically. hot-fixed two new machines (tbb-nightlies-master and onionoo-backend-01) because they were deployed from fsn-node-03, which didn't have the fix then.

progress report:

~~document the workaround in the ganeti installer~~ (done, in ganeti, will need to be reverted when the package is fixed)

~~ship the new package in Debian~~, get it to stable (in progress: NUM's to bullseye, blocked on release managers, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944538 )

fix the hetzner-cloud installer ~~or at least add a node to check it~~ could not figure out what is wrong with that installer, so I just added a note in the install procedure

~~check the other installers~~ the "robot", "linaro" and "ganeti procedures have been confirmed as checked and fixed. the "cloud" procedure is still unclear but has been documented.

i think this ticket can be closed. mitigations are in place and workarounds will either expire with the bullseye release or show up in the next install but are flagged in the procedures.

Trac:
Status: needs_review to closed
Resolution: N/A to fixed

closed

mentioned in issue #31786 (moved)

ping fails as a regular user on new VMs

Child items 0

Activity