Opened 4 years ago

Closed 4 years ago

#18025 closed defect (fixed)

test_util_touch_file is racy and *does* fail

Reported by: weasel Owned by: nickm
Priority: Medium Milestone: Tor: 0.2.8.x-final
Component: Core Tor/Tor Version: Tor: 0.2.7.6
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Tor's master has a broken test:

https://jenkins.torproject.org/job/tor-debian-master-nightly-binaries/630/ARCHITECTURE=i386,SUITE=wily/console

00:30:38 util/get_avail_disk_space: OK
00:30:38 util/touch_file: 
00:30:38   FAIL ../src/test/test_util.c:4615: assert(st.st_mtime OP_GE now): 1452299437 vs 1452299438
00:30:38   [touch_file FAILED]
00:30:38 util/pwdb: [forking] OK

from test_util.c:

  const time_t now = time(NULL);
  struct stat st;
  write_bytes_to_file(fname, "abc", 3, 1);
  tt_int_op(0, OP_EQ, stat(fname, &st));
  tt_i64_op(st.st_mtime, OP_GE, now);

The test is flawed to begin with, there's no guarantee for how much or little time is between the time() call and the write().

Child Tickets

Change History (13)

comment:1 Changed 4 years ago by nickm

Milestone: Tor: 0.2.8.x-final
Owner: set to nickm
Status: newassigned

comment:2 Changed 4 years ago by nickm

hang on, OP_GE means >=. It's supposed to be testing that the mtime on the file is no earlier than the time() before we created it. I am probably sleep deprived and missing something key

comment:3 Changed 4 years ago by weasel

Hmm. Very interesting.

comment:4 Changed 4 years ago by weasel

Seems like we're making bad assumptions of timestamp granularity of the underlying filesystem.

comment:5 Changed 4 years ago by weasel

weasel@valiant:/tmp$ rm -f x; date +%s.%N; echo fo > x  ; (stat -c %y x)      
1452352797.520460922
2016-01-09 16:19:57.518168621 +0100

comment:6 Changed 4 years ago by weasel

Note that this isn't limited to tmpfs. I observe the same on my ext4 home:

weasel@valiant:~$ rm -f x; date +'%Y-%m-%d %H:%M:%S.%N'; echo fo > x  ; (stat -c %y x)
2016-01-09 16:23:34.702175683
2016-01-09 16:23:34.696191451 +0100

comment:7 Changed 4 years ago by weasel

weasel@valiant:~$ while :; do rm -f x; date +'%Y-%m-%d %H:%M:%S.%N'; echo fo > x  ; (stat -c %y x); echo ;done | perl -e '$/=""; while (<>) { print if /\.00/ && /\.99/; }'
2016-01-09 16:28:51.002161899
2016-01-09 16:28:50.997310311 +0100

2016-01-09 16:28:53.002053462
2016-01-09 16:28:52.997292088 +0100

comment:8 Changed 4 years ago by weasel

ex #debian-devel:

<ron> weasel: apparently ext4 uses cached time from inside the kernel that only gets updated by the timer interrupt.
<ron> so it's probably getting truncated to the previous jiffy.
<ron> which sort of makes sense, you don't really want an expensive time lookup everytime fs metadata changes.
<ron> but is kinda evil for the use you have here.

comment:9 Changed 4 years ago by nickm

hrgh. okay; once I'm done with this plane flight and rested a bit, I'll figure out what I want to do with this check. "Wait half a second" sounds undesirable; so does "allow up to half a second of timewarp."

comment:10 Changed 4 years ago by weasel

<ron> weasel: oh ...  it could also be a small difference between the wall clock and the monotonic clock.
<ron> it looks like current_fs_time() reads the monotonic clock.
<ron> Mithrandir: so yeah, the kernel code seems to show it is updating the vsyscall cache too.  just not from the same clock :)
<Mithrandir> fun
<Mithrandir> so there's a kernel<->userspace clock skew. :-P
<ron> yeah, looks like someone needs to add an option to date(1) to read the monotonic clock ...
<ron> you should see the same thing if you pick different clock sources in a call to clock_gettime(3)

comment:11 Changed 4 years ago by weasel

<ron> weasel: http://paste.debian.net/363027/  third guess is the charm!
<ron> the difference between CLOCK_REALTIME and CLOCK_REALTIME_COARSE by the look of it.  MONOTONIC is way different.
<ron> but that's a trivial way to get a now() you can probably use

copy from that paste:


#include <time.h>
#include <stdio.h>

int main()
{
	struct timespec t1, t2, t3;

	clock_gettime(CLOCK_REALTIME, &t1);
	clock_gettime(CLOCK_REALTIME_COARSE, &t2);
	clock_gettime(CLOCK_MONOTONIC, &t3);

	printf("t1: %zu.%zu\n", t1.tv_sec, t1.tv_nsec);
	printf("t2: %zu.%zu\n", t2.tv_sec, t2.tv_nsec);
	printf("t3: %zu.%zu\n", t3.tv_sec, t3.tv_nsec);

	return 0;
}

$ ./bad_time 
t1: 1452359862.758024307
t2: 1452359862.755158473
t3: 2921765.65971100

comment:12 Changed 4 years ago by weasel

<bwh> see https://sources.debian.net/src/linux/4.3.3-5/kernel/time/time.c/#L240
<bwh> and that seems to end up reading cached time
<bwh> whereas gettimeofday() goes here: https://sources.debian.net/src/linux/4.3.3-5/arch/x86/entry/vdso/vclock_gettime.c/#L218
<bwh> and indeed can use the TSC or HPET to get high-res time

comment:13 Changed 4 years ago by nickm

Resolution: fixed
Status: assignedclosed

I've tried the stupidest fix possible as 16840e52e594e221069efc9bf7b65ebfb5e2d639

Note: See TracTickets for help on using tickets.