Opened 9 months ago

Closed 8 months ago

#27373 closed enhancement (implemented)

add UTF-8 validation code

Reported by: cyberpunks Owned by:
Priority: Medium Milestone: Tor: 0.3.5.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: rust-wants, prop285
Cc: Actual Points:
Parent ID: #24033 Points:
Reviewer: Sponsor:

Description


Child Tickets

Change History (15)

comment:1 Changed 9 months ago by cyberpunks

Implemented in the unicode1 branch at ​https://gitgud.io/onionk/tor.git without adding an external library dependency.

comment:2 Changed 9 months ago by teor

Milestone: Tor: unspecifiedTor: 0.3.6.x-final
Status: newneeds_review

Thanks, I pushed your branch to my github, so the CI will run on it.

You can watch for the green tick at:
https://github.com/teor2345/tor/branches

Reviews might be delayed for the next few weeks, because the 0.3.5 release deadline is mid-September.

comment:3 Changed 9 months ago by teor

Status: needs_reviewneeds_revision

The tests are missing the following valid edge cases:

  • a zero-length string
  • the scalar value U+00
  • Unicode byte order mark (BOM, U+FEFF)
  • Byte-swapped BOMs (U+FFFE)

Note the serialisations of these values:
https://gitweb.torproject.org/torspec.git/tree/proposals/285-utf-8.txt#n104

Please also fix the code so it passes CI:

util/validate_utf8: =================================================================
==19689== ERROR: AddressSanitizer: global-buffer-overflow on address 0x556399c41968 at pc 0x556399a0db8b bp 0x7ffce271cd70 sp 0x7ffce271cd68
READ of size 1 at 0x556399c41968 thread T0
    #0 0x556399a0db8a (/home/travis/build/teor2345/tor/src/test/test+0x910b8a)
    #1 0x556399666958 (/home/travis/build/teor2345/tor/src/test/test+0x569958)
    #2 0x5563996bf00c (/home/travis/build/teor2345/tor/src/test/test+0x5c200c)
    #3 0x5563996bf2a5 (/home/travis/build/teor2345/tor/src/test/test+0x5c22a5)
    #4 0x5563996c018a (/home/travis/build/teor2345/tor/src/test/test+0x5c318a)
    #5 0x5563992fc40d (/home/travis/build/teor2345/tor/src/test/test+0x1ff40d)
    #6 0x1456be049f44 (/lib/x86_64-linux-gnu/libc-2.19.so+0x21f44)
    #7 0x55639930061a (/home/travis/build/teor2345/tor/src/test/test+0x20361a)
0x556399c41968 is located 56 bytes to the left of global variable '*.LC663 (src/test/test_util.c)' (0x556399c419a0) of size 10
0x556399c41968 is located 0 bytes to the right of global variable '*.LC662 (src/test/test_util.c)' (0x556399c41960) of size 8
  '*.LC662 (src/test/test_util.c)' is ascii string 'ascii
'
Shadow bytes around the buggy address:
  0x0aacf33802d0: 05 f9 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 00 00
  0x0aacf33802e0: 07 f9 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 06 f9
  0x0aacf33802f0: f9 f9 f9 f9 03 f9 f9 f9 f9 f9 f9 f9 00 00 00 00
  0x0aacf3380300: 00 00 07 f9 f9 f9 f9 f9 00 00 00 00 00 04 f9 f9
  0x0aacf3380310: f9 f9 f9 f9 00 00 00 00 00 04 f9 f9 f9 f9 f9 f9
=>0x0aacf3380320: 00 00 00 00 00 04 f9 f9 f9 f9 f9 f9 00[f9]f9 f9
  0x0aacf3380330: f9 f9 f9 f9 00 02 f9 f9 f9 f9 f9 f9 02 f9 f9 f9
  0x0aacf3380340: f9 f9 f9 f9 02 f9 f9 f9 f9 f9 f9 f9 03 f9 f9 f9
  0x0aacf3380350: f9 f9 f9 f9 04 f9 f9 f9 f9 f9 f9 f9 03 f9 f9 f9
  0x0aacf3380360: f9 f9 f9 f9 03 f9 f9 f9 f9 f9 f9 f9 04 f9 f9 f9
  0x0aacf3380370: f9 f9 f9 f9 05 f9 f9 f9 f9 f9 f9 f9 04 f9 f9 f9
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:     fa
  Heap righ redzone:     fb
  Freed Heap region:     fd
  Stack left redzone:    f1
  Stack mid redzone:     f2
  Stack right redzone:   f3
  Stack partial redzone: f4
  Stack after return:    f5
  Stack use after scope: f8
  Global redzone:        f9
  Global init order:     f6
  Poisoned by user:      f7
  ASan internal:         fe
==19689== ABORTING

https://travis-ci.org/teor2345/tor/jobs/421877257#L6314

perl ../scripts/maint/checkSpace.pl -C \
	../src/lib/*/*.[ch] \
	../src/core/*/*.[ch] \
	../src/feature/*/*.[ch] \
	../src/app/*/*.[ch] \
	../src/test/*.[ch] \
	../src/test/*/*.[ch] \
	../src/tools/*.[ch]
make[2]: Entering directory '/c/projects/tor/i686-w64-mingw32'
...
tp fn():../src/lib/string/util_string.c:476
tp fn():../src/lib/string/util_string.c:482
bash.exe : make[1]: *** [Makefile:15841: check-spaces] Error 1
At line:2 char:5
+     & $commandPath $args 2>&1
+     ~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (make[1]: *** [M...spaces] Error 1:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
make[1]: *** Waiting for unfinished jobs....

https://ci.appveyor.com/project/teor2345/tor/build/1.0.103/job/au99ya2f1vkqynas#L2466

We usually document each macro and function, describing what it does, the arguments, and the return value.

If you could write a changes file, that would be great. Otherwise, someone will do it eventually:
https://gitweb.torproject.org/tor.git/tree/doc/HACKING/CodingStandards.md#n95

comment:4 in reply to:  3 ; Changed 9 months ago by cyberpunks

Replying to teor:

Fixed the tests and check-spaces, sorry bout that.

This is just a helper function and it hasn't been used anywhere yet, so it has no impact on Tor for users on its own, does that still need a changes file?

comment:5 in reply to:  4 ; Changed 9 months ago by teor

Replying to cyberpunks:

Replying to teor:

Fixed the tests and check-spaces, sorry bout that.

Thanks!

Can you also check that Surrogate code points fail to validate?
I couldn't see any in there.
(I'd like to be able to use these tests to make sure that our C and Rust implementations match, so we should check all the code point types.)

This is just a helper function and it hasn't been used anywhere yet, so it has no impact on Tor for users on its own, does that still need a changes file?

That's an interesting question!

We usually commit helper functions and the code the uses them in the same branch, with a single changes file.

Adding a helper function and a unit test could potentially trigger compilation warnings or unit test failures on some platforms.

So I think this change fits at least one of these criteria:

Anything that changes internals, documentation, or the build
system enough that somebody could notice. ….
Anything about which somebody might plausibly wonder "when
did that happen, and/or why did we do that" 6 months down the line.

https://gitweb.torproject.org/tor.git/tree/doc/HACKING/CodingStandards.md#n144

comment:6 Changed 9 months ago by teor

Also, clang does partial validation of unicode strings characters:

src/test/test_util.c:4044:38: error: invalid universal character
  tt_int_op(0, OP_EQ, validate_utf8("\U00110000", 5));
                                     ^~~~~~~~~~
./src/ext/tinytest_macros.h:158:24: note: expanded from macro 'tt_int_op'
        tt_assert_test_type(a,b,#a" "#op" "#b,long,(val1_ op val2_), \
                              ^
./src/ext/tinytest_macros.h:144:28: note: expanded from macro
      'tt_assert_test_type'
        tt_assert_test_fmt_type(a,b,str_test,type,test,type,fmt,        \
                                  ^
./src/ext/tinytest_macros.h:117:16: note: expanded from macro
      'tt_assert_test_fmt_type'
        type val2_ = (b);                                               \
                      ^
1 error generated.
make: *** [src/test/src_test_test-test_util.o] Error 1
make: *** Waiting for unfinished jobs....

https://travis-ci.org/teor2345/tor/jobs/421941943#L3884

(The macOS gcc builds fail as well, but that's because gcc is an alias to clang on those systems. I'll fix that in #27252.)

Last edited 9 months ago by teor (previous) (diff)

comment:7 in reply to:  5 ; Changed 9 months ago by cyberpunks

Replying to teor:

Can you also check that Surrogate code points fail to validate?
I couldn't see any in there.

They're in there. That's what "\xed\xa0\x80" and "\xed\xbf\xbf" are. They could be Unicode literals to make it clearer, but now I wonder if clang will warn about those too...

comment:8 in reply to:  7 ; Changed 9 months ago by teor

Replying to cyberpunks:

Replying to teor:

Can you also check that Surrogate code points fail to validate?
I couldn't see any in there.

They're in there. That's what "\xed\xa0\x80" and "\xed\xbf\xbf" are. They could be Unicode literals to make it clearer, but now I wonder if clang will warn about those too...

Oh, right, in their encoded forms.

Please add comments to each test (or each block of tests), so that future developers know why the tests are failing.

comment:9 in reply to:  8 Changed 9 months ago by cyberpunks

Replying to teor:

Please add comments

Done.

comment:10 Changed 9 months ago by teor

Status: needs_revisionmerge_ready
Type: taskenhancement

Thanks!

CI is here:

Let's merge to 0.3.6 if it passes. (There doesn't seem to be much point in adding an unused function to 0.3.5 right before the code freeze.)

comment:11 Changed 9 months ago by teor

Let's merge this to 0.3.5, so that relays can check ContactInfo lines in the LTS release:
https://trac.torproject.org/projects/tor/ticket/27380#comment:4

comment:12 Changed 9 months ago by teor

Milestone: Tor: 0.3.6.x-finalTor: 0.3.5.x-final

comment:13 in reply to:  11 Changed 9 months ago by cyberpunks

See the squashed and rebased branch in #27428 for merging, rather than the one in this ticket.

comment:14 Changed 8 months ago by nickm

Merged as part of #27428

comment:15 Changed 8 months ago by nickm

Resolution: implemented
Status: merge_readyclosed
Note: See TracTickets for help on using tickets.