Opened 5 months ago

Closed 4 months ago

#22410 closed defect (implemented)

ensure that uint8_t is unsigned char

Reported by: catalyst Owned by: catalyst
Priority: Medium Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: review-group-18
Cc: Actual Points:
Parent ID: #6877 Points:
Reviewer: Sponsor:

Description

On a POSIX platform, uint8_t is required to exist, but not explicitly required to be identical to unsigned char. It's theoretically possible that a very unusual platform could have an extension integer type that it defines as uint8_t. Test for this at configure time.

This supports an effort to move toward using uint8_t as the "byte" type in preference to char (which is often a signed type and can cause bugs related to signed arithmetic).

Child Tickets

Change History (9)

comment:1 Changed 5 months ago by catalyst

Status: newneeds_review

comment:2 Changed 5 months ago by cypherpunks

This ticket seems to be trying to test whether two data types are actually the same. IMO this is the wrong approach to reach the goal of replacing the data type of some buffers.

Instead the code should be designed and changed around the new data type and not simply swap data types after verifying they're the same.

comment:3 Changed 5 months ago by catalyst

The "best" alternative is unsigned char, but that's an unwieldy 13 characters vs 7 characters for uint8_t or 6 for u_char. POSIX guarantees that uint8_t exists but not that it's identical to unsigned char. The best abbreviation for unsigned char is probably u_char, but no standard guarantees that it exists (though it usually does on BSD-derived platforms).

If uint8_t is an extended integer type rather than unsigned char (which is admittedly unlikely), it won't have the privileged aliasing properties of unsigned char so code that casts pointers to other types to pointers to uint8_t might violate the strict aliasing rules and produce undefined behavior.

If you think it's better to use u_char and define it ourselves on platforms where it doesn't exist, I think that's a reasonable alternative too. (I think uint8_t is marginally better on the portability front because I think we effectively require POSIX already.)

comment:4 in reply to:  3 ; Changed 5 months ago by cypherpunks

Replying to catalyst:

The "best" alternative is unsigned char, but that's an unwieldy 13 characters vs 7 characters for uint8_t or 6 for u_char. POSIX guarantees that uint8_t exists but not that it's identical to unsigned char. The best abbreviation for unsigned char is probably u_char, but no standard guarantees that it exists (though it usually does on BSD-derived platforms).

Comparing the length of the keywords is a bad way of choosing data types.

If uint8_t is an extended integer type rather than unsigned char (which is admittedly unlikely), it won't have the privileged aliasing properties of unsigned char so code that casts pointers to other types to pointers to uint8_t might violate the strict aliasing rules and produce undefined behavior.

I agree with the possibility of violating strict aliasing. However, i assume (yes, i know i should never) these pointers are dereferenced at some point which is always undefined behavior when the old and new type differs so the point is moot.

If you think it's better to use u_char and define it ourselves on platforms where it doesn't exist, I think that's a reasonable alternative too. (I think uint8_t is marginally better on the portability front because I think we effectively require POSIX already.)

I don't care about the data type names (any renaming can easily be done using typedef if preferred). IMO it's more important that the data type matches the type of data it holds and the code handling these data types is built around these data types in order to keep casting to a minimum (preferably none).

comment:5 in reply to:  4 Changed 5 months ago by catalyst

Replying to cypherpunks:

Comparing the length of the keywords is a bad way of choosing data types.

In the abstract, I agree. I also think the length of a type name does affects readability by humans and we should consider that in our decisions.

Replying to catalyst:

If uint8_t is an extended integer type rather than unsigned char (which is admittedly unlikely), it won't have the privileged aliasing properties of unsigned char so code that casts pointers to other types to pointers to uint8_t might violate the strict aliasing rules and produce undefined behavior.

I agree with the possibility of violating strict aliasing. However, i assume (yes, i know i should never) these pointers are dereferenced at some point which is always undefined behavior when the old and new type differs so the point is moot.

C99 §6.5 paragraph 7 explicitly says that it's always valid to use a character lvalue to access the stored value of any object. This means it's always valid to dereference a pointer to a character type as long as it points into an object. If we detect that uint8_t is a character type, then we know that it will also have these privileged aliasing properties.

I don't care about the data type names (any renaming can easily be done using typedef if preferred). IMO it's more important that the data type matches the type of data it holds and the code handling these data types is built around these data types in order to keep casting to a minimum (preferably none).

I think the best data type for handling arbitrary byte data on a platform with CHAR_BIT==8 is unsigned char. This also has an advantage when handling encoding or decoding a larger type, because of the privileged aliasing properties of character types in C.

A lot of existing code in the tree uses uint8_t. It's easier to check at configure time whether uint8_t is a character type than to check each use of uint8_t for strict aliasing violations that could occur on (presumably rare) platforms where uint8_t is not a character type.

There will often need to be casting even when using unsigned char or uint8_t because they will promote to signed int on most platforms. This can cause problems with bitwise shifts if the appropriate casts aren't done. (Left-shifting a 1 bit into the sign bit is undefined behavior.)

comment:6 Changed 4 months ago by nickm

Keywords: review-group-18 added

comment:7 Changed 4 months ago by nickm

Status: needs_reviewneeds_revision

Code lgtm but needs a changes file

comment:8 Changed 4 months ago by catalyst

Status: needs_revisionneeds_review

Thanks. Added changes file and rebased on maint-0.3.1.

comment:9 Changed 4 months ago by nickm

Resolution: implemented
Status: needs_reviewclosed

Merged to master!

Note: See TracTickets for help on using tickets.