Discussion:
Cryptographic hash functions reloaded [was Interest in cryptographic functions within the standard library]
Markus Mayer
2014-08-23 20:14:24 UTC
Permalink
Sorry for not responding for such a long time, but private life kept me
busy.

First of all many thanks to all the valuable feedback you posted in the
'Interest in cryptographic functions within the standard library'
thread. As I will start with hash function I decided to start a new thread.

I have revamped (and simplified) my initial design and put it on the
list to get further feedback.

Design (heavily based on boost::crc):

class hash_function
{
public:
typedef std::array<unsigned char, ALGORITHM_DEFINED> result_type;
//Default contructable
//Copyable
//Moveable

hash_function& process_bytes( void const *buffer, std::size_t
byte_count);

void reset();

const result_type& hash_value();

};

//I am not sure about this function yet...
template<class hash>
typename hash::result_type calculate_hash(void const *buffer,
std::size_t byte_count);

The implemented algorithms will be (the class name is given in the list):
-md5
-sha_1
-sha_224
-sha_256
-sha_384
-sha_512
-sha3_224
-sha3_256
-sha3_384
-sha3_512
-Various flavors of crc


Rational:
-Why 'result_type'?
To be consistent with std::function.

-Why 'unsigned char' and not 'uint8_t' for result_type?
Most hash algorithms work on an 8 bit basis. But as long as unsigned
char has a multiple of 8 bits, the algorithms can still be applied. So
'unsigned char' enables those architectures to implement the functions.
Architectures where 'unsigned char' is not a multiple of 8 bits will be
excluded by the proposal.

-Why 'unsigned char' and not 'char' for result_type?
It is to prevent, that people think that the result is a text(string).
Furthermore if I interpret a raw byte it is always positive. Or do you
interpret 0xFF as -128 when it is given in a raw byte stream.

-Why 'process_bytes' and not 'write', 'update', ...?
Well, naming is hard. I will stick to 'process_bytes' during design, but
I'm open to suggestions.

-Why not implement operator()?
Having a function (with a name) is more vocal (and clear) then just
braces. IMHO Operator() is only useful if the object will be used as a
functor (as in std::less). But the signature is to uncommon to be used
in any standard algorithm. But I'm willing to change if someone came up
with a good example.

-Why not rename 'hash_value' to result_type()?
What do you prefer?
auto result = myHash.hash_value;
or
auto result = static_cast<hash_function::result_type>(myHash);

-Why 'sha_512' and not 'sha2_512'?
'sha_512' is the official name for the algorithm. I know it is bad, but
better be consistent.

-Why not add an iterator based process_bytes?
For now I consider it to complex.

-Why not add/delete algorithm XXX?
I think these are the most common. But I am open to suggestions.

-Why not use the naming of 'N3980: Types Don't Know #'?
Is already discussed above.

Open topics:
-How to handle the large state of a hash function?
Hash function can have a large internal state (64 Byte for sha2, 200
Byte for sha3) is it OK to put such objects on the stack, or do we need
to allocate them dynamically (using allocators)?

-How to hash a files?
Hashing a file is quite common. As the iterator interface was removed,
there is no easy way to hash a file (using istream_iterator). How to do
it now?

-Sync with 'Types Don't Know #'

-Should we only add some crc classes (like crc_32, crc_16 or crc_ccitt)
or a an generic (templated) crc algorithm (like in boost:crc)?

-Add 'nothrow' where applicable

-More naming discussions

-Find a suitable header file (maybe functional or algorithm)


regards
Markus
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Zhihao Yuan
2014-08-23 21:17:58 UTC
Permalink
I used boost::detail::sha1 in two projects but latter I dropped it,
because it has a serious flaw: The hasher object become
unusable after you getting the digest (finalized), but you
have no way to tell whether such an object is still usable or
not if the object is returned from a function.

Python's hashlib has no such problem, and the solution is
very simple: just copy the digest before you finalize it.

Now my design is thin wrapper to openssl, and provides all
functionality of Python's hashlib at compile time:

https://github.com/lichray/cpp-deuceclient/blob/master/src/hashlib.h
Post by Markus Mayer
hash_function& process_bytes( void const *buffer, std::size_t
byte_count);
void* is a very bad interface. It destroys type safety. A user
may call process_bytes(my_std_string, a_shorter_length)
and think he is getting hash of a prefix of the string.
Post by Markus Mayer
const result_type& hash_value();
The issue I explained above: we need to return a value
to make the object less stateful.
Post by Markus Mayer
-Why 'result_type'?
To be consistent with std::function.
It's not a function, then why be consistent? Plus TR1
function interface no longer matters.
Post by Markus Mayer
-Why not add an iterator based process_bytes?
For now I consider it to complex.
Yes. Iterator is for type generic algorithms, while message
digest has much more limited input types.
Post by Markus Mayer
-How to handle the large state of a hash function?
Hash function can have a large internal state (64 Byte for sha2, 200 Byte
for sha3) is it OK to put such objects on the stack, or do we need to
allocate them dynamically (using allocators)?
I don't think we need. 200 bytes does not look large to me,
comparing with those of pseudo random number generators.
Post by Markus Mayer
-How to hash a files?
Hashing a file is quite common. As the iterator interface was removed,
there is no easy way to hash a file (using istream_iterator). How to do it
now?
I tried, but I found that the internal data flow of streambuf is
too complex to a hasher.

User can use an external buffer to do so, or to use system
calls like read(2) directly. To me, it does not seem to be a
problem must be solved in the standard.
--
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://bit.ly/blog4bsd
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Howard Hinnant
2014-08-24 01:25:38 UTC
Permalink
Post by Markus Mayer
class hash_function
{
typedef std::array<unsigned char, ALGORITHM_DEFINED> result_type;
//Default contructable
//Copyable
//Moveable
hash_function& process_bytes( void const *buffer, std::size_t byte_count);
void reset();
const result_type& hash_value();
};
//I am not sure about this function yet...
template<class hash>
typename hash::result_type calculate_hash(void const *buffer, std::size_t byte_count);
-md5
-sha_1
-sha_224
-sha_256
-sha_384
-sha_512
-sha3_224
-sha3_256
-sha3_384
-sha3_512
-Various flavors of crc
-Why 'result_type'?
To be consistent with std::function.
Agreed with result_type, but see below.
Post by Markus Mayer
-Why 'unsigned char' and not 'uint8_t' for result_type?
Most hash algorithms work on an 8 bit basis. But as long as unsigned char has a multiple of 8 bits, the algorithms can still be applied. So 'unsigned char' enables those architectures to implement the functions. Architectures where 'unsigned char' is not a multiple of 8 bits will be excluded by the proposal.
Ditto for uint8_t, except that if you use uint8_t, you'll find out about those excluded architectures at compile time instead of at run time.
Post by Markus Mayer
-Why not implement operator()?
Having a function (with a name) is more vocal (and clear) then just braces. IMHO Operator() is only useful if the object will be used as a functor (as in std::less). But the signature is to uncommon to be used in any standard algorithm. But I'm willing to change if someone came up with a good example.
The "Types Don't Know #" proposal used operator() because it made it much easier to design a type-erased hasher that could be used for pimpl types.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#pimpl
Post by Markus Mayer
-Why not rename 'hash_value' to result_type()?
What do you prefer?
auto result = myHash.hash_value;
or
auto result = static_cast<hash_function::result_type>(myHash);
Or:

hash_function::result_type result{myHash};

auto is a double-edged sword. I personally love it. But I'm also finding it can be abused. Sometimes it is helpful to have the type of a variable be more explicit. But see below where it is shown that minimal code interfaces directly with hash algorithms. Only hash functors should call hash algorithms. And hash functors are what everybody else calls, such as unordered containers, or clients needing the result of a sha256. The distinction between hash functors and hash algorithms is what enables end-user-clients to easily swap out hash algorithms, simply by telling the hash functor to switch hash algorithms.
Post by Markus Mayer
-Sync with 'Types Don't Know #'
Here is a 'Types Don't Know #' version of sha256 based on the implementation at http://www.aarongifford.com/computers/sha.html

class sha256
{
SHA256_CTX state_;
public:
static constexpr std::endian endian = std::endian::big;
using result_type = std::array<std::uint8_t, SHA256_DIGEST_LENGTH>;

sha256() noexcept
{
SHA256_Init(&state_);
}

void
operator()(void const* key, std::size_t len) noexcept
{
SHA256_Update(&state_, static_cast<uint8_t const*>(key), len);
}

explicit
operator result_type() noexcept
{
result_type r;
SHA256_Final(r.data(), &state_);
return r;
}
};

The types, constants, and functions starting with SHA256_ come from the C implementation at http://www.aarongifford.com/computers/sha.html. sha256 is copy constructible (and thus move constructible). It is not complicated -- a simple adaptor of existing C implementations. And it is easy and type-safe to use:

int
main()
{
std::uhash<sha256> h;
auto r = h(54);
}

r has type sha256::result_type, which is a 32 byte std::array. It hashes the int 54, converted to big endian prior to feeding it to the hash (if necessary). Little endian or native endian could just have easily been chosen. One uses native endian if you don't care about the results being consistent across platforms with differing endian. Note that the endian bits are lacking from N3980. They are a later refinement, strictly to aid fingerprinting applications such as sha256.

Note that clients such as the main() shown above, nor unordered containers, use sha256 directly, except to parameterize hash functions such as uhash. The hash functions such as uhash communicate directly with the hash algorithms such as sha256. This (compile-time) level of indirection is what makes it possible to very easily and quickly switch between sha256 and sha3_512, or siphash, etc.

std::uhash is an example, default, std-supplied hash_function, much like std::vector is a an example std-supplied container. However std::uhash is far simpler than std::vector. Clients can very easily supply custom hash functors to replace std::uhash to do things such as seeding, padding and salting. For example from:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#seeding

std::unordered_set<MyType, randomly_seeded_hash<acme::spooky>> my_set;

Using sha256 in this seeded context would fail at compile time, since sha256 is not seedable (at least not the implementation I used).

Finally note that if instead of wanting to has an int, I want to hash a string, it is simply:

int
main()
{
std::uhash<sha256> h;
auto r = h(std::string("54"));
}

The remarkable thing to note here is that neither int nor std::string has the slightest notion of sha256, nor even std::uhash. These types are hash-algorithm-agnostic. And this means if tomorrow, someone invents and implements super_sha, then all we have to do to use it is:

int
main()
{
std::uhash<super_sha> h;
auto r = h(std::string("54"));
}

This is in stark contrast to today's std::hash<T> model where one would have to revisit int, std::string, and all other types, to teach them about super_sha.

Howard
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Miro Knejp
2014-08-24 03:38:06 UTC
Permalink
Post by Howard Hinnant
Post by Markus Mayer
-Why 'unsigned char' and not 'uint8_t' for result_type?
Most hash algorithms work on an 8 bit basis. But as long as unsigned char has a multiple of 8 bits, the algorithms can still be applied. So 'unsigned char' enables those architectures to implement the functions. Architectures where 'unsigned char' is not a multiple of 8 bits will be excluded by the proposal.
Ditto for uint8_t, except that if you use uint8_t, you'll find out about those excluded architectures at compile time instead of at run time.
Wouldn't it be the job of the implementation on such platforms to make
the algorithm work on the machine? That an algorithm works on an 8 bit
basis does not mean it cannot be implemented on architectures with other
word sizes as long as the input size conforms to the algorithm's
requirements and instructions for the necessary bit fiddling are available.

Besides, such architectures would most likely have a /freestanding
implementation /(see 17.6.1.3 [compliance]) where most standard headers
are optional and excluded features must be documented, so I see no
reason to add this restriction to a proposal.

Miro
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Howard Hinnant
2014-08-24 03:54:55 UTC
Permalink
Post by Howard Hinnant
Post by Markus Mayer
-Why 'unsigned char' and not 'uint8_t' for result_type?
Most hash algorithms work on an 8 bit basis. But as long as unsigned char has a multiple of 8 bits, the algorithms can still be applied. So 'unsigned char' enables those architectures to implement the functions. Architectures where 'unsigned char' is not a multiple of 8 bits will be excluded by the proposal.
Ditto for uint8_t, except that if you use uint8_t, you'll find out about those excluded architectures at compile time instead of at run time.
Wouldn't it be the job of the implementation on such platforms to make the algorithm work on the machine? That an algorithm works on an 8 bit basis does not mean it cannot be implemented on architectures with other word sizes as long as the input size conforms to the algorithm's requirements and instructions for the necessary bit fiddling are available.
Besides, such architectures would most likely have a freestanding implementation (see 17.6.1.3 [compliance]) where most standard headers are optional and excluded features must be documented, so I see no reason to add this restriction to a proposal.
I misspoke. This sentence of mine is a side issue, not worthy of comment. My apologies for the noise.

Howard
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Markus Mayer
2014-08-24 08:33:44 UTC
Permalink
Post by Miro Knejp
Post by Howard Hinnant
Post by Markus Mayer
-Why 'unsigned char' and not 'uint8_t' for result_type?
Most hash algorithms work on an 8 bit basis. But as long as unsigned char has a multiple of 8 bits, the algorithms can still be applied. So 'unsigned char' enables those architectures to implement the functions. Architectures where 'unsigned char' is not a multiple of 8 bits will be excluded by the proposal.
Ditto for uint8_t, except that if you use uint8_t, you'll find out about those excluded architectures at compile time instead of at run time.
Wouldn't it be the job of the implementation on such platforms to make
the algorithm work on the machine? That an algorithm works on an 8 bit
basis does not mean it cannot be implemented on architectures with other
word sizes as long as the input size conforms to the algorithm's
requirements and instructions for the necessary bit fiddling are available.
Besides, such architectures would most likely have a /freestanding
implementation /(see 17.6.1.3 [compliance]) where most standard headers
are optional and excluded features must be documented, so I see no
reason to add this restriction to a proposal.
Miro
That's a really good point. I will use 'unsigned char' and let
implementations of odd-sized architectures decide themselves if they
implement it.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Ion GaztaƱaga
2014-08-24 13:15:44 UTC
Permalink
Post by Markus Mayer
That's a really good point. I will use 'unsigned char' and let
implementations of odd-sized architectures decide themselves if they
implement it.
Although it's pretty similar, you can use uint_least8_t, to express that
the result will be put in an array storage of a type that will only use
8 bits and it's interpreted as an unsigned number.

That means that if SHA-256 is used, ALGORITHM_DEFINED will always be 32
in all platforms. Otherwise, in a platform with CHAR_BIT==32 (say, a 32
bit DSP) one could store 32 bits in an unsigned char and
ALGORITHM_DEFINED would be 8 for SHA-256. This is important for the user
as she has to correctly interpret how the result is stored.

Best,

Ion
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-08-24 06:44:57 UTC
Permalink
Post by Howard Hinnant
Post by Markus Mayer
hash_function& process_bytes( void const *buffer, std::size_t byte_count);
I'd like to argue in favor of "unsigned char*", not "void*".
Post by Howard Hinnant
Post by Markus Mayer
-Why not implement operator()?
Having a function (with a name) is more vocal (and clear) then just braces. IMHO Operator() is only useful if the object will be used as a functor (as in std::less). But the signature is to uncommon to be used in any standard algorithm. But I'm willing to change if someone came up with a good example.
The "Types Don't Know #" proposal used operator() because it made it much easier to design a type-erased hasher that could be used for pimpl types.
Yes, operator() is better for an entity that does essentially just one thing.
Post by Howard Hinnant
Post by Markus Mayer
-Sync with 'Types Don't Know #'
Here is a 'Types Don't Know #' version of sha256 based on the implementation at http://www.aarongifford.com/computers/sha.html
class sha256
{
SHA256_CTX state_;
static constexpr std::endian endian = std::endian::big;
using result_type = std::array<std::uint8_t, SHA256_DIGEST_LENGTH>;
sha256() noexcept
{
SHA256_Init(&state_);
}
void
operator()(void const* key, std::size_t len) noexcept
{
SHA256_Update(&state_, static_cast<uint8_t const*>(key), len);
}
I'd like to have "unsigned char *" as the interface. Your implementation
makes it clear it doesn't work on non-8-bit-char machines, which is
fine, but that shouldn't constrain the interface.
Post by Howard Hinnant
int
main()
{
std::uhash<sha256> h;
auto r = h(54);
}
It's definitely the right approach to separate the core hash algorithm from
the preparation and adaptation of values of arbitrary type that gets passed in.
Post by Howard Hinnant
r has type sha256::result_type, which is a 32 byte std::array. It hashes the int 54, converted to big endian prior to feeding it to the hash (if necessary). Little endian or native endian could just have easily been chosen. One uses native endian if you don't care about the results being consistent across platforms with differing endian. Note that the endian bits are lacking from N3980. They are a later refinement, strictly to aid fingerprinting applications such as sha256.
I'm not convinced the "endian" parts are well-designed as-is.

A hash function such as SHA256 should be defined to hash a sequence
of octets, accessed via "unsigned char *". At that level, endianess
doesn't matter. I agree there is a choice for std::uhash<> when
converting a, say, "int" to a sequence of "unsigned char"s whether to
perform endian conversion or not, and that influences the cross-platform
reproducibility of the hash when e.g. hashing a sequence of "int"
values.

Some hash functions that process more than one octet in one step
could omit repeated endian conversion when accumulating multiple
"unsigned char"s into a step-unit (e.g. uint64_t). Specializations
of std::uhash<> that exploit that might be useful, possibly with an
optional
operator()(const uint64_t *, size_t len)
interface for the individual hash function that makes the dependency
explicit.

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Olaf van der Spek
2014-08-24 14:30:16 UTC
Permalink
Post by Markus Mayer
Post by Markus Mayer
hash_function& process_bytes( void const *buffer, std::size_t
byte_count);
I'd like to argue in favor of "unsigned char*", not "void*".
(const void*, size_t) is a pretty standard type, why use const unsigned
char*?
It'd disallow easily hashing a string for example.
Perhaps we should have (array_view<>) instead?
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Markus Mayer
2014-08-24 16:41:07 UTC
Permalink
Post by Markus Mayer
Post by Markus Mayer
hash_function& process_bytes( void const *buffer, std::size_t
byte_count);
I'd like to argue in favor of "unsigned char*", not "void*".
(const void*, size_t) is a pretty standard type, why use const unsigned
char*?
It'd disallow easily hashing a string for example.
Perhaps we should have (array_view<>) instead?
I reverted to (const void*, size_t). It is used by boost::crc and
boost::asio.

I think a common usage is:
myHashFunc.process_bytes(&myStruct, sizeof(myStruct));

array_view<> would require to create an additional object and write more
code. So I stick with (const void*, size_t)
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-08-24 21:38:02 UTC
Permalink
Post by Markus Mayer
I reverted to (const void*, size_t). It is used by boost::crc and
boost::asio.
That doesn't mean it's the best choice.
Post by Markus Mayer
myHashFunc.process_bytes(&myStruct, sizeof(myStruct));
which is totally bogus, because it hashes all the padding inside
the struct, which likely has unspecified values.

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-08-24 21:29:41 UTC
Permalink
Post by Jens Maurer
Post by Markus Mayer
hash_function& process_bytes( void const *buffer, std::size_t byte_count);
I'd like to argue in favor of "unsigned char*", not "void*".
(const void*, size_t) is a pretty standard type, why use const unsigned char*?
It'd disallow easily hashing a string for example.
And it disallows (directly) hashing

struct S {
char c;
int x;
};

S s[10] = { };
my_hash(s, sizeof(s)); // bad: allowed with "void *" proposal

which is a good thing, because all the padding between "c" and "x"
has unspecified value, so your hash is very unpredictable.

Also, simply hashing
int x[] = { 0, 1, 2, 3 };
my_hash(x, sizeof(x));

doesn't give you a consistent hash value across different machines,
because the hash operates on bytes, but the endianess of the "int"
(i.e. the mapping to bytes) differs.
Post by Jens Maurer
Perhaps we should have (array_view<>) instead?
Well, array_view<unsigned char> is probably equivalent to the
"unsigned char" proposal.

(I'm also weakly in favor of an iterator-style
const unsigned char * first, const unsigned char * last
interface, because it seems to save one register in some
hash functions compared to the "pointer, length" style,
but it's certainly possible to convert.)

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Olaf van der Spek
2014-08-25 15:21:23 UTC
Permalink
Post by Jens Maurer
(const void*, size_t) is a pretty standard type, why use const unsigned char*?
It'd disallow easily hashing a string for example.
And it disallows (directly) hashing ...
which is a good thing, because all the padding between "c" and "x"
has unspecified value, so your hash is very unpredictable.
True
Also, simply hashing
int x[] = { 0, 1, 2, 3 };
my_hash(x, sizeof(x));
doesn't give you a consistent hash value across different machines,
because the hash operates on bytes, but the endianess of the "int"
(i.e. the mapping to bytes) differs.
Post by Jens Maurer
Perhaps we should have (array_view<>) instead?
Well, array_view<unsigned char> is probably equivalent to the
"unsigned char" proposal.
Kinda. Shouldn't both array_view<> and string_view<> be supported?
--
Olaf
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-08-25 20:40:05 UTC
Permalink
Post by Olaf van der Spek
Post by Jens Maurer
Also, simply hashing
int x[] = { 0, 1, 2, 3 };
my_hash(x, sizeof(x));
doesn't give you a consistent hash value across different machines,
because the hash operates on bytes, but the endianess of the "int"
(i.e. the mapping to bytes) differs.
Post by Olaf van der Spek
Perhaps we should have (array_view<>) instead?
Well, array_view<unsigned char> is probably equivalent to the
"unsigned char" proposal.
Kinda. Shouldn't both array_view<> and string_view<> be supported?
The mapping of application-level data (structs, strings etc.)
should be left to an adaptation layer, independent of the actual
hash algorithm. Howard's proposal has this part right.

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Thiago Macieira
2014-08-24 15:29:51 UTC
Permalink
Post by Jens Maurer
Yes, operator() is better for an entity that does essentially just one thing.
This one does more than one thing.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Howard Hinnant
2014-08-24 17:39:53 UTC
Permalink
Post by Jens Maurer
Post by Howard Hinnant
class sha256
{
SHA256_CTX state_;
static constexpr std::endian endian = std::endian::big;
I agree there is a choice for std::uhash<> when
converting a, say, "int" to a sequence of "unsigned char"s whether to
perform endian conversion or not, and that influences the cross-platform
reproducibility of the hash when e.g. hashing a sequence of "int"
values.
This is the only role of the endian specifier, which can be any one of:

static constexpr std::endian endian = std::endian::big;
static constexpr std::endian endian = std::endian::little;
static constexpr std::endian endian = std::endian::native;

std::endian::native will be equal to one of std::endian::little or std::endian::big. These enums are nothing more than the C++ version of the already popular macros: __BYTE_ORDER__, __ORDER_LITTLE_ENDIAN__, __ORDER_BIG_ENDIAN__.

And endian is really not even used directly by the hash functor uhash<>:

template <class Hasher = acme::siphash>
struct uhash
{
using result_type = typename Hasher::result_type;

template <class T>
result_type
operator()(T const& t) const noexcept
{
Hasher h;
hash_append(h, t);
return static_cast<result_type>(h);
}
};

Instead there are now two traits:

template <class T> struct is_uniquely_represented;
template <class T, class HashAlgorithm> struct is_contiguously_hashable;

The first, is_uniquely_represented, is exactly what N3980 called is_contiguously_hashable. And is_contiguously_hashable has evolved into:

template <class T, class HashAlgorithm>
struct is_contiguously_hashable
: public std::integral_constant<bool, is_uniquely_represented<T>{} &&
(sizeof(T) == 1 ||
HashAlgorithm::endian == endian::native)>
{};

That is, whether or not a type T is contiguously hashable depends not only on if the type is_uniquely_represented, but also on whether the HashAlgorithm needs to reverse the bytes of T before consuming it. If the HashAlgorithm's *requested endian* (HashAlgorithm::endian) is the same as the platform's *native endian* (endian::native), then the bytes do not need to be reversed prior to feeding T to the HashAlgorithm. And thus if T is also uniquely represented, then one can feed T (or an array of T's) directly to the HashAlgorithm. This is the job of this function:

template <class Hasher, class T>
inline
std::enable_if_t
<
is_contiguously_hashable<T, Hasher>{}
hash_append(Hasher& h, T const& t) noexcept
{
h(std::addressof(t), sizeof(t));
}

If we are dealing with a platform/HashAlgorithm disagreement in endian, then an alternative hash_append can be used for scalars:

template <class Hasher, class T>
inline
std::enable_if_t
<
!is_contiguously_hashable<T, Hasher>{} &&
(std::is_integral<T>{} || std::is_pointer<T>{} || std::is_enum<T>{})
hash_append(Hasher& h, T t) noexcept
{
detail::reverse_bytes(t);
h(std::addressof(t), sizeof(t));
}

Although it doesn't look like it in the source code, reverse_bytes is carefully crafted to compile down to x86 instructions such as bswapl, at least on clang/OSX, and so is maximally efficient.

So in summary, when uhash<HashAlgorithm> says:

hash_append(h, t);

A correct hash_append will be chosen (at compile time), which for scalar types that have endian issues, may or may not reverse the bytes of t depending on what the hash algorithm has requested, and what the platform's native endian is.

Most hash functions won't care about the endian of the scalars fed to them, and they can indicate this by requesting the native endian, whatever that is:

static constexpr std::endian endian = std::endian::native;

Given an underlying C implementation of a hash algorithm (such as SHA-256), it would be quite easy to write adaptors around that C code with varying endian requirements, so that different parts of your code could use SHA-256 but reverse, or not reverse scalars as required:

class sha256
{
SHA256_CTX state_;
public:
static constexpr xstd::endian endian = xstd::endian::native;
// ...


class sha256_little
{
SHA256_CTX state_;
public:
static constexpr xstd::endian endian = xstd::endian::little;
// ...

// ...

uhash<sha256> h1; // don't worry about endian
uhash<sha256_little> h2; // ensure scalars are little endian prior to hashing

Finally note that the implementation of hash_append is made simpler by the use of the (const void*, size_t) interface, as opposed to a (const unsigned char*, size_t) interface. With the latter, one would have to code:

template <class Hasher, class T>
inline
std::enable_if_t
<
is_contiguously_hashable<T, Hasher>{}
hash_append(Hasher& h, T const& t) noexcept
{
h(reinterpret_cast<const unsigned char*>(std::addressof(t)), sizeof(t));
}

See https://github.com/HowardHinnant/hash_append for complete code.

Howard
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Myriachan
2014-08-24 21:01:37 UTC
Permalink
Post by Howard Hinnant
Post by Jens Maurer
I agree there is a choice for std::uhash<> when
converting a, say, "int" to a sequence of "unsigned char"s whether to
perform endian conversion or not, and that influences the cross-platform
reproducibility of the hash when e.g. hashing a sequence of "int"
values.
Sadly, C++ still supports non-two's-complement architectures, so what happens when you hash a negative int?
Post by Howard Hinnant
static constexpr std::endian endian = std::endian::big;
static constexpr std::endian endian = std::endian::little;
static constexpr std::endian endian = std::endian::native;
std::endian::native will be equal to one of std::endian::little or std::endian::big. These enums are nothing more than the C++ version of the already popular macros: __BYTE_ORDER__, __ORDER_LITTLE_ENDIAN__, __ORDER_BIG_ENDIAN__.
What about PDP-endian?
Post by Howard Hinnant
Although it doesnā€™t look like it in the source code, reverse_bytes is carefully crafted to compile down to x86 instructions such as bswapl, at least on clang/OSX, and so is maximally efficient.
On x86:

For 16-bit swaps, use "rol reg, 8". "xchg al, ah" and equivalent work, but may be slower.
For 32-bit swaps, use "bswap reg".
For 64-bit swaps on x86-32, use "bswap reg", but use the opposite registers for results.
For 64-bit swaps on x86-64, use "bswap reg".

On GCC and Clang, the intrinsics for the above are __builtin_bswap16, __builtin_bswap32 and __builtin_bswap64
On Visual C++, the intrinsics for the above are _byteswap_ushort, _byteswap_ulong and _byteswap_uint64 respectively.
Post by Howard Hinnant
A correct hash_append will be chosen (at compile time), which for scalar
types that have endian issues, may or may not reverse the bytes of t
depending on what the hash algorithm has requested, and what the
platformā€™s native endian is.
Does this mechanism support handling multiple scalars at once? On some systems, byte swapping may be more efficient if does as vector math.
Post by Howard Hinnant
See https://github.com/HowardHinnant/hash_append for complete code.
Its SHA-256 code is unsuitable for any platform for which "int" is larger than 32 bits. It will have undefined behavior in those cases due to signed integer overflow or shifting left a negative number.

Melissa
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Howard Hinnant
2014-08-24 23:21:17 UTC
Permalink
Post by Myriachan
Post by Howard Hinnant
Post by Jens Maurer
I agree there is a choice for std::uhash<> when
converting a, say, "int" to a sequence of "unsigned char"s whether to
perform endian conversion or not, and that influences the cross-platform
reproducibility of the hash when e.g. hashing a sequence of "int"
values.
Sadly, C++ still supports non-two's-complement architectures, so what happens when you hash a negative int?
I presume the int's bytes would be hashed just like any other scalar.
Post by Myriachan
Post by Howard Hinnant
static constexpr std::endian endian = std::endian::big;
static constexpr std::endian endian = std::endian::little;
static constexpr std::endian endian = std::endian::native;
std::endian::native will be equal to one of std::endian::little or std::endian::big. These enums are nothing more than the C++ version of the already popular macros: __BYTE_ORDER__, __ORDER_LITTLE_ENDIAN__, __ORDER_BIG_ENDIAN__.
What about PDP-endian?
I stand corrected. On PDP-endian std::endian::native has a value that is not equal to either of std::endian::little or std::endian::big. If the HashAlgorithm specifies little or big endian, the implementation is required to map the bytes of the int (or whatever) to little or big endian prior to feeding it to the HashAlgorithm. For std::endian::big, this is the exact same thing as inserting a call to hton() prior to the HashAlgorithm.
Post by Myriachan
By the late 1990s, not only DEC but most of the New England computer industry which had been built around minicomputers similar to the PDP-11 collapsed in the face of microcomputer-based workstations and servers.
Post by Howard Hinnant
A correct hash_append will be chosen (at compile time), which for scalar
types that have endian issues, may or may not reverse the bytes of t
depending on what the hash algorithm has requested, and what the
platform's native endian is.
Does this mechanism support handling multiple scalars at once? On some systems, byte swapping may be more efficient if does as vector math.
An implementation could conceivably create a hash_append overload on arrays of int and optimize that.
Post by Myriachan
Post by Howard Hinnant
See https://github.com/HowardHinnant/hash_append for complete code.
Its SHA-256 code is unsuitable for any platform for which "int" is larger than 32 bits. It will have undefined behavior in those cases due to signed integer overflow or shifting left a negative number.
Thanks, I'll put in a static_assert.

Howard
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
'Geoffrey Romer' via ISO C++ Standard - Future Proposals
2014-08-26 18:26:15 UTC
Permalink
Post by Jens Maurer
Post by Myriachan
Post by Howard Hinnant
Post by Jens Maurer
I agree there is a choice for std::uhash<> when
converting a, say, "int" to a sequence of "unsigned char"s whether to
perform endian conversion or not, and that influences the
cross-platform
Post by Myriachan
Post by Howard Hinnant
Post by Jens Maurer
reproducibility of the hash when e.g. hashing a sequence of "int"
values.
Sadly, C++ still supports non-two's-complement architectures, so what
happens when you hash a negative int?
I presume the int's bytes would be hashed just like any other scalar.
OK, so the output of this hasher is implementation-defined, and not
suitable for use cases that require cross-binary reproducibility. I think
that's the only reasonable choice, but then why bother with the whole
endian rigamarole?
Post by Jens Maurer
Post by Myriachan
Post by Howard Hinnant
static constexpr std::endian endian = std::endian::big;
static constexpr std::endian endian = std::endian::little;
static constexpr std::endian endian = std::endian::native;
std::endian::native will be equal to one of std::endian::little or
std::endian::big. These enums are nothing more than the C++ version of the
already popular macros: __BYTE_ORDER__, __ORDER_LITTLE_ENDIAN__,
__ORDER_BIG_ENDIAN__.
Post by Myriachan
What about PDP-endian?
I stand corrected. On PDP-endian std::endian::native has a value that is
not equal to either of std::endian::little or std::endian::big. If the
HashAlgorithm specifies little or big endian, the implementation is
required to map the bytes of the int (or whatever) to little or big endian
prior to feeding it to the HashAlgorithm. For std::endian::big, this is
the exact same thing as inserting a call to hton() prior to the
HashAlgorithm.
Post by Myriachan
By the late 1990s, not only DEC but most of the New England computer
industry which had been built around minicomputers similar to the PDP-11
collapsed in the face of microcomputer-based workstations and servers.
Post by Myriachan
Post by Howard Hinnant
A correct hash_append will be chosen (at compile time), which for scalar
types that have endian issues, may or may not reverse the bytes of t
depending on what the hash algorithm has requested, and what the
platform's native endian is.
Does this mechanism support handling multiple scalars at once? On some
systems, byte swapping may be more efficient if does as vector math.
An implementation could conceivably create a hash_append overload on
arrays of int and optimize that.
Post by Myriachan
Post by Howard Hinnant
See https://github.com/HowardHinnant/hash_append for complete code.
Its SHA-256 code is unsuitable for any platform for which "int" is
larger than 32 bits. It will have undefined behavior in those cases due to
signed integer overflow or shifting left a negative number.
Thanks, I'll put in a static_assert.
Howard
--
---
You received this message because you are subscribed to the Google Groups
"ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at
http://groups.google.com/a/isocpp.org/group/std-proposals/.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Howard Hinnant
2014-08-26 20:02:38 UTC
Permalink
OK, so the output of this hasher is implementation-defined, and not suitable for use cases that require cross-binary reproducibility. I think that's the only reasonable choice, but then why bother with the whole endian rigamarole?
Because 98% of the platforms that actually have a modern C++ compiler have a pretty standard scalar layout that differs only by endian (or word size). That is probably to be useful to somebody. No need to take that functionality away just because it wouldn't have worked on a platform that stopped being sold two decades ago.

Howard
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
'Geoffrey Romer' via ISO C++ Standard - Future Proposals
2014-08-26 21:28:17 UTC
Permalink
On Aug 26, 2014, at 2:26 PM, 'Geoffrey Romer' via ISO C++ Standard -
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
OK, so the output of this hasher is implementation-defined, and not
suitable for use cases that require cross-binary reproducibility. I think
that's the only reasonable choice, but then why bother with the whole
endian rigamarole?
Because 98% of the platforms that actually have a modern C++ compiler have
a pretty standard scalar layout that differs only by endian (or word
size).
Good point about word size, but I don't see where your proposal handles it.
That is probably to be useful to somebody. No need to take that
functionality away just because it wouldn't have worked on a platform that
stopped being sold two decades ago.
OK, let's assume that you're right, and in practice everyone will use the
same byte representation for int (up to endianness and maybe word size).
Can you say the same for an arbitrary type T? In other words, when I define
a hash_append overload, am I required to guarantee that the byte
representation it produces will not vary across platforms, or over time?

You are no doubt right that this functionality will "probably be useful to
somebody", but that's a fairly low bar to clear. Can you point to a
specific use case that would benefit from a generic fingerprinting (as
opposed to hashing) API?
Howard
--
---
You received this message because you are subscribed to the Google Groups
"ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at
http://groups.google.com/a/isocpp.org/group/std-proposals/.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Howard Hinnant
2014-08-26 22:13:40 UTC
Permalink
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
Post by Howard Hinnant
OK, so the output of this hasher is implementation-defined, and not suitable for use cases that require cross-binary reproducibility. I think that's the only reasonable choice, but then why bother with the whole endian rigamarole?
Because 98% of the platforms that actually have a modern C++ compiler have a pretty standard scalar layout that differs only by endian (or word size).
Good point about word size, but I don't see where your proposal handles it.
One can hash a std::int32_t (for example) if you want to specify integral size. Technically, std::int32_t is optional in the standard. But lots of platforms have a std::int32_t, and the client doesn't have to care if that is a short, int or long.
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
Post by Howard Hinnant
That is probably to be useful to somebody. No need to take that functionality away just because it wouldn't have worked on a platform that stopped being sold two decades ago.
OK, let's assume that you're right, and in practice everyone will use the same byte representation for int (up to endianness and maybe word size). Can you say the same for an arbitrary type T? In other words, when I define a hash_append overload, am I required to guarantee that the byte representation it produces will not vary across platforms, or over time?
I missed the smiley on the end of that sentence. You're kidding me, right?!

If we restrict T to the set of scalars. And if we say that hash_append is part of the std::lib. Then the std::lib implementors will be required to provide a hash_append as the standard specifies for all scalars. Hopefully for many std-defined types as well. That's all a standard could possibly do.
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
You are no doubt right that this functionality will "probably be useful to somebody", but that's a fairly low bar to clear. Can you point to a specific use case that would benefit from a generic fingerprinting (as opposed to hashing) API?
I can well imagine that an application that implements a secure financial P2P protocol consisting of a history of public ledgers might use SHA-256 as a means of securing the integrity of those ledgers. And I can imagine that it might be desirable to have that application run on platforms of differing endian. Say PPC and x86. When a 64 byte integral type is part of a message that gets hashed by the SHA-256 hash algorithm, it will be important that when this is computed by both the PPC machine and the x86 machine, that they get the same result. Otherwise the two machines will not be able to agree upon the contents of the public ledger. Think bitcoin. Or perhaps ripple: https://ripple.com.

Such applications will also have to limit their porting to that tiny slice of platforms that use ASCII/Unicode as well. There will be no porting to EBCDIC-based platforms. :-)

Howard
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
'Geoffrey Romer' via ISO C++ Standard - Future Proposals
2014-08-27 15:30:09 UTC
Permalink
On Aug 26, 2014, at 5:28 PM, 'Geoffrey Romer' via ISO C++ Standard -
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
On Tue, Aug 26, 2014 at 1:02 PM, Howard Hinnant <
On Aug 26, 2014, at 2:26 PM, 'Geoffrey Romer' via ISO C++ Standard -
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
OK, so the output of this hasher is implementation-defined, and not
suitable for use cases that require cross-binary reproducibility. I think
that's the only reasonable choice, but then why bother with the whole
endian rigamarole?
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
Because 98% of the platforms that actually have a modern C++ compiler
have a pretty standard scalar layout that differs only by endian (or word
size).
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
Good point about word size, but I don't see where your proposal handles
it.
One can hash a std::int32_t (for example) if you want to specify integral
size. Technically, std::int32_t is optional in the standard. But lots of
platforms have a std::int32_t, and the client doesn't have to care if that
is a short, int or long.
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
That is probably to be useful to somebody. No need to take that
functionality away just because it wouldn't have worked on a platform that
stopped being sold two decades ago.
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
OK, let's assume that you're right, and in practice everyone will use
the same byte representation for int (up to endianness and maybe word
size). Can you say the same for an arbitrary type T? In other words, when I
define a hash_append overload, am I required to guarantee that the byte
representation it produces will not vary across platforms, or over time?
I missed the smiley on the end of that sentence. You're kidding me, right?!
If we restrict T to the set of scalars. And if we say that hash_append is
part of the std::lib. Then the std::lib implementors will be required to
provide a hash_append as the standard specifies for all scalars. Hopefully
for many std-defined types as well. That's all a standard could possibly
do.
Imposing a contract on users of a standard-defined extension point is well
within the standard's purview; see std::hash, to name a particularly
salient example. I agree that the standard shouldn't impose this particular
requirement, but only because it would be exceedingly onerous, not because
the standard can't do it even in principle.
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
You are no doubt right that this functionality will "probably be useful
to somebody", but that's a fairly low bar to clear. Can you point to a
specific use case that would benefit from a generic fingerprinting (as
opposed to hashing) API?
I can well imagine that an application that implements a secure financial
P2P protocol consisting of a history of public ledgers might use SHA-256 as
a means of securing the integrity of those ledgers. And I can imagine that
it might be desirable to have that application run on platforms of
differing endian. Say PPC and x86. When a 64 byte integral type is part
of a message that gets hashed by the SHA-256 hash algorithm, it will be
important that when this is computed by both the PPC machine and the x86
machine, that they get the same result. Otherwise the two machines will
not be able to agree upon the contents of the public ledger. Think
bitcoin. Or perhaps ripple: https://ripple.com.
How does this use case benefit from a _generic_ fingerprinting API? How
does it benefit from using the same hash_append overloads for both
fingerprinting and in-memory hashing?

Don't get me wrong, this design sounds plausible for this particular use
case; I'm concerned that it may couple the protocol too closely to the C++
implementation, which may create rigidities as the two evolve, but no doubt
the engineers implementing the system have thought that through (or would
have, since this is of course entirely hypothetical ;-) ). However, what
makes this case plausible is the fact that cryptographic fingerprinting is
at the very core of the application, and so every class that's being
fingerprinted has been designed to support efficient, reliable
fingerprinting. That being the case, reusing the existing fingerprint
infrastructure for ordinary in-memory hashing makes sense.

However, if this API is standardized, it seems self-evident that the vast
majority of classes, and in particular their hash_append overloads, will be
written without much attention to fingerprinting, since in-memory hashing
is a much more common use case than persistent fingerprinting.
Consequently, hash_append authors are very likely to do things that are
inappropriate for fingerprinting, such as ignoring the endian flags, using
types with implementation-defined sizes, or changing hash_append to reflect
changes in the underlying class.

In an environment where fingerprinting is not a central concern, using
hash_append for both hashing and fingerprinting seems very likely to lead
to bugs. These would be the worst sort of bugs, too: bugs where the code
initially works just fine, and then breaks months or years later, after
you've had plenty of time for your faulty assumptions to become deeply
embedded in both the codebase and the protocol/serialization format that it
implements.
Such applications will also have to limit their porting to that tiny slice
of platforms that use ASCII/Unicode as well. There will be no porting to
EBCDIC-based platforms. :-)
This is doubtless the right choice for this application, but I don't think
the standard can afford to be so cavalier. Or maybe it can, but if so it
should start with the core language, not with library extensions (I for one
would be perfectly happy for C++ to standardize on Unicode, and for that
matter on two's complement little-endian LP64 integers and IEEE 754
floating-point, but I get the feeling there are good reasons that hasn't
happened). We can't just write off marginal platforms in the library while
claiming to support them in the language, _especially_ if our degraded
support comes in the form of correctness bugs, rather than build failures
or degraded performance.
Howard
--
---
You received this message because you are subscribed to the Google Groups
"ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at
http://groups.google.com/a/isocpp.org/group/std-proposals/.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals
2014-08-27 16:05:32 UTC
Permalink
On Wed, Aug 27, 2014 at 8:30 AM, 'Geoffrey Romer' via ISO C++ Standard
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
On Aug 26, 2014, at 5:28 PM, 'Geoffrey Romer' via ISO C++ Standard -
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
On Tue, Aug 26, 2014 at 1:02 PM, Howard Hinnant
On Aug 26, 2014, at 2:26 PM, 'Geoffrey Romer' via ISO C++ Standard -
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
OK, so the output of this hasher is implementation-defined, and not
suitable for use cases that require cross-binary reproducibility. I think
that's the only reasonable choice, but then why bother with the whole endian
rigamarole?
Because 98% of the platforms that actually have a modern C++ compiler
have a pretty standard scalar layout that differs only by endian (or word
size).
Good point about word size, but I don't see where your proposal handles it.
One can hash a std::int32_t (for example) if you want to specify integral
size. Technically, std::int32_t is optional in the standard. But lots of
platforms have a std::int32_t, and the client doesn't have to care if that
is a short, int or long.
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
That is probably to be useful to somebody. No need to take that
functionality away just because it wouldn't have worked on a platform that
stopped being sold two decades ago.
OK, let's assume that you're right, and in practice everyone will use
the same byte representation for int (up to endianness and maybe word size).
Can you say the same for an arbitrary type T? In other words, when I define
a hash_append overload, am I required to guarantee that the byte
representation it produces will not vary across platforms, or over time?
I missed the smiley on the end of that sentence. You're kidding me, right?!
If we restrict T to the set of scalars. And if we say that hash_append is
part of the std::lib. Then the std::lib implementors will be required to
provide a hash_append as the standard specifies for all scalars. Hopefully
for many std-defined types as well. That's all a standard could possibly
do.
Imposing a contract on users of a standard-defined extension point is well
within the standard's purview; see std::hash, to name a particularly salient
example. I agree that the standard shouldn't impose this particular
requirement, but only because it would be exceedingly onerous, not because
the standard can't do it even in principle.
Post by 'Geoffrey Romer' via ISO C++ Standard - Future Proposals
You are no doubt right that this functionality will "probably be useful
to somebody", but that's a fairly low bar to clear. Can you point to a
specific use case that would benefit from a generic fingerprinting (as
opposed to hashing) API?
I can well imagine that an application that implements a secure financial
P2P protocol consisting of a history of public ledgers might use SHA-256 as
a means of securing the integrity of those ledgers. And I can imagine that
it might be desirable to have that application run on platforms of differing
endian. Say PPC and x86. When a 64 byte integral type is part of a message
that gets hashed by the SHA-256 hash algorithm, it will be important that
when this is computed by both the PPC machine and the x86 machine, that they
get the same result. Otherwise the two machines will not be able to agree
https://ripple.com.
How does this use case benefit from a _generic_ fingerprinting API? How does
it benefit from using the same hash_append overloads for both fingerprinting
and in-memory hashing?
Don't get me wrong, this design sounds plausible for this particular use
case; I'm concerned that it may couple the protocol too closely to the C++
implementation, which may create rigidities as the two evolve, but no doubt
the engineers implementing the system have thought that through (or would
have, since this is of course entirely hypothetical ;-) ). However, what
makes this case plausible is the fact that cryptographic fingerprinting is
at the very core of the application, and so every class that's being
fingerprinted has been designed to support efficient, reliable
fingerprinting. That being the case, reusing the existing fingerprint
infrastructure for ordinary in-memory hashing makes sense.
However, if this API is standardized, it seems self-evident that the vast
majority of classes, and in particular their hash_append overloads, will be
written without much attention to fingerprinting, since in-memory hashing is
a much more common use case than persistent fingerprinting. Consequently,
hash_append authors are very likely to do things that are inappropriate for
fingerprinting, such as ignoring the endian flags, using types with
implementation-defined sizes, or changing hash_append to reflect changes in
the underlying class.
In an environment where fingerprinting is not a central concern, using
hash_append for both hashing and fingerprinting seems very likely to lead to
bugs. These would be the worst sort of bugs, too: bugs where the code
initially works just fine, and then breaks months or years later, after
you've had plenty of time for your faulty assumptions to become deeply
embedded in both the codebase and the protocol/serialization format that it
implements.
Such applications will also have to limit their porting to that tiny slice
of platforms that use ASCII/Unicode as well. There will be no porting to
EBCDIC-based platforms. :-)
This is doubtless the right choice for this application, but I don't think
the standard can afford to be so cavalier. Or maybe it can, but if so it
should start with the core language, not with library extensions (I for one
would be perfectly happy for C++ to standardize on Unicode, and for that
matter on two's complement little-endian LP64 integers and IEEE 754
floating-point, but I get the feeling there are good reasons that hasn't
happened). We can't just write off marginal platforms in the library while
claiming to support them in the language, _especially_ if our degraded
support comes in the form of correctness bugs, rather than build failures or
degraded performance.
I am suspicious that I don't know of any other platform that provides
cryptographic hash functions as a generic traversal over user-defined
data types. They all require byte-string inputs instead. If we did
that for the C++ library, users would need to explicitly serialize
their types before feeding them into the cryptographic hash, which
doesn't really seem that bad, especially if we have a way to stream
the serialization into the fingerprint function so there never needs
to be a large string allocated in memory.

I went back to check
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#serialization,
and the problem Howard pointed out is that hashing wants to treat 0.0
and -0.0 as equivalent. But I don't see a reason to think that
*fingerprinting* needs to treat them as equivalent, especially if the
serialization code wants to treat them as distinct behaviors. If
you're generating a signature of your current state, and that state
depends on the difference between 0.0 and -0.0, I think you'd want to
incorporate that into the signature.

Jeffrey
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-08-27 21:38:44 UTC
Permalink
Post by 'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals
I am suspicious that I don't know of any other platform that provides
cryptographic hash functions as a generic traversal over user-defined
data types. They all require byte-string inputs instead. If we did
that for the C++ library, users would need to explicitly serialize
their types before feeding them into the cryptographic hash, which
doesn't really seem that bad, especially if we have a way to stream
the serialization into the fingerprint function so there never needs
to be a large string allocated in memory.
But hash_append() is exactly that: A "serialization" that streams
directly into the cryptographic hash. You have to define hash_append()
for your own types (thereby defining the serialization format, and
which parts are salient for the hash and which ones aren't), and you
use hash_append() on scalar types and strings as defined by the standard
library.
Post by 'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals
I went back to check
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#serialization,
and the problem Howard pointed out is that hashing wants to treat 0.0
and -0.0 as equivalent.
It will be hard to specify the details of hash_append(double) ,
given that C++ says nearly nothing about floating-point behavior,
in general.
Post by 'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals
But I don't see a reason to think that
*fingerprinting* needs to treat them as equivalent, especially if the
serialization code wants to treat them as distinct behaviors. If
you're generating a signature of your current state, and that state
depends on the difference between 0.0 and -0.0, I think you'd want to
incorporate that into the signature.
I think/hope hash_append() provides enough customization points to deal
with that. For example, you could define your own SHA256 as a thin
wrapper around the library-provided one. Now you have a customization
point for hash_append(sha256, double) to do what you want.
You can also distinguish that crypto-hashing from other kinds
of hashing, if you're so inclined.

Yes, it's rather brittle if you switch your hash from sha256 to
something else. If you want to avoid that, we should ensure the
design is flexible enough that I can have my own
my_crypto_hash_append() that does what I want, without too much
repetition of code. Yep, I can't use std::uhash<> any more, but
that should contain only trivial code, anyway.

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-08-24 21:35:48 UTC
Permalink
Post by Howard Hinnant
Post by Jens Maurer
Post by Howard Hinnant
class sha256
{
SHA256_CTX state_;
static constexpr std::endian endian = std::endian::big;
I agree there is a choice for std::uhash<> when
converting a, say, "int" to a sequence of "unsigned char"s whether to
perform endian conversion or not, and that influences the cross-platform
reproducibility of the hash when e.g. hashing a sequence of "int"
values.
static constexpr std::endian endian = std::endian::big;
static constexpr std::endian endian = std::endian::little;
static constexpr std::endian endian = std::endian::native;
std::endian::native will be equal to one of std::endian::little or std::endian::big. These enums are nothing more than the C++ version of the already popular macros: __BYTE_ORDER__, __ORDER_LITTLE_ENDIAN__, __ORDER_BIG_ENDIAN__.
We need a better abstraction for this. C++ makes no assumption on
endianess, and there is certainly more than big and little
(VAX-endian springs to mind).
Post by Howard Hinnant
template <class Hasher = acme::siphash>
struct uhash
{
using result_type = typename Hasher::result_type;
template <class T>
result_type
operator()(T const& t) const noexcept
{
Hasher h;
hash_append(h, t);
return static_cast<result_type>(h);
}
};
Uh, is there some incremental functionality (without finalizing)
as well, and a variadic template overload?

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Howard Hinnant
2014-08-24 23:21:42 UTC
Permalink
Post by Jens Maurer
Post by Howard Hinnant
Post by Jens Maurer
Post by Howard Hinnant
class sha256
{
SHA256_CTX state_;
static constexpr std::endian endian = std::endian::big;
I agree there is a choice for std::uhash<> when
converting a, say, "int" to a sequence of "unsigned char"s whether to
perform endian conversion or not, and that influences the cross-platform
reproducibility of the hash when e.g. hashing a sequence of "int"
values.
static constexpr std::endian endian = std::endian::big;
static constexpr std::endian endian = std::endian::little;
static constexpr std::endian endian = std::endian::native;
std::endian::native will be equal to one of std::endian::little or std::endian::big. These enums are nothing more than the C++ version of the already popular macros: __BYTE_ORDER__, __ORDER_LITTLE_ENDIAN__, __ORDER_BIG_ENDIAN__.
We need a better abstraction for this. C++ makes no assumption on
endianess, and there is certainly more than big and little
(VAX-endian springs to mind).
I have a hard time getting excited about hardware that you can't find outside of museum. I do have fond memories of learning the VAX operating system, but that was a very long time ago. However on such a machine std::endian::native has a value that is not equal to either of std::endian::little or std::endian::big. If the HashAlgorithm specifies little or big endian, the implementation is required to map the bytes of the int (or whatever) to little or big endian prior to feeding it to the HashAlgorithm. For std::endian::big, this is the exact same thing as inserting a call to hton() prior to the HashAlgorithm.
Post by Jens Maurer
Post by Howard Hinnant
template <class Hasher = acme::siphash>
struct uhash
{
using result_type = typename Hasher::result_type;
template <class T>
result_type
operator()(T const& t) const noexcept
{
Hasher h;
hash_append(h, t);
return static_cast<result_type>(h);
}
};
Uh, is there some incremental functionality (without finalizing)
as well, and a variadic template overload?
Yes to both, for example:

struct S
{
char c;
int x;
};

template <class HashAlgorithm>
void
hash_append(HashAlgorithm& h, const S& s)
{
using std::hash_append;
hash_append(h, s.c, s.x);
}

One can not hash S without writing this hash_append. And once written, S can be hashed with any hash algorithm conforming to HashAlgorithm. I've demonstrated FNV-1A, Jenkins1, Spooky, Murmur2, SipHash, and SHA-256, none of which require any changes to hash_append(HashAlgorithm& h, const S& s). To choose an algorithm, one simply specifies it as the template parameter to uhash at the point of use:

std::unordered_set<S, uhash<fnv1a>> s;

Within hash_append, the HashAlgorithm is being updated incrementally, first with the char c, then with the int x (not both at once, because of the padding issues you have referred to). That is, one could have written hash_append for S as:

template <class HashAlgorithm>
void
hash_append(HashAlgorithm& h, const S& s)
{
using std::hash_append;
hash_append(h, s.c);
hash_append(h, s.x);
}

These are exercising the incremental hash functionality.

This is composable. For example consider another type Y that has S as a member:

struct Y
{
S s;
std::string name;
};

hash_append for this could be written:

template <class HashAlgorithm>
void
hash_append(HashAlgorithm& h, const Y& y)
{
using std::hash_append;
hash_append(h, y.s, y.name);
}

And Y can then be hashed:

std::unordered_set<Y, uhash<fnv1a>> y;

or:

std::cout << uhash<siphash>{}(y) << '\n';

Y doesn't have to know how to hash S, nor std::string. And Y doesn't have to use anything like hash_combine on the results of hashing S or std::string. This is because *all* of the hashing is happening incrementally, with only the hash functor (uhash in this case) doing the hash algorithm initialization and finalization. And uhash is only used at the top level of a data structure when you actually need a hash code.

Individual types don't really know how to create a hash code. They only know how to present themselves to a generic hash algorithm so as to incrementally append their state to the state of the hash algorithm. Types Don't Know #.

Hash functors initialize a generic hash algorithm, ask a type to append its state to the hash algorithm (which recursively asks its bases and members to append their state to the generic hash algorithm), and finalizes the generic hash algorithm.

Top-level clients, such as unordered containers, combine a hash functor template with a specific hash algorithm, to create a hash functor that will hash any type for which hash_append has been implemented, using that specific hash algorithm as if all of the discontinuous bits of memory had been rearranged into one contiguous chunk of memory.

Howard
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Myriachan
2014-08-25 15:16:40 UTC
Permalink
I have a hard time getting excited about hardware that you canā€™t find
outside of museum. I do have fond memories of learning the VAX
operating system, but that was a very long time ago. However on
such a machine std::endian::native has a value that is not equal to
either of std::endian::little or std::endian::big. If the HashAlgorithm
specifies little or big endian, the implementation is required to map the
bytes of the int (or whatever) to little or big endian prior to feeding it to
the HashAlgorithm. For std::endian::big, this is the exact same thing
as inserting a call to hton() prior to the HashAlgorithm.
Don't you mean std::endian::little, unless the hash algorithm is, say, MD5, or the little-endian version of one of the SHA series? (The latter of which would not make sense in the Standard since they're non-standard, and the former is deprecated for security reasons.)

Also, I would love to say Ā”adiĆ³s! to such architectures as well, because it means signed integer overflow being undefined would have a chance in Hell of being removed instead of the current no. Only overzealous compiler optimizers would be in the way, rather than architectures that'd be left behind. >.<

Does your design easily templatize into constructions like HMAC?

Melissa
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Howard Hinnant
2014-08-25 15:45:57 UTC
Permalink
Post by Myriachan
Post by Howard Hinnant
I have a hard time getting excited about hardware that you can't find
outside of museum. I do have fond memories of learning the VAX
operating system, but that was a very long time ago. However on
such a machine std::endian::native has a value that is not equal to
either of std::endian::little or std::endian::big. If the HashAlgorithm
specifies little or big endian, the implementation is required to map the
bytes of the int (or whatever) to little or big endian prior to feeding it to
the HashAlgorithm. For std::endian::big, this is the exact same thing
as inserting a call to hton() prior to the HashAlgorithm.
Don't you mean std::endian::little, unless the hash algorithm is, say, MD5, or the little-endian version of one of the SHA series? (The latter of which would not make sense in the Standard since they're non-standard, and the former is deprecated for security reasons.)
The intended semantics is:

struct MyHashAlgorithm
{
static constexpr std::endian endian = std::endian::big;
// ...
};

means: Attention hash_append function: Prior to feeding MyHashAlgorithm bytes from scalar types, map them (from native) into big endian. So if two platforms had scalars with identical layout except for endian, the two platforms could generate identical hash codes for identical scalar input.
Post by Myriachan
Also, I would love to say Ā”adiĆ³s! to such architectures as well, because it means signed integer overflow being undefined would have a chance in Hell of being removed instead of the current no. Only overzealous compiler optimizers would be in the way, rather than architectures that'd be left behind. >.<
Does your design easily templatize into constructions like HMAC?
I am unsure.

One can easily build hash functors that can be seeded, for example:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#seeding

I.e. you choose a seed (key?) and use that to initialize the HashAlgorithm. And then the hash functor runs the update and finalization stages in the same way as previously shown.

One could also prepend a message, by updating the HashAlgorithm with a key prior to updating it with the message. Or one could append a message by updating the HashAlgorithm just prior to running the finalization stage.

But I am unsure if the abilities of seeding, prepending and appending are sufficient to generate a HMAC.

Howard
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals
2014-08-25 18:01:42 UTC
Permalink
On Mon, Aug 25, 2014 at 8:45 AM, Howard Hinnant
Post by Howard Hinnant
Post by Myriachan
Post by Howard Hinnant
I have a hard time getting excited about hardware that you can't find
outside of museum. I do have fond memories of learning the VAX
operating system, but that was a very long time ago. However on
such a machine std::endian::native has a value that is not equal to
either of std::endian::little or std::endian::big. If the HashAlgorithm
specifies little or big endian, the implementation is required to map the
bytes of the int (or whatever) to little or big endian prior to feeding it to
the HashAlgorithm. For std::endian::big, this is the exact same thing
as inserting a call to hton() prior to the HashAlgorithm.
Don't you mean std::endian::little, unless the hash algorithm is, say, MD5, or the little-endian version of one of the SHA series? (The latter of which would not make sense in the Standard since they're non-standard, and the former is deprecated for security reasons.)
struct MyHashAlgorithm
{
static constexpr std::endian endian = std::endian::big;
// ...
};
means: Attention hash_append function: Prior to feeding MyHashAlgorithm bytes from scalar types, map them (from native) into big endian. So if two platforms had scalars with identical layout except for endian, the two platforms could generate identical hash codes for identical scalar input.
Post by Myriachan
Also, I would love to say Ā”adiĆ³s! to such architectures as well, because it means signed integer overflow being undefined would have a chance in Hell of being removed instead of the current no. Only overzealous compiler optimizers would be in the way, rather than architectures that'd be left behind. >.<
Does your design easily templatize into constructions like HMAC?
I am unsure.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#seeding
I.e. you choose a seed (key?) and use that to initialize the HashAlgorithm. And then the hash functor runs the update and finalization stages in the same way as previously shown.
One could also prepend a message, by updating the HashAlgorithm with a key prior to updating it with the message. Or one could append a message by updating the HashAlgorithm just prior to running the finalization stage.
But I am unsure if the abilities of seeding, prepending and appending are sufficient to generate a HMAC.
I think you need a "block size" accessor on cryptographic hashers in
order to write the HMAC<> template, since HMAC involves padding the
key to the block size. You could write an HMAC_SHA256 without that by
hard-coding the block size.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Myriachan
2014-08-25 21:53:06 UTC
Permalink
Post by 'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals
On Mon, Aug 25, 2014 at 8:45 AM, Howard Hinnant
Post by Howard Hinnant
I.e. you choose a seed (key?) and use that to initialize the
HashAlgorithm. And then the hash functor runs the update and finalization
stages in the same way as previously shown.
Post by Howard Hinnant
One could also prepend a message, by updating the HashAlgorithm with a
key prior to updating it with the message. Or one could append a message
by updating the HashAlgorithm just prior to running the finalization stage.
Post by Howard Hinnant
But I am unsure if the abilities of seeding, prepending and appending
are sufficient to generate a HMAC.
I think you need a "block size" accessor on cryptographic hashers in
order to write the HMAC<> template, since HMAC involves padding the
key to the block size. You could write an HMAC_SHA256 without that by
hard-coding the block size.
Yes; it seems like block_size would be a useful property to expose. Maybe
a has_block_size as well, so that it's not necessary to do the whole member
detection mess just to check. Or, perhaps, the block_size static constexpr
member could be mandatory, with the rule that it be 1 if no other value
makes sense for a given hash function.

So is this system going to be defined as only using the low 8 bits of each
unsigned char, and only for bytewise hash functions (or bitwise hash
functions defined in a certain bit order as multiples of 8)?

Melissa
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-08-27 22:41:14 UTC
Permalink
Hi Howard!

I'll have to ask a few more questions here. If something gets
standardized in this area, I'd like to see a roadmap how all
platforms supported by C++ can get portable (crypto) hash values,
even if not maximally efficient on "strange" environments.

I'm assuming that (in an abstract sense) a crypto hash algorithm
(and most others) hash a sequence of octets (i.e. 8-bit quantities).

(SHA256 and others can do odd trailing bits, but let's ignore
this for now.)

I presume I'm passing these octets to the hash algorithm using an
array of (unsigned) char, right?

Is this assumption also true for platforms where (unsigned) char
is e.g. 32-bits (DSPs)? If so, the following implementation doesn't
work there, because it makes no effort to split e.g. T==int (suppose
it's 32-bit) into four individual (unsigned) char objects.

(Oh, and on such platforms, 1 == sizeof(char) == sizeof(int).)
Post by Howard Hinnant
hash_append(Hasher& h, T const& t) noexcept
{
h(std::addressof(t), sizeof(t));
}
So, the endianness is a boolean, not a three-way type? Either you're "native" or not
seems all that matters, from the code you presented.


I think we're mixing two slightly related, but distinct aspects here.

One aspect is the fact that hashing an array of "short" values with a portable
hash (such as sha256) should give the same value across all platforms.
(The cost of endianess normalization is negligible compared to the cost
of running sha256 at all.) Maybe we need to identify those hash algorithms
that are supposed to give portable results.

So, for example, hashing this:

short a[] = { 1, 2, 3, 4 };

with sha256 should give consistent cross-platform results.
Thus, when feeding octets to the hash algorithm, we must have a
(default) idea how to map a "short" value into (two? four? more?)
octets. That might be big or little endian, or something entirely
different (e.g. a variable-length encoding, which might actually
turn out to be the most portable).


The other aspect is the fact that hash algorithms such as sha256 like
to process e.g. 32-bits (= 4 octets) at once. When reading four
octets from memory, it's helpful to be able to simply read them
into a register on "suitable" platforms and only do the endianess
(or other) conversion on the remainder of the platforms. But, on
the abstract level, this is not a configuration option, it's a
question of correctness.

[Giving the user a way to opt-out of this endianess correctness is
fine for me (emphasis: opt-out).]


I don't think a single "endian" value captures both aspects.
Post by Howard Hinnant
uhash<sha256> h1; // don't worry about endian
uhash<sha256_little> h2; // ensure scalars are little endian prior to hashing
The simple name must result in the portable hash value.
Post by Howard Hinnant
template <class Hasher, class T>
inline
std::enable_if_t
<
is_contiguously_hashable<T, Hasher>{}
hash_append(Hasher& h, T const& t) noexcept
{
h(reinterpret_cast<const unsigned char*>(std::addressof(t)), sizeof(t));
}
I agree "void *" is simpler, but I continue to believe this is a more dangerous
interface, allowing to inadvertently pass stuff with padding in it. Note
that the user is not expected to write this code, but rely on the standard
library to hash scalar types.

(If hashing comes up in Urbana-Champaign, please grab me so that I can voice
a "strongly against" for this particular aspect.)

(You can use two static_casts via "void *" if the reinterpret_cast is too
dreadful.)

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Miro Knejp
2014-08-28 00:51:07 UTC
Permalink
Maybe this whole issue simply shows that algorithms with strict size
requirements should not be defined in terms of char, short, int but
int8_t, int16_t, and so on. If hash_append has (by default) only
overloads for exactly sized types then the compiler should pick the
correct one when the user feeds it unsized types like int. On machines
where no int8_t exists no hash_append overload for int8_t exists.

Where the same source code is to produce equal hashes for the same data
structures on different machines then the only portable way is to use
the explicitly sized types. That is honestly the *only* way to be sure
of consistent hash values and should maybe be added as a note somewhere.

Though I think "char" (not "signed char" or "unsigned char" as those are
3 different types) should be treated implementation-defined as the
standard uses this type only for character values in strings. On a
machine where char has more than 8 bit only the implementation knows
which of these bits are representative of the value of a string
character/codepoint. Same goes for wchar_t.

Regarding the void* question, I think Howard's is_contiguously_hashable
covers this nicely. If your type has no padding in it, specialize the
trait. If it does, provide your own hash_append overload and feed it the
required members. I think a good set of predefined overloads for
hash_append would be

hash_append(Hasher&, const T&) // enable if is_contiguously_hashable<T,
Hasher> is true
hash_append(Hasher&, array_view<T>) // enable if hash_append(Hasher, T)
is well-formed
hash_append(Hasher&, basic_string_view<Char, Traits>) // enable if
hash_append(Hasher, Char) is well-formed

Having is_contiguously_hashable<T, Hasher> predefined as true for the
types [u]int[8|16|32|64]_t, char, wchar_t and char[16|32]_t allows the
implementation to select which integral types are acceptable and can
then internally provide specializations with tag dispatching. Using
unsized types like short or int would pick the proper overload depending
on what int##_t typedefs alias to.

A further alternative would be to provide a third parameter in the form
hash_append(Hasher&, const T&, valid_bits<N>), so even on architectures
without direct int8_t support one could use ints for storage and only
mask the leading N bits as relevant for the hash.

This should make any need for a hash_append(void*, size_t) overload
obsolete. The Hasher itself needs a (void* p, size_t n) overload where n
denotes the number of valid OCTETS pointed to by p. hash_append() has
then already taken care of endianess and other details by applying
Hasher's traits. The nice hing here is that if a type has no padding and
the user *decided that endianess, etc. does not matter* then enabling
is_contiguously_hashable makes hash_append() feed the entire structure
to (void*, size_t). Appropriate warning signs should be positioned
around is_contiguously_hashable to make the user aware of its positive
and negative implications.

Does this make sense?
Post by Jens Maurer
Hi Howard!
I'll have to ask a few more questions here. If something gets
standardized in this area, I'd like to see a roadmap how all
platforms supported by C++ can get portable (crypto) hash values,
even if not maximally efficient on "strange" environments.
I'm assuming that (in an abstract sense) a crypto hash algorithm
(and most others) hash a sequence of octets (i.e. 8-bit quantities).
(SHA256 and others can do odd trailing bits, but let's ignore
this for now.)
I presume I'm passing these octets to the hash algorithm using an
array of (unsigned) char, right?
Is this assumption also true for platforms where (unsigned) char
is e.g. 32-bits (DSPs)? If so, the following implementation doesn't
work there, because it makes no effort to split e.g. T==int (suppose
it's 32-bit) into four individual (unsigned) char objects.
(Oh, and on such platforms, 1 == sizeof(char) == sizeof(int).)
Post by Howard Hinnant
hash_append(Hasher& h, T const& t) noexcept
{
h(std::addressof(t), sizeof(t));
}
So, the endianness is a boolean, not a three-way type? Either you're "native" or not
seems all that matters, from the code you presented.
I think we're mixing two slightly related, but distinct aspects here.
One aspect is the fact that hashing an array of "short" values with a portable
hash (such as sha256) should give the same value across all platforms.
(The cost of endianess normalization is negligible compared to the cost
of running sha256 at all.) Maybe we need to identify those hash algorithms
that are supposed to give portable results.
short a[] = { 1, 2, 3, 4 };
with sha256 should give consistent cross-platform results.
Thus, when feeding octets to the hash algorithm, we must have a
(default) idea how to map a "short" value into (two? four? more?)
octets. That might be big or little endian, or something entirely
different (e.g. a variable-length encoding, which might actually
turn out to be the most portable).
The other aspect is the fact that hash algorithms such as sha256 like
to process e.g. 32-bits (= 4 octets) at once. When reading four
octets from memory, it's helpful to be able to simply read them
into a register on "suitable" platforms and only do the endianess
(or other) conversion on the remainder of the platforms. But, on
the abstract level, this is not a configuration option, it's a
question of correctness.
[Giving the user a way to opt-out of this endianess correctness is
fine for me (emphasis: opt-out).]
I don't think a single "endian" value captures both aspects.
Post by Howard Hinnant
uhash<sha256> h1; // don't worry about endian
uhash<sha256_little> h2; // ensure scalars are little endian prior to hashing
The simple name must result in the portable hash value.
Post by Howard Hinnant
template <class Hasher, class T>
inline
std::enable_if_t
<
is_contiguously_hashable<T, Hasher>{}
hash_append(Hasher& h, T const& t) noexcept
{
h(reinterpret_cast<const unsigned char*>(std::addressof(t)), sizeof(t));
}
I agree "void *" is simpler, but I continue to believe this is a more dangerous
interface, allowing to inadvertently pass stuff with padding in it. Note
that the user is not expected to write this code, but rely on the standard
library to hash scalar types.
(If hashing comes up in Urbana-Champaign, please grab me so that I can voice
a "strongly against" for this particular aspect.)
(You can use two static_casts via "void *" if the reinterpret_cast is too
dreadful.)
Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-08-28 14:25:04 UTC
Permalink
Post by Miro Knejp
Maybe this whole issue simply shows that algorithms with strict size
requirements should not be defined in terms of char, short, int but
int8_t, int16_t, and so on. If hash_append has (by default) only
overloads for exactly sized types then the compiler should pick the
correct one when the user feeds it unsized types like int. On machines
where no int8_t exists no hash_append overload for int8_t exists.
If we go that route, we should use "uint8_t" etc, not the signed variants,
which have more platform freedom.

And, for a hash algorithm that operates on octets, what does it mean
to hash a uint16_t value, call it x? I think the definition should be

hash the two octets obtained by x & 0xff and (x>>8) & 0xff,
in sequence

but that should be stated explicitly, and also how the user can get
that result if he chooses to directly call the hash algorithm.
Post by Miro Knejp
Where the same source code is to produce equal hashes for the same data
structures on different machines then the only portable way is to use
the explicitly sized types. That is honestly the *only* way to be sure
of consistent hash values and should maybe be added as a note somewhere.
I don't think so. If you strictly think about portably serializing
values to octets, the types on the platforms (other than the
integer/floating-point distinction) becomes meaningless, and you should
only think about how to represent values in a portable manner, using
(for example) a value-dependent variable-length integer representation.

(I'm not saying this is the only choice, but we shouldn't claim
hash algorithms such as sha256 if the results aren't really portable.)
Post by Miro Knejp
Though I think "char" (not "signed char" or "unsigned char" as those are
3 different types) should be treated implementation-defined as the
standard uses this type only for character values in strings.
This is not exactly true. Standard filestreams use these for (possibly)
binary data, too. (Unfortunately, in my opinion.)
Post by Miro Knejp
On a
machine where char has more than 8 bit only the implementation knows
which of these bits are representative of the value of a string
character/codepoint. Same goes for wchar_t.
Note that for "unsigned char", all bits contribute to the value
(3.9.1p1 basic.fundamental). (No padding allowed.) That also
holds if "char" happens to be unsigned.
Post by Miro Knejp
Regarding the void* question, I think Howard's is_contiguously_hashable
covers this nicely. If your type has no padding in it, specialize the
trait. If it does, provide your own hash_append overload and feed it the
required members. I think a good set of predefined overloads for
hash_append would be
hash_append(Hasher&, const T&) // enable if is_contiguously_hashable<T,
Hasher> is true
Other than for nerd-ness, why is it easier to partially specialize
is_contiguously_hashable<my_T,Hasher> instead of overloading

void hash_append(Hasher& h, const my_T& value) {
std::hash_append_contiguous(h, value);
}

?

The latter seems easier to read, and makes it easier to provide a
different answer for a different Hasher.
Post by Miro Knejp
hash_append(Hasher&, array_view<T>) // enable if hash_append(Hasher, T)
is well-formed
hash_append(Hasher&, basic_string_view<Char, Traits>) // enable if
hash_append(Hasher, Char) is well-formed
Yes, this is basic decomposition. We should have something like that.
(I'd be happy to omit the enable_if dance; you'll simply get an error
if T doesn't have a hash_append(), which is good.)

Question: Do you hash the length of the string or array separately, or
just the elements? In the latter case, hashing two empty strings
is indistinguishable from hashing three empty strings. This seems
undesirable for crypto-hashes.
Post by Miro Knejp
Having is_contiguously_hashable<T, Hasher> predefined as true for the
types [u]int[8|16|32|64]_t, char, wchar_t and char[16|32]_t allows the
implementation to select which integral types are acceptable and can
then internally provide specializations with tag dispatching. Using
unsized types like short or int would pick the proper overload depending
on what int##_t typedefs alias to.
So would std::hash_append() overloads that each take one of the
scalar types mentioned above. (The standard library can avoid
duplicate overloads for the uintX_t typedefs conflicting with char etc.)

I don't see why we need an is_contiguously_hashable<T, Hasher> trait.
Post by Miro Knejp
A further alternative would be to provide a third parameter in the form
hash_append(Hasher&, const T&, valid_bits<N>), so even on architectures
without direct int8_t support one could use ints for storage and only
mask the leading N bits as relevant for the hash.
For scalar T, this should be the job of the standard library implementation.
I don't see a need for valid_bits<N> with a generic "T". That said, I'm
not opposed to adding

hash_append(Hasher&, const std::bitset<N>&);

where N is divisible by 8. This allows to pass an octet without
concerns about C++ built-in types at all.
Post by Miro Knejp
This should make any need for a hash_append(void*, size_t) overload
obsolete. The Hasher itself needs a (void* p, size_t n) overload where n
denotes the number of valid OCTETS pointed to by p.
Then it's hard to call that function in portable code, because
sizeof(int) = 1 on some platforms where "int" is 32 bits (and "char", too).
Post by Miro Knejp
hash_append() has
then already taken care of endianess and other details by applying
Hasher's traits. The nice hing here is that if a type has no padding and
the user *decided that endianess, etc. does not matter* then enabling
is_contiguously_hashable makes hash_append() feed the entire structure
to (void*, size_t).
I believe it's a user-level policy decision for his struct type T to
determine whether it can be hashed contiguously, possibly depending
on the hasher (think two hash tables keyed off on different things,
both pointing to T objects). And that policy decision is best left
to the user's overload of hash_append(T).

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Miro Knejp
2014-08-28 20:42:44 UTC
Permalink
Post by Jens Maurer
Post by Miro Knejp
Maybe this whole issue simply shows that algorithms with strict size
requirements should not be defined in terms of char, short, int but
int8_t, int16_t, and so on. If hash_append has (by default) only
overloads for exactly sized types then the compiler should pick the
correct one when the user feeds it unsized types like int. On machines
where no int8_t exists no hash_append overload for int8_t exists.
If we go that route, we should use "uint8_t" etc, not the signed variants,
which have more platform freedom.
Possibly
Post by Jens Maurer
And, for a hash algorithm that operates on octets, what does it mean
to hash a uint16_t value, call it x? I think the definition should be
hash the two octets obtained by x & 0xff and (x>>8) & 0xff,
in sequence
but that should be stated explicitly, and also how the user can get
that result if he chooses to directly call the hash algorithm.
I'd say it's the job of hash_append() provided by the implementation for
the native bultin types to take care of this. If the Hasher has endian
requirements, x gets swapped as necessary and its contiguous (value
representing)*octets* fed to the Hasher in order of least to most
significant (or reversed). Every introductory course to programming I
have ever witnessed talks in lengths about bytes and how they make up
ints and what endianess is and so forth. I consider this basic
knowledge. The standardese has to describe this in detail of course (as
far as it can considering the C++ abstract machine), but I think for the
sake of this discussion what happens is clear. The real challenge I
think is figuring out how to define it portably for machines not
operating on multiples of 8 bit (if this is deemed relevant).
Post by Jens Maurer
Post by Miro Knejp
Where the same source code is to produce equal hashes for the same data
structures on different machines then the only portable way is to use
the explicitly sized types. That is honestly the *only* way to be sure
of consistent hash values and should maybe be added as a note somewhere.
I don't think so. If you strictly think about portably serializing
values to octets, the types on the platforms (other than the
integer/floating-point distinction) becomes meaningless, and you should
only think about how to represent values in a portable manner, using
(for example) a value-dependent variable-length integer representation.
(I'm not saying this is the only choice, but we shouldn't claim
hash algorithms such as sha256 if the results aren't really portable.)
Then I wonder who you want to put the burden on: the implementation or
the user? I do do think portability is possible (certainly for machines
with 8*n bits) by only providing hash_append for selected native bultin
(maybe only unsigned) types which are either equivalent on all
architectures supporting them; or don't exist at all. At least if you
are on an architecture without these typedefs the compiler will kindly
remind you that your data structure is incompatible, instead of
producing funny hashes.
Post by Jens Maurer
Post by Miro Knejp
Though I think "char" (not "signed char" or "unsigned char" as those are
3 different types) should be treated implementation-defined as the
standard uses this type only for character values in strings.
This is not exactly true. Standard filestreams use these for (possibly)
binary data, too. (Unfortunately, in my opinion.)
Well sadly we have the sized typedefs officially only since 11...
Post by Jens Maurer
Post by Miro Knejp
hash_append(Hasher&, array_view<T>) // enable if hash_append(Hasher, T)
is well-formed
hash_append(Hasher&, basic_string_view<Char, Traits>) // enable if
hash_append(Hasher, Char) is well-formed
Yes, this is basic decomposition. We should have something like that.
(I'd be happy to omit the enable_if dance; you'll simply get an error
if T doesn't have a hash_append(), which is good.)
Whether it's an enable_if dance or predefined overloads doesn't really
matter as long as it does the job. This was more of an exposition.
Post by Jens Maurer
Question: Do you hash the length of the string or array separately, or
just the elements? In the latter case, hashing two empty strings
is indistinguishable from hashing three empty strings. This seems
undesirable for crypto-hashes.
This is an issue that was discussed previously and I don't know if there
was a consensus. It's nested containers where it becomes headache-inducing.
Post by Jens Maurer
Post by Miro Knejp
Having is_contiguously_hashable<T, Hasher> predefined as true for the
types [u]int[8|16|32|64]_t, char, wchar_t and char[16|32]_t allows the
implementation to select which integral types are acceptable and can
then internally provide specializations with tag dispatching. Using
unsized types like short or int would pick the proper overload depending
on what int##_t typedefs alias to.
So would std::hash_append() overloads that each take one of the
scalar types mentioned above. (The standard library can avoid
duplicate overloads for the uintX_t typedefs conflicting with char etc.)
is_contiguously_hashable for the scalar types was just a tool for the
enable_if exposition (which may or may not be used). The implementation
should know how to turn these types into octet streams.
Post by Jens Maurer
I don't see why we need an is_contiguously_hashable<T, Hasher> trait.
As far as I understad it's primarily for optimization and boilerplate
reduction. If all bytes of my struct contribute to its value
representation without any padding, and endianess doesn't apply, I can
shove the entire struct (or array thereof) in the hungry maw of (void*,
size_t), and as has been mentioned earlier there are CPUs with hash/CRC
instructions so there is optimization potential present. This is usually
true for scalar types. For compound types it is upon the author of the
type to decide if this is true or not. This is obviouly not a good
solution if the hash has to be communicated to the outside world where
endianess may matter. The introduction of hash_append_contiguous() may
make this trait obsolete. There's only the question left whether
defining a class partial specialization or a forwarding function
overload is less effort and less prone to error by the user. But I think
the trait approach makes it easier to apply the information
transitively. Plus you get a query for static_assert to check whether
your assumptions about the types of member variables hold:

// All three types have no padding, only value bytes, endianess doesn't
matter.
// Ensured with type_traits and whatnot.
struct A { ... };
struct B { ... };
struct C
{
A a;
B b;
int i;
};

// Overload solution 1
void hash_append(Hasher& h, const A& x) { ... }
void hash_append(Hasher& h, const B& x) { ... }
void hash_append(Hasher& h, const C& x)
{
// Compiler may not figure out the entire struct can go in one big
gulp,
// making the optimization unlikely.
hash_append(h, x.a);
hash_append(h, x.b);
hash_append(h, x.i);
}

// Overload solution 2
void hash_append(Hasher& h, const C& x)
{
// There is no way to ensure this is actually correct
// because we cannot check it for A and B
hash_append_contiguous(addressof(x), sizeof(x));
}

// Traits solution
struct is_contiguously_hashable<Hasher, A> : true_type { };
struct is_contiguously_hashable<Hasher, B> : true_type { };
struct is_contiguously_hashable<Hasher, C> : true_type
{
// Sprinkle code with magic static_assert dust to ensure
// our assumptions about A and B are true.
// (especially if their traits are defined in some other place)
};
Post by Jens Maurer
Post by Miro Knejp
A further alternative would be to provide a third parameter in the form
hash_append(Hasher&, const T&, valid_bits<N>), so even on architectures
without direct int8_t support one could use ints for storage and only
mask the leading N bits as relevant for the hash.
For scalar T, this should be the job of the standard library implementation.
I don't see a need for valid_bits<N> with a generic "T". That said, I'm
not opposed to adding
hash_append(Hasher&, const std::bitset<N>&);
where N is divisible by 8. This allows to pass an octet without
concerns about C++ built-in types at all.
Good point about bitset. About my motivation: consider you have to use
uint32_t for your data type because the machine has no uint8_t but only
the lower 8 bit of the integer are to be hashed (the integer is
abstracted in a way as to behave like an 8 bit value), the upper 24 bits
must be discarded (not masked) in order to produce identical signatures
as on some remote machine that has direct 8 bit support. This
information cannot be provided using only the builtin scalar types.
bitset is a nice solution to this.
Post by Jens Maurer
Post by Miro Knejp
This should make any need for a hash_append(void*, size_t) overload
obsolete. The Hasher itself needs a (void* p, size_t n) overload where n
denotes the number of valid OCTETS pointed to by p.
Then it's hard to call that function in portable code, because
sizeof(int) = 1 on some platforms where "int" is 32 bits (and "char", too).
I didn't specify n in terms of sizeof() but *octets*. The hasher has to
know how to extract the octets from the pointed to contiguous octet
stream. Forget sizeof(char), this is very carefully defined as the
number of octets pointed to by the pointer. If sizeof(int) ==
sizeof(char) == 1, int has 32 bits, and all bits shall be hashed, then
it still holds that n == 4.
Post by Jens Maurer
Post by Miro Knejp
hash_append() has
then already taken care of endianess and other details by applying
Hasher's traits. The nice hing here is that if a type has no padding and
the user *decided that endianess, etc. does not matter* then enabling
is_contiguously_hashable makes hash_append() feed the entire structure
to (void*, size_t).
I believe it's a user-level policy decision for his struct type T to
determine whether it can be hashed contiguously, possibly depending
on the hasher (think two hash tables keyed off on different things,
both pointing to T objects). And that policy decision is best left
to the user's overload of hash_append(T).
The "user-level policy" you refer to is presented by Howard as
is_contiguously_hashable. But regardless of any traits one has always
the liberty of overloading hash_append() for cases like this or call
hash_append for the relevant members manually for a case like you
describe. I see is_contiguously_hashable primarily as a hint for
optimization (and reduction of boilerplate , and compile time checking
with static_assert).
Post by Jens Maurer
Jens
Post by Miro Knejp
Regarding the void* question, I think Howard's is_contiguously_hashable covers this nicely. If your type has no padding in it, specialize the trait. If it does, provide your own hash_append overload and feed it the required members. I think a good set of predefined overloads for hash_append would be
hash_append(Hasher&, const T&) // enable if is_contiguously_hashable<T, Hasher> is true
hash_append(Hasher&, array_view<T>) // enable if hash_append(Hasher, T) is well-formed
hash_append(Hasher&, basic_string_view<Char, Traits>) // enable if hash_append(Hasher, Char) is well-formed
Having is_contiguously_hashable<T, Hasher> predefined as true for the types [u]int[8|16|32|64]_t, char, wchar_t and char[16|32]_t allows the implementation to select which integral types are acceptable and can then internally provide specializations with tag dispatching. Using unsized types like short or int would pick the proper overload depending on what int##_t typedefs alias to.
A further alternative would be to provide a third parameter in the form hash_append(Hasher&, const T&, valid_bits<N>), so even on architectures without direct int8_t support one could use ints for storage and only mask the leading N bits as relevant for the hash.
This should make any need for a hash_append(void*, size_t) overload obsolete. The Hasher itself needs a (void* p, size_t n) overload where n denotes the number of valid OCTETS pointed to by p. hash_append() has then already taken care of endianess and other details by applying Hasher's traits. The nice hing here is that if a type has no padding and the user *decided that endianess, etc. does not matter* then enabling is_contiguously_hashable makes hash_append() feed the entire structure to (void*, size_t). Appropriate warning signs should be positioned around is_contiguously_hashable to make the user aware of its positive and negative implications.
hash_apppend(Hasher&, const T&)
for all arithmetic T, T*, and probably enums and nullptr_t as well. Additionally the spec should provide hash_append for containers, pair, tuple, etc. Anything that we today have a std::hash<T> for, we should definitely have a std::hash_append for. And more.
(...)
As soon as we start saying something like: "hash_append only exists for a small set of types", then we have completely changed the entire design, and completely missed the point of it.
hash_append is not about programmers stuffing bytes into a hash algorithm (that's the std::lib's job). hash_append is about building a system whereby programmers can write their hashing support just once, for all hashing algorithms, and with absolutely no need for a hash_combine step.
I agree, and I was merely talking about the atomic building blocks (the
std::lib job of stuffing bytes) which depend on the
implementation/machine, Hasher endianess and whatnot. If the arithmetic
Ts (and underlying values for enums) are exposed in the interface using
explicitly sized integer types. There should of course be further
overloads for the compound std types, containers, and stuff.
Post by Jens Maurer
hash_append should be as ubiquitous as swap, and operator==.
Howard
Miro
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Howard Hinnant
2014-08-28 18:00:12 UTC
Permalink
Maybe this whole issue simply shows that algorithms with strict size requirements should not be defined in terms of char, short, int but int8_t, int16_t, and so on. If hash_append has (by default) only overloads for exactly sized types then the compiler should pick the correct one when the user feeds it unsized types like int. On machines where no int8_t exists no hash_append overload for int8_t exists.
Agreed. We should think of hash_append similar to swap. swap(int8_t&, int8_t&) exists only where int8_t exists. :-) But swap(signed char&, signed char&) exists everywhere. And so should hash_append(H&, signed char).
Where the same source code is to produce equal hashes for the same data structures on different machines then the only portable way is to use the explicitly sized types. That is honestly the *only* way to be sure of consistent hash values and should maybe be added as a note somewhere.
Though I think "char" (not "signed char" or "unsigned char" as those are 3 different types) should be treated implementation-defined as the standard uses this type only for character values in strings. On a machine where char has more than 8 bit only the implementation knows which of these bits are representative of the value of a string character/codepoint. Same goes for wchar_t.
This is pretty much the role of the is_uniquely_represented<T> trait. The implementation will have to decide if a scalar uniquely represented.

A type T is uniquely represented if for all combinations of two values of a type, say x and y, if x == y, then it must also be true that memcmp(addressof(x), addressof(y), sizeof(T)) == 0. I.e. if x == y, then x and y have the same bit pattern representation. A 2's complement int satisfies this property because every bit pattern an int can have results in a distinct value (rule 2). And there are no "padding bits" which might take on random values.
Regarding the void* question, I think Howard's is_contiguously_hashable covers this nicely. If your type has no padding in it, specialize the trait. If it does, provide your own hash_append overload and feed it the required members. I think a good set of predefined overloads for hash_append would be
hash_append(Hasher&, const T&) // enable if is_contiguously_hashable<T, Hasher> is true
hash_append(Hasher&, array_view<T>) // enable if hash_append(Hasher, T) is well-formed
hash_append(Hasher&, basic_string_view<Char, Traits>) // enable if hash_append(Hasher, Char) is well-formed
Having is_contiguously_hashable<T, Hasher> predefined as true for the types [u]int[8|16|32|64]_t, char, wchar_t and char[16|32]_t allows the implementation to select which integral types are acceptable and can then internally provide specializations with tag dispatching. Using unsized types like short or int would pick the proper overload depending on what int##_t typedefs alias to.
A further alternative would be to provide a third parameter in the form hash_append(Hasher&, const T&, valid_bits<N>), so even on architectures without direct int8_t support one could use ints for storage and only mask the leading N bits as relevant for the hash.
This should make any need for a hash_append(void*, size_t) overload obsolete. The Hasher itself needs a (void* p, size_t n) overload where n denotes the number of valid OCTETS pointed to by p. hash_append() has then already taken care of endianess and other details by applying Hasher's traits. The nice hing here is that if a type has no padding and the user *decided that endianess, etc. does not matter* then enabling is_contiguously_hashable makes hash_append() feed the entire structure to (void*, size_t). Appropriate warning signs should be positioned around is_contiguously_hashable to make the user aware of its positive and negative implications.
The implementation should provide:

hash_apppend(Hasher&, const T&)

for all arithmetic T, T*, and probably enums and nullptr_t as well. Additionally the spec should provide hash_append for containers, pair, tuple, etc. Anything that we today have a std::hash<T> for, we should definitely have a std::hash_append for. And more.

A hash_append is the API that everyday programmers will use to build their own hash_append. For example:

class Customer
{
std::string firstName_;
std::string lastName_;
int age_;
public:
// ...
template <class HashAlgorithm>
friend
void
hash_append(HashAlgorithm& h, const Customer& c)
{
using std::hash_append;
hash_append(c.firstName_, c.lastName_, c.age_);
}
};

And this is further composable:

class Sale
{
Customer customer_;
Product product_;
public:
// ...
template <class HashAlgorithm>
friend
void
hash_append(HashAlgorithm& h, const Sale& s)
{
using std::hash_append;
hash_append(s.customer_, s.product_);
}
};

As soon as we start saying something like: "hash_append only exists for a small set of types", then we have completely changed the entire design, and completely missed the point of it.

hash_append is not about programmers stuffing bytes into a hash algorithm (that's the std::lib's job). hash_append is about building a system whereby programmers can write their hashing support just once, for all hashing algorithms, and with absolutely no need for a hash_combine step.

hash_append should be as ubiquitous as swap, and operator==.

Howard
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Howard Hinnant
2014-08-28 17:41:54 UTC
Permalink
Post by Jens Maurer
Hi Howard!
Hi Jens! :-)

I suspect you intended this as a private message, but it came through public. I'm responding publicly because I think you make some good points below and I thought everyone here could benefit from your points, and my response to some of them. Hope that's ok.
Post by Jens Maurer
I'll have to ask a few more questions here. If something gets
standardized in this area, I'd like to see a roadmap how all
platforms supported by C++ can get portable (crypto) hash values,
even if not maximally efficient on "strange" environments.
I'm assuming that (in an abstract sense) a crypto hash algorithm
(and most others) hash a sequence of octets (i.e. 8-bit quantities).
(SHA256 and others can do odd trailing bits, but let's ignore
this for now.)
I presume I'm passing these octets to the hash algorithm using an
array of (unsigned) char, right?
As currently coded, there are hash_append overloads for both unsigned char, C-arrays of unsigned char, std::array<unsigned char, N>, etc. There are also hash_append overloads for all other arithmetic types, and std-defined containers. For each type, the hash_append function is responsible for deciding how that type should present itself to a generic hashing algorithm. For example an unsigned char would just say: Consume this byte!
Post by Jens Maurer
Is this assumption also true for platforms where (unsigned) char
is e.g. 32-bits (DSPs)? If so, the following implementation doesn't
work there, because it makes no effort to split e.g. T==int (suppose
it's 32-bit) into four individual (unsigned) char objects.
The hash_append infrastructure makes no assumptions on the size of a byte. However concrete hashing algorithms (such as SHA256) most certainly will make such assumptions.
Post by Jens Maurer
(Oh, and on such platforms, 1 == sizeof(char) == sizeof(int).)
Post by Howard Hinnant
hash_append(Hasher& h, T const& t) noexcept
{
h(std::addressof(t), sizeof(t));
}
So, the endianness is a boolean, not a three-way type? Either you're "native" or not
seems all that matters, from the code you presented.
As currently coded, a hashing algorithm would set a member static const enum to one of three values:

static constexpr xstd::endian endian = xstd::endian::native;
static constexpr xstd::endian endian = xstd::endian::big;
static constexpr xstd::endian endian = xstd::endian::little;

The meaning of these is to ask the hash_append overload for scalars influenced by endian (larger than char) to change the endian from native, to the requested endian, prior to sending the bytes into the hashing algorithm. Concretely, in the order shown above:

1. Map native endian to native endian (presumably this is always a no-op).
2. Map native endian to big endian. This will be a no-op on big endian machines.
3. Map native endian to little endian. This will be a no-op on little endian machines.

Non-fingerprinting hash applications will probably always use the native mapping (i.e. they don't care about endian).
Post by Jens Maurer
I think we're mixing two slightly related, but distinct aspects here.
One aspect is the fact that hashing an array of "short" values with a portable
hash (such as sha256) should give the same value across all platforms.
(The cost of endianess normalization is negligible compared to the cost
of running sha256 at all.) Maybe we need to identify those hash algorithms
that are supposed to give portable results.
short a[] = { 1, 2, 3, 4 };
with sha256 should give consistent cross-platform results.
Thus, when feeding octets to the hash algorithm, we must have a
(default) idea how to map a "short" value into (two? four? more?)
octets. That might be big or little endian, or something entirely
different (e.g. a variable-length encoding, which might actually
turn out to be the most portable).
<nod> This is the aspect addressed by the enum above.
Post by Jens Maurer
The other aspect is the fact that hash algorithms such as sha256 like
to process e.g. 32-bits (= 4 octets) at once. When reading four
octets from memory, it's helpful to be able to simply read them
into a register on "suitable" platforms and only do the endianess
(or other) conversion on the remainder of the platforms. But, on
the abstract level, this is not a configuration option, it's a
question of correctness.
[Giving the user a way to opt-out of this endianess correctness is
fine for me (emphasis: opt-out).]
I don't think a single "endian" value captures both aspects.
Agreed. I see the second aspect above is an implementation detail of the hashing algorithm. This detail does not impact the hashing algorithm's interface, except as to impact its results. I see no motivation to "leak" this implementation detail into a standard specification, unless the standard is to specify concrete hashing algorithms. In that event we could choose any number of options, of which I really have little opinion. For example: Sha256_output_little_endian as one hashing algorithm and Sha256_output_big_endian as another. Or perhaps Sha256<endian> is another solution.

<shrug>, the hash_append proposal is hashing algorithm neutral, and does not address concrete hashing algorithms. It only addresses a technique for easily switching among concrete hashing algorithms. And this is why hash_append does have to deal with the "input endian" to the hashing algorithm, but does not address the "output endian" aspect.
Post by Jens Maurer
Post by Howard Hinnant
uhash<sha256> h1; // don't worry about endian
uhash<sha256_little> h2; // ensure scalars are little endian prior to hashing
The simple name must result in the portable hash value.
My understanding from http://en.wikipedia.org/wiki/Sha256 is that SHA256 output endian is always big.
Post by Jens Maurer
Post by Howard Hinnant
template <class Hasher, class T>
inline
std::enable_if_t
<
is_contiguously_hashable<T, Hasher>{}
hash_append(Hasher& h, T const& t) noexcept
{
h(reinterpret_cast<const unsigned char*>(std::addressof(t)), sizeof(t));
}
I agree "void *" is simpler, but I continue to believe this is a more dangerous
interface, allowing to inadvertently pass stuff with padding in it. Note
that the user is not expected to write this code, but rely on the standard
library to hash scalar types.
(If hashing comes up in Urbana-Champaign, please grab me so that I can voice
a "strongly against" for this particular aspect.)
Will do.

Howard
Post by Jens Maurer
(You can use two static_casts via "void *" if the reinterpret_cast is too
dreadful.)
Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-08-28 22:11:26 UTC
Permalink
Post by Howard Hinnant
I suspect you intended this as a private message, but it came through public.
It was intended as a public message. Let's try something different:

Hi everybody!
Post by Howard Hinnant
Post by Jens Maurer
I'll have to ask a few more questions here. If something gets
standardized in this area, I'd like to see a roadmap how all
platforms supported by C++ can get portable (crypto) hash values,
even if not maximally efficient on "strange" environments.
I'm assuming that (in an abstract sense) a crypto hash algorithm
(and most others) hash a sequence of octets (i.e. 8-bit quantities).
(SHA256 and others can do odd trailing bits, but let's ignore
this for now.)
I presume I'm passing these octets to the hash algorithm using an
array of (unsigned) char, right?
As currently coded, there are hash_append overloads for both unsigned char, C-arrays of unsigned char, std::array<unsigned char, N>, etc. There are also hash_append overloads for all other arithmetic types, and std-defined containers. For each type, the hash_append function is responsible for deciding how that type should present itself to a generic hashing algorithm.
I agree that the standard library should provide hash_append overloads for
all scalar types and standard containers, including C-style arrays.
(Btw, does hash_append on a container also hash the size, or just the
contents, i.e. the sequence of elements?)

That's not what I'm concerned about here. I'm concerned about writing a
crypto hash sum that plays nicely with the framework, and is maximally
portable both in its implementation and its result value.
Post by Howard Hinnant
For example an unsigned char would just say: Consume this byte!
A byte is not (necessarily) an octet. I'm concerned about this
particular gap.
Post by Howard Hinnant
Post by Jens Maurer
Is this assumption also true for platforms where (unsigned) char
is e.g. 32-bits (DSPs)? If so, the following implementation doesn't
work there, because it makes no effort to split e.g. T==int (suppose
it's 32-bit) into four individual (unsigned) char objects.
The hash_append infrastructure makes no assumptions on the size of a byte. However concrete hashing algorithms (such as SHA256) most certainly will make such assumptions.
I want to write a portable implementation of a hash sum that also
works on a machine where 1 == sizeof(char) == sizeof(int) (= 32 bits).

If I've understood the interface correctly, both hashing an int x and
a char c will end up with a call to my hash algorithm h like this:

h(&x, 1);
h(&c, 1);

and the interface is type-erased (i.e. uses "void*" or "unsigned char*").

On a usual platform, the calls will end up like this (ignoring endianess
for now):

h(&x, 4);
h(&c, 1);

I don't think "h" will ever be able to produce the same hash sum on both
platforms, even if specifically tailored for the particular platform.
It seems too much information is lost on the sizeof(char) == sizeof(int)
platform.

One way to address this is to split the "int" into four octets and
assign a separate "unsigned char" for each octet in the hash_append
function on the DSP-style platform. Then both calls end up as

h(&x, 4);
h(&c, 1);
Post by Howard Hinnant
Post by Jens Maurer
So, the endianness is a boolean, not a three-way type? Either you're "native" or not
seems all that matters, from the code you presented.
static constexpr xstd::endian endian = xstd::endian::native;
static constexpr xstd::endian endian = xstd::endian::big;
static constexpr xstd::endian endian = xstd::endian::little;
1. Map native endian to native endian (presumably this is always a no-op).
2. Map native endian to big endian. This will be a no-op on big endian machines.
3. Map native endian to little endian. This will be a no-op on little endian machines.
Non-fingerprinting hash applications will probably always use the native mapping (i.e. they don't care about endian).
Yes.

It seems to me that these choices are, strictly speaking, not a property of the
(crypto) hash algorithm (that is only concerned with octets coming in), but with
the preferences / situation in which it is used. As someone else pointed out,
we're essentially defining an ephemeral serialization format for purposes of
computing the hash.

I'd like to ask:

- that (core) hash algorithm implementations such as SHA256 do not
specify the "endian" thing (it doesn't mean anything at this level) and

- that there is a config option to do scalar endian conversions if so desired.

Example:

std::uhash<std::sha256> // unportable for scalars > char
std::uhash<std::sha256, std::endian::big> // convert scalars > char to "big endian" prior to feeding octets to hash algorithm

This also supports strange VAX-endianess as the native endian
convention, I believe.
Post by Howard Hinnant
Post by Jens Maurer
The other aspect is the fact that hash algorithms such as sha256 like
to process e.g. 32-bits (= 4 octets) at once. When reading four
octets from memory, it's helpful to be able to simply read them
into a register on "suitable" platforms and only do the endianess
(or other) conversion on the remainder of the platforms. But, on
the abstract level, this is not a configuration option, it's a
question of correctness.
Agreed. I see the second aspect above is an implementation detail of the hashing algorithm. This detail does not impact the hashing algorithm's interface, except as to impact its results. I see no motivation to "leak" this implementation detail into a standard specification, unless the standard is to specify concrete hashing algorithms. In that event we could choose any number of options, of which I really have little opinion. For example: Sha256_output_little_endian as one hashing algorithm and Sha256_output_big_endian as another. Or perhaps Sha256<endian> is another solution.
Well, there is a lost optimization opportunity if you're hashing an
array of unsigned int (32 bits) on a platform and with an endianess
choice that is just "right". Otherwise, you get two endianess conversions
back-to-back: One for the scalar > char thing from above, and one when
sha256 tries to form its internal 32-bit chunks.
Post by Howard Hinnant
Post by Jens Maurer
The simple name must result in the portable hash value.
Let me retract that; see details above.
Post by Howard Hinnant
My understanding from http://en.wikipedia.org/wiki/Sha256 is that SHA256 output endian is always big.
The output is a sequence of octets. It might be that this sequence is interpreted in
big endian style for (some) display purposes, but that's a minor detail.

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Howard Hinnant
2014-08-29 02:58:25 UTC
Permalink
Post by Jens Maurer
Post by Howard Hinnant
I suspect you intended this as a private message, but it came through public.
Hi everybody!
:-)
Post by Jens Maurer
Post by Howard Hinnant
Post by Jens Maurer
I'll have to ask a few more questions here. If something gets
standardized in this area, I'd like to see a roadmap how all
platforms supported by C++ can get portable (crypto) hash values,
even if not maximally efficient on "strange" environments.
I'm assuming that (in an abstract sense) a crypto hash algorithm
(and most others) hash a sequence of octets (i.e. 8-bit quantities).
(SHA256 and others can do odd trailing bits, but let's ignore
this for now.)
I presume I'm passing these octets to the hash algorithm using an
array of (unsigned) char, right?
As currently coded, there are hash_append overloads for both unsigned char, C-arrays of unsigned char, std::array<unsigned char, N>, etc. There are also hash_append overloads for all other arithmetic types, and std-defined containers. For each type, the hash_append function is responsible for deciding how that type should present itself to a generic hashing algorithm.
I agree that the standard library should provide hash_append overloads for
all scalar types and standard containers, including C-style arrays.
(Btw, does hash_append on a container also hash the size, or just the
contents, i.e. the sequence of elements?)
As currently coded, the run-time-sized containers append the size of the container. This is to prevent hash collision between:

vector<vector<int>>{{}, {1}} and vector<vector<int>>{{1}, {}},

which if they did not include a size() in their message, would both result in the same message to the hash algorithm, and thus both generate the same the hash code.

Append was chosen as opposed to prepend so that forward_list<T> would not have to traverse twice during hash_append.

Markers such as "begin container" and "end container" were also considered, but appending the size was considered to be simpler and more economical than the alternatives.
Post by Jens Maurer
That's not what I'm concerned about here. I'm concerned about writing a
crypto hash sum that plays nicely with the framework, and is maximally
portable both in its implementation and its result value.
Post by Howard Hinnant
For example an unsigned char would just say: Consume this byte!
A byte is not (necessarily) an octet. I'm concerned about this
particular gap.
I think designing for a portable implementation of (for example) SHA256 for non-8-bit-byte platforms is an over-reaching goal. Forgetting hash_append, and even C++, and designing whatever API you cared for, how would you write an implementation of SHA256 that was byte-size agnostic? And how would that impact the API of SHA256?

Today, the SHA256 implementations I've seen (written in C) are simply #ifdef'd on endian, and are not portable to non-8-bit-byte platforms. The hash_append proposal is not trying to impact this status-quo. I consider all of this HashAlgorithm details to be worked out by the HashAlgorithm author.
Post by Jens Maurer
Post by Howard Hinnant
Post by Jens Maurer
Is this assumption also true for platforms where (unsigned) char
is e.g. 32-bits (DSPs)? If so, the following implementation doesn't
work there, because it makes no effort to split e.g. T==int (suppose
it's 32-bit) into four individual (unsigned) char objects.
The hash_append infrastructure makes no assumptions on the size of a byte. However concrete hashing algorithms (such as SHA256) most certainly will make such assumptions.
I want to write a portable implementation of a hash sum that also
works on a machine where 1 == sizeof(char) == sizeof(int) (= 32 bits).
I anxiously await your prototype.
Post by Jens Maurer
If I've understood the interface correctly, both hashing an int x and
h(&x, 1);
h(&c, 1);
and the interface is type-erased (i.e. uses "void*" or "unsigned char*").
You are correct, assuming on this platform that all bits in both int and char participate in the type's representation. OTOH, if char is padded with 24 bits of random values, the hash_append will be prohibited (by the specification) from sending those random bits to the hash algorithm. hash_append might (for example) zero all the padding bits before sending the byte to the hashing algorithm.
Post by Jens Maurer
On a usual platform, the calls will end up like this (ignoring endianess
h(&x, 4);
h(&c, 1);
Correct again.
Post by Jens Maurer
I don't think "h" will ever be able to produce the same hash sum on both
platforms, even if specifically tailored for the particular platform.
It seems too much information is lost on the sizeof(char) == sizeof(int)
platform.
<shrug> My understanding is that hash algorithms consume bytes (and sometimes bits). They don't care what the type is. If we want to enforce that the same byte representation for two different types sends different byte streams to the hashing algorithm, hash_append is the right place to make that customization.
Post by Jens Maurer
One way to address this is to split the "int" into four octets and
assign a separate "unsigned char" for each octet in the hash_append
function on the DSP-style platform. Then both calls end up as
h(&x, 4);
h(&c, 1);
The hash algorithm 'h' is going to see bytes, no matter what we do (void* or unsigned char*). Whether hashing algorithms change those bytes into octets (or words) or not is completely within their implementation.

If the committee proclaims that the "message" sent by a 32 bit int should be different than the "message" sent by a 32 bit char should be different, so be it. The current hash_append proposal purposefully does not dictate such details. My feeling is that dictating such details is bound to adversely impact efficiency, but I am happy to see those details worked out in committee.
Post by Jens Maurer
Post by Howard Hinnant
Post by Jens Maurer
So, the endianness is a boolean, not a three-way type? Either you're "native" or not
seems all that matters, from the code you presented.
static constexpr xstd::endian endian = xstd::endian::native;
static constexpr xstd::endian endian = xstd::endian::big;
static constexpr xstd::endian endian = xstd::endian::little;
1. Map native endian to native endian (presumably this is always a no-op).
2. Map native endian to big endian. This will be a no-op on big endian machines.
3. Map native endian to little endian. This will be a no-op on little endian machines.
Non-fingerprinting hash applications will probably always use the native mapping (i.e. they don't care about endian).
Yes.
It seems to me that these choices are, strictly speaking, not a property of the
(crypto) hash algorithm (that is only concerned with octets coming in),
but the algorithm will see bytes,
Post by Jens Maurer
but with
the preferences / situation in which it is used. As someone else pointed out,
we're essentially defining an ephemeral serialization format for purposes of
computing the hash.
- that (core) hash algorithm implementations such as SHA256 do not
specify the "endian" thing (it doesn't mean anything at this level) and
do you mean aspect 1 (input), or aspect 2 (output), or both?
Post by Jens Maurer
- that there is a config option to do scalar endian conversions if so desired.
std::uhash<std::sha256> // unportable for scalars > char
std::uhash<std::sha256, std::endian::big> // convert scalars > char to "big endian" prior to feeding octets to hash algorithm
This part sounds exactly what I'm proposing, except that the specification is applied to std::sha256 (and all hashing algorithms in general), instead of to std::uhash (and all hash functors in general).

The hash functor is the wrong place to specify endian requests because the hash functor should be as simple as possible. It is only responsible for:

1. Initializing the hash algorithm.
2. Updating the hash algorithm with a single value.
3. Finalizing the hash algorithm.

template <class T>
result_type
operator()(T const& t) const noexcept
{
Hasher h;
hash_append(h, t);
return static_cast<result_type>(h);
}

With as-simple-as-possible hash functor requirements, one maximizes the ability of the programmer to create a custom hash functor to do things such as seeding and/or salting.

On the other hand, hash_append is the perfect place to respond to such a request as hash_append will know what type to be hashed it is dealing with (which may or may not have endian concerns), and what type of hash algorithm it is dealing with. If the hash algorithm specifies the desired endian input, this is very easily taken care of. For example:

template <class HashAlgorithm>
hash_append(HashAlgorithm& h, char c) noexcept
{
h(&c, 1); // never any endian concerns
}

On the other hand, for scalars > 1:

template <class HashAlgorithm>
hash_append(HashAlgorithm& h, int i) noexcept
{
if (HashAlgorithm::endian != endian::native)
i = convert_native_to(i, HashAlgorithm::endian)
h(&i, sizeof(i));
}

The above does not have to be the literal implementation. I would prefer compile-time branching instead of run-time branching on the compile-time question of endian. But I'm just trying to simplify the presentation. And note that for platforms where sizeof(int) == sizeof(char) (and all bits are part of the representation), the std::lib implementation simplifies down to:

template <class HashAlgorithm>
hash_append(HashAlgorithm& h, char c) noexcept
{
h(&c, 1); // never any endian concerns
}

template <class HashAlgorithm>
hash_append(HashAlgorithm& h, int i) noexcept
{
h(&i, 1); // never any endian concerns
}

I.e. the implementation of the std::lib isn't portable. But that's ok. The std::lib implementors write non-portable code so that we don't have to.

If all bits *are not* part of the representation, then the std::lib implementation must mask off or zero the padding bits prior to sending a message to the hash algorithm.
Post by Jens Maurer
This also supports strange VAX-endianess as the native endian
convention, I believe.
Agreed.
Post by Jens Maurer
Post by Howard Hinnant
Post by Jens Maurer
The other aspect is the fact that hash algorithms such as sha256 like
to process e.g. 32-bits (= 4 octets) at once. When reading four
octets from memory, it's helpful to be able to simply read them
into a register on "suitable" platforms and only do the endianess
(or other) conversion on the remainder of the platforms. But, on
the abstract level, this is not a configuration option, it's a
question of correctness.
Agreed. I see the second aspect above is an implementation detail of the hashing algorithm. This detail does not impact the hashing algorithm's interface, except as to impact its results. I see no motivation to "leak" this implementation detail into a standard specification, unless the standard is to specify concrete hashing algorithms. In that event we could choose any number of options, of which I really have little opinion. For example: Sha256_output_little_endian as one hashing algorithm and Sha256_output_big_endian as another. Or perhaps Sha256<endian> is another solution.
Well, there is a lost optimization opportunity if you're hashing an
array of unsigned int (32 bits) on a platform and with an endianess
choice that is just "right". Otherwise, you get two endianess conversions
back-to-back: One for the scalar > char thing from above, and one when
sha256 tries to form its internal 32-bit chunks.
This is precisely the optimization that my hash_proposal has gone to great lengths to preserve. And also for std::string, std::vector<int>, std::array<int>, std::pair<int, int>, std::vector<std::pair<int, int>>, std::vector<std::tuple<int, int, int, int>>, etc, etc.
Post by Jens Maurer
Post by Howard Hinnant
Post by Jens Maurer
The simple name must result in the portable hash value.
Let me retract that; see details above.
Post by Howard Hinnant
My understanding from http://en.wikipedia.org/wiki/Sha256 is that SHA256 output endian is always big.
The output is a sequence of octets. It might be that this sequence is interpreted in
big endian style for (some) display purposes, but that's a minor detail.
If by "display purposes" you also mean transmission across a network, I agree, though for me it is a major detail.

Howard
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Myriachan
2014-08-29 03:45:28 UTC
Permalink
Post by Jens Maurer
Hi Howard!
I'll have to ask a few more questions here. If something gets
standardized in this area, I'd like to see a roadmap how all
platforms supported by C++ can get portable (crypto) hash values,
even if not maximally efficient on "strange" environments.
I'm assuming that (in an abstract sense) a crypto hash algorithm
(and most others) hash a sequence of octets (i.e. 8-bit quantities).
(SHA256 and others can do odd trailing bits, but let's ignore
this for now.)
I presume I'm passing these octets to the hash algorithm using an
array of (unsigned) char, right?
Is this assumption also true for platforms where (unsigned) char
is e.g. 32-bits (DSPs)? If so, the following implementation doesn't
work there, because it makes no effort to split e.g. T==int (suppose
it's 32-bit) into four individual (unsigned) char objects.
(Oh, and on such platforms, 1 == sizeof(char) == sizeof(int).)
That's kind of the wrong way to look at SHA-256 and related
algorithms--they are actually bitwise algorithms, not bytewise. For a
certain class of hash functions, naming off the top of my head RIPEMD-160,
MD5, SHA-0/1/224/256/384/512, your message is treated as some number of
input bits. These input bits are chopped up into a block size of some
number of bits, typically 512 or 1024. The internal hash state is updated
using a compression function with the previous value of the hash state and
the input block. This repeats until there is no more input.

At the end, a single "1" bit is added to the input bits. Then "0" bits are
added until the input size has reached (BLOCKSIZE - bitsizeof(SIZEFIELD))
mod BLOCKSIZE. The SIZEFIELD is how big the length indicator is; this is
64 bits for the 512-bit blocks and 128 for the 1024-bit blocks. This
length, which is counted in bits, again pointing out that these algorithms
work with bits and not bytes, is appended, which, given the modulo,
completes the final block. The compression function is applied one last
time, and you have a result.

Most implementations simply ignore the fact that these are bitwise
algorithms. They write an extra 0x80 byte--all these algorithms use
big-endian bit order, even the little-endian byte order algorithms, like
MD5--do the zero padding, then append the byte length multiplied by 8.
Post by Jens Maurer
I think designing for a portable implementation of (for example) SHA256
for non-8-bit-byte platforms is an over-reaching goal. Forgetting
hash_append, and even C++, and designing whatever API you cared for, how
would you write an implementation of SHA256 that was byte-size agnostic?
And how would that impact the API of SHA256?
Today, the SHA256 implementations I've seen (written in C) are simply
#ifdefĆ¢Ā€Ā™d on endian, and are not portable to non-8-bit-byte platforms. The
hash_append proposal is not trying to impact this status-quo. I consider
all of this HashAlgorithm details to be worked out by the HashAlgorithm
author.
Most are not portable to systems with INT_MAX > UINT32_MAX, either, as I
pointed out earlier...

Melissa
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-08-29 21:48:59 UTC
Permalink
Post by Howard Hinnant
As currently coded, the run-time-sized containers append the size of the container.
Good.
Post by Howard Hinnant
Append was chosen as opposed to prepend so that forward_list<T> would not have to traverse twice during hash_append.
Good point.
Post by Howard Hinnant
I think designing for a portable implementation of (for example) SHA256 for non-8-bit-byte platforms is an over-reaching goal. Forgetting hash_append, and even C++, and designing whatever API you cared for, how would you write an implementation of SHA256 that was byte-size agnostic? And how would that impact the API of SHA256?
I'll take that as a challenge.
Post by Howard Hinnant
Today, the SHA256 implementations I've seen (written in C) are simply #ifdef'd on endian, and are not portable to non-8-bit-byte platforms.
I agree with that.
Post by Howard Hinnant
OTOH, if char is
padded with 24 bits of random values, the hash_append will be
prohibited (by the specification) from sending those random bits to
the hash algorithm. hash_append might (for example) zero all the
padding bits before sending the byte to the hashing algorithm.
This cannot happen for "char" because "char" has no padding.
See 3.9.1p1 "all bits of the object representation participate in the value
representation."

(It would be hard to reliably zero the padding, because random noise might
pop up again any time I read the "char".)

Padding can happen for "int" and other scalars.
Post by Howard Hinnant
<shrug> My understanding is that hash algorithms consume bytes (and sometimes bits). They don't care what the type is.
My understanding is that hash algorithms consume octets (and sometimes bits).
They don't care what the type is as long as they get values in the range
0-255 (one octet) at a time. Anything else would be non-portable on
the input side, so I cannot expect portable output values.
Post by Howard Hinnant
Post by Jens Maurer
One way to address this is to split the "int" into four octets and
assign a separate "unsigned char" for each octet in the hash_append
function on the DSP-style platform. Then both calls end up as
h(&x, 4);
h(&c, 1);
The hash algorithm 'h' is going to see bytes, no matter what we do (void* or unsigned char*). Whether hashing algorithms change those bytes into octets (or words) or not is completely within their implementation.
We could say that we pass "unsigned char *" and constrain each "unsigned char" to the minimum
value range that is portably representable in an "unsigned char", i.e. 0 .. 255.
Post by Howard Hinnant
If the committee proclaims that the "message" sent by a 32 bit int should be different than the "message" sent by a 32 bit char should be different, so be it. The current hash_append proposal purposefully does not dictate such details. My feeling is that dictating such details is bound to adversely impact efficiency, but I am happy to see those details worked out in committee.
See you there :-)
Post by Howard Hinnant
Post by Jens Maurer
- that (core) hash algorithm implementations such as SHA256 do not
specify the "endian" thing (it doesn't mean anything at this level) and
do you mean aspect 1 (input), or aspect 2 (output), or both?
Aspect 1 (input, e.g. mapping of source-level "short" to bytes).
Post by Howard Hinnant
Post by Jens Maurer
- that there is a config option to do scalar endian conversions if so desired.
std::uhash<std::sha256> // unportable for scalars > char
std::uhash<std::sha256, std::endian::big> // convert scalars > char to "big endian" prior to feeding octets to hash algorithm
This part sounds exactly what I'm proposing, except that the specification is applied to std::sha256 (and all hashing algorithms in general), instead of to std::uhash (and all hash functors in general).
std::uhash<> is one specific hash functor that uses ADL on hash_append() to do its job.
Post by Howard Hinnant
1. Initializing the hash algorithm.
2. Updating the hash algorithm with a single value.
3. Finalizing the hash algorithm.
template <class T>
result_type
operator()(T const& t) const noexcept
{
Hasher h;
hash_append(h, t);
return static_cast<result_type>(h);
}
My suggestion would lead to

template <class T>
result_type
operator()(T const& t) const noexcept
{
ConvertToBigEndian<Hasher> h;
hash_append(h, t);
return static_cast<result_type>(h);
}

in one of the three specializations (the other two are similar).
The hash functor requirements remain the same (and very simple).
Post by Howard Hinnant
With as-simple-as-possible hash functor requirements, one maximizes the ability of the programmer to create a custom hash functor to do things such as seeding and/or salting.
Thinking about it, I see the endian normalization for aspect 1 (input)
as a similar user-level customization as seeding or salting. This
should not be related to something like std::sha256.
Post by Howard Hinnant
template <class HashAlgorithm>
hash_append(HashAlgorithm& h, char c) noexcept
{
h(&c, 1); // never any endian concerns
}
Agreed.
Post by Howard Hinnant
template <class HashAlgorithm>
hash_append(HashAlgorithm& h, int i) noexcept
{
if (HashAlgorithm::endian != endian::native)
i = convert_native_to(i, HashAlgorithm::endian)
h(&i, sizeof(i));
}
My suggestion is simpler: Let's have a more specialized overload:

template <class HashAlgorithm>
hash_append(ConvertToBigEndian<HashAlgorithm>& h, int i) noexcept
{
i = convert_native_to_big_endian(i);
h.base()(&i, sizeof(i));
}

No fancy traits, no template metaprogramming.
Sure. (Not that either option would produce different code with today's optimizers.)
Post by Howard Hinnant
I.e. the implementation of the std::lib isn't portable. But that's ok. The std::lib implementors write non-portable code so that we don't have to.
Fully agreed here. However, I see this interface framework as something that
could have been user-written, so I'd like to see how it breaks for users
trying to do the same thing.
Post by Howard Hinnant
Post by Jens Maurer
Well, there is a lost optimization opportunity if you're hashing an
array of unsigned int (32 bits) on a platform and with an endianess
choice that is just "right". Otherwise, you get two endianess conversions
back-to-back: One for the scalar > char thing from above, and one when
sha256 tries to form its internal 32-bit chunks.
This is precisely the optimization that my hash_proposal has gone to great lengths to preserve.
How so? Suppose we want to hash std::vector<int> and we're on a little endian machine.
Further, suppose the "portable hash value" convention, as defined by some user
community, says you should hash an "int" value as a big-endian sequence of bytes.
So, for "aspect 1" (endian conversion while forming bytes), you'll do a byte reversal
before passing the bytes through the "void *" interface border.

Then, according to http://tools.ietf.org/html/rfc6234#page-47 (bottom part),
we'll have an endian conversion for forming 32-bit "words" from input bytes for
SHA256 again. This is "aspect 2", and beyond the "void *" interface border.

I can't see how to avoid the useless double endian conversion for this specific
case with your design.
Post by Howard Hinnant
Post by Jens Maurer
Post by Howard Hinnant
My understanding from http://en.wikipedia.org/wiki/Sha256 is that SHA256 output endian is always big.
The output is a sequence of octets. It might be that this sequence is interpreted in
big endian style for (some) display purposes, but that's a minor detail.
If by "display purposes" you also mean transmission across a network, I agree, though for me it is a major detail.
Fine. As long as there is a well-defined sequence of octets coming
out of it, I'm happy to call it duck-endian if need be.

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Markus Mayer
2014-08-24 08:52:53 UTC
Permalink
Post by Howard Hinnant
Post by Markus Mayer
-Why not implement operator()?
Having a function (with a name) is more vocal (and clear) then just braces. IMHO Operator() is only useful if the object will be used as a functor (as in std::less). But the signature is to uncommon to be used in any standard algorithm. But I'm willing to change if someone came up with a good example.
The "Types Don't Know #" proposal used operator() because it made it much easier to design a type-erased hasher that could be used for pimpl types.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#pimpl
That's a good reason for add operator(). Adding it to the next version
Post by Howard Hinnant
Post by Markus Mayer
-Why not rename 'hash_value' to result_type()?
What do you prefer?
auto result = myHash.hash_value;
or
auto result = static_cast<hash_function::result_type>(myHash);
hash_function::result_type result{myHash};
With my version it is:
hash_function::result_type result{myHash.hash_value};

I think my solution allows everything your solution allows plus
auto result = myHash.hash_value;

So I stick with my solution (for now)...
Post by Howard Hinnant
auto is a double-edged sword. I personally love it. But I'm also finding it can be abused. Sometimes it is helpful to have the type of a variable be more explicit. But see below where it is shown that minimal code interfaces directly with hash algorithms.
Only hash functors should call hash algorithms.
I'm not sure if I get your point, but I think others will also call hash
algorithms. Think a about implementing a protocol which uses a checksum
for the header and the content.

At first you load it into a buffer, then you want to check if the
checksum matches.

vector<unsigned char> buffer;
myHash(buffer.data(), buffer.size());
//compare checksum...

As you don't know how the hasher of vector works (maybe it is add size
to the hash) you have to use the algorithm directly.
Post by Howard Hinnant
And hash functors are what everybody else calls, such as unordered containers, or clients needing the result of a sha256. The distinction between hash functors and hash algorithms is what enables end-user-clients to easily swap out hash algorithms, simply by telling the hash functor to switch hash algorithms.
Post by Markus Mayer
-Sync with 'Types Don't Know #'
Here is a 'Types Don't Know #' version of sha256 based on the implementation at http://www.aarongifford.com/computers/sha.html
class sha256
{
SHA256_CTX state_;
static constexpr std::endian endian = std::endian::big;
using result_type = std::array<std::uint8_t, SHA256_DIGEST_LENGTH>;
sha256() noexcept
{
SHA256_Init(&state_);
}
void
operator()(void const* key, std::size_t len) noexcept
{
SHA256_Update(&state_, static_cast<uint8_t const*>(key), len);
}
explicit
operator result_type() noexcept
{
result_type r;
SHA256_Final(r.data(), &state_);
return r;
}
};
int
main()
{
std::uhash<sha256> h;
auto r = h(54);
}
r has type sha256::result_type, which is a 32 byte std::array. It hashes the int 54, converted to big endian prior to feeding it to the hash (if necessary). Little endian or native endian could just have easily been chosen. One uses native endian if you don't care about the results being consistent across platforms with differing endian. Note that the endian bits are lacking from N3980. They are a later refinement, strictly to aid fingerprinting applications such as sha256.
Note that clients such as the main() shown above, nor unordered containers, use sha256 directly, except to parameterize hash functions such as uhash. The hash functions such as uhash communicate directly with the hash algorithms such as sha256. This (compile-time) level of indirection is what makes it possible to very easily and quickly switch between sha256 and sha3_512, or siphash, etc.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#seeding
std::unordered_set<MyType, randomly_seeded_hash<acme::spooky>> my_set;
Using sha256 in this seeded context would fail at compile time, since sha256 is not seedable (at least not the implementation I used).
int
main()
{
std::uhash<sha256> h;
auto r = h(std::string("54"));
}
int
main()
{
std::uhash<super_sha> h;
auto r = h(std::string("54"));
}
This is in stark contrast to today's std::hash<T> model where one would have to revisit int, std::string, and all other types, to teach them about super_sha.
Howard
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-08-24 06:30:46 UTC
Permalink
Post by Markus Mayer
-Why 'unsigned char' and not 'uint8_t' for result_type?
Most hash algorithms work on an 8 bit basis. But as long as unsigned
char has a multiple of 8 bits, the algorithms can still be applied. So
'unsigned char' enables those architectures to implement the functions.
Architectures where 'unsigned char' is not a multiple of 8 bits will be
excluded by the proposal.
If you specify that a hash function operates on octets, i.e. 8-bit
quantities (I believe all them do), then it seems ok to have a 9-bit
unsigned char, whose top bit will always be zero.

Conversion from object representation to octets is a separate issue.
Post by Markus Mayer
-Why not implement operator()?
Having a function (with a name) is more vocal (and clear) then just
braces. IMHO Operator() is only useful if the object will be used as a
functor (as in std::less). But the signature is to uncommon to be used
in any standard algorithm. But I'm willing to change if someone came up
with a good example.
In my opinion, a hash is a function object and should have operator().
Post by Markus Mayer
-How to handle the large state of a hash function?
Hash function can have a large internal state (64 Byte for sha2, 200
Byte for sha3) is it OK to put such objects on the stack, or do we need
to allocate them dynamically (using allocators)?
No, you've got constant state size. Putting 200 bytes on the stack
is totally ok; implementations where that isn't are free to use
another strategy.
Post by Markus Mayer
-How to hash a files?
Hashing a file is quite common. As the iterator interface was removed,
there is no easy way to hash a file (using istream_iterator). How to do
it now?
It's certainly possible to provide a generic wrapper template that converts
from input iterators to byte buffers, i.e.

template<class H, class InputIterator>
void process(H&, InputIterator first, InputIterator last);
Post by Markus Mayer
-Should we only add some crc classes (like crc_32, crc_16 or crc_ccitt)
or a an generic (templated) crc algorithm (like in boost:crc)?
Yes, please add crc32c at least. Intel has a special instruction for it
that we should make easily usable.
Post by Markus Mayer
-Add 'nothrow' where applicable
noexcept

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
David Krauss
2014-08-24 07:05:35 UTC
Permalink
Post by Jens Maurer
If you specify that a hash function operates on octets, i.e. 8-bit
quantities (I believe all them do), then it seems ok to have a 9-bit
unsigned char, whose top bit will always be zero.
Conversion from object representation to octets is a separate issue.
9-bit systems use the extra bit for parity anyway and it the hash would just ignore it.

Anyway, portable hashes should be defined in terms of values, not representation bytes. For text, any system will give you a well-defined octet sequence. For anything else, it's better to follow the lines of Hinnant's research than to give up and reinterpret_cast< unsigned char * >( obj ), even for simple integers.

I'm even more leery of unsigned char * than void * as a generic interface. The user needs a template to handle anything besides text, which is covered by ambiguously-signed plain char *.
Post by Jens Maurer
In my opinion, a hash is a function object and should have operator().
+1. Function objects are amenable to metaprogramming facilities. Hashes are often used as filters, so they should work with higher-order functions. A method name inside a one-function class only introduces repetition. A method name inside a generic interface implemented by various classes reserves that name from the generic implementations.
Post by Jens Maurer
Post by Markus Mayer
-Should we only add some crc classes (like crc_32, crc_16 or crc_ccitt)
or a an generic (templated) crc algorithm (like in boost:crc)?
Yes, please add crc32c at least. Intel has a special instruction for it
that we should make easily usable.
A generic template still allows a hardware-accelerated specialization. Other hardware supports CRC16 in its various flavors.

I think the random number engines ([rand.eng] and [rand.predef]) set a good precedent.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Markus Mayer
2014-08-24 09:05:59 UTC
Permalink
Thanks for all your feedback. Here comes the new version...

Changes:
-Add operator()
-Make hash_value() const and return a value
-Remove question if a allocator is needed. It is not.
-Rename 'nothrow' to 'noexcept'
-Add a rational for operator()
-Rewrite 'Why 'unsigned char' and not 'uint8_t' for result_type?'

New open questions:
-Should we add an overload for 'signed char' and 'char' also?

Design:

class hash_function
{
public:
typedef std::array<unsigned char, ALGORITHM_DEFINED> result_type;
//Default contructable
//Copyable
//Moveable

hash_function& process_bytes( void unsigned char *buffer, std::size_t
byte_count);

hash_function& operator() ( void unsigned char *buffer, std::size_t
byte_count);

void reset();

result_type hash_value() const;

};

//I am not sure about this function yet...
template<class hash>
typename hash::result_type calculate_hash(void const *buffer,
std::size_t byte_count);

The implemented algorithms will be (the class name is given in the list):
-md5
-sha_1
-sha_224
-sha_256
-sha_384
-sha_512
-sha3_224
-sha3_256
-sha3_384
-sha3_512
-Various flavors of crc


Rational:
-Why 'result_type'?
To be consistent with std::function. And the name fits pretty good anyway.

-Why 'unsigned char' and not 'uint8_t' for result_type?
Most hash algorithms work on an 8 bit basis. But they can often by
implemented on odd-sized architectures as well. As such architectures
often uses freestanding implementations, they are also free to not
implement it.

-Why 'unsigned char' and not 'char' for result_type?
It is to prevent, that people think that the result is a text(string).
Furthermore if I interpret a raw byte it is always positive. Or do you
interpret 0xFF as -128 when it is given in a raw byte stream.

-Why 'process_bytes' and not 'write', 'update', ...?
Well, naming is hard. I will stick to 'process_bytes' during design, but
I'm open to suggestions.

-Why operator()?
To allow the usage of hash functions as function objects.

-Why not rename 'hash_value' to result_type()?
What do you prefer?
auto result = myHash.hash_value;
or
auto result = static_cast<hash_function::result_type>(myHash);

-Why 'sha_512' and not 'sha2_512'?
'sha_512' is the official name for the algorithm. I know it is bad, but
better be consistent.

-Why not add an iterator based process_bytes?
For now I consider it to complex.

-Why not add/delete algorithm XXX?
I think these are the most common. But I am open to suggestions.

-Why not use the naming of 'N3980: Types Don't Know #'?
Is already discussed above.

Open topics:

-How to hash a files?
Hashing a file is quite common. As the iterator interface was removed,
there is no easy way to hash a file (using istream_iterator). How to do
it now?

-Sync with 'Types Don't Know #'

-Should we only add some crc classes (like crc_32, crc_16 or crc_ccitt)
or a an generic (templated) crc algorithm (like in boost:crc)?

-Add 'noexcept' where applicable

-More naming discussions

-Find a suitable header file (maybe functional or algorithm)


regards
Markus
--
--- You received this message because you are subscribed to the Google
Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at
http://groups.google.com/a/isocpp.org/group/std-proposals/.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
David Krauss
2014-08-24 09:41:55 UTC
Permalink
-Should we only add some crc classes (like crc_32, crc_16 or crc_ccitt) or a an generic (templated) crc algorithm (like in boost:crc)?
Please use generic templates wherever possible. Provide typedefs to the common cases. Special cases and hardware acceleration can be done in template specializations, but that should be transparent to the user.

It's better to let the library worry about a generic solution than to leave users without a paddle. There are many weird permutations of CRC in the wild. And it shouldn't be much worry to the library, if there's already such a thing as Boost CRC.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-08-24 12:31:26 UTC
Permalink
Post by Markus Mayer
class hash_function
{
typedef std::array<unsigned char, ALGORITHM_DEFINED> result_type;
//Default contructable
//Copyable
//Moveable
hash_function& process_bytes( void unsigned char *buffer, std::size_t
byte_count);
hash_function& operator() ( void unsigned char *buffer, std::size_t
byte_count);
Remove the "void".

Remove "process_bytes()", it's redundant.
Post by Markus Mayer
void reset();
result_type hash_value() const;
I've got a slight preference towards a named "result" function,
but I can also live with the explicit conversion operator.
Post by Markus Mayer
-How to hash a files?
Hashing a file is quite common. As the iterator interface was removed,
there is no easy way to hash a file (using istream_iterator). How to do
it now?
Provide a thin wrapper template that processes input iterators
in e.g. 1024-byte chunks.
Post by Markus Mayer
-Should we only add some crc classes (like crc_32, crc_16 or crc_ccitt)
or a an generic (templated) crc algorithm (like in boost:crc)?
I've no objection against providing a fully general CRC capability,
but I'd like to see a typedef for at least crc32c, since Intel has
a hardware instruction for this that asks for specialization.
Post by Markus Mayer
-Find a suitable header file (maybe functional or algorithm)
I'd be inclined to ask for a separate header file.

Are there any legal restrictions on secure hash functions somewhere?
There are definitely some for encryption algorithms in some places.

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Markus Mayer
2014-08-24 16:05:47 UTC
Permalink
Post by Jens Maurer
Are there any legal restrictions on secure hash functions somewhere?
There are definitely some for encryption algorithms in some places.
I have not found evidence that hash functions are export restricted, but
this have to be checked by a layer eventually.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Olaf van der Spek
2014-08-24 14:36:26 UTC
Permalink
Post by Markus Mayer
-Why not add/delete algorithm XXX?
I think these are the most common. But I am open to suggestions.
bcrypt or something suitable for hashing passwords?
Post by Markus Mayer
hash_value()
The hash part seems a bit redundant, isn't there a standard name to return
the value from such an object?
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Myriachan
2014-08-24 15:31:33 UTC
Permalink
I think "unsigned char *" would be a confusing interface, unless the documentation explicitly stated that only the low 8 bits of each char were used as input. Remember, the MD5 and the SHA family are formally defined as a bitwise algorithm, not bytewise, so on an architecture in which CHAR_BIT > 8, does that mean every bit gets hashed, or only the low 8? Keep in mind that there are weird 12-bit microcontrollers and such, though they may be too small to have C++ and these algorithms.

The American government agency NIST actually certifies algorithms as either supporting bitwise mode or only supporting bytewise.

Even if these *are* considered to be just be the low 8 bits, I suppose we assume big-endian bit order in all cases? (MD5 is big-endian bit order, little-endian byte order; the SHA-1 and -2 series is big-endian for both; no idea about SHA-3).

By the way, I would make sure that there is no built-in way in the interface to avoid the += step at the end of an MD5/SHA-1/SHA-256 series of rounds. Otherwise, it's trivial to turn these into block ciphers, and then it'd be a legal mess.

Melissa
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Markus Mayer
2014-08-24 16:38:17 UTC
Permalink
Post by Markus Mayer
-Why not add/delete algorithm XXX?
I think these are the most common. But I am open to suggestions.
bcrypt or something suitable for hashing passwords?
I have added bcrypt and scrypt.
Post by Markus Mayer
Post by Markus Mayer
hash_value()
The hash part seems a bit redundant, isn't there a standard name to
return the value from such an object?
std::future uses get(). But it does not fit that good, imho
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Thiago Macieira
2014-08-24 15:32:37 UTC
Permalink
Post by Markus Mayer
//Default contructable
//Copyable
//Moveable
Do you mean trivially for any of the above?

I don't see how the types could be trivially constructible. And if you allow
the implementations to allocate heap for the state, they can't be trivially
copyable, movable or destructible either.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Markus Mayer
2014-08-24 16:44:41 UTC
Permalink
Post by Thiago Macieira
Post by Markus Mayer
//Default contructable
//Copyable
//Moveable
Do you mean trivially for any of the above?
I don't see how the types could be trivially constructible. And if you allow
the implementations to allocate heap for the state, they can't be trivially
copyable, movable or destructible either.
No, I don't mean trivially. Maybe they will be noexcept, which also
forbids dynamic allocation. I'm not sure if the implementors need that
much flexibility.
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Markus Mayer
2014-08-24 17:16:17 UTC
Permalink
Thanks for all your feedback. Here comes the new version...

Changes:
-change signature of process_bytes and operator() (I think I will keep
both of them)
-add more algorithms
-add draft of a function to process a range given by two iterators
blockwise (process_blockwise)

Design:

class hash_function
{
public:
typedef std::array<unsigned char, ALGORITHM_DEFINED> result_type;
//Default contructable
//Copyable
//Moveable

hash_function& process_bytes( const void *buffer, std::size_t
byte_count);

hash_function& operator() ( const void *buffer, std::size_t byte_count);

void reset();

result_type hash_value() const;

};

template<class InputIt, class Func>
process_blockwise(InputIt first, InputIt last, std::size_t block_size,
Func& processing_function);

//Maybe add process_blockwise_n

The implemented algorithms will be (the class name is given in the list):
-md5
-sha_1
-sha_224
-sha_256
-sha_384
-sha_512
-sha3_224
-sha3_256
-sha3_384
-sha3_512
-crc as a generic implementation with typedefs for crc_32, crc_16,
crc_32c, crc_xmodem and crc_ccitt
-bcrypt
-scrypt



Rational:
-Why 'result_type'?
To be consistent with std::function. And the name fits pretty good anyway.

-Why 'unsigned char' and not 'uint8_t' for result_type?
Most hash algorithms work on an 8 bit basis. But they can often by
implemented on odd-sized architectures as well. As such architectures
often uses freestanding implementations, they are also free to not
implement it.

-Why 'unsigned char' and not 'char' for result_type?
It is to prevent, that people think that the result is a text(string).
Furthermore if I interpret a raw byte it is always positive. Or do you
interpret 0xFF as -128 when it is given in a raw byte stream.

-Why 'process_bytes' and not 'write', 'update', ...?
Well, naming is hard. I will stick to 'process_bytes' during design, but
I'm open to suggestions.

-Why operator()?
To allow the usage of hash functions as function objects.

-Why not rename 'hash_value' to result_type()?
What do you prefer?
auto result = myHash.hash_value;
or
auto result = static_cast<hash_function::result_type>(myHash);

-Why 'sha_512' and not 'sha2_512'?
'sha_512' is the official name for the algorithm. I know it is bad, but
better be consistent.

-Why not add an iterator based process_bytes?
For now I consider it to complex.

-Why not add/delete algorithm XXX?
I think these are the most common. But I am open to suggestions.

-Why not use the naming of 'N3980: Types Don't Know #'?
Is already discussed above.

Open topics:

-How to hash a files?
Hashing a file is quite common. As the iterator interface was removed,
there is no easy way to hash a file (using istream_iterator). How to do
it now?

-Sync with 'Types Don't Know #'

-Add 'noexcept' where applicable

-More naming discussions

-Find a suitable header file (maybe functional or algorithm)


regards
Markus
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Markus Mayer
2014-08-31 17:15:21 UTC
Permalink
Thanks for your valuable feedback.

If checked the proposal for its feasibility for different byte-orders
and for odd-sized architectures. Here are my findings and fixes:
- The result type is a array of std::uint_least8_t. Each cell contains a
value between 0 and 255 (including 0 and 255). This can be represented
by every architecture. Byte-order is no problem because it is sequence
of single bytes.

- process_bytes take an address and a count. It hashes 'count' bytes
(which have CHAR_BITS bits) starting a the given address in the order
they are represented in memory. Every bit of this range contributes to
the result. No byte-order conversion is performed. It is not the job of
the hash function. It must be handled by an upper layer.

With the current proposal your are unable to hash a range that is not a
multiple of a byte. It is possible to add a process_bits(unsigned char
byte, std::size_t bit_count), but I thinks it is not that useful. We can
add it once we have experience how it is actually used.


Regarding the 'const void* buffer vs. const unsinged char* buffer':
I think 'const unsigned char*' is clearer, because it requires an
explicit cast. This is like a warning sign "Warning: you have to care
about byte-order, padding and other stuff by yourself!".

The 'const void*' solution is more convenient when you want to hash a
struct or a single variable.

I came to the conclusion that it depends on how common the 'interprete
this as a byte range' case is. I tried to come up with some actual
numbers, but my google fun isn't strong enough.

For my usage in 80% I do not have a unsigned char*. But what about your
experience?


Other changes:
- Add a constexpr block_size. It specifies the smallest number of bytes
(having CHAR_BITS bits), the buffer length have to be to achieve optimal
performance. It is implementation defined, because it is a property of
the specific implementation.


Open questions:
- process_bytes vs. operator(), or keep both
- const void* buffer vs. const unsinged char* buffer

class hash_function
{
public:
typedef std::array<std::uint_least8_t, ALGORITHM_DEFINED> result_type;
static constexpr std::size_t block_size = IMPLEMENTATION_DEFINED;
//Default contructable
//Copyable
//Moveable

hash_function& process_bytes( const void *buffer, std::size_t
byte_count);

hash_function& operator() ( const void *buffer, std::size_t byte_count);

void reset();

result_type hash_value() const;

};

template<class InputIt, class Hasher>
process_blockwise(InputIt first, InputIt last, Hasher& hash_fn);

//Maybe add process_blockwise_n

regards
Markus
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Jens Maurer
2014-09-10 19:32:01 UTC
Permalink
Post by Markus Mayer
I think 'const unsigned char*' is clearer, because it requires an
explicit cast. This is like a warning sign "Warning: you have to care
about byte-order, padding and other stuff by yourself!".
The 'const void*' solution is more convenient when you want to hash a
struct or a single variable.
I came to the conclusion that it depends on how common the 'interprete
this as a byte range' case is. I tried to come up with some actual
numbers, but my google fun isn't strong enough.
For my usage in 80% I do not have a unsigned char*. But what about your
experience?
My experience is that type-safety is one of the strongest
assets of C++. Also, I dislike an interface that makes it
easy to fall into a portability trap without warning.
(Yes, it's also bad that C++ allows narrowing conversions
on integers (which is a portability issue), but fortunately,
some compilers warn about these nowadays.)

Jens
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Zhihao Yuan
2014-09-10 20:40:55 UTC
Permalink
Post by Jens Maurer
Post by Markus Mayer
For my usage in 80% I do not have a unsigned char*. But what about your
experience?
My experience is that type-safety is one of the strongest
assets of C++. Also, I dislike an interface that makes it
easy to fall into a portability trap without warning.
(Yes, it's also bad that C++ allows narrowing conversions
on integers (which is a portability issue), but fortunately,
some compilers warn about these nowadays.)
Maybe we should standardize `storageof` as an `addressof`
returning `unsigned char*` for anything contiguously hashable.
The term `storage` comes from `aligned_storage`, which
give you an `unsigned char[]`.
--
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://bit.ly/blog4bsd
--
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.
Loading...