[john-dev] SHA-1 H()

Discussion:

Solar Designer

2015-09-02 15:20:25 UTC

magnum, Lei -

SHA-1's H() aka F3() is the same as SHA-2's Maj(), yet somehow we're
using less optimal expressions for it on systems with bitselect().

In opencl_sha1.h we have:

#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

I've just tried changing this to:

#define F3(x, y, z) bitselect(x, y, (z) ^ (x))

and got some speedup for pbkdf2-hmac-sha1-opencl on GCN (1200K to
1228K c/s).

The same pattern is also seen in:

[***@super src]$ grep -r 'bitselect.*\^.*bitselect' .
./opencl_sha1.h:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/gpg_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/sha1_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/salted_sha_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/pbkdf2_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))
./opencl/pbkdf2_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))

and maybe elsewhere, if written slightly differently.

In simd-intrinsics.c we have:

#if __XOP__
#define SHA1_H(x,y,z) \
tmp[i] = vcmov((x[i]),(y[i]),(z[i])); \
tmp[i] = vxor((tmp[i]),vandnot((x[i]),(y[i])));
#else
#define SHA1_H(x,y,z) \
tmp[i] = vand((x[i]),(y[i])); \
tmp[i] = vor((tmp[i]),vand(vor((x[i]),(y[i])),(z[i])));
#endif

This is suboptimal in two ways:

1. It doesn't use the more optimal expression above (can do 2 operations
instead of 3).

2. The check for __XOP__ prevents this optimization from being used for
other archs where we have non-emulated vcmov(). This is currently NEON
and AltiVec.

While we could simply drop the check for __XOP__ once we've optimized
the expression since it'd be 4 operations with emulated vcmov(), which
is same count as the current #else branch, I suggest that we don't,
because the 4 operations in #else include some parallelism whereas our
vcmov() emulation does not.

So I think we should either enhance the check with also checking for
__ARM_NEON__ and __ALTIVEC__, or introduce some generic way of checking
for non-emulated vcmov() (e.g., a macro defined in pseudo_intrinsics.h
that would indicate that vcmov() is emulated, so we'd avoid it then).

The same applies to rawSHA1_ng_fmt_plug.c where we have:

#define R3(W, A, B, C, D, E) do { \
E = vadd_epi32(E, K); \
E = vadd_epi32(E, vxor(vcmov(D, B, C), vandnot(D, B))); \
E = vadd_epi32(E, W); \
B = vroti_epi32(B, 30); \
E = vadd_epi32(E, vroti_epi32(A, 5)); \
} while (false)

In fact, it's even worse there: as currently written, this expands to 5
operations when vcmov() is emulated, instead of 4. I think we should
put an #if around R3 and define it in two different ways: using the more
optimal 2 operations expression when vcmov() is non-emulated, and using
the 4 operations expression with parallelism (same as the current #else
branch in simd-intrinsics.c) when vcmov() is emulated.

magnum, will you take care of this, please? And test on GPUs and on XOP.

Lei, will you test/benchmark on NEON and AltiVec once magnum commits the
fixes, please?

Thanks,

Alexander

Solar Designer

2015-09-02 15:52:25 UTC