Solar Designer
2015-09-02 15:20:25 UTC
magnum, Lei -
SHA-1's H() aka F3() is the same as SHA-2's Maj(), yet somehow we're
using less optimal expressions for it on systems with bitselect().
In opencl_sha1.h we have:
#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
I've just tried changing this to:
#define F3(x, y, z) bitselect(x, y, (z) ^ (x))
and got some speedup for pbkdf2-hmac-sha1-opencl on GCN (1200K to
1228K c/s).
The same pattern is also seen in:
[***@super src]$ grep -r 'bitselect.*\^.*bitselect' .
./opencl_sha1.h:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/gpg_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/sha1_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/salted_sha_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/pbkdf2_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))
./opencl/pbkdf2_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))
and maybe elsewhere, if written slightly differently.
In simd-intrinsics.c we have:
#if __XOP__
#define SHA1_H(x,y,z) \
tmp[i] = vcmov((x[i]),(y[i]),(z[i])); \
tmp[i] = vxor((tmp[i]),vandnot((x[i]),(y[i])));
#else
#define SHA1_H(x,y,z) \
tmp[i] = vand((x[i]),(y[i])); \
tmp[i] = vor((tmp[i]),vand(vor((x[i]),(y[i])),(z[i])));
#endif
This is suboptimal in two ways:
1. It doesn't use the more optimal expression above (can do 2 operations
instead of 3).
2. The check for __XOP__ prevents this optimization from being used for
other archs where we have non-emulated vcmov(). This is currently NEON
and AltiVec.
While we could simply drop the check for __XOP__ once we've optimized
the expression since it'd be 4 operations with emulated vcmov(), which
is same count as the current #else branch, I suggest that we don't,
because the 4 operations in #else include some parallelism whereas our
vcmov() emulation does not.
So I think we should either enhance the check with also checking for
__ARM_NEON__ and __ALTIVEC__, or introduce some generic way of checking
for non-emulated vcmov() (e.g., a macro defined in pseudo_intrinsics.h
that would indicate that vcmov() is emulated, so we'd avoid it then).
The same applies to rawSHA1_ng_fmt_plug.c where we have:
#define R3(W, A, B, C, D, E) do { \
E = vadd_epi32(E, K); \
E = vadd_epi32(E, vxor(vcmov(D, B, C), vandnot(D, B))); \
E = vadd_epi32(E, W); \
B = vroti_epi32(B, 30); \
E = vadd_epi32(E, vroti_epi32(A, 5)); \
} while (false)
In fact, it's even worse there: as currently written, this expands to 5
operations when vcmov() is emulated, instead of 4. I think we should
put an #if around R3 and define it in two different ways: using the more
optimal 2 operations expression when vcmov() is non-emulated, and using
the 4 operations expression with parallelism (same as the current #else
branch in simd-intrinsics.c) when vcmov() is emulated.
magnum, will you take care of this, please? And test on GPUs and on XOP.
Lei, will you test/benchmark on NEON and AltiVec once magnum commits the
fixes, please?
Thanks,
Alexander
SHA-1's H() aka F3() is the same as SHA-2's Maj(), yet somehow we're
using less optimal expressions for it on systems with bitselect().
In opencl_sha1.h we have:
#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
I've just tried changing this to:
#define F3(x, y, z) bitselect(x, y, (z) ^ (x))
and got some speedup for pbkdf2-hmac-sha1-opencl on GCN (1200K to
1228K c/s).
The same pattern is also seen in:
[***@super src]$ grep -r 'bitselect.*\^.*bitselect' .
./opencl_sha1.h:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/gpg_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/sha1_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/salted_sha_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/pbkdf2_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))
./opencl/pbkdf2_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))
and maybe elsewhere, if written slightly differently.
In simd-intrinsics.c we have:
#if __XOP__
#define SHA1_H(x,y,z) \
tmp[i] = vcmov((x[i]),(y[i]),(z[i])); \
tmp[i] = vxor((tmp[i]),vandnot((x[i]),(y[i])));
#else
#define SHA1_H(x,y,z) \
tmp[i] = vand((x[i]),(y[i])); \
tmp[i] = vor((tmp[i]),vand(vor((x[i]),(y[i])),(z[i])));
#endif
This is suboptimal in two ways:
1. It doesn't use the more optimal expression above (can do 2 operations
instead of 3).
2. The check for __XOP__ prevents this optimization from being used for
other archs where we have non-emulated vcmov(). This is currently NEON
and AltiVec.
While we could simply drop the check for __XOP__ once we've optimized
the expression since it'd be 4 operations with emulated vcmov(), which
is same count as the current #else branch, I suggest that we don't,
because the 4 operations in #else include some parallelism whereas our
vcmov() emulation does not.
So I think we should either enhance the check with also checking for
__ARM_NEON__ and __ALTIVEC__, or introduce some generic way of checking
for non-emulated vcmov() (e.g., a macro defined in pseudo_intrinsics.h
that would indicate that vcmov() is emulated, so we'd avoid it then).
The same applies to rawSHA1_ng_fmt_plug.c where we have:
#define R3(W, A, B, C, D, E) do { \
E = vadd_epi32(E, K); \
E = vadd_epi32(E, vxor(vcmov(D, B, C), vandnot(D, B))); \
E = vadd_epi32(E, W); \
B = vroti_epi32(B, 30); \
E = vadd_epi32(E, vroti_epi32(A, 5)); \
} while (false)
In fact, it's even worse there: as currently written, this expands to 5
operations when vcmov() is emulated, instead of 4. I think we should
put an #if around R3 and define it in two different ways: using the more
optimal 2 operations expression when vcmov() is non-emulated, and using
the 4 operations expression with parallelism (same as the current #else
branch in simd-intrinsics.c) when vcmov() is emulated.
magnum, will you take care of this, please? And test on GPUs and on XOP.
Lei, will you test/benchmark on NEON and AltiVec once magnum commits the
fixes, please?
Thanks,
Alexander