I optimized yescrypt-opencl (960m) by copying one table to private memory
before(with some optimizations):
***@none ~/Desktop/r/run $ GWS=1024 ./john --test --format=yescrypt-opencl
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
memory per hash : 2.10 MB
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 247 c/s real, 247 c/s virtual
Only one salt: 247 c/s real, 247 c/s virtual
now:
***@none ~/Desktop/r/src $ m;r;GWS=1024 ./john --test --format=yescrypt-opencl
Make process completed.
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
memory per hash : 2.10 MB
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 409 c/s real, 407 c/s virtual
Only one salt: 409 c/s real, 407 c/s virtual
but if I want to run benchmarks for GWS=256,512 and 1024 I need to set
a quarter of needed memory in autotune
(I'm getting CL_MEM_OBJECT_ALLOCATION_FAILURE for GWS=2048)
***@none ~/Desktop/r/run $ ./john --test --format=yescrypt-opencl --v=4
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=32 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125 -DHASH_SIZE=44
memory per hash : 2.10 MB
Calculating best global worksize (GWS); max. 100s total for crypt_all()
gws: 256 159 c/s 159 rounds/s 1.608s per crypt_all()!
gws: 512 161 c/s 161 rounds/s 3.176s per crypt_all()+
gws: 1024 145 c/s 145 rounds/s 7.029s per crypt_all()
Local worksize (LWS) 64, global worksize (GWS) 512
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 355 c/s real, 358 c/s virtual
Only one salt: 358 c/s real, 358 c/s virtual
If I set all of needed memory:
***@none ~/Desktop/r/run $ ./john --test --format=yescrypt-opencl --v=4
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=32 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125 -DHASH_SIZE=44
memory per hash : 2.10 MB
Calculating best global worksize (GWS); max. 100s total for crypt_all()
gws: 256 158 c/s 158 rounds/s 1.612s per crypt_all()!
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 230 c/s real, 230 c/s virtual
Only one salt: 237 c/s real, 237 c/s virtual
and the other thing is that benchamrks estimate the speed inproperly
***@none ~/Desktop/r/run $ GWS=1024 ./john --test --format=yescrypt-opencl
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
memory per hash : 2.10 MB
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 407 c/s real, 407 c/s virtual
Only one salt: 409 c/s real, 409 c/s virtual
***@none ~/Desktop/r/run $ GWS=512 ./john --test --format=yescrypt-opencl
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
memory per hash : 2.10 MB
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 358 c/s real, 358 c/s virtual
Only one salt: 358 c/s real, 360 c/s virtual
***@none ~/Desktop/r/run $ ./john --test --format=yescrypt-opencl --v=4
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=32 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125 -DHASH_SIZE=44
memory per hash : 2.10 MB
Calculating best global worksize (GWS); max. 100s total for crypt_all()
gws: 256 159 c/s 159 rounds/s 1.608s per crypt_all()!
gws: 512 161 c/s 161 rounds/s 3.176s per crypt_all()+
gws: 1024 145 c/s 145 rounds/s 7.029s per crypt_all()
Local worksize (LWS) 64, global worksize (GWS) 512
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 355 c/s real, 358 c/s virtual
Only one salt: 358 c/s real, 358 c/s virtual