Discussion:
[john-dev] PHC: Argon2 on GPU
Agnieszka Bielec
2015-08-03 21:32:01 UTC
Permalink
argon2i before coalescing:
***@none ~/Desktop/rr/run $ ./john --test --format=argon2i-opencl
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1000.00 kB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1000, cost 3 (l) of 1
Many salts: 1163 c/s real, 1163 c/s virtual
Only one salt: 1163 c/s real, 1163 c/s virtual

argon2i after coalescing:
***@none ~/Desktop/rr_coal_2/run $ ./john --test --format=argon2i-opencl
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1000.00 kB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1000, cost 3 (l) of 1
Many salts: 1828 c/s real, 1828 c/s virtual
Only one salt: 1845 c/s real, 1845 c/s virtual

argon2i on CPU:
***@none ~/Desktop/rr/run $ ./john --test --format=argon2i
Will run 8 OpenMP threads
Benchmarking: argon2i [Blake2 AVX]... (8xOMP)
memory per hash : 1000.00 kB
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1000, cost 3 (l) of 1
Raw: 3408 c/s real, 433 c/s virtual
Agnieszka Bielec
2015-08-03 22:10:33 UTC
Permalink
Post by Agnieszka Bielec
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1000.00 kB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1000, cost 3 (l) of 1
Many salts: 1828 c/s real, 1828 c/s virtual
Only one salt: 1845 c/s real, 1845 c/s virtual
even faster after small modification

***@none ~/Desktop/rr_coal_2/run $ ./john --test --format=argon2i-opencl
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1000.00 kB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1000, cost 3 (l) of 1
Many salts: 2133 c/s real, 2118 c/s virtual
Only one salt: 2118 c/s real, 2133 c/s virtual

again the same problem as with yescrypt, it's auotuned properly after
I divided MEM_SIZE / 4 here:

opencl_init_auto_setup(SEED, 0, NULL,
warn, 4, self, create_clobj, release_clobj, MEM_SIZE/4, 0);
Solar Designer
2015-08-06 12:14:46 UTC
Permalink
Agnieszka, magnum -
Post by Agnieszka Bielec
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1000.00 kB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1000, cost 3 (l) of 1
Many salts: 2133 c/s real, 2118 c/s virtual
Only one salt: 2118 c/s real, 2133 c/s virtual
Agnieszka - why are you testing this on your small GPU, and at 1 MB?
Why not tune it (both 2d and 2i) the same as we did for the yescrypt vs.
Lyra2 test, so likely 1.5 MB? I am interested in results that would be
directly comparable to those you had for yescrypt and Lyra2.
Post by Agnieszka Bielec
again the same problem as with yescrypt, it's auotuned properly after
opencl_init_auto_setup(SEED, 0, NULL,
warn, 4, self, create_clobj, release_clobj, MEM_SIZE/4, 0);
magnum - please comment on this, or have you commented on the issue in
another john-dev thread already?

Thanks,

Alexander
magnum
2015-08-06 23:49:56 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
again the same problem as with yescrypt, it's auotuned properly after
opencl_init_auto_setup(SEED, 0, NULL,
warn, 4, self, create_clobj, release_clobj, MEM_SIZE/4, 0);
magnum - please comment on this, or have you commented on the issue in
another john-dev thread already?
Does it stop too early because it thinks memory use will be too high?
This format's use of that argument (without the division) looks sane to me.

Division by 4 does ring a bell though: Maybe it's somehow related to the
fact the maximum size of a single allocation in OpenCL is usually just a
quarter of total main memory?

Claudio, are you reading this? Any idea?

magnum
Agnieszka Bielec
2015-08-12 16:32:09 UTC
Permalink
I wanted to do tests on super but I discovered that final speed and
reported when computng gws differs, don't know which is good

[***@super run]$ ./john --test --format=argon2d-opencl --v=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 964 c/s 964 rounds/s 265.514ms per crypt_all()!
gws: 512 1878 c/s 1878 rounds/s 272.497ms per crypt_all()+
gws: 1024 3447 c/s 3447 rounds/s 297.022ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 2925 c/s real, 307200 c/s virtual
Only one salt: 2898 c/s real, 307200 c/s virtual
Agnieszka Bielec
2015-08-12 16:45:02 UTC
Permalink
Post by Agnieszka Bielec
I wanted to do tests on super but I discovered that final speed and
reported when computng gws differs, don't know which is good
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 964 c/s 964 rounds/s 265.514ms per crypt_all()!
gws: 512 1878 c/s 1878 rounds/s 272.497ms per crypt_all()+
gws: 1024 3447 c/s 3447 rounds/s 297.022ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 2925 c/s real, 307200 c/s virtual
Only one salt: 2898 c/s real, 307200 c/s virtual
the same results when using the same password everywhere
magnum
2015-08-12 21:45:35 UTC
Permalink
Post by Agnieszka Bielec
I wanted to do tests on super but I discovered that final speed and
reported when computng gws differs, don't know which is good
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 964 c/s 964 rounds/s 265.514ms per crypt_all()!
gws: 512 1878 c/s 1878 rounds/s 272.497ms per crypt_all()+
gws: 1024 3447 c/s 3447 rounds/s 297.022ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 2925 c/s real, 307200 c/s virtual
Only one salt: 2898 c/s real, 307200 c/s virtual
The benchmark figures (last two lines) are the correct ones. If you set
up auto-tune correctly, that speed should be similar to the benchmark.
For some formats/situations this is hard to achieve and it's just
cosmetic anyway.

magnum
Solar Designer
2015-08-12 21:51:32 UTC
Permalink
Post by magnum
Post by Agnieszka Bielec
gws: 1024 3447 c/s 3447 rounds/s 297.022ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 2925 c/s real, 307200 c/s virtual
Only one salt: 2898 c/s real, 307200 c/s virtual
The benchmark figures (last two lines) are the correct ones. If you set
up auto-tune correctly, that speed should be similar to the benchmark.
For some formats/situations this is hard to achieve and it's just
cosmetic anyway.
magnum, do you have an explanation why the best benchmark result during
auto-tuning is usually substantially different from the final benchmark
in most of Agnieszka's formats? I'm fine with eventually dismissing it
as "hard to achieve" and "cosmetic anyway", but I'd like to understand
the cause first. Thanks!

Alexander
magnum
2015-08-12 22:28:57 UTC
Permalink
Post by Solar Designer
Post by magnum
Post by Agnieszka Bielec
gws: 1024 3447 c/s 3447 rounds/s 297.022ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 2925 c/s real, 307200 c/s virtual
Only one salt: 2898 c/s real, 307200 c/s virtual
The benchmark figures (last two lines) are the correct ones. If you set
up auto-tune correctly, that speed should be similar to the benchmark.
For some formats/situations this is hard to achieve and it's just
cosmetic anyway.
magnum, do you have an explanation why the best benchmark result during
auto-tuning is usually substantially different from the final benchmark
in most of Agnieszka's formats? I'm fine with eventually dismissing it
as "hard to achieve" and "cosmetic anyway", but I'd like to understand
the cause first. Thanks!
Generally a mismatch could be caused by using different [cost] test
vectors in auto-tune than the ones benchmarked, or auto-tune using just
one repeated plaintext in a format where length matters for speed (eg.
RAR), or something along those lines.

Another reason would be incorrect setup of autotune for split kernels.
For example, if auto-tune thinks we're going to call a split kernel 500
times but the real run does it 1000 times, we'll see inflated figures
from autotune.

A third reason (seen in early WPA-PSK) is when crypt_all() does
significant post-processing on CPU where auto-tune doesn't.

magnum
Agnieszka Bielec
2015-08-13 07:52:34 UTC
Permalink
Post by Solar Designer
Post by magnum
Post by Agnieszka Bielec
gws: 1024 3447 c/s 3447 rounds/s 297.022ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 2925 c/s real, 307200 c/s virtual
Only one salt: 2898 c/s real, 307200 c/s virtual
The benchmark figures (last two lines) are the correct ones. If you set
up auto-tune correctly, that speed should be similar to the benchmark.
For some formats/situations this is hard to achieve and it's just
cosmetic anyway.
magnum, do you have an explanation why the best benchmark result during
auto-tuning is usually substantially different from the final benchmark
in most of Agnieszka's formats? I'm fine with eventually dismissing it
as "hard to achieve" and "cosmetic anyway", but I'd like to understand
the cause first. Thanks!
Generally a mismatch could be caused by using different [cost] test vectors
in auto-tune than the ones benchmarked, or auto-tune using just one repeated
plaintext in a format where length matters for speed (eg. RAR), or something
along those lines.
Another reason would be incorrect setup of autotune for split kernels. For
example, if auto-tune thinks we're going to call a split kernel 500 times
but the real run does it 1000 times, we'll see inflated figures from
autotune.
A third reason (seen in early WPA-PSK) is when crypt_all() does significant
post-processing on CPU where auto-tune doesn't.
none of these I printfed plaintexts which are set during computation
of gws and modified benchc.c to set the same values and result is the
same
magnum
2015-08-13 09:23:05 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
magnum, do you have an explanation why the best benchmark result during
auto-tuning is usually substantially different from the final benchmark
in most of Agnieszka's formats? I'm fine with eventually dismissing it
as "hard to achieve" and "cosmetic anyway", but I'd like to understand
the cause first. Thanks!
Generally a mismatch could be caused by using different [cost] test vectors
in auto-tune than the ones benchmarked, or auto-tune using just one repeated
plaintext in a format where length matters for speed (eg. RAR), or something
along those lines.
Another reason would be incorrect setup of autotune for split kernels. For
example, if auto-tune thinks we're going to call a split kernel 500 times
but the real run does it 1000 times, we'll see inflated figures from
autotune.
A third reason (seen in early WPA-PSK) is when crypt_all() does significant
post-processing on CPU where auto-tune doesn't.
none of these I printfed plaintexts which are set during computation
of gws and modified benchc.c to set the same values and result is the
same
Then you might want to dig into it. The autotune code should be easy to
follow. Try to establish exactly what it comes up with and how it ends
up with the figures it prints for your format.

magnum
Agnieszka Bielec
2015-08-14 08:37:32 UTC
Permalink
unfortunately I have another problem, only with --dev=5 on super. I
tested argon2d/i on 4 cards and this problem occurs only on nvidia on
super
I did some debugging

return 0;
blake2b_update(&BlakeHash, (const uchar*)msg, msglen);

works
__
blake2b_update(&BlakeHash, (const uchar*)msg, msglen);
return 0;

doesn't work (CL_OUT_OF_RESOURCES)
__
printf("%d\n",msglen);
blake2b_update(&BlakeHash, (const uchar*)msg, msglen);
return 0;

works

so it seems like a bug with driver for me
Solar Designer
2015-08-14 10:56:07 UTC
Permalink
Post by Agnieszka Bielec
unfortunately I have another problem, only with --dev=5 on super. I
tested argon2d/i on 4 cards and this problem occurs only on nvidia on
super
I did some debugging
return 0;
blake2b_update(&BlakeHash, (const uchar*)msg, msglen);
works
__
blake2b_update(&BlakeHash, (const uchar*)msg, msglen);
return 0;
doesn't work (CL_OUT_OF_RESOURCES)
__
printf("%d\n",msglen);
blake2b_update(&BlakeHash, (const uchar*)msg, msglen);
return 0;
works
so it seems like a bug with driver for me
Not necessarily. It could be that the addition of printf() affects
optimization, and thus resource usage.

Alexander
Solar Designer
2015-08-14 13:31:33 UTC
Permalink
Post by magnum
Post by Solar Designer
magnum, do you have an explanation why the best benchmark result during
auto-tuning is usually substantially different from the final benchmark
in most of Agnieszka's formats? I'm fine with eventually dismissing it
as "hard to achieve" and "cosmetic anyway", but I'd like to understand
the cause first. Thanks!
Generally a mismatch could be caused by using different [cost] test
vectors in auto-tune than the ones benchmarked, or auto-tune using just
one repeated plaintext in a format where length matters for speed (eg.
RAR), or something along those lines.
Another reason would be incorrect setup of autotune for split kernels.
For example, if auto-tune thinks we're going to call a split kernel 500
times but the real run does it 1000 times, we'll see inflated figures
from autotune.
A third reason (seen in early WPA-PSK) is when crypt_all() does
significant post-processing on CPU where auto-tune doesn't.
At least the first reason you listed may likely result in suboptimal
auto-tuning. Perhaps it wouldn't with simple iterated schemes like
PBKDF2, but with memory-hard schemes like Argon2 the cost settings do
affect optimal LWS and GWS substantially.

So we shouldn't dismiss this without understanding of what exactly is
going on in a given case.

Alexander
Agnieszka Bielec
2015-08-14 14:37:06 UTC
Permalink
Post by Solar Designer
Post by magnum
Post by Solar Designer
magnum, do you have an explanation why the best benchmark result during
auto-tuning is usually substantially different from the final benchmark
in most of Agnieszka's formats? I'm fine with eventually dismissing it
as "hard to achieve" and "cosmetic anyway", but I'd like to understand
the cause first. Thanks!
Generally a mismatch could be caused by using different [cost] test
vectors in auto-tune than the ones benchmarked, or auto-tune using just
one repeated plaintext in a format where length matters for speed (eg.
RAR), or something along those lines.
Another reason would be incorrect setup of autotune for split kernels.
For example, if auto-tune thinks we're going to call a split kernel 500
times but the real run does it 1000 times, we'll see inflated figures
from autotune.
A third reason (seen in early WPA-PSK) is when crypt_all() does
significant post-processing on CPU where auto-tune doesn't.
At least the first reason you listed may likely result in suboptimal
auto-tuning. Perhaps it wouldn't with simple iterated schemes like
PBKDF2, but with memory-hard schemes like Argon2 the cost settings do
affect optimal LWS and GWS substantially.
So we shouldn't dismiss this without understanding of what exactly is
going on in a given case.
cracking mode on my laptop on argon2d showed that at the beginning
speed is the same to this showed during computing gws, after some time
I am getting speed closest to showed during --test but it's not
exactly the same.

beggining
0g 0:00:00:05 13.67% 2/3 (ETA: 16:00:32) 0g/s 3922p/s 3922c/s 3922C/s
GPU:56°C util:99% leugim..nolfet

after 1 min
0g 0:00:03:25 3/3 0g/s 4067p/s 4067c/s 4067C/s GPU:77°C util:99% 213160..241144

after 5 min
0g 0:00:07:40 3/3 0g/s 4083p/s 4083c/s 4083C/s GPU:78°C util:45%
critas01..crachera

--test

Local worksize (LWS) 64, global worksize (GWS) 512
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 4114 c/s real, 4077 c/s virtual
Only one salt: 4114 c/s real, 4114 c/s virtual

I don't have big differences with argon2i on my laptop

on super:

[***@super run]$ ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Options used: -I ./kernels -cl-mad-enable -D__GPU__ -DDEVICE_INFO=138
-DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER
-DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=32
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 385 c/s 385 rounds/s 663.846ms per crypt_all()!
gws: 512 719 c/s 719 rounds/s 711.475ms per crypt_all()+
gws: 1024 1298 c/s 1298 rounds/s 788.748ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 390 c/s real, 102400 c/s virtual
Only one salt: 390 c/s real, 102400 c/s virtual

cracking run shows

Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:21 6.61% 2/3 (ETA: 17:03:04) 0g/s 385.3p/s 385.3c/s
385.3C/s fireballs..bens
GPU 0 overheat (33816176°C, fan 0%), aborting job.
0g 0:00:00:21 6.61% 2/3 (ETA: 17:03:04) 0g/s 384.0p/s 384.0c/s
384.0C/s fireballs..bens

so speeds reported by main --test are good
Agnieszka Bielec
2015-08-14 14:44:09 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
Post by magnum
Post by Solar Designer
magnum, do you have an explanation why the best benchmark result during
auto-tuning is usually substantially different from the final benchmark
in most of Agnieszka's formats? I'm fine with eventually dismissing it
as "hard to achieve" and "cosmetic anyway", but I'd like to understand
the cause first. Thanks!
Generally a mismatch could be caused by using different [cost] test
vectors in auto-tune than the ones benchmarked, or auto-tune using just
one repeated plaintext in a format where length matters for speed (eg.
RAR), or something along those lines.
Another reason would be incorrect setup of autotune for split kernels.
For example, if auto-tune thinks we're going to call a split kernel 500
times but the real run does it 1000 times, we'll see inflated figures
from autotune.
A third reason (seen in early WPA-PSK) is when crypt_all() does
significant post-processing on CPU where auto-tune doesn't.
At least the first reason you listed may likely result in suboptimal
auto-tuning. Perhaps it wouldn't with simple iterated schemes like
PBKDF2, but with memory-hard schemes like Argon2 the cost settings do
affect optimal LWS and GWS substantially.
So we shouldn't dismiss this without understanding of what exactly is
going on in a given case.
cracking mode on my laptop on argon2d showed that at the beginning
speed is the same to this showed during computing gws, after some time
I am getting speed closest to showed during --test but it's not
exactly the same.
beggining
0g 0:00:00:05 13.67% 2/3 (ETA: 16:00:32) 0g/s 3922p/s 3922c/s 3922C/s
GPU:56°C util:99% leugim..nolfet
after 1 min
0g 0:00:03:25 3/3 0g/s 4067p/s 4067c/s 4067C/s GPU:77°C util:99% 213160..241144
after 5 min
0g 0:00:07:40 3/3 0g/s 4083p/s 4083c/s 4083C/s GPU:78°C util:45%
critas01..crachera
--test
Local worksize (LWS) 64, global worksize (GWS) 512
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 4114 c/s real, 4077 c/s virtual
Only one salt: 4114 c/s real, 4114 c/s virtual
I don't have big differences with argon2i on my laptop
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Options used: -I ./kernels -cl-mad-enable -D__GPU__ -DDEVICE_INFO=138
-DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER
-DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=32
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 385 c/s 385 rounds/s 663.846ms per crypt_all()!
gws: 512 719 c/s 719 rounds/s 711.475ms per crypt_all()+
gws: 1024 1298 c/s 1298 rounds/s 788.748ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 390 c/s real, 102400 c/s virtual
Only one salt: 390 c/s real, 102400 c/s virtual
cracking run shows
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:21 6.61% 2/3 (ETA: 17:03:04) 0g/s 385.3p/s 385.3c/s
385.3C/s fireballs..bens
GPU 0 overheat (33816176°C, fan 0%), aborting job.
0g 0:00:00:21 6.61% 2/3 (ETA: 17:03:04) 0g/s 384.0p/s 384.0c/s
384.0C/s fireballs..bens
so speeds reported by main --test are good
wtf?

[***@super run]$ ./john --test --format=argon2i-opencl
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 423 c/s real, 102400 c/s virtual
Only one salt: 423 c/s real, 102400 c/s virtual

[***@super run]$ ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 387 c/s 387 rounds/s 659.830ms per crypt_all()!
gws: 512 720 c/s 720 rounds/s 710.817ms per crypt_all()+
gws: 1024 1305 c/s 1305 rounds/s 784.470ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 389 c/s real, 102400 c/s virtual
Only one salt: 386 c/s real, 51200 c/s virtual

[***@super run]$ GWS=1024 ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 1296 c/s real, 204800 c/s virtual
Only one salt: 1304 c/s real, 204800 c/s virtual

this can have something common with MEM_SIZE/4 (now I have removed /4)
http://www.openwall.com/lists/john-dev/2015/08/06/22
sorry, couldn't find my original e-mail
Solar Designer
2015-08-14 15:20:47 UTC
Permalink
Post by Agnieszka Bielec
wtf?
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 423 c/s real, 102400 c/s virtual
Only one salt: 423 c/s real, 102400 c/s virtual
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 387 c/s 387 rounds/s 659.830ms per crypt_all()!
gws: 512 720 c/s 720 rounds/s 710.817ms per crypt_all()+
gws: 1024 1305 c/s 1305 rounds/s 784.470ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 389 c/s real, 102400 c/s virtual
Only one salt: 386 c/s real, 51200 c/s virtual
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 1296 c/s real, 204800 c/s virtual
Only one salt: 1304 c/s real, 204800 c/s virtual
So the reported auto-tuned GWS works differently from manually set one?
You certainly need to figure this out.

And just as important: figure out why the speeds are so poor, compared
to what you're getting on your laptop, as well as to what you had
reported earlier (IIRC).
Post by Agnieszka Bielec
this can have something common with MEM_SIZE/4 (now I have removed /4)
When did you remove the /4? Before or after running the test above?
Post by Agnieszka Bielec
http://www.openwall.com/lists/john-dev/2015/08/06/22
sorry, couldn't find my original e-mail
Alexander
Agnieszka Bielec
2015-08-14 15:31:07 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
wtf?
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 423 c/s real, 102400 c/s virtual
Only one salt: 423 c/s real, 102400 c/s virtual
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 387 c/s 387 rounds/s 659.830ms per crypt_all()!
gws: 512 720 c/s 720 rounds/s 710.817ms per crypt_all()+
gws: 1024 1305 c/s 1305 rounds/s 784.470ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 389 c/s real, 102400 c/s virtual
Only one salt: 386 c/s real, 51200 c/s virtual
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 1296 c/s real, 204800 c/s virtual
Only one salt: 1304 c/s real, 204800 c/s virtual
So the reported auto-tuned GWS works differently from manually set one?
You certainly need to figure this out.
Post by Agnieszka Bielec
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 387 c/s 387 rounds/s 659.830ms per crypt_all()!
gws: 512 720 c/s 720 rounds/s 710.817ms per crypt_all()+
gws: 1024 1305 c/s 1305 rounds/s 784.470ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 389 c/s real, 102400 c/s virtual
Only one salt: 386 c/s real, 51200 c/s virtual
reports that GWS=1024 in main test but it's actually 256

this also occurs in yescrypt, I have there MEM_SIZE/4 although this
makes problems on Tahiti
Post by Solar Designer
And just as important: figure out why the speeds are so poor, compared
to what you're getting on your laptop, as well as to what you had
reported earlier (IIRC).
when earlier? I remember only earlier tests when I had another costs
Post by Solar Designer
Post by Agnieszka Bielec
this can have something common with MEM_SIZE/4 (now I have removed /4)
When did you remove the /4? Before or after running the test above?
Post by Agnieszka Bielec
http://www.openwall.com/lists/john-dev/2015/08/06/22
sorry, couldn't find my original e-mail
after
Solar Designer
2015-08-14 15:37:42 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
And just as important: figure out why the speeds are so poor, compared
to what you're getting on your laptop, as well as to what you had
reported earlier (IIRC).
when earlier? I remember only earlier tests when I had another costs
Here you reported ~3k c/s at ~1.5 MB on super's Tahiti for Argon2d:

http://www.openwall.com/lists/john-dev/2015/08/12/11

Now you're reporting ~10x lower speeds for the same GPU.

Alexander
Agnieszka Bielec
2015-08-14 15:47:38 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
And just as important: figure out why the speeds are so poor, compared
to what you're getting on your laptop, as well as to what you had
reported earlier (IIRC).
when earlier? I remember only earlier tests when I had another costs
http://www.openwall.com/lists/john-dev/2015/08/12/11
Now you're reporting ~10x lower speeds for the same GPU.
indeed, this is even the same version that I'm using now. I had a
situation that I runned tests, runned again and I noticed that speed
is worse, typed command 'w', super was not idle, there was Kai on
super so I e-mailed to him, he turned off his job but after that speed
was the same (worse) I though that this is only problem with my eyes
or memory but now I see that not, maybe this have nothing common with
Kai, he said that he didn't touched GPU's. Solar, can you restart
super?
Agnieszka Bielec
2015-08-14 17:02:39 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
And just as important: figure out why the speeds are so poor, compared
to what you're getting on your laptop, as well as to what you had
reported earlier (IIRC).
when earlier? I remember only earlier tests when I had another costs
http://www.openwall.com/lists/john-dev/2015/08/12/11
Now you're reporting ~10x lower speeds for the same GPU.
indeed, this is even the same version that I'm using now. I had a
situation that I runned tests, runned again and I noticed that speed
is worse, typed command 'w', super was not idle, there was Kai on
super so I e-mailed to him, he turned off his job but after that speed
was the same (worse) I though that this is only problem with my eyes
or memory but now I see that not, maybe this have nothing common with
Kai, he said that he didn't touched GPU's. Solar, can you restart
super?
ah, In this link is argon2d, it's faster than argon2i because t_cost
for argon2d is equal to 1, 3 for argon2i
Solar Designer
2015-08-14 17:06:28 UTC
Permalink
Post by Agnieszka Bielec
ah, In this link is argon2d, it's faster than argon2i because t_cost
for argon2d is equal to 1, 3 for argon2i
Sure, but IIRC on other benchmarks you posted there was only a small
difference in performance for 2i at t=3 and 2d at t=1. Also, this
doesn't explain the ~10x worse performance we're seeing for 2i now.

Alexander
Agnieszka Bielec
2015-08-14 18:01:31 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
ah, In this link is argon2d, it's faster than argon2i because t_cost
for argon2d is equal to 1, 3 for argon2i
Sure, but IIRC on other benchmarks you posted there was only a small
difference in performance for 2i at t=3 and 2d at t=1. Also, this
doesn't explain the ~10x worse performance we're seeing for 2i now.
where do you see ~10x batter performance than now with the same costs?
Solar Designer
2015-08-14 18:11:22 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
ah, In this link is argon2d, it's faster than argon2i because t_cost
for argon2d is equal to 1, 3 for argon2i
Sure, but IIRC on other benchmarks you posted there was only a small
difference in performance for 2i at t=3 and 2d at t=1. Also, this
doesn't explain the ~10x worse performance we're seeing for 2i now.
where do you see ~10x batter performance than now with the same costs?
Not the same, but I meant this:

http://www.openwall.com/lists/john-dev/2015/08/14/42

[***@...er run]$ ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 387 c/s 387 rounds/s 659.830ms per crypt_all()!
gws: 512 720 c/s 720 rounds/s 710.817ms per crypt_all()+
gws: 1024 1305 c/s 1305 rounds/s 784.470ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 389 c/s real, 102400 c/s virtual
Only one salt: 386 c/s real, 51200 c/s virtual

vs. this:

http://www.openwall.com/lists/john-dev/2015/08/12/11

[***@...er run]$ ./john --test --format=argon2d-opencl --v=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 964 c/s 964 rounds/s 265.514ms per crypt_all()!
gws: 512 1878 c/s 1878 rounds/s 272.497ms per crypt_all()+
gws: 1024 3447 c/s 3447 rounds/s 297.022ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 2925 c/s real, 307200 c/s virtual
Only one salt: 2898 c/s real, 307200 c/s virtual

It's 2i at t=3 vs. 2d at t=1. I'd expect the former to be at most 3x
slower (because of higher t), and in practice less than that due to 2i's
predictable and coalescing-friendly access pattern.

Alexander
Agnieszka Bielec
2015-08-14 18:40:28 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
ah, In this link is argon2d, it's faster than argon2i because t_cost
for argon2d is equal to 1, 3 for argon2i
Sure, but IIRC on other benchmarks you posted there was only a small
difference in performance for 2i at t=3 and 2d at t=1. Also, this
doesn't explain the ~10x worse performance we're seeing for 2i now.
where do you see ~10x batter performance than now with the same costs?
http://www.openwall.com/lists/john-dev/2015/08/14/42
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 387 c/s 387 rounds/s 659.830ms per crypt_all()!
gws: 512 720 c/s 720 rounds/s 710.817ms per crypt_all()+
gws: 1024 1305 c/s 1305 rounds/s 784.470ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 389 c/s real, 102400 c/s virtual
Only one salt: 386 c/s real, 51200 c/s virtual
http://www.openwall.com/lists/john-dev/2015/08/12/11
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 964 c/s 964 rounds/s 265.514ms per crypt_all()!
gws: 512 1878 c/s 1878 rounds/s 272.497ms per crypt_all()+
gws: 1024 3447 c/s 3447 rounds/s 297.022ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 2925 c/s real, 307200 c/s virtual
Only one salt: 2898 c/s real, 307200 c/s virtual
It's 2i at t=3 vs. 2d at t=1. I'd expect the former to be at most 3x
slower (because of higher t), and in practice less than that due to 2i's
predictable and coalescing-friendly access pattern.
I'm not sure if you fully understand this post
http://www.openwall.com/lists/john-dev/2015/08/14/42
on super, john computes gws, 1024 is the best, it prints:
Local worksize (LWS) 64, global worksize (GWS) 1024
but actually 256 is set and these coputations are for GWS=256, if you
specify GWS=1024 speed is better and really for GWS=1024
for argon2d is similar problem but it's not so bad : GWS isn't equal
to 256 but it's between 512 and 1024

this shows that there is a bug in auto tune or in my configuration (
but if there is a bug in my configuration there is also in auto tune,
even if I configured something wrong john shouldn't show that GWS=1024
when GWS=256)
but I don't have this problem on my laptop (another or it's just only
that first call of crypt_all() is just slower)
Solar Designer
2015-08-14 18:44:10 UTC
Permalink
Agnieszka, magnum -
Post by Agnieszka Bielec
I'm not sure if you fully understand this post
http://www.openwall.com/lists/john-dev/2015/08/14/42
Local worksize (LWS) 64, global worksize (GWS) 1024
but actually 256 is set and these coputations are for GWS=256, if you
specify GWS=1024 speed is better and really for GWS=1024
for argon2d is similar problem but it's not so bad : GWS isn't equal
to 256 but it's between 512 and 1024
this shows that there is a bug in auto tune or in my configuration (
but if there is a bug in my configuration there is also in auto tune,
even if I configured something wrong john shouldn't show that GWS=1024
when GWS=256)
but I don't have this problem on my laptop (another or it's just only
that first call of crypt_all() is just slower)
OK, you and magnum need to figure this out and fix whatever bug there is.

Alexander
magnum
2015-08-15 07:12:30 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
this shows that there is a bug in auto tune or in my configuration (
but if there is a bug in my configuration there is also in auto tune,
even if I configured something wrong john shouldn't show that GWS=1024
when GWS=256)
but I don't have this problem on my laptop (another or it's just only
that first call of crypt_all() is just slower)
OK, you and magnum need to figure this out and fix whatever bug there is.
I doubt it's a bug in shared code. Agnieszka, you need to establish what
happens and why. It's just a matter of adding a bunch of debug prints.

magnum
Agnieszka Bielec
2015-08-15 11:38:33 UTC
Permalink
Post by magnum
Post by Solar Designer
Post by Agnieszka Bielec
this shows that there is a bug in auto tune or in my configuration (
but if there is a bug in my configuration there is also in auto tune,
even if I configured something wrong john shouldn't show that GWS=1024
when GWS=256)
but I don't have this problem on my laptop (another or it's just only
that first call of crypt_all() is just slower)
OK, you and magnum need to figure this out and fix whatever bug there is.
I doubt it's a bug in shared code. Agnieszka, you need to establish what
happens and why. It's just a matter of adding a bunch of debug prints.
the first issue is a MEM_SIZE.

made some tests on my laptop with nvidia and on AMD on super

my laptop, MEM_SIZE

***@none ~/Desktop/r/run $ ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: GeForce GTX 960M
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=32
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 1358 c/s 1358 rounds/s 188.400ms per crypt_all()!
gws: 512 1475 c/s 1475 rounds/s 346.913ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 512
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
aaa Many salts: 1462 c/s real, 1476 c/s virtual
zzzzz Only one salt: 1462 c/s real, 1449 c/s virtual
___

my laptop, MEM_SIZE/4

***@none ~/Desktop/r/run $ ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: GeForce GTX 960M
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=32
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 1366 c/s 1366 rounds/s 187.301ms per crypt_all()!
gws: 512 1476 c/s 1476 rounds/s 346.862ms per crypt_all()+
gws: 1024 1900 c/s 1900 rounds/s 538.875ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
aaa Many salts: 1914 c/s real, 1914 c/s virtual
zzzzz Only one salt: 1896 c/s real, 1896 c/s virtual

__
super AMD, MEM_SIZE

[***@super run]$ ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Options used: -I ./kernels -cl-mad-enable -D__GPU__ -DDEVICE_INFO=138
-DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER
-DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=32
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 388 c/s 388 rounds/s 659.713ms per crypt_all()!
gws: 512 719 c/s 719 rounds/s 711.542ms per crypt_all()+
gws: 1024 1309 c/s 1309 rounds/s 782.178ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
aaa Many salts: 530 c/s real, 102400 c/s virtual
zzzzz Only one salt: 525 c/s real, 102400 c/s virtual

___

super AMD, MEM_SIZE/4

[***@super run]$ ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Options used: -I ./kernels -cl-mad-enable -D__GPU__ -DDEVICE_INFO=138
-DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER
-DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=32
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 389 c/s 389 rounds/s 657.904ms per crypt_all()!
gws: 512 719 c/s 719 rounds/s 711.460ms per crypt_all()+
gws: 1024 1302 c/s 1302 rounds/s 786.136ms per crypt_all()+
OpenCL error (CL_INVALID_BUFFER_SIZE) in file
(opencl_argon2i_fmt_plug.c) at line (118) - (Error creating device
buffer)

___
according to this link:
https://devtalk.nvidia.com/default/topic/496980/cl_device_max_mem_alloc_size-looking-for-an-device-with-1gb/
"As has been pointed out in, e.g., this thread,
CL_DEVICE_MAX_MEM_ALLOC_SIZE should be the maximum size of memory
objects. The OpenCL specs demand that it is at least a quarter of the
total memory (which I find a severe restriction). However, NVIDIAs
(and Apple's) OpenCL implementations always return exactly that
quarter, even if you can create larger memory objects in practice, so
this looks more like a misinterpretation of the specs or a kind of
bug. "

some info here:
https://devtalk.nvidia.com/default/topic/478783/cuda-programming-and-performance/cl_device_max_mem_alloc_size-incorrect-/

tests shows that AMD treats it differently than nvidia. I coud'n find
on internet how exactly amd treats CL_DEVICE_MAX_MEM_ALLOC_SIZE, only
found this https://community.amd.com/thread/152028 but maybe this link
is not important

this bug will be hard to fix because we don't know how device behaves
and always something can be changed in the future

the second bug can be linked with MEM_SIZE | MEM_SIZE/4.
Agnieszka Bielec
2015-08-15 14:40:46 UTC
Permalink
I wanted to manually measure a time in one crypt_all() but somehow it
doesn't work when GWS=x is not set (I tested 2 methods to measure
time)

[***@super run]$ ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Options used: -I ./kernels -cl-mad-enable -D__GPU__ -DDEVICE_INFO=138
-DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER
-DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=32
Calculating best global worksize (GWS); max. 1s single kernel invocation.
crypt all 256 256 64
Elapsed time: 0.030995 seconds
gws: 256 387 c/s 387 rounds/s 660.142ms per crypt_all()!
crypt all 512 512 64
Elapsed time: 0.000000 seconds
gws: 512 718 c/s 718 rounds/s 712.825ms per crypt_all()+
crypt all 1024 1024 64
Elapsed time: 0.000000 seconds
gws: 1024 1299 c/s 1299 rounds/s 788.154ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
crypt all 1 64 64
Elapsed time: 0.270959 seconds
crypt all 2 64 64
Elapsed time: 0.000000 seconds
crypt all 3 64 64
Elapsed time: 0.000000 seconds
crypt all 4 64 64
Elapsed time: 0.000000 seconds
crypt all 5 64 64
Elapsed time: 0.000000 seconds
crypt all 7 64 64
Elapsed time: 0.000000 seconds
crypt all 10 64 64
Elapsed time: 0.000000 seconds
crypt all 14 64 64
Elapsed time: 0.000000 seconds
crypt all 1024 1024 64
Elapsed time: 0.000000 seconds
using different password for benchmarking
crypt all 1024 1024 64
Elapsed time: 0.000000 seconds
qqqqqqqqqqqqqqqqqqqqqqqqq
crypt all 1024 1024 64
Elapsed time: 0.000000 seconds
qqqqqqqqqqqqqqqqqqqqqqqqq
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
ten int 1024
clock :f500000000000000
aaa Many salts: 417 c/s real, 102400 c/s virtual
zzzzz Only one salt: 419 c/s real, 102400 c/s virtual
___
[***@super run]$ GWS=1024 ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 64, global worksize (GWS) 1024
crypt all 1 64 64
Elapsed time: 0.139979 seconds
crypt all 2 64 64
Elapsed time: 0.002000 seconds
crypt all 3 64 64
Elapsed time: 0.002000 seconds
crypt all 4 64 64
Elapsed time: 0.002000 seconds
crypt all 5 64 64
Elapsed time: 0.001000 seconds
crypt all 7 64 64
Elapsed time: 0.000999 seconds
crypt all 10 64 64
Elapsed time: 0.001000 seconds
crypt all 14 64 64
Elapsed time: 0.000999 seconds
crypt all 1024 1024 64
Elapsed time: 0.002000 seconds
using different password for benchmarking
crypt all 1024 1024 64
Elapsed time: 0.002000 seconds
crypt all 1024 1024 64
Elapsed time: 0.003000 seconds
qqqqqqqqqqqqqqqqqqqqqqqqq
crypt all 1024 1024 64
Elapsed time: 0.002999 seconds
crypt all 1024 1024 64
Elapsed time: 0.003000 seconds
qqqqqqqqqqqqqqqqqqqqqqqqq
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
ten int 2048
clock :9d00000000000000
aaa Many salts: 1304 c/s real, 204800 c/s virtual
zzzzz Only one salt: 1304 c/s real, 204800 c/s virtual
______________________


I wanted to test bcrypt-opencl but --v=4 doesn't work here. why?
[***@super run]$ ./john --test --format=bcrypt-opencl --v=4
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 4, Global worksize (GWS) 2048
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish
OpenCL]... using different password for benchmarking
qqqqqqqqqqqqqqqqqqqqqqqqq
DONE
Speed for cost 1 (iteration count) of 32
ten int 6144
clock :9000000000000000
aaa Raw: 4266 c/s real, 307200 c/s virtual
Solar Designer
2015-08-15 16:04:34 UTC
Permalink
Post by Agnieszka Bielec
I wanted to manually measure a time in one crypt_all() but somehow it
doesn't work when GWS=x is not set (I tested 2 methods to measure
time)
What are those 2 methods? Which one is used in the john invocations you
posted? And what makes you think it does not work?

Alexander
Agnieszka Bielec
2015-08-15 22:21:41 UTC
Permalink
I added to crypt_all() time measurement and here are results:

[***@super run]$ ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Options used: -I ./kernels -cl-mad-enable -D__GPU__ -DDEVICE_INFO=138
-DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER
-DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=32
Calculating best global worksize (GWS); max. 1s single kernel invocation.
crypt all start, count=256, gws=256, lws=64
crypt all end, time: 0.702250
gws: 256 385 c/s 385 rounds/s 664.384ms per crypt_all()!
crypt all start, count=512, gws=512, lws=64
crypt all end, time: 0.738910
gws: 512 719 c/s 719 rounds/s 711.666ms per crypt_all()+
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 0.819439
gws: 1024 1306 c/s 1306 rounds/s 783.545ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
crypt all start, count=1, gws=64, lws=64
crypt all end, time: 0.982416
crypt all start, count=2, gws=64, lws=64
crypt all end, time: 0.642484
crypt all start, count=3, gws=64, lws=64
crypt all end, time: 0.675356
crypt all start, count=4, gws=64, lws=64
crypt all end, time: 0.677136
crypt all start, count=5, gws=64, lws=64
crypt all end, time: 0.057678
crypt all start, count=7, gws=64, lws=64
crypt all end, time: 0.057936
crypt all start, count=10, gws=64, lws=64
crypt all end, time: 0.042161
crypt all start, count=14, gws=64, lws=64
crypt all end, time: 0.054247
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 2.615536
using different password for benchmarking
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 2.635043
qqqqqqqqqqqqqqqqqqqqqqqqq
real_time 263
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 2.645786
qqqqqqqqqqqqqqqqqqqqqqqqq
real_time 265
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
ten int 1024
clock : 263
aaa Many salts: 389 c/s real, 102400 c/s virtual
zzzzz Only one salt: 386 c/s real, 102400 c/s virtual

[***@super run]$ GWS=1024 ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 64, global worksize (GWS) 1024
crypt all start, count=1, gws=64, lws=64
crypt all end, time: 0.653867
crypt all start, count=2, gws=64, lws=64
crypt all end, time: 0.578068
crypt all start, count=3, gws=64, lws=64
crypt all end, time: 0.618967
crypt all start, count=4, gws=64, lws=64
crypt all end, time: 0.621076
crypt all start, count=5, gws=64, lws=64
crypt all end, time: 0.053851
crypt all start, count=7, gws=64, lws=64
crypt all end, time: 0.054477
crypt all start, count=10, gws=64, lws=64
crypt all end, time: 0.041921
crypt all start, count=14, gws=64, lws=64
crypt all end, time: 0.052137
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 0.788093
using different password for benchmarking
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 0.788118
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 0.789293
qqqqqqqqqqqqqqqqqqqqqqqqq
real_time 158
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 0.788320
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 0.787732
qqqqqqqqqqqqqqqqqqqqqqqqq
real_time 158
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
ten int 2048
clock : 158
aaa Many salts: 1296 c/s real, 204800 c/s virtual
zzzzz Only one salt: 1296 c/s real, 102400 c/s virtual


don't know how is this possible, this bug occurs only on super AMD
(--dev=5 on super works after I cut plaintext length)
also the same problem in cracking run - works faster when GWS=1024 is
set, works slow when GWS is not set
Agnieszka Bielec
2015-08-16 12:01:38 UTC
Permalink
Post by Agnieszka Bielec
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Options used: -I ./kernels -cl-mad-enable -D__GPU__ -DDEVICE_INFO=138
-DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER
-DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=32
Calculating best global worksize (GWS); max. 1s single kernel invocation.
crypt all start, count=256, gws=256, lws=64
crypt all end, time: 0.702250
gws: 256 385 c/s 385 rounds/s 664.384ms per crypt_all()!
crypt all start, count=512, gws=512, lws=64
crypt all end, time: 0.738910
gws: 512 719 c/s 719 rounds/s 711.666ms per crypt_all()+
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 0.819439
gws: 1024 1306 c/s 1306 rounds/s 783.545ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
crypt all start, count=1, gws=64, lws=64
crypt all end, time: 0.982416
crypt all start, count=2, gws=64, lws=64
crypt all end, time: 0.642484
crypt all start, count=3, gws=64, lws=64
crypt all end, time: 0.675356
crypt all start, count=4, gws=64, lws=64
crypt all end, time: 0.677136
crypt all start, count=5, gws=64, lws=64
crypt all end, time: 0.057678
crypt all start, count=7, gws=64, lws=64
crypt all end, time: 0.057936
crypt all start, count=10, gws=64, lws=64
crypt all end, time: 0.042161
crypt all start, count=14, gws=64, lws=64
crypt all end, time: 0.054247
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 2.615536
using different password for benchmarking
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 2.635043
qqqqqqqqqqqqqqqqqqqqqqqqq
real_time 263
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 2.645786
qqqqqqqqqqqqqqqqqqqqqqqqq
real_time 265
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
ten int 1024
clock : 263
aaa Many salts: 389 c/s real, 102400 c/s virtual
zzzzz Only one salt: 386 c/s real, 102400 c/s virtual
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 64, global worksize (GWS) 1024
crypt all start, count=1, gws=64, lws=64
crypt all end, time: 0.653867
crypt all start, count=2, gws=64, lws=64
crypt all end, time: 0.578068
crypt all start, count=3, gws=64, lws=64
crypt all end, time: 0.618967
crypt all start, count=4, gws=64, lws=64
crypt all end, time: 0.621076
crypt all start, count=5, gws=64, lws=64
crypt all end, time: 0.053851
crypt all start, count=7, gws=64, lws=64
crypt all end, time: 0.054477
crypt all start, count=10, gws=64, lws=64
crypt all end, time: 0.041921
crypt all start, count=14, gws=64, lws=64
crypt all end, time: 0.052137
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 0.788093
using different password for benchmarking
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 0.788118
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 0.789293
qqqqqqqqqqqqqqqqqqqqqqqqq
real_time 158
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 0.788320
crypt all start, count=1024, gws=1024, lws=64
crypt all end, time: 0.787732
qqqqqqqqqqqqqqqqqqqqqqqqq
real_time 158
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
ten int 2048
clock : 158
aaa Many salts: 1296 c/s real, 204800 c/s virtual
zzzzz Only one salt: 1296 c/s real, 102400 c/s virtual
don't know how is this possible, this bug occurs only on super AMD
(--dev=5 on super works after I cut plaintext length)
also the same problem in cracking run - works faster when GWS=1024 is
set, works slow when GWS is not set
now I was digging in argon2d ( I discovored that this bug occurs after
commit 9e96f452350c0f2cae32b38e4a4cd1f83d51a367)
and before this commit was code:

bi = prev_block_offset = ((prev_slice * lanes + pos.lane + 1) *
segment_length - 1) * BLOCK_SIZE;
for (i = 0; i < 64; i++)
{
prev_block[i] = *(__global ulong2 *) (&memory[bi]);
bi += 16;
}

slowdown on AMD occurs when I changed this code to:

bi = prev_block_offset = ((prev_slice * lanes + pos.lane + 1) *
segment_length - 1) * BLOCK_SIZE / 16;
for (i = 0; i < 64; i++)
{
prev_block[i] = ((__global ulong2*)memory)[bi+i];
}

see anyone some logic here or is this just a bug on AMD?
I didn't gained speed anywhere on similar changes to this so I can
just revert back these changes
Solar Designer
2015-08-16 12:48:20 UTC
Permalink
Post by Agnieszka Bielec
now I was digging in argon2d ( I discovored that this bug occurs after
commit 9e96f452350c0f2cae32b38e4a4cd1f83d51a367)
bi = prev_block_offset = ((prev_slice * lanes + pos.lane + 1) *
segment_length - 1) * BLOCK_SIZE;
for (i = 0; i < 64; i++)
{
prev_block[i] = *(__global ulong2 *) (&memory[bi]);
bi += 16;
}
bi = prev_block_offset = ((prev_slice * lanes + pos.lane + 1) *
segment_length - 1) * BLOCK_SIZE / 16;
for (i = 0; i < 64; i++)
{
prev_block[i] = ((__global ulong2*)memory)[bi+i];
}
see anyone some logic here or is this just a bug on AMD?
Why do you call this a bug? It isn't necessarily a bug when performance
of code changes when you change the source code.

Anyway, it looks like in the second code version you rely on address
scaling by 16, and this is probably not available in the architecture
(usually available is scaling by up to 8), so requires extra
instructions (explicit left shifts).

Alexander
Agnieszka Bielec
2015-08-16 13:03:56 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
now I was digging in argon2d ( I discovored that this bug occurs after
commit 9e96f452350c0f2cae32b38e4a4cd1f83d51a367)
bi = prev_block_offset = ((prev_slice * lanes + pos.lane + 1) *
segment_length - 1) * BLOCK_SIZE;
for (i = 0; i < 64; i++)
{
prev_block[i] = *(__global ulong2 *) (&memory[bi]);
bi += 16;
}
bi = prev_block_offset = ((prev_slice * lanes + pos.lane + 1) *
segment_length - 1) * BLOCK_SIZE / 16;
for (i = 0; i < 64; i++)
{
prev_block[i] = ((__global ulong2*)memory)[bi+i];
}
see anyone some logic here or is this just a bug on AMD?
Why do you call this a bug? It isn't necessarily a bug when performance
of code changes when you change the source code.
Anyway, it looks like in the second code version you rely on address
scaling by 16, and this is probably not available in the architecture
(usually available is scaling by up to 8), so requires extra
instructions (explicit left shifts).
where do you see address scaling? bi is uint and even before /16 is
BLOCK_SIZE which is much bigger than 16 and divisible by 16 so
preprocessor will change this to *[single value]
Solar Designer
2015-08-16 14:09:16 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
prev_block[i] = ((__global ulong2*)memory)[bi+i];
}
see anyone some logic here or is this just a bug on AMD?
Why do you call this a bug? It isn't necessarily a bug when performance
of code changes when you change the source code.
Anyway, it looks like in the second code version you rely on address
scaling by 16, and this is probably not available in the architecture
(usually available is scaling by up to 8), so requires extra
instructions (explicit left shifts).
where do you see address scaling?
bi+i is used to index an array if 16-byte elements, so it needs to be
multiplied by 16 each time (unless the compiler manages to optimize
this, perhaps much like you had done manually in the first version).
Post by Agnieszka Bielec
bi is uint and even before /16 is
BLOCK_SIZE which is much bigger than 16 and divisible by 16 so
How is this relevant?
Post by Agnieszka Bielec
preprocessor will change this to *[single value]
I don't see any preprocessor macros here, on the line I quoted above.

What branch is this committed on, so that I can take a look in context?

Alexander
Agnieszka Bielec
2015-08-16 14:30:23 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
prev_block[i] = ((__global ulong2*)memory)[bi+i];
}
see anyone some logic here or is this just a bug on AMD?
Why do you call this a bug? It isn't necessarily a bug when performance
of code changes when you change the source code.
Anyway, it looks like in the second code version you rely on address
scaling by 16, and this is probably not available in the architecture
(usually available is scaling by up to 8), so requires extra
instructions (explicit left shifts).
where do you see address scaling?
bi+i is used to index an array if 16-byte elements, so it needs to be
multiplied by 16 each time (unless the compiler manages to optimize
this, perhaps much like you had done manually in the first version).
ok
Post by Solar Designer
Post by Agnieszka Bielec
bi is uint and even before /16 is
BLOCK_SIZE which is much bigger than 16 and divisible by 16 so
How is this relevant?
Post by Agnieszka Bielec
preprocessor will change this to *[single value]
I don't see any preprocessor macros here, on the line I quoted above.
not relevant anymore
Post by Solar Designer
What branch is this committed on, so that I can take a look in context?
bleeding-jumbo
Agnieszka Bielec
2015-08-16 20:27:27 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
prev_block[i] = ((__global ulong2*)memory)[bi+i];
}
see anyone some logic here or is this just a bug on AMD?
Why do you call this a bug? It isn't necessarily a bug when performance
of code changes when you change the source code.
Anyway, it looks like in the second code version you rely on address
scaling by 16, and this is probably not available in the architecture
(usually available is scaling by up to 8), so requires extra
instructions (explicit left shifts).
where do you see address scaling?
bi+i is used to index an array if 16-byte elements, so it needs to be
multiplied by 16 each time (unless the compiler manages to optimize
this, perhaps much like you had done manually in the first version).
if something is not supported why I have on my laptop the opposite of
this slowdown on AMD? although only slightly, when I modify the newest
code as in my previous e-mail I have the same speed for tests when
gws is set and when gws is not set

***@none ~/Desktop/r/run $ ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 4114 c/s real, 4114 c/s virtual
Only one salt: 4114 c/s real, 4151 c/s virtual

***@none ~/Desktop/r/run $ GWS=512 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 4055 c/s real, 4096 c/s virtual
Only one salt: 4096 c/s real, 4055 c/s virtual

after this modification:

***@none ~/Desktop/r/run $ ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 4055 c/s real, 4055 c/s virtual
Only one salt: 4055 c/s real, 4015 c/s virtual

***@none ~/Desktop/r/run $ GWS=512 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 4055 c/s real, 4096 c/s virtual
Only one salt: 4055 c/s real, 4055 c/s virtual
Solar Designer
2015-08-16 21:51:00 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
bi+i is used to index an array if 16-byte elements, so it needs to be
multiplied by 16 each time (unless the compiler manages to optimize
this, perhaps much like you had done manually in the first version).
if something is not supported why I have on my laptop the opposite of
this slowdown on AMD?
It is possible that index scaling by 16 is not supported on AMD GCN, but
is supported on NVIDIA Maxwell (although I doubt it) - you'd need to
check the corresponding ISA manuals and/or the generated GPU ISA code.
It is also possible that one compiler happens to handle this better than
the other, optimizing out the need to scale the index. Finally, it is
possible that extra instructions for the scaling by 16 are generated for
either GPU, but on one of them they end up actually helping e.g. through
avoiding a stall elsewhere. (It does sometimes happen that even a NOP
introduced into code speeds it up. In fact, some compilers generate
code with occasional NOPs in it in some cases - I've recently seen that
in code that icc generates for MIC. Usually this is done to have a next
instruction more likely issued onto a specific execution unit, which
may in turn benefit yet another sequence of instructions through which
execution units are busy vs. available at the time that sequence starts.)
Post by Agnieszka Bielec
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 4114 c/s real, 4114 c/s virtual
Only one salt: 4114 c/s real, 4151 c/s virtual
BTW, these are impressively good speeds for your small GPU. We need to
get a Titan X, and it'll outperform a CPU significantly.

What speeds are you getting on well's CPU for Argon2d at these settings?
With memory (de)allocation out of the loop, like we had for the Lyra2
and yescrypt benchmarks.

Also, please set m=1536, so we'd have exactly 1.5 MiB.

Thanks,

Alexander
Agnieszka Bielec
2015-08-16 22:14:11 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
bi+i is used to index an array if 16-byte elements, so it needs to be
multiplied by 16 each time (unless the compiler manages to optimize
this, perhaps much like you had done manually in the first version).
if something is not supported why I have on my laptop the opposite of
this slowdown on AMD?
It is possible that index scaling by 16 is not supported on AMD GCN, but
why speed is fine when GWS=1024 is set before john ?
Solar Designer
2015-08-17 09:56:44 UTC
Permalink
Post by Agnieszka Bielec
why speed is fine when GWS=1024 is set before john ?
This appears to be unrelated to the OpenCL code fragments you posted,
and it is an important issue for you to continue investigating.

If you think those code fragments were somehow related to it, please
explain that.

Alexander
Agnieszka Bielec
2015-08-17 23:52:20 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
why speed is fine when GWS=1024 is set before john ?
This appears to be unrelated to the OpenCL code fragments you posted,
and it is an important issue for you to continue investigating.
If you think those code fragments were somehow related to it, please
explain that.
maybe is not but BTW when I only turn-on coalescing in argon2d, the
speed is the same with GWS=x and without GWS, maybe this bug depends
on speed although after modifications I was talking before speed is
very very slightly slower.

I would be grateful if someone could test './john --test
--format=argon2i-opencl --v=4'
Agnieszka Bielec
2015-08-17 23:35:47 UTC
Permalink
Post by Solar Designer
What speeds are you getting on well's CPU for Argon2d at these settings?
With memory (de)allocation out of the loop, like we had for the Lyra2
and yescrypt benchmarks.
Also, please set m=1536, so we'd have exactly 1.5 MiB.
I made some tests although there is still a bug on super's AMD.
--dev=5 on super is working now. also I checked most of better speeds
in cracking run and were ok. I was setting gws every time due to this
bug on AMD and a bug with MEM_SIZE on nvidia

argon2i
CPU on well - 2480
GeForce GTX 960M - 1861
AMD Tahiti - 1288
GeForce GTX TITAN - 2805
memory: 1.5 MB

argon2d
CPU on well - 7808
GeForce GTX 960M - 4227
AMD Tahiti - 2742
GeForce GTX TITAN - 6083
memory: 1.5 MB

___________
well

***@well:~/f/run$ ./john --test --format=argon2d
Will run 8 OpenMP threads
Benchmarking: argon2d [Blake2 AVX]... (8xOMP)
memory per hash : 1.50 MB
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 7760 c/s real, 971 c/s virtual
Only one salt: 7808 c/s real, 976 c/s virtual

***@well:~/f/run$ ./john --test --format=argon2d --cost=1:1,1536:1536,5:5
Will run 8 OpenMP threads
Benchmarking: argon2d [Blake2 AVX]... (8xOMP)
memory per hash : 1.50 MB
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 5
Many salts: 7288 c/s real, 911 c/s virtual
Only one salt: 7224 c/s real, 904 c/s virtual

***@well:~/f/run$ ./john --test --format=argon2i
Will run 8 OpenMP threads
Benchmarking: argon2i [Blake2 AVX]... (8xOMP)
memory per hash : 1.50 MB
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 2488 c/s real, 311 c/s virtual
Only one salt: 2480 c/s real, 310 c/s virtual

***@well:~/f/run$ ./john --test --format=argon2i --cost=3:3,1536:1536,5:5
Will run 8 OpenMP threads
Benchmarking: argon2i [Blake2 AVX]... (8xOMP)
memory per hash : 1.50 MB
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 5
Many salts: 2392 c/s real, 298 c/s virtual
Only one salt: 2384 c/s real, 298 c/s virtual

________________________
Titan X

[***@super run]$ GWS=1024 ./john --test --format=argon2d-opencl --v=4 --dev=5
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 5: GeForce GTX TITAN
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=65554 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 5206 c/s real, 5206 c/s virtual
Only one salt: 5206 c/s real, 5206 c/s virtual

[***@super run]$ LWS=32 GWS=1024 ./john --test --format=argon2d-opencl
--v=4 --dev=5
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 5: GeForce GTX TITAN
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=65554 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 32, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 6083 c/s real, 6083 c/s virtual
Only one salt: 6083 c/s real, 6083 c/s virtual

[***@super run]$ GWS=2048 ./john --test --format=argon2i-opencl --v=4 --dev=5
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 5: GeForce GTX TITAN
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=65554 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 64, global worksize (GWS) 2048
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 2805 c/s real, 2805 c/s virtual
Only one salt: 2805 c/s real, 2824 c/s virtual

[***@super run]$ LWS=32 GWS=1024 ./john --test --format=argon2d-opencl
--v=4 --dev=5 --cost=1:1,1536:1536,5:5
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 5: GeForce GTX TITAN
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=65554 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 32, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 5
Many salts: 5688 c/s real, 5688 c/s virtual
Only one salt: 5636 c/s real, 5585 c/s virtual

[***@super run]$ GWS=2048 ./john --test --format=argon2i-opencl --v=4
--dev=5 --cost=3:3,1536:1536,5:5
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 5: GeForce GTX TITAN
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=65554 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 64, global worksize (GWS) 2048
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 5
Many salts: 2694 c/s real, 2730 c/s virtual
Only one salt: 2712 c/s real, 2712 c/s virtual

_____________________
Tahiti

[***@super run]$ GWS=1024 ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 1288 c/s real, 204800 c/s virtual
Only one salt: 1288 c/s real, 204800 c/s virtual

[***@super run]$ LWS=32 GWS=1024 ./john --test --format=argon2d-opencl --v=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 32, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 2648 c/s real, 307200 c/s virtual
Only one salt: 2742 c/s real, 307200 c/s virtual

[***@super run]$ LWS=32 GWS=1024 ./john --test --format=argon2d-opencl
--v=4 --cost=1:1,1536:1536,5:5
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 32, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 5
Many salts: 2671 c/s real, 307200 c/s virtual
Only one salt: 2671 c/s real, 307200 c/s virtual

[***@super run]$ GWS=1024 ./john --test --format=argon2i-opencl --v=4
--cost=3:3,1536:1536,5:5
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 5
Many salts: 1211 c/s real, 204800 c/s virtual
Only one salt: 1211 c/s real, 204800 c/s virtual

__________________________
980m

***@none ~/Desktop/r/run $ GWS=1024 ./john --test --format=argon2i-opencl
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 1878 c/s real, 1861 c/s virtual
Only one salt: 1861 c/s real, 1861 c/s virtual

***@none ~/Desktop/r/run $ GWS=512 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 3976 c/s real, 3938 c/s virtual
Only one salt: 3976 c/s real, 4015 c/s virtual

***@none ~/Desktop/r/run $ LWS=32 GWS=512 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 4266 c/s real, 4227 c/s virtual
Only one salt: 4227 c/s real, 4266 c/s virtual
Solar Designer
2015-08-19 00:46:20 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
What speeds are you getting on well's CPU for Argon2d at these settings?
With memory (de)allocation out of the loop, like we had for the Lyra2
and yescrypt benchmarks.
Also, please set m=1536, so we'd have exactly 1.5 MiB.
I made some tests although there is still a bug on super's AMD.
You mean the auto-tuning weirdness?
Post by Agnieszka Bielec
--dev=5 on super is working now. also I checked most of better speeds
in cracking run and were ok. I was setting gws every time due to this
bug on AMD and a bug with MEM_SIZE on nvidia
OK for now.
Post by Agnieszka Bielec
argon2i
CPU on well - 2480
GeForce GTX 960M - 1861
AMD Tahiti - 1288
GeForce GTX TITAN - 2805
memory: 1.5 MB
argon2d
CPU on well - 7808
GeForce GTX 960M - 4227
AMD Tahiti - 2742
GeForce GTX TITAN - 6083
memory: 1.5 MB
OK. We really need to get a Titan X to run tests on it as well,
hopefully soon. Your 960M works impressively well here.
Post by Agnieszka Bielec
Will run 8 OpenMP threads
Benchmarking: argon2d [Blake2 AVX]... (8xOMP)
memory per hash : 1.50 MB
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 7760 c/s real, 971 c/s virtual
Only one salt: 7808 c/s real, 976 c/s virtual
I assume it's just a temporary glitch that "Many salts" appeared
slightly slower than "Only one salt" here. They should be almost the
same, or "Many salts" very slightly better. In fact, once you're done
debugging these formats, you'll need to set BENCHMARK_LENGTH to -1 to
suppress these separate benchmark (there will be just one then: Raw).
Post by Agnieszka Bielec
Will run 8 OpenMP threads
Benchmarking: argon2d [Blake2 AVX]... (8xOMP)
memory per hash : 1.50 MB
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 5
Many salts: 7288 c/s real, 911 c/s virtual
Only one salt: 7224 c/s real, 904 c/s virtual
What's that "cost 3 (l)"?
Post by Agnieszka Bielec
Titan X
We don't have a Titan X yet, unfortunately. Our Titan is of the older
type, based on Kepler architecture. We need to buy and add a Titan X
into this machine, hopefully soon.

Alexander
Agnieszka Bielec
2015-08-19 02:10:18 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
What speeds are you getting on well's CPU for Argon2d at these settings?
With memory (de)allocation out of the loop, like we had for the Lyra2
and yescrypt benchmarks.
Also, please set m=1536, so we'd have exactly 1.5 MiB.
I made some tests although there is still a bug on super's AMD.
You mean the auto-tuning weirdness?
yes, auto-tuning weirdness
I discovered that slow-down is after I de-allocate and allocate again
this big buffer. I blocked allocate_clobj and deallocate_clobj when
certain part of the code was reached and I was playing around with
allocation/deallocation in crypt_all()
Post by Solar Designer
Post by Agnieszka Bielec
Will run 8 OpenMP threads
Benchmarking: argon2d [Blake2 AVX]... (8xOMP)
memory per hash : 1.50 MB
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 7760 c/s real, 971 c/s virtual
Only one salt: 7808 c/s real, 976 c/s virtual
I assume it's just a temporary glitch that "Many salts" appeared
slightly slower than "Only one salt" here. They should be almost the
same, or "Many salts" very slightly better. In fact, once you're done
debugging these formats, you'll need to set BENCHMARK_LENGTH to -1 to
suppress these separate benchmark (there will be just one then: Raw).
I tested bcrypt and scrypt and the same situation

***@none ~/Desktop/r/run $ ./john --test --format=bcrypt
Will run 8 OpenMP threads
Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]...
(8xOMP) using different password for benchmarking
DONE
Speed for cost 1 (iteration count) of 32
Many salts: 6096 c/s real, 768 c/s virtual
Only one salt: 6225 c/s real, 782 c/s virtual
Post by Solar Designer
Post by Agnieszka Bielec
Will run 8 OpenMP threads
Benchmarking: argon2d [Blake2 AVX]... (8xOMP)
memory per hash : 1.50 MB
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 5
Many salts: 7288 c/s real, 911 c/s virtual
Only one salt: 7224 c/s real, 904 c/s virtual
What's that "cost 3 (l)"?
lanes
Solar Designer
2015-08-19 02:18:47 UTC
Permalink
Agnieszka,

Am I correct that all of your work on Argon2 so far is on revision 1.0,
without BlaMka and without the indexing function enhancement? (I hope
you're keeping track of relevant discussions on the PHC list.)

In other words, where and when did you obtain the Argon2 code that you
integrated and ported to OpenCL? What revision numbers did it have on it?
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
Will run 8 OpenMP threads
Benchmarking: argon2d [Blake2 AVX]... (8xOMP)
memory per hash : 1.50 MB
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 7760 c/s real, 971 c/s virtual
Only one salt: 7808 c/s real, 976 c/s virtual
I assume it's just a temporary glitch that "Many salts" appeared
slightly slower than "Only one salt" here. They should be almost the
same, or "Many salts" very slightly better. In fact, once you're done
debugging these formats, you'll need to set BENCHMARK_LENGTH to -1 to
suppress these separate benchmark (there will be just one then: Raw).
I tested bcrypt and scrypt and the same situation
Will run 8 OpenMP threads
Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]...
(8xOMP) using different password for benchmarking
DONE
Speed for cost 1 (iteration count) of 32
Many salts: 6096 c/s real, 768 c/s virtual
Only one salt: 6225 c/s real, 782 c/s virtual
This might be CPU clock frequency scaling being slow to kick in (un-idle
your CPU), or other activity on the system being slow to calm down after
you've just started john. You may try running longer benchmarks, e.g.
use "--test=10".

Alexander
Agnieszka Bielec
2015-08-19 02:47:32 UTC
Permalink
Post by Solar Designer
Agnieszka,
Am I correct that all of your work on Argon2 so far is on revision 1.0,
without BlaMka and without the indexing function enhancement? (I hope
you're keeping track of relevant discussions on the PHC list.)
I have argon from github https://github.com/khovratovich/Argon2 from
branch master. should I work now on 'enhcance' branch?

in argon2i.h
#define VERSION_NUMBER 0x11
in agon 2d.h
#define VERSION_NUMBER 0x10

I know that there is no blamka, no idea if there is indexing function
enhancement or not
I'm keeping track of discussion
Post by Solar Designer
In other words, where and when did you obtain the Argon2 code that you
integrated and ported to OpenCL? What revision numbers did it have on it?
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
Will run 8 OpenMP threads
Benchmarking: argon2d [Blake2 AVX]... (8xOMP)
memory per hash : 1.50 MB
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 7760 c/s real, 971 c/s virtual
Only one salt: 7808 c/s real, 976 c/s virtual
I assume it's just a temporary glitch that "Many salts" appeared
slightly slower than "Only one salt" here. They should be almost the
same, or "Many salts" very slightly better. In fact, once you're done
debugging these formats, you'll need to set BENCHMARK_LENGTH to -1 to
suppress these separate benchmark (there will be just one then: Raw).
I tested bcrypt and scrypt and the same situation
Will run 8 OpenMP threads
Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]...
(8xOMP) using different password for benchmarking
DONE
Speed for cost 1 (iteration count) of 32
Many salts: 6096 c/s real, 768 c/s virtual
Only one salt: 6225 c/s real, 782 c/s virtual
This might be CPU clock frequency scaling being slow to kick in (un-idle
your CPU), or other activity on the system being slow to calm down after
you've just started john. You may try running longer benchmarks, e.g.
use "--test=10".
I'm testing bcrypt,argon2i and argon2d in well now
in bcrypt many salts sometimes is faster, sometimes is slower
now argon2d many salts is always faster than only one (and argon2d is
faster now, many salts: 8008 l Only one salt: 7960)
argon2i many salts is slower, faster or equal
Solar Designer
2015-08-19 04:10:43 UTC
Permalink
Post by Agnieszka Bielec
I have argon from github https://github.com/khovratovich/Argon2 from
branch master. should I work now on 'enhcance' branch?
You'll need to switch at some point, but since you already did so much
work on the original Argon2, I think it makes sense for you to stick
with it for a while longer.

Regarding optimizations, especially for GPU:

I think the modulo division operations are causing a lot of latency:

[***@super opencl]$ fgrep % argon2*.cl
argon2d_kernel.cl: reference_block_offset = (phi % r);
argon2i_kernel.cl: uint reference_block_index = addresses[0] % r;
argon2i_kernel.cl: uint reference_block_index = addresses[i] % r;

On GPU, this means having local and private memory tied up to code that
doesn't actually touch that memory but is instead doing this division
operation for many cycles in a row. This is extremely wasteful. I think
this might explain the unexpectedly poor performance on AMD GCN. (Maybe
NVIDIA has relatively low latency integer division hardware.)

For Argon2i, you should be able to easily optimize this overhead out,
since all the indices are known in advance (they are the same each
time, by design, as required to avoid cache timing leaks). You should
also be able to optimize out the hashing that produces those indices
(before the modulo division), but that's relatively minor (yet by all
means make this optimization as well if you do precompute the indices).

This means you will need some memory to store those indices in (1536 of
them for our current benchmarks? meaning something like 3 KB?), but this
memory can be shared between different concurrent hash computations.

For Argon2d, optimizing this is not easy, and the speedup potential is
lower. Yet there could be ways:

Fast Division Algorithm with a Small Lookup Table
http://arith.stanford.edu/~hung/papers/asilomar.pdf

"This paper presents a new division algorithm, which requires two
multiplication operations and a single lookup in a small table. The
division algorithm takes two steps. The table lookup and the first
multiplication are processed concurrently in the first step, and the
second multiplication is executed in the next step. This divider uses a
single multiplier and a lookup table with 2/sup m/(2m+1) bits to produce
2 m-bit results that are guaranteed correct to one ulp. By using a
multiplier and a 12.5 KB lookup table, the basic algorithm generates a
24-bit result in two cycles."

It might also be practical to adapt methods normally used for
floating-point to producing precise integer results (there might need to
be some trial and error to confirm or come up with precise results, and
you'll need to implement this in code too):

https://en.wikipedia.org/wiki/Division_algorithm#Fast_division_methods
http://stackoverflow.com/questions/12227126/division-as-multiply-and-lut-fast-float-division-reciprocal

Someone with a GPU working on a much simpler subset of the problem:

http://stackoverflow.com/questions/2616072/faster-integer-division-when-denominator-is-known

(just to illustrate the problem of slow integer division on GPUs).

Before you spend a lot of time on this, I suggest that you replace this
modulo operation with something simpler (and wrong), yet in some ways
similar, e.g.:

static inline uint32_t wrap(uint64_t x, uint32_t n)
{
uint64_t a = (x + n) & (n - 1);
uint64_t b = x & n;
uint64_t c = (x << 1) & n;
return ((a << 1) + b + c) >> 2;
}

(and its OpenCL equivalent, with proper data types). Of course, this
revision of Argon2 won't match Argon2's normal test vectors, but you
should be able to see roughly what performance you could get if you
later optimize the division.

Alexander
Solar Designer
2015-08-19 04:40:57 UTC
Permalink
Agnieszka,
Post by Solar Designer
argon2d_kernel.cl: reference_block_offset = (phi % r);
argon2i_kernel.cl: uint reference_block_index = addresses[0] % r;
argon2i_kernel.cl: uint reference_block_index = addresses[i] % r;
You might also achieve speedup by moving these operations up in code, to
be performed as soon as their input data is available. Maybe the
compiler already does it for you, or maybe not.

In 2d, you could compute an equivalent of "phi % r" inside ComputeBlock,
after having invoked 9 out of 16 BLAKE2 rounds. (The remaining 7 only
affect state[] elements other than the one used for phi.)

In 2i, you could compute "addresses[...] % r" on writes to addresses[]
(thus, store the block indices instead of the original addresses).

"r" might not yet have the correct value at that point, but you should
be able to correctly predict what it would have been at the "%" time.

However, I doubt this will fully do the latency hiding trick on GPUs.
Parallel processing capabilities to perform the modulo operations
concurrently are simply not available at that level in the OpenCL
programming model:

For 2d, there's a chance you'd save the latency of one modulo operation
(and that's all you need) if the GPU has an instruction like this in
hardware (and microcode?) and it doesn't block further processing (until
there's a data dependency on the result).

For 2i, there's no way those 256 modulo operations would be run
concurrently from one work-item. And besides, to run them concurrently
you'd need to provide storage for the results (you already have that
addresses[] array) and then it's no better than copying this data e.g.
from a larger array (holding all precomputed indices) in global memory.

Alexander
Solar Designer
2015-08-19 16:51:35 UTC
Permalink
Agnieszka,
Post by Solar Designer
Post by Solar Designer
argon2d_kernel.cl: reference_block_offset = (phi % r);
argon2i_kernel.cl: uint reference_block_index = addresses[0] % r;
argon2i_kernel.cl: uint reference_block_index = addresses[i] % r;
You might also achieve speedup by moving these operations up in code, to
be performed as soon as their input data is available. Maybe the
compiler already does it for you, or maybe not.
Moreover, you may also prefetch the data pointed to by the index from
global memory sooner. You have limited local or private memory to
prefetch to, but you probably do have it allocated for one block anyway,
and you can start fetching it sooner. Or you can prefetch() into global
memory cache:

https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/prefetch.html

If you continue to use ulong2, then in 2d you may prefetch after 9 out
of 16 BLAKE2's. With ulong8, you may prefetch after 12 out of 16. With
ulong16, you can't... yet it might be optimal for other reasons (twice
higher parallelism over ulong8 yet with the same total concurrent
instance count).

In 2i, you may prefetch whenever you like (which should be when you
determine is the optimal time to prefetch, so that the prefetched data
is available in time for its use yet isn't thrown out of global memory
cache before it's used), regardless of how much parallelism in
ComputeBlock you exploit.
Post by Solar Designer
For 2i, there's no way those 256 modulo operations would be run
concurrently from one work-item. And besides, to run them concurrently
you'd need to provide storage for the results (you already have that
addresses[] array) and then it's no better than copying this data e.g.
from a larger array (holding all precomputed indices) in global memory.
You could probably pack more instances of 2i per GPU by reducing the
size of this addresses[] array, fetching smaller groups of indices from
global memory at a time (than are being computed at a time now).

Alexander
Solar Designer
2015-08-19 15:37:03 UTC
Permalink
Post by Solar Designer
For Argon2i, you should be able to easily optimize this overhead out,
since all the indices are known in advance (they are the same each
time, by design, as required to avoid cache timing leaks). You should
also be able to optimize out the hashing that produces those indices
(before the modulo division), but that's relatively minor (yet by all
means make this optimization as well if you do precompute the indices).
This means you will need some memory to store those indices in (1536 of
them for our current benchmarks? meaning something like 3 KB?), but this
memory can be shared between different concurrent hash computations.
Well, not 1536, but 3 times that, for t=3. Still easily affordable.

Alexander
Solar Designer
2015-08-19 16:39:09 UTC
Permalink
Agnieszka,

As it has just been mentioned on the PHC list, you need to try
exploiting the parallelism inside ComputeBlock. There are two groups of
8 BLAKE2 rounds. In each of the groups, the 8 rounds may be computed in
parallel. When your kernel is working on ulong2, I think it won't fully
exploit this parallelism, except that the parallelism may allow for
better pipelining within those ulong2 lanes (not stalling further
instructions since their input data is separate and thus is readily
available).

I think you may try working on ulong16 or ulong8 instead. I expect
ulong8 to match the current GPU hardware best, but OTOH ulong16 makes
more parallelism apparent to the OpenCL compiler and allocates it to one
work-item. So please try both and see which works best.

With this, you'd launch groups of 8 or 4 BLAKE2 rounds on those wider
vectors, and then between the two groups of 8 in ComputeBlock you'd need
to shuffle vector elements (moving them between two vectors of ulong8 if
you use that type) instead of shuffling state[] elements like you do now
(and like the original Argon2 code did).

The expectation is that a single kernel invocation will then make use of
more SIMD width (2x512- or 512-bit instead of the current 128-bit), yet
only the same amount of local and private memory as it does now. So
you'd pack as many of these kernels per GPU as you do now, but they will
run faster (up to 8x faster) since they'd process 8 or 4 BLAKE2 rounds
in parallel rather than sequentially.

Of course, once you've sped this up, other parts of code may become the
new bottlenecks. In particular, the modulo operation may become even
more important to optimize as well. You can, and should, quickly test
whether or not it is a bottleneck for a given kernel on a given GPU by
replacing it with that wrap() function I posted.

Alexander
Agnieszka Bielec
2015-08-20 18:04:20 UTC
Permalink
Post by Solar Designer
I think you may try working on ulong16 or ulong8 instead. I expect
ulong8 to match the current GPU hardware best, but OTOH ulong16 makes
more parallelism apparent to the OpenCL compiler and allocates it to one
work-item. So please try both and see which works best.
I created something using ulong8, it's almost not noticeable better
speed in my laptop but worse on super both cards, no idea if this is
what you wanted ( I think that not ), you can take a look on branch
vector8
Solar Designer
2015-08-20 20:34:03 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
I think you may try working on ulong16 or ulong8 instead. I expect
ulong8 to match the current GPU hardware best, but OTOH ulong16 makes
more parallelism apparent to the OpenCL compiler and allocates it to one
work-item. So please try both and see which works best.
I created something using ulong8, it's almost not noticeable better
speed in my laptop but worse on super both cards, no idea if this is
what you wanted ( I think that not ), you can take a look on branch
vector8
This is a step towards what I meant, but you're not quite there yet.

You need to convert more of the processing to ulong8 (or ulong16). For
example, you still have "ulong2 ref_block[64];" in ComputeBlock_pgg(),
but it should become an array of ulong 8 too. And so on. Yes, this
means that you either have to convert the callers to using this wider
vector type as well, or you have to convert between the vectors
somewhere (which likely results in performance loss). You should also
use the wider vector type for the global memory references and in the
kernel parameter list.

The only shuffling of ulong2's inside/between ulong8's should be between
the two groups of 8 BLAKE2 rounds. Right now, you also have conversion
from ulong2 to ulong8 before the first group of 8 BLAKE2 rounds - it
should go away when you optimize this code further as I suggested above.

Also, the shuffling can probably be optimized. Right now, you keep the
full block in state[] and you also have 8 ulong8's storing half a block
at a time. You may instead have 16 ulong8's storing the entire block.
Yes, the shuffling might require some temporary storage, but you don't
necessarily have to write the entire block to a temporary array of
ulong2's - perhaps there's a more efficient way for the specific kind of
shuffling that is being done.

Also, we're optimizing this blindfolded, and that's wrong. We should be
reviewing the generated code. You may patch common-opencl.c:
opencl_build_kernel_opt() to invoke opencl_build() like this:

opencl_build(sequential_id, opts, 1, "kernel.out");

instead of the current:

opencl_build(sequential_id, opts, 0, NULL);

Then when targeting NVIDIA cards it dumps PTX assembly to the filename
specified there. It looks something like this, just much larger:

http://arrayfire.com/demystifying-ptx-code/

You could start by experimenting with a much simpler than Argon2 yet in
some ways similar kernel: implement some trivial operation like XOR on
different vector widths and see whether/how this changes the assembly.
Then make it slightly less trivial (just enough to prevent the compiler
from optimizing things out) and add uses of private or local memory,
and see if you can make it run faster by using wider vectors per the
same private or local memory usage.

Alexander
Agnieszka Bielec
2015-08-20 20:40:16 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
I think you may try working on ulong16 or ulong8 instead. I expect
ulong8 to match the current GPU hardware best, but OTOH ulong16 makes
more parallelism apparent to the OpenCL compiler and allocates it to one
work-item. So please try both and see which works best.
I created something using ulong8, it's almost not noticeable better
speed in my laptop but worse on super both cards, no idea if this is
what you wanted ( I think that not ), you can take a look on branch
vector8
You should also
use the wider vector type for the global memory references and in the
kernel parameter list.
was even more slower (on super, both cards)
Solar Designer
2015-08-20 21:03:13 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
I think you may try working on ulong16 or ulong8 instead. I expect
ulong8 to match the current GPU hardware best, but OTOH ulong16 makes
more parallelism apparent to the OpenCL compiler and allocates it to one
work-item. So please try both and see which works best.
I created something using ulong8, it's almost not noticeable better
speed in my laptop but worse on super both cards, no idea if this is
what you wanted ( I think that not ), you can take a look on branch
vector8
You should also
use the wider vector type for the global memory references and in the
kernel parameter list.
was even more slower (on super, both cards)
Where is the code? Slower now doesn't necessarily mean we're doing
anything wrong - it might also mean we're not doing enough of it yet.

And how much slower was it? Did you try re-tuning LWS and GWS?

Alexander
Agnieszka Bielec
2015-08-20 21:29:08 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
I think you may try working on ulong16 or ulong8 instead. I expect
ulong8 to match the current GPU hardware best, but OTOH ulong16 makes
more parallelism apparent to the OpenCL compiler and allocates it to one
work-item. So please try both and see which works best.
I created something using ulong8, it's almost not noticeable better
speed in my laptop but worse on super both cards, no idea if this is
what you wanted ( I think that not ), you can take a look on branch
vector8
You should also
use the wider vector type for the global memory references and in the
kernel parameter list.
was even more slower (on super, both cards)
Where is the code? Slower now doesn't necessarily mean we're doing
anything wrong - it might also mean we're not doing enough of it yet.
deleted, it wasn't much of effort anyway
Post by Solar Designer
And how much slower was it? Did you try re-tuning LWS and GWS?
nope

vector8

[***@super run]$ GWS=1024 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 2061 c/s real, 307200 c/s virtual
Only one salt: 2104 c/s real, 307200 c/s virtual

[***@super run]$ GWS=1024 ./john --test --format=argon2d-opencl --dev=5
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 5: GeForce GTX TITAN
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 4970 c/s real, 5019 c/s virtual
Only one salt: 5019 c/s real, 4970 c/s virtual

vector8 + ulong8 for copying and xoring

[***@super run]$ GWS=1024 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 1563 c/s real, 102400 c/s virtual
Only one salt: 1563 c/s real, 204800 c/s virtual

[***@super run]$ GWS=1024 ./john --test --format=argon2d-opencl --dev=5
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 5: GeForce GTX TITAN
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 4970 c/s real, 4923 c/s virtual
Only one salt: 4923 c/s real, 4970 c/s virtual
Solar Designer
2015-08-21 02:34:32 UTC
Permalink
Agnieszka,
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
You should also
use the wider vector type for the global memory references and in the
kernel parameter list.
was even more slower (on super, both cards)
Where is the code? Slower now doesn't necessarily mean we're doing
anything wrong - it might also mean we're not doing enough of it yet.
deleted, it wasn't much of effort anyway
Does your posting of the benchmarks now imply that you re-created that
code? If so, may I take a look at it now?

I am surprised that "it wasn't much of effort". This suggests that you
did it only partially, and I'd like to see what you did vs. did not do,
to know what those benchmarks actually correspond to.

Thanks,

Alexander
Agnieszka Bielec
2015-08-21 15:40:42 UTC
Permalink
Post by Solar Designer
Also, we're optimizing this blindfolded, and that's wrong. We should be
opencl_build(sequential_id, opts, 1, "kernel.out");
opencl_build(sequential_id, opts, 0, NULL);
Then when targeting NVIDIA cards it dumps PTX assembly to the filename
http://arrayfire.com/demystifying-ptx-code/
You could start by experimenting with a much simpler than Argon2 yet in
some ways similar kernel: implement some trivial operation like XOR on
different vector widths and see whether/how this changes the assembly.
Then make it slightly less trivial (just enough to prevent the compiler
from optimizing things out) and add uses of private or local memory,
and see if you can make it run faster by using wider vectors per the
same private or local memory usage.
I tested (only 960m)
-copying memory from __private to __private
- from __global to __private
-xoring private tables with __prrivate tables

using ulong, ulong2, ulong4, ulong8 (I was getting empty kernel using ulong16)

in generated PTX code ulong4 and ulong8 were changed to ulong2

something like here (uong4):

ld.global.v2.u64 {%rd73, %rd74}, [%rd926+8000];
ld.global.v2.u64 {%rd77, %rd78}, [%rd926+8016];
st.local.v2.u64 [%rd937+208], {%rd77, %rd78};
st.local.v2.u64 [%rd937+192], {%rd73, %rd74};

I was getting the best speed on ulong ( except copying from global to private )


speeds:

xoring:

//1
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts: 1312 c/s real, 1312 c/s virtual
Only one salt: 1301 c/s real, 1312 c/s virtual
//2
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts: 590 c/s real, 590 c/s virtual
Only one salt: 595 c/s real, 595 c/s virtual

//4
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts: 914 c/s real, 914 c/s virtual
Only one salt: 906 c/s real, 898 c/s virtual
//8
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts: 738 c/s real, 731 c/s virtual
Only one salt: 738 c/s real, 738 c/s virtual

copying from global:

//1
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts: 853 c/s real, 860 c/s virtual
Only one salt: 860 c/s real, 860 c/s virtual

//2
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts: 1174 c/s real, 1185 c/s virtual
Only one salt: 1174 c/s real, 1163 c/s virtual

//4,8
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts: 1122 c/s real, 1113 c/s virtual
Only one salt: 1132 c/s real, 1132 c/s virtual


copying from private:

//1
*/
/*
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts: 2258 c/s real, 2236 c/s virtual
Only one salt: 2258 c/s real, 2258 c/s virtual*/

//2
/*
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts: 685 c/s real, 679 c/s virtual
Only one salt: 685 c/s real, 691 c/s virtual
*/


//4
/*
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts: 1153 c/s real, 1142 c/s virtual
Only one salt: 1163 c/s real, 1163 c/s virtual
*/

//8
/*Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts: 1796 c/s real, 1796 c/s virtual
Only one salt: 1812 c/s real, 1812 c/s virtual
*/
Solar Designer
2015-08-22 01:48:56 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
You could start by experimenting with a much simpler than Argon2 yet in
some ways similar kernel: implement some trivial operation like XOR on
different vector widths and see whether/how this changes the assembly.
Then make it slightly less trivial (just enough to prevent the compiler
from optimizing things out) and add uses of private or local memory,
and see if you can make it run faster by using wider vectors per the
same private or local memory usage.
I tested (only 960m)
-copying memory from __private to __private
- from __global to __private
-xoring private tables with __prrivate tables
using ulong, ulong2, ulong4, ulong8 (I was getting empty kernel using ulong16)
Great. Where is the OpenCL code for these? I'd like to take a look at
what exactly you were testing.

Alexander
Agnieszka Bielec
2015-08-22 08:42:42 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
You could start by experimenting with a much simpler than Argon2 yet in
some ways similar kernel: implement some trivial operation like XOR on
different vector widths and see whether/how this changes the assembly.
Then make it slightly less trivial (just enough to prevent the compiler
from optimizing things out) and add uses of private or local memory,
and see if you can make it run faster by using wider vectors per the
same private or local memory usage.
I tested (only 960m)
-copying memory from __private to __private
- from __global to __private
-xoring private tables with __prrivate tables
using ulong, ulong2, ulong4, ulong8 (I was getting empty kernel using ulong16)
Great. Where is the OpenCL code for these? I'd like to take a look at
what exactly you were testing.
xoring:


#define MOD 4
#define type ulong4

void func(type * table)
{ //prohibits optimizing code
}

__kernel void pomelo_crypt_kernel(__global const uchar * in,
__global const uint * index,
__global char *out,
__global struct pomelo_salt *salt, __global type *S)
{
int i,j;
uint gid;
gid = get_global_id(0);
S+=gid*1024/MOD;
type copy1[1024/MOD];
type copy2[1024/MOD];
for(j=0;j<1024/MOD;j++)
copy1[j]=S[1024/MOD-j];
for(j=0;j<1024/MOD;j++)
copy2[j]=S[j];
for(i=0;i<1000;i++)
{
for(j=0;j<1024/MOD;j++)
copy1[j]^=copy2[1024/MOD-j];
func(copy1);
for(j=0;j<1024/MOD;j++)
copy1[j]^=copy2[j];
func(copy1);
}
out[gid]=((ulong*)copy1)[0];
}

copying from global to private:
__kernel void pomelo_crypt_kernel(__global const uchar * in,
__global const uint * index,
__global char *out,
__global struct pomelo_salt *salt, __global type *S)
{
int i,j;
uint gid;
gid = get_global_id(0);
S+=gid*1024/MOD;
type copy1[1024/MOD];
type copy2[1024/MOD];
for(i=0;i<1000;i++)
{
for(j=0;j<1024/MOD;j++)
copy1[j]=S[1024/MOD-j];
func(copy1);
for(j=0;j<1024/MOD;j++)
copy1[j]=S[j];
func(copy1);
}
out[gid]=((ulong*)copy1)[0];
}

copying from private to private. I didn't have defines yet:

__kernel void pomelo_crypt_kernel(__global const uchar * in,
__global const uint * index,
__global char *out,
__global struct pomelo_salt *salt, __global ulong2 *S)
{
int i,j;
uint gid;
gid = get_global_id(0);
S+=gid*1024/2;
ulong2 copy1[1024/2];
ulong2 copy2[1024/2];
for(i=0;i<1024/2;i++)
{
copy1[i]=S[i+0];
}
for(i=0;i<1000;i++)
{
for(j=0;j<1024/2;j++)
copy2[j]=copy1[1024/2-j];
func(copy2);
for(j=0;j<1024/2;j++)
copy1[j]=copy2[j];
func(copy1);
}
}
Solar Designer
2015-08-23 05:37:22 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
You could start by experimenting with a much simpler than Argon2 yet in
some ways similar kernel: implement some trivial operation like XOR on
different vector widths and see whether/how this changes the assembly.
Then make it slightly less trivial (just enough to prevent the compiler
from optimizing things out) and add uses of private or local memory,
and see if you can make it run faster by using wider vectors per the
same private or local memory usage.
I tested (only 960m)
-copying memory from __private to __private
- from __global to __private
-xoring private tables with __prrivate tables
using ulong, ulong2, ulong4, ulong8 (I was getting empty kernel using ulong16)
in generated PTX code ulong4 and ulong8 were changed to ulong2
I've just read up on this. It turns out that vectors wider than 128-bit
are not supported in PTX:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#vectors

"Vectors cannot exceed 128-bits in length"

So it's no surprise that ulong4 and ulong8 resulted in PTX code that
used ulong2 alike instructions.

Maybe we could bypass this limitation by programming at GPU ISA level,
such as with MaxAs, or maybe not. We'd need to become familiar with a
given GPU ISA (and maybe more) to answer this.

When we're programming in OpenCL (or even in PTX or IL), we appear to be
stuck with the SIMT rather than SIMD model. We want to access wider
SIMD from one instance of Argon2, but we're only given SIMT instead.
What we can possibly do is accept this model, and work within it to
achieve our goal. We can have a few threads (work-items) running in
lock-step anyway (because they're physically being run on wider SIMD
than we can access) communicate via shared (local) memory frequently,
during each block computation. If we're confident enough of them
running in lock-step, we could even forgo explicit synchronization, even
though this is considered risky (not future-proof):

http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#warp-synchronous
https://devtalk.nvidia.com/default/topic/632471/is-syncthreads-required-within-a-warp-/

In OpenCL, I think we'd use barrier(CLK_LOCAL_MEM_FENCE). We already
use it in some kernels in JtR.

If we do use explicit synchronization, then I wonder if it'd kill the
performance (since we'd do it very frequently) or if it possibly would
be just fine (since there's not much to synchronize anyway).

Explicit local memory barrier might be needed (even when the underlying
hardware is simply a wider SIMD) because the compiler might otherwise
reorder some store instructions to be after the loads that we intend to
load data already modified by those stores. In other words, even if a
memory barrier isn't needed at hardware level, it might still be needed
at compiler level.

Alexander
Solar Designer
2015-08-23 05:57:42 UTC
Permalink
Post by Solar Designer
When we're programming in OpenCL (or even in PTX or IL), we appear to be
stuck with the SIMT rather than SIMD model. We want to access wider
SIMD from one instance of Argon2, but we're only given SIMT instead.
What we can possibly do is accept this model, and work within it to
achieve our goal. We can have a few threads (work-items) running in
lock-step anyway (because they're physically being run on wider SIMD
than we can access) communicate via shared (local) memory frequently,
during each block computation.
Actually, there's a better alternative:

http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#warp-shuffle

"Kepler introduces a new warp-level intrinsic called the shuffle
operation. This feature allows the threads of a warp to exchange data
with each other directly without going through shared (or global)
memory. The shuffle instruction also has lower latency than shared
memory access and does not consume shared memory space for data
exchange, so this can present an attractive way for applications to
rapidly interchange data among threads."

Sounds like precisely what we need for the original Argon2.

We need to find out if it's available in OpenCL.

Also need to find out whether there's an equivalent for AMD GCN.

Alexander
Solar Designer
2015-08-23 05:53:05 UTC
Permalink
Agnieszka,

There might also be room for improvement of Argon2 performance on GPUs
through special handling of BLAKE2b's 64-bit operations. See:

http://hashcat.net/forum/archive/index.php?thread-3422.html

"All the 64-bit based algorithms like SHA512, Keccak etc dropped in
performance with each new driver a little bit. So it was hard to notice.
GPUs instructions operate still on 32-bit only, so the 64-bit mode is
emulated. But the way how it is emulated was somehow broken. I was
able to pinpoint the problem where the biggest drop came from and I
managed to workaround it. For NVidia it took me a little PTX hack, for
AMD luckily there was no binary hack required."

Unfortunately, atom doesn't go into further detail there (but we could
try asking him). I guess the approach amounts to explicitly building
64-bit addition out of 32-bit additions. Maybe having it split like
that right away (rather than only in the PTX or IL to ISA translation)
is somehow friendlier to current compilers.

I guess this is part of why oclHashcat is faster than JtR at SHA-512
based hashes (per further announcements, oclHashcat's performance at
those has been improved way further since that old forum posting above).

In a vectorized kernel, we'd switch from ulong2 to uint4.

Alexander
Solar Designer
2015-08-23 07:21:54 UTC
Permalink
Post by Solar Designer
There might also be room for improvement of Argon2 performance on GPUs
http://hashcat.net/forum/archive/index.php?thread-3422.html
"All the 64-bit based algorithms like SHA512, Keccak etc dropped in
performance with each new driver a little bit. So it was hard to notice.
GPUs instructions operate still on 32-bit only, so the 64-bit mode is
emulated. But the way how it is emulated was somehow broken. I was
able to pinpoint the problem where the biggest drop came from and I
managed to workaround it. For NVidia it took me a little PTX hack, for
AMD luckily there was no binary hack required."
Unfortunately, atom doesn't go into further detail there (but we could
try asking him). I guess the approach amounts to explicitly building
64-bit addition out of 32-bit additions. Maybe having it split like
that right away (rather than only in the PTX or IL to ISA translation)
is somehow friendlier to current compilers.
In PTX, we appear to be getting add.s64 now. I guess it'd be more
optimal to get add.cc.u32 followed by addc.u32:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#extended-precision-integer-arithmetic-instructions

In source code, it should be something like:

dst_lo = src1_lo + src2_lo;
dst_hi = src1_hi + src2_hi + (dst_lo < src1_lo);

(where in our case these would likely be 32-bit elements of uint4), but
it'll probably take effort to reach the desired PTX code.

For bit rotates, we appear to be getting things like this:

{
.reg .b64 %lhs;
.reg .b64 %rhs;
shl.b64 %lhs, %rd12449, 1;
shr.b64 %rhs, %rd12449, 63;
add.u64 %rd12450, %lhs, %rhs;
}

This probably translates to at least 6 native instructions. There ought
to be more efficient ways, such as involving bfe or/and bfi instructions:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#integer-arithmetic-instructions-bfe

For 32-bit rotates, there also was the mad trick:

http://www.openwall.com/lists/john-dev/2012/03/22/7

For starters, we should try OpenCL's rotate() and see if it translates
to decent PTX code these days. Right now, opencl_blake2.h: rotr64()
does not yet use rotate(), while opencl_blake2-round-no-msg.h does. We
should switch both to use the same approach, at least to make reviewing
the generated PTX code easier.

Also, right now opencl_blake2-round-no-msg.h uses rotate() with negative
shift counts. We should change this to use the corresponding positive
shift counts instead.

opencl_blake2.h uses the "__constant uchar blake2b_sigma" array to
simplify the source code. Unfortunately, this actually gets into the
compiled code:

ld.const.u8 %r13, [blake2b_sigma+1];
mul.wide.u32 %rd84, %r13, 8;
add.s64 %rd85, %rd2021, %rd84;
ld.local.u64 %rd86, [%rd85];

We should optimize this in the source using cpp macros, or alternatively
those non-performance-critical uses of BLAKE2 may be kept on the host.

Alexander
Solar Designer
2015-08-23 07:39:00 UTC
Permalink
Post by Solar Designer
{
.reg .b64 %lhs;
.reg .b64 %rhs;
shl.b64 %lhs, %rd12449, 1;
shr.b64 %rhs, %rd12449, 63;
add.u64 %rd12450, %lhs, %rhs;
}
This probably translates to at least 6 native instructions. There ought
http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#integer-arithmetic-instructions-bfe
Actually, what we need for newer NVIDIAs (ours are just new enough) is
funnel shift:

http://stackoverflow.com/questions/12767113/funnel-shift-what-is-it
http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf

AMD had the same thing under the name of bitalign for ages, and we can
simply use rotate() there:

https://community.amd.com/thread/158497
Post by Solar Designer
For starters, we should try OpenCL's rotate() and see if it translates
to decent PTX code these days. Right now, opencl_blake2.h: rotr64()
does not yet use rotate(), while opencl_blake2-round-no-msg.h does. We
should switch both to use the same approach, at least to make reviewing
the generated PTX code easier.
Actually, now that I specifically grep the Argon2 PTX code for shf, I
see that rotate() gives it to us:

shf.r.wrap.b32 %r287, %r286, %r285, 24;
shf.r.wrap.b32 %r288, %r285, %r286, 24;

and so on for other BLAKE2 rotate counts.

So those shl/shr/add sequences must have come from opencl_blake2.h:
rotr64(). This is what I mean by "make reviewing the generated PTX code
easier" - if we used rotate() everywhere, I wouldn't be misled into
thinking that unoptimal code was generated.

Alexander
Solar Designer
2015-08-23 08:02:24 UTC
Permalink
Post by Solar Designer
For starters, we should try OpenCL's rotate() and see if it translates
to decent PTX code these days. Right now, opencl_blake2.h: rotr64()
does not yet use rotate(), while opencl_blake2-round-no-msg.h does. We
should switch both to use the same approach, at least to make reviewing
the generated PTX code easier.
The body of rotr64() should be:

return rotate(w, (ulong)(64 - c));

(or we can turn it into a macro, not to rely on the inlining).

Unfortunately, when we're dealing with 64-bit types, the generated PTX
code includes extra mov's:

{
.reg .b32 %dummy;
mov.b64 {%r15,%dummy}, %rd82;
}
{
.reg .b32 %dummy;
mov.b64 {%dummy,%r16}, %rd82;
}
shf.r.wrap.b32 %r17, %r16, %r15, 24;
shf.r.wrap.b32 %r18, %r15, %r16, 24;

These are simply to extract the 32-bit halves as needed for the shf
instructions. The mov's should be gone and proper registers
substituted right into the shf instructions in the final ISA code -
however, I am not sure this is what is actually happening (depends on
how good the translator from PTX to native ISA is).

I think this also serves to illustrate why working with 32-bit values or
vector elements at OpenCL source level is a safer bet... although then
we'd need to find and use the right intrinsics for funnel shift in
OpenCL. AMD has it as amd_bitalign(), but I don't know if NVIDIA has an
equivalent now (maybe the same funnel shift intrinsics names as they use
in CUDA?)

Alexander
Solar Designer
2015-08-23 08:40:35 UTC
Permalink
Post by Solar Designer
Unfortunately, when we're dealing with 64-bit types, the generated PTX
{
.reg .b32 %dummy;
mov.b64 {%r15,%dummy}, %rd82;
}
{
.reg .b32 %dummy;
mov.b64 {%dummy,%r16}, %rd82;
}
shf.r.wrap.b32 %r17, %r16, %r15, 24;
shf.r.wrap.b32 %r18, %r15, %r16, 24;
These are simply to extract the 32-bit halves as needed for the shf
instructions. The mov's should be gone and proper registers
substituted right into the shf instructions in the final ISA code -
however, I am not sure this is what is actually happening (depends on
how good the translator from PTX to native ISA is).
I passed the PTX code through "ptxas --gpu-name sm_35" and nvdisasm, and
it looks OK in this respect:

/*28d0*/ LOP.XOR R26, R248, R12;
/*28d8*/ LOP.XOR R36, R32, R13;
/*28e0*/ LOP.XOR R37, R25, R31;
/*28e8*/ IADD.X R239, R24, R27;
/*28f0*/ LDL.64 R24, [R141+0x58];
/*28f8*/ IADD R35.CC, R232, c[0x3][0x0];
/*2908*/ SHF.R.W R242, R26, 0x18, R36;
/*2910*/ LOP.XOR R7, R7, c[0x3][0x34];
/*2918*/ SHF.R.W R39, R36, 0x18, R26;

Here we can see R26 and R36 come directly from LOP.XOR, without MOV.

Also interesting is LDL.64. I guess it loads two adjacent registers
(R24 and R25 in this example), which under the SIMT model are 32-bit
elements in two different hardware SIMD registers.

Alexander
Solar Designer
2015-08-24 10:05:04 UTC
Permalink
Post by Solar Designer
I think this also serves to illustrate why working with 32-bit values or
vector elements at OpenCL source level is a safer bet... although then
we'd need to find and use the right intrinsics for funnel shift in
OpenCL. AMD has it as amd_bitalign(), but I don't know if NVIDIA has an
equivalent now (maybe the same funnel shift intrinsics names as they use
in CUDA?)
The CUDA intrinsics don't appear to exist in OpenCL, not even after
#include'ing the corresponding CUDA header file (it got parsed as OpenCL
fine, but didn't result in the intrinsics becoming available).

However, inline PTX asm is available in OpenCL, and this is how I made
use of the funnel shifter in the patch for md5crypt-opencl that I've
just posted. (The funnel shifter was already in use by rotate(), but in
that patch I also used it to implement md5crypt's unaligned writes.)

Normally, tiny pieces of inline asm hurt the compiler's instruction
scheduling and thus are rarely a good idea, but in NVIDIA's case there's
hopefully sufficient rescheduling in the PTX to native ISA translation.

Alexander
Agnieszka Bielec
2015-08-24 22:24:49 UTC
Permalink
Post by Solar Designer
Also, right now opencl_blake2-round-no-msg.h uses rotate() with negative
shift counts. We should change this to use the corresponding positive
shift counts instead.
why? I just changed -63 to 1 in one case but the rest have values
bigger than -32
Solar Designer
2015-08-24 22:37:40 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
Also, right now opencl_blake2-round-no-msg.h uses rotate() with negative
shift counts. We should change this to use the corresponding positive
shift counts instead.
why? I just changed -63 to 1 in one case but the rest have values
bigger than -32
Is it stated anywhere that negative or/and over-large shift counts have
a defined behavior? In C, that's UB. Even if for OpenCL's rotate()
this is defined in a specific way(*), it's better not to get into that
bad habit (it will then bite you in C or the like).

(*) For amd_bitalign(), it is in fact said that the shift count is taken
"& 31" (which I found inconvenient since 32 would have been useful as a
way to fully shift in the high 32 bits when using a runtime computed
shift count). However, I didn't see anything like that for rotate().

Alexander
Solar Designer
2015-08-23 06:15:11 UTC
Permalink
Agnieszka,

Your current Argon2 kernels use global and private memory only. They
don't use local memory.

While private memory might be larger and faster on specific devices, I
think that not making any use of local memory is wasteful. By using
both private and local memory at once, we should be able to optimally
pack more concurrent Argon2 instances per GPU and thereby hide more of
the various latencies.

You should try moving some of the arrays from private to local memory.

Here's a related finding:

http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#shared-memory-bandwidth

"[...] shared memory bandwidth in SMX is twice that of Fermi's SM. This
bandwidth increase is exposed to the application through a configurable
new 8-byte shared memory bank mode. When this mode is enabled, 64-bit
(8-byte) shared memory accesses (such as loading a double-precision
floating point number from shared memory) achieve twice the effective
bandwidth of 32-bit (4-byte) accesses. Applications that are sensitive
to shared memory bandwidth can benefit from enabling this mode as long
as their kernels' accesses to shared memory are for 8-byte entities
wherever possible."

Argon2's accesses are wider than and are a multiple of 8 bytes, so I
think we need to enable this mode. Please try to find out how to enable
it, and whether it possibly gets enabled automatically e.g. when the
kernel uses specific data types.

I think it could benefit many more of our kernels. So this is important
to figure out and learn to use regardless of Argon2.

Yet another relevant finding is that, per the tuning guides, Kepler and
Maxwell do not use L1 caches for global memory (they only use L2), but
there's a compiler option to change this behavior (enable use of both L1
and L2 caches for global memory). We could give this a try (if we find
how to do this for OpenCL) and see if it improves or hurts performance,
especially if we end up not using local memory anyway (for whatever
reason) and have no or few register spills (where L1 cache for local
memory could have been helpful). I don't expect this to be of much
help, though - most likely the default is actually optimal for us,
unless we don't use local memory at all (not even implicitly via spills).

Alexander
Agnieszka Bielec
2015-08-23 23:52:35 UTC
Permalink
Post by Solar Designer
While private memory might be larger and faster on specific devices, I
think that not making any use of local memory is wasteful. By using
both private and local memory at once, we should be able to optimally
pack more concurrent Argon2 instances per GPU and thereby hide more of
the various latencies.
why will we pack more argon2 per gpu using both types of memory?
I'm using only very small portions of private memory.

BTW in my vectorized kernels shuffling between two groups of argon
rounds takes very long time so I did something that I grouped kernel
instances to 4 and I'm interleaving data to this local memory and I
can avoid shuffling
but in my laptop I can gain 3k c/s for LWS=8 so no speedup. (4k is in
bleeding-jumbo branch)
but I think this is not what you mean here
I uploaded this to branch interleaving4 (argon2d only)
I updated vector8 branch and created vector16 some time ago
Solar Designer
2015-08-24 02:28:21 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
While private memory might be larger and faster on specific devices, I
think that not making any use of local memory is wasteful. By using
both private and local memory at once, we should be able to optimally
pack more concurrent Argon2 instances per GPU and thereby hide more of
the various latencies.
why will we pack more argon2 per gpu using both types of memory?
I'm using only very small portions of private memory.
You're using several kilobytes per instance - that's not very small.

If not this, then what is limiting the number of concurrent instances
when we're not yet bumping into total global memory size? For some of
the currently optimal LWS/GWS settings, we're nearly bumping into the
global memory size, but for some (across the different GPUs, as well as
2i vs. 2d) we are not. And even when we are, maybe a higher LWS would
improve performance when we can afford it.
Post by Agnieszka Bielec
BTW in my vectorized kernels shuffling between two groups of argon
rounds takes very long time so I did something that I grouped kernel
instances to 4 and I'm interleaving data to this local memory and I
can avoid shuffling
but in my laptop I can gain 3k c/s for LWS=8 so no speedup. (4k is in
bleeding-jumbo branch)
but I think this is not what you mean here
I uploaded this to branch interleaving4 (argon2d only)
I updated vector8 branch and created vector16 some time ago
I took a look at the interleaving4 branch, commit
11f932d9642604ca807e336fe286329651a87c49 (comment interleaving4).
No, that's not what I meant. It's a weird mix of ulong8 and splitting
work across different work-items, whereas the latter should be attempted
as an alternative to the former.

Alexander
Agnieszka Bielec
2015-08-24 13:42:30 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
While private memory might be larger and faster on specific devices, I
think that not making any use of local memory is wasteful. By using
both private and local memory at once, we should be able to optimally
pack more concurrent Argon2 instances per GPU and thereby hide more of
the various latencies.
why will we pack more argon2 per gpu using both types of memory?
I'm using only very small portions of private memory.
You're using several kilobytes per instance - that's not very small.
If not this, then what is limiting the number of concurrent instances
when we're not yet bumping into total global memory size? For some of
the currently optimal LWS/GWS settings, we're nearly bumping into the
global memory size, but for some (across the different GPUs, as well as
2i vs. 2d) we are not. And even when we are, maybe a higher LWS would
improve performance when we can afford it.
the second option is that we reached the point when after increasing
gws number, we can't get more access to global memory and most of
work-items are waiting for memory.
argon2i is coaelesced and it can run using more gws than argon2d,
Solar Designer
2015-08-29 04:47:07 UTC
Permalink
Agnieszka,
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
Post by Solar Designer
While private memory might be larger and faster on specific devices, I
think that not making any use of local memory is wasteful. By using
both private and local memory at once, we should be able to optimally
pack more concurrent Argon2 instances per GPU and thereby hide more of
the various latencies.
why will we pack more argon2 per gpu using both types of memory?
I'm using only very small portions of private memory.
You're using several kilobytes per instance - that's not very small.
If not this, then what is limiting the number of concurrent instances
when we're not yet bumping into total global memory size? For some of
the currently optimal LWS/GWS settings, we're nearly bumping into the
global memory size, but for some (across the different GPUs, as well as
2i vs. 2d) we are not. And even when we are, maybe a higher LWS would
improve performance when we can afford it.
the second option is that we reached the point when after increasing
gws number, we can't get more access to global memory and most of
work-items are waiting for memory.
argon2i is coaelesced and it can run using more gws than argon2d,
If we calculate our global memory bandwidth usage from the observed c/s
figures, it's easy to see it's still several times lower than what these
GPUs have available - and as you point out, with 2i's coalescing we
should actually be able to use the full bandwidth. (With 2d, we might
not be because we need accesses narrower than what's optimal for the
hardware.)

However, you're right - we might be bumping into the memory bus in the
current code anyway. Maybe the coalescing isn't good enough. Or maybe
we get too many spills to global memory - we could review the generated
code to see where they are and where they go to, or just get rid of them
without such analysis if we can.

I've just briefly tried running your 2i and 2d kernels from your main
branch (not the vector8 stuff) on Titan X - and the results are
disappointing. Performance is similar to what we saw on the old Titan,
whereas the expectation was it'd be a multiple of what we saw on your
960M. Can you please experiment with this too, and try to use LWS and
GWS settings directly scaled from those that you find performing good on
your 960M (perhaps it means same LWS, but ~4.8x larger GWS)? In your
most recent full set of benchmark results, you didn't include the
auto-tuning output (no -v=4), so I don't know what LWS and GWS you were
using in the 960M benchmarks.

Like I said, my initial results are not good, and I did try a few LWS
and GWS combinations (up to using nearly the full 12 GB memory even).
So I don't expect you would succeed either, but I'd like us to have a
direct comparison of 960M vs. Titan X anyway, so that we can try to
figure out what the bottleneck in scaling Argon2 between these two GPUs
is. And the next task might be to deal with the register spilling.

If things just don't fit into private memory, then we might prefer to
explicitly move some into local or/and global than leave this up to the
compiler and keep guessing what's going on. For a start, we need to
achieve the same performance as we do now, but without spills and with
explicit use of other memory types. And after that point, we could
proceed to optimize our use of the different memory types.

Titan X is -dev=4 on super now, although I expect this device number
will change once we re-introduce the HD 7990 card.

Thanks,

Alexander
Solar Designer
2015-08-29 05:29:53 UTC
Permalink
Agnieszka,
Post by Solar Designer
However, you're right - we might be bumping into the memory bus in the
current code anyway. Maybe the coalescing isn't good enough. Or maybe
we get too many spills to global memory - we could review the generated
code to see where they are and where they go to, or just get rid of them
without such analysis if we can.
Another related problem, and something to correct as well, is that this
kernel is huge and we're getting too many L1i cache misses. The kernel
as a whole is ~100k PTX instructions, which (depending on target ISA)
may be ~800 KB. This obviously exceeds the caches (even L2), by far,
however it is unclear how large our most performance critical loop is.
You could identify that loop's code size (including any functions it
calls, if not inlined), and/or try to reduce it (e.g., cut down on the
unrolling and inlining overall, or do it selectively).

In fact, even if the most performance critical loop fits in cache, or if
we make it fit eventually, the size of the full kernel also matters.

For comparison, the size of our md5crypt kernel is under 8k PTX
instructions total, and even at that size inlining of md5_digest() or
partially unrolling the main 1000 iterations loop isn't always optimal.
In my recent experiments, I ended up not inlining md5_digest(), but
unrolling the loop 2x on AMD and 4x on NVIDIA. Greater unrolling slowed
things down on our HD 7990's GPUs, so large kernel size might be a
reason why your Argon2 kernels perform worse on the AMD GPUs.

Alexander
Solar Designer
2015-08-29 06:48:49 UTC
Permalink
Agnieszka,
Post by Solar Designer
You could identify that loop's code size (including any functions it
calls, if not inlined), and/or try to reduce it (e.g., cut down on the
unrolling and inlining overall, or do it selectively).
In fact, even if the most performance critical loop fits in cache, or if
we make it fit eventually, the size of the full kernel also matters.
For comparison, the size of our md5crypt kernel is under 8k PTX
instructions total, and even at that size inlining of md5_digest() or
partially unrolling the main 1000 iterations loop isn't always optimal.
In my recent experiments, I ended up not inlining md5_digest(), but
unrolling the loop 2x on AMD and 4x on NVIDIA. Greater unrolling slowed
things down on our HD 7990's GPUs, so large kernel size might be a
reason why your Argon2 kernels perform worse on the AMD GPUs.
Per this recent discussion, not inlining of functions isn't supported in
AMD OpenCL currently:

https://community.amd.com/thread/170309

So I am puzzled why I appeared to have any performance difference from
including or omitting the "inline" keyword on md5_digest(). I'll need
to re-test this, preferably reviewing the generated code. When
targeting NVIDIA, I am indeed getting the exact same PTX code regardless
of whether I include the inline keyword or not.

"realhet", who commented in that thread, wrote a GCN ISA assembler, so
he would know. It's one of the tools we have listed at:

http://openwall.info/wiki/john/development/GPU-low-level

And it seems I was wrong about the 8k PTX instructions - that might have
been for another kernel or something. Our md5crypt kernel is at around
4k PTX instructions currently.

However, function calls in OpenCL do seem to be supported on NVIDIA, as
seen from reviewing the PTX code for your Argon2 kernels. You don't
have your functions explicitly marked "inline", but most are inlined
anyway - yet a few are not:

$ fgrep .func kernel.out
.func Initialize
.func blake2b_update(
.func blake2b_final(
.func blake2b(
.func Initialize(

$ fgrep -A1 call.uni kernel.out | head -8
call.uni
blake2b_update,
--
call.uni
blake2b_update,
--
call.uni
blake2b_update,

You could want to look into ways to make more of the infrequent function
calls to actually be calls rather than inlining. Ideally, there would
be a keyword to prevent inlining, but I am not aware of one. Maybe
there's a compiler switch, and then explicit "inline" would start to
matter. Please look into this.

As to loop unrolling, there's "#pragma unroll N", and when you specify
N=1 so "#pragma unroll 1" I think it prevents unrolling. As an
experiment, I tried adding "#pragma unroll 1" before all loops in
argon2d_kernel.cl, and the PTX instruction count reduced - but not a
lot. With uses of BLAKE2_ROUND_NO_MSG_V macros also put into loops:

#pragma unroll 1
for (i = 0; i < 64; i += 8) {
BLAKE2_ROUND_NO_MSG_V(state[i], state[i+1],
state[i+2], state[i+3],
state[i+4], state[i+5],
state[i+6], state[i+7]);
}

#pragma unroll 1
for (i = 0; i < 8; i++) {
BLAKE2_ROUND_NO_MSG_V(state[i], state[i+8],
state[i+16], state[i+24],
state[i+32], state[i+40],
state[i+48], state[i+56]);
}

I got the PTX instruction count down from ~100k to ~80k. No speedup,
though. (But not much slowdown either.)

We need to figure out why it doesn't get lower. ~80k is still a lot.
Are there many inlined functions and unrolled loops in the .h files?

Maybe some pre- and/or post-processing should be kept on host to make
the kernel simpler and smaller. This is bad in terms of Amdahl's law,
but it might help us figure things out initially.

BTW, it would be helpful to have some Perl scripts or such to analyze
the PTX code. Even counting the instructions is a bit tricky since many
of the lines are not instructions. "sort -u ... | wc -l" gives an
estimate (and this is what I have been using) due to a new virtual
register number being allocated each time (so even if the same
instruction is used multiple times, it appears as different - and that's
as we want it for counting).

Alexander
Agnieszka Bielec
2015-08-29 23:44:32 UTC
Permalink
Post by Solar Designer
As to loop unrolling, there's "#pragma unroll N", and when you specify
N=1 so "#pragma unroll 1" I think it prevents unrolling. As an
experiment, I tried adding "#pragma unroll 1" before all loops in
argon2d_kernel.cl, and the PTX instruction count reduced - but not a
lot.
Can I get this code?
Post by Solar Designer
We need to figure out why it doesn't get lower. ~80k is still a lot.
Are there many inlined functions and unrolled loops in the .h files?
there are also blake2 files
Post by Solar Designer
Maybe some pre- and/or post-processing should be kept on host to make
the kernel simpler and smaller. This is bad in terms of Amdahl's law,
but it might help us figure things out initially.
I will think about it and split kernels, even small pomelo was slightly faster
Agnieszka Bielec
2015-08-29 23:31:42 UTC
Permalink
Post by Solar Designer
I've just briefly tried running your 2i and 2d kernels from your main
branch (not the vector8 stuff) on Titan X - and the results are
disappointing. Performance is similar to what we saw on the old Titan,
whereas the expectation was it'd be a multiple of what we saw on your
960M.
I made some tests for titan X, and updated speeds for TITAN from your
previous mails, speed of argon2i is better but argon2d is slightly
worse

argon2i
CPU on well - 2480
GeForce GTX 960M - 1861
AMD Tahiti - 1288
GeForce GTX TITAN - 4292
GeForce GTX TITAN X - 6113
memory: 1.5 MB

argon2d
CPU on well - 7808
GeForce GTX 960M - 4227
AMD Tahiti - 2742
GeForce GTX TITAN - 6215
GeForce GTX TITAN X - 6525
memory: 1.5 MB


[***@super run]$ LWS=128 GWS=4096 ./john --test --format=argon2i-opencl
--dev=4 --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 128, global worksize (GWS) 4096
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 6068 c/s real, 6068 c/s virtual
Only one salt: 6113 c/s real, 6068 c/s virtual


[***@super run]$ LWS=32 GWS=512 ./john --test --format=argon2d-opencl --dev=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 6525 c/s real, 6525 c/s virtual
Only one salt: 6525 c/s real, 6525 c/s virtual

but there is again difference between results when GWS is set and when
is not set (I tested several times):

[***@super run]$ ./john --test --format=argon2d-opencl --dev=4 --v=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 2559 c/s 2559 rounds/s 100.027ms per crypt_all()!
gws: 512 3131 c/s 3131 rounds/s 163.523ms per crypt_all()+
gws: 1024 4574 c/s 4574 rounds/s 223.859ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 4740 c/s real, 4740 c/s virtual
Only one salt: 4740 c/s real, 4740 c/s virtual

[***@super run]$ GWS=1024 ./john --test --format=argon2d-opencl --dev=4 --v=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 4697 c/s real, 4697 c/s virtual
Only one salt: 4697 c/s real, 4654 c/s virtual
Post by Solar Designer
Can you please experiment with this too, and try to use LWS and
GWS settings directly scaled from those that you find performing good on
your 960M (perhaps it means same LWS, but ~4.8x larger GWS)? In your
most recent full set of benchmark results, you didn't include the
auto-tuning output (no -v=4), so I don't know what LWS and GWS you were
using in the 960M benchmarks.
I was putting LWS and before ./john
_______
TITAN X

[***@super run]$ GWS=1024 ./john --test --format=argon2i-opencl --dev=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 2625 c/s real, 2603 c/s virtual
Only one salt: 2648 c/s real, 2648 c/s virtual

[***@super run]$ GWS=512 ./john --test --format=argon2d-opencl --dev=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 3479 c/s real, 3479 c/s virtual
Only one salt: 3513 c/s real, 3513 c/s virtual

[***@super run]$ LWS=32 GWS=512 ./john --test --format=argon2d-opencl --dev=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 6525 c/s real, 6525 c/s virtual
Only one salt: 6525 c/s real, 6525 c/s virtual

__________________________
980m

***@none ~/Desktop/r/run $ GWS=1024 ./john --test --format=argon2i-opencl
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 1878 c/s real, 1861 c/s virtual
Only one salt: 1861 c/s real, 1861 c/s virtual

***@none ~/Desktop/r/run $ GWS=512 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 3976 c/s real, 3938 c/s virtual
Only one salt: 3976 c/s real, 4015 c/s virtual

***@none ~/Desktop/r/run $ LWS=32 GWS=512 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 4266 c/s real, 4227 c/s virtual
Only one salt: 4227 c/s real, 4266 c/s virtual
Post by Solar Designer
Like I said, my initial results are not good, and I did try a few LWS
and GWS combinations (up to using nearly the full 12 GB memory even).
So I don't expect you would succeed either, but I'd like us to have a
direct comparison of 960M vs. Titan X anyway, so that we can try to
figure out what the bottleneck in scaling Argon2 between these two GPUs
is. And the next task might be to deal with the register spilling.
there is a tool nvidia visual profiler but unfortunatelly doesn't work
on my laptop and on super. nvprof is a the same tool but in a command
line

[***@super run]$ nvprof ./john --test --format=argon2i-opencl --dev=4 --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
==25405== NVPROF is profiling process 25405, command: ./john --test
--format=argon2i-opencl --dev=4 --v=4
Device 4: GeForce GTX TITAN X
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 1053 c/s 1053 rounds/s 242.977ms per crypt_all()!
gws: 512 2292 c/s 2292 rounds/s 223.306ms per crypt_all()!
gws: 1024 2643 c/s 2643 rounds/s 387.368ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 2648 c/s real, 2648 c/s virtual
Only one salt: 2625 c/s real, 2625 c/s virtual

==25405== Profiling application: ./john --test --format=argon2i-opencl
--dev=4 --v=4
==25405== Profiling result:
No kernels were profiled.

==25405== API calls:
No API activities were profiled.

***@none ~/Desktop/morecopy/run $ nvprof ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
==6809== NVPROF is profiling process 6809, command: ./john --test
--format=argon2d-opencl
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 3976 c/s real, 3938 c/s virtual
Only one salt: 3976 c/s real, 4015 c/s virtual

==6809== Profiling application: ./john --test --format=argon2d-opencl
==6809== Warning: make sure cudaDeviceReset() is called before
application exit to flush profile data.
======== Error: CUDA profiling error.
Post by Solar Designer
If things just don't fit into private memory, then we might prefer to
explicitly move some into local or/and global than leave this up to the
compiler and keep guessing what's going on. For a start, we need to
achieve the same performance as we do now, but without spills and with
explicit use of other memory types. And after that point, we could
proceed to optimize our use of the different memory types.
Agnieszka Bielec
2015-08-29 23:33:08 UTC
Permalink
Post by Agnieszka Bielec
I made some tests for titan X, and updated speeds for TITAN from your
previous mails, speed of argon2i is better but argon2d is slightly
worse
sorry, it's slightly better
Solar Designer
2015-08-23 08:48:25 UTC
Permalink
Post by Solar Designer
Agnieszka,
As it has just been mentioned on the PHC list, you need to try
exploiting the parallelism inside ComputeBlock. There are two groups of
8 BLAKE2 rounds. In each of the groups, the 8 rounds may be computed in
parallel. When your kernel is working on ulong2, I think it won't fully
exploit this parallelism, except that the parallelism may allow for
better pipelining within those ulong2 lanes (not stalling further
instructions since their input data is separate and thus is readily
available).
I think you may try working on ulong16 or ulong8 instead. I expect
ulong8 to match the current GPU hardware best, but OTOH ulong16 makes
more parallelism apparent to the OpenCL compiler and allocates it to one
work-item. So please try both and see which works best.
With this, you'd launch groups of 8 or 4 BLAKE2 rounds on those wider
vectors, and then between the two groups of 8 in ComputeBlock you'd need
to shuffle vector elements (moving them between two vectors of ulong8 if
you use that type) instead of shuffling state[] elements like you do now
(and like the original Argon2 code did).
The expectation is that a single kernel invocation will then make use of
more SIMD width (2x512- or 512-bit instead of the current 128-bit), yet
only the same amount of local and private memory as it does now. So
you'd pack as many of these kernels per GPU as you do now, but they will
run faster (up to 8x faster) since they'd process 8 or 4 BLAKE2 rounds
in parallel rather than sequentially.
I was totally wrong and naive in hoping that use of ulong2 (or wider)
would somehow give us a corresponding portion of the GPU hardware SIMD
vectors. There are simply no such instructions. We're instead given
32-bit elements in different registers.

I think use of vectorized kernels like that works like I had expected
when targeting CPUs with SIMD, but not when targeting GPUs.

So our only hope to exploit Argon2's ComputeBlock parallelism on GPUs is
through playing by the SIMT rules.

Alexander
Agnieszka Bielec
2015-08-19 17:12:55 UTC
Permalink
Post by Solar Designer
(just to illustrate the problem of slow integer division on GPUs).
Before you spend a lot of time on this, I suggest that you replace this
modulo operation with something simpler (and wrong), yet in some ways
static inline uint32_t wrap(uint64_t x, uint32_t n)
{
uint64_t a = (x + n) & (n - 1);
uint64_t b = x & n;
uint64_t c = (x << 1) & n;
return ((a << 1) + b + c) >> 2;
}
(and its OpenCL equivalent, with proper data types). Of course, this
revision of Argon2 won't match Argon2's normal test vectors, but you
should be able to see roughly what performance you could get if you
later optimize the division.
it's slower with wrap instead of %
I just changed x % y to number 5 and I gained speed only on my 960m
from 1861 to 1878 (argon2i). I will check again % after another
optimizations
Solar Designer
2015-08-19 17:27:03 UTC
Permalink
Post by Agnieszka Bielec
it's slower with wrap instead of %
I just changed x % y to number 5 and I gained speed only on my 960m
from 1861 to 1878 (argon2i). I will check again % after another
optimizations
I expected this to make more of a difference, but that's OK. As usual,
reviewing the generated GPU ISA code would be helpful to figure out
what's going on.

BTW, with "changed x % y to number 5" you probably had it process
uninitialized data until 5 blocks were written.

Alexander
Agnieszka Bielec
2015-08-19 17:39:24 UTC
Permalink
Post by Agnieszka Bielec
Post by Solar Designer
(just to illustrate the problem of slow integer division on GPUs).
Before you spend a lot of time on this, I suggest that you replace this
modulo operation with something simpler (and wrong), yet in some ways
static inline uint32_t wrap(uint64_t x, uint32_t n)
{
uint64_t a = (x + n) & (n - 1);
uint64_t b = x & n;
uint64_t c = (x << 1) & n;
return ((a << 1) + b + c) >> 2;
}
(and its OpenCL equivalent, with proper data types). Of course, this
revision of Argon2 won't match Argon2's normal test vectors, but you
should be able to see roughly what performance you could get if you
later optimize the division.
it's slower with wrap instead of %
I just changed x % y to number 5 and I gained speed only on my 960m
from 1861 to 1878 (argon2i). I will check again % after another
optimizations
I checked also argon2d just in case and I have more speedup here

normal code
***@none ~/Desktop/r/run $ GWS=512 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 3976 c/s real, 3938 c/s virtual
Only one salt: 3976 c/s real, 4015 c/s virtual


with 5
***@none ~/Desktop/r/run $ GWS=512 ./john --test
--format=argon2d-opencl --skip-self-test
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 4055 c/s real, 4055 c/s virtual
Only one salt: 4114 c/s real, 4151 c/s virtual


with wrap()
***@none ~/Desktop/r/run $ GWS=512 ./john --test
--format=argon2d-opencl --skip-self-test
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 3976 c/s real, 3976 c/s virtual
Only one salt: 4015 c/s real, 4015 c/s virtual

maybe it's just usual coincidence
Agnieszka Bielec
2015-08-19 17:41:02 UTC
Permalink
btw I have many spills (this is 960m)

txas info : 0 bytes gmem, 264 bytes cmem[3]
ptxas info : Compiling entry function 'argon2i_crypt_kernel' for 'sm_50'
ptxas info : Function properties for argon2i_crypt_kernel
ptxas . 22592 bytes stack frame, 272 bytes spill stores,
500 bytes spill loads
ptxas info : Used 255 registers, 360 bytes cmem[0], 20 bytes cmem[2]
ptxas info : Function properties for FillSegment
ptxas . 0 bytes stack frame, 17400 bytes spill stores,
19352 bytes spill loads
ptxas info : Function properties for GenerateAddresses
ptxas . 0 bytes stack frame, 7780 bytes spill stores,
11648 bytes spill loads
ptxas info : Function properties for Initialize
ptxas . 0 bytes stack frame, 208 bytes spill stores, 232
bytes spill loads
ptxas info : Function properties for blake2b
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes
spill loads
ptxas info : Function properties for blake2b_final
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes
spill loads
ptxas info : Function properties for blake2b_update
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes
spill loads
using different password for benchmarking
Solar Designer
2015-08-20 01:53:55 UTC
Permalink
Post by Agnieszka Bielec
ptxas info : Function properties for FillSegment
ptxas . 0 bytes stack frame, 17400 bytes spill stores,
19352 bytes spill loads
ptxas info : Function properties for GenerateAddresses
ptxas . 0 bytes stack frame, 7780 bytes spill stores,
11648 bytes spill loads
The spills in FillSegment and GenerateAddresses are pretty bad. Where
do they come from, and why so much? In FillSegment you use 1 KB per
work-item for addresses[], in GenerateAddresses you use 2 KB for two
blocks. GenerateAddresses is called from FillSegment, so adds its
private memory needs on top of FillSegment's.

Alexander
Solar Designer
2015-08-20 03:30:10 UTC
Permalink
Post by Solar Designer
Post by Agnieszka Bielec
ptxas info : Function properties for FillSegment
ptxas . 0 bytes stack frame, 17400 bytes spill stores,
19352 bytes spill loads
ptxas info : Function properties for GenerateAddresses
ptxas . 0 bytes stack frame, 7780 bytes spill stores,
11648 bytes spill loads
The spills in FillSegment and GenerateAddresses are pretty bad. Where
do they come from, and why so much? In FillSegment you use 1 KB per
work-item for addresses[], in GenerateAddresses you use 2 KB for two
blocks. GenerateAddresses is called from FillSegment, so adds its
private memory needs on top of FillSegment's.
There's also 1 KB ref_block[] in ComputeBlock and in ComputeBlock_pgg.

On super's -dev=5, I was getting:

ptxas info : Function properties for FillSegment
ptxas . 8216 bytes stack frame, 9708 bytes spill stores, 7776 bytes spill loads
ptxas info : Function properties for GenerateAddresses
ptxas . 6104 bytes stack frame, 4056 bytes spill stores, 4124 bytes spill loads

I've optimized this to:

ptxas info : Function properties for FillSegment
ptxas . 4408 bytes stack frame, 5984 bytes spill stores, 4020 bytes
spill loads
ptxas info : Function properties for GenerateAddresses
ptxas . 1304 bytes stack frame, 388 bytes spill stores, 400 bytes spill loads

with the attached patch. As it is, it provides no speedup for me (in
fact, there's very slight slowdown), but it should illustrate to you
what to optimize. I expect that once you convert those uint operations
to work on ulong2 all the time, you'll see slight speedup. (The changes
in performance seen from these code changes are relatively minor because
GenerateAddresses corresponds to a relatively small part of the total
running time. There is a significant reduction in global memory usage,
though, as seen via nvidia-smi.)

In fact, those typecasts between ulong2 and uint pointers are probably
disallowed, as they violate strict aliasing rules. Also, your code
heavily depends on the architecture being little-endian (just like
Argon2's original code did, which is a known bug). You should try to
avoid that as you proceed to optimize your OpenCL kernels. You'll find
that avoiding endianness dependencies goes along with avoiding strict
aliasing violations and achieving better speed as well (since the kernel
would use its full allocated SIMD width all the time, rather than only
part of the time).

BTW, out_tmp[] in Initialize() appears to be twice larger than it needs
to be:

ulong2 out_tmp[BLOCK_SIZE/8];

ulong2 is 16 bytes, but you divide by 8. Or is this on purpose? Why?

Alexander
Agnieszka Bielec
2015-08-20 08:29:06 UTC
Permalink
Post by Solar Designer
BTW, out_tmp[] in Initialize() appears to be twice larger than it needs
ulong2 out_tmp[BLOCK_SIZE/8];
ulong2 is 16 bytes, but you divide by 8. Or is this on purpose? Why?
should be ulong2 out_tmp[BLOCK_SIZE/16]; (I have /16 in argon2d)
Agnieszka Bielec
2015-08-20 09:19:56 UTC
Permalink
Post by Agnieszka Bielec
ptxas info : Function properties for FillSegment
ptxas . 8216 bytes stack frame, 9708 bytes spill stores, 7776 bytes spill loads
ptxas info : Function properties for GenerateAddresses
ptxas . 6104 bytes stack frame, 4056 bytes spill stores, 4124 bytes spill loads
ptxas info : Function properties for FillSegment
ptxas . 4408 bytes stack frame, 5984 bytes spill stores, 4020 bytes
spill loads
ptxas info : Function properties for GenerateAddresses
ptxas . 1304 bytes stack frame, 388 bytes spill stores, 400 bytes spill loads
with the attached patch. As it is, it provides no speedup for me (in
fact, there's very slight slowdown)
thx, it's faster on my laptop
Solar Designer
2015-08-20 20:36:18 UTC
Permalink
Agnieszka,
Post by Agnieszka Bielec
Post by Solar Designer
with the attached patch. As it is, it provides no speedup for me (in
fact, there's very slight slowdown)
thx, it's faster on my laptop
Great. How much faster?

Alexander
Agnieszka Bielec
2015-08-20 20:47:15 UTC
Permalink
Post by Solar Designer
Agnieszka,
Post by Agnieszka Bielec
Post by Solar Designer
with the attached patch. As it is, it provides no speedup for me (in
fact, there's very slight slowdown)
thx, it's faster on my laptop
Great. How much faster?
***@none ~/Desktop/p/run $ GWS=1024 ./john --test --format=argon2i-opencl
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 1861 c/s real, 1861 c/s virtual
Only one salt: 1861 c/s real, 1878 c/s virtual

***@none ~/Desktop/p/run $ GWS=1024 ./john --test --format=argon2i-opencl
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 1950 c/s real, 1950 c/s virtual
Only one salt: 1932 c/s real, 1932 c/s virtual
Solar Designer
2015-08-21 03:44:49 UTC
Permalink
Agnieszka,

Per slide 37 of:

http://humus.name/Articles/Persson_LowlevelShaderOptimization.pdf

integer division isn't that slow on AMD GCN. This says 40 cycles.
Although these slides do call it "horrible", it isn't as bad as I
thought it could be.
Post by Agnieszka Bielec
it's slower with wrap instead of %
I just changed x % y to number 5 and I gained speed only on my 960m
from 1861 to 1878 (argon2i). I will check again % after another
optimizations
It is moderately surprising that changing this to a constant like 5
doesn't speed things up through the improved locality of reference to
global memory resulting from accessing just that one block each time.
Is this because the global memory caches are being constantly thrashed
anyway, even when loading the same blocks over and over, due to the
sheer number of Argon2 instances being computed concurrently? Perhaps.

Alexander
Solar Designer
2015-08-20 01:57:13 UTC
Permalink
Agnieszka,
Post by Agnieszka Bielec
argon2d
CPU on well - 7808
GeForce GTX 960M - 4227
AMD Tahiti - 2742
GeForce GTX TITAN - 6083
--v=4 --dev=5
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 5: GeForce GTX TITAN
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=65554 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 32, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 6083 c/s real, 6083 c/s virtual
Only one salt: 6083 c/s real, 6083 c/s virtual
You can do slightly better:

[***@super run]$ LWS=11 GWS=1024 ./john -test=10 -form=argon2d-opencl -v=4 -dev=5
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 5: GeForce GTX TITAN
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=65554 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=125
Local worksize (LWS) 11, global worksize (GWS) 1023
using different password for benchmarking
DONE, GPU util:99%
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 6227 c/s real, 6227 c/s virtual
Only one salt: 6215 c/s real, 6215 c/s virtual

Alexander
Solar Designer
2015-08-20 02:14:51 UTC
Permalink
Post by Agnieszka Bielec
argon2i
CPU on well - 2480
GeForce GTX 960M - 1861
AMD Tahiti - 1288
GeForce GTX TITAN - 2805
[...]
Post by Agnieszka Bielec
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 5: GeForce GTX TITAN
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=65554 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 64, global worksize (GWS) 2048
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 2805 c/s real, 2805 c/s virtual
Only one salt: 2805 c/s real, 2824 c/s virtual
You can do a lot better here (no code changes, just higher GWS):

[***@super run]$ GWS=3584 ./john --test --format=argon2i-opencl --v=4 --dev=5
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 5: GeForce GTX TITAN
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=65554 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DBINARY_SIZE=2
56 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=125
Local worksize (LWS) 64, global worksize (GWS) 3584
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 4318 c/s real, 4292 c/s virtual
Only one salt: 4292 c/s real, 4292 c/s virtual

[***@super run]$ GWS=3584 ./john --test=10 --format=argon2i-opencl --v=4 --dev=5
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 5: GeForce GTX TITAN
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=65554 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=125
Local worksize (LWS) 64, global worksize (GWS) 3584
using different password for benchmarking
DONE, GPU util:99%
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts: 4287 c/s real, 6023 c/s virtual
Only one salt: 4283 c/s real, 7891 c/s virtual

BTW, this uses the card's 6 GB almost fully.

I haven't tried tuning for Tahiti yet, and this may be premature given
that you have lots of optimizations yet to make. But I suspect it, too,
can benefit from adjusted LWS and GWS.

Alexander
Solar Designer
2015-08-14 17:56:25 UTC
Permalink
Post by Agnieszka Bielec
indeed, this is even the same version that I'm using now. I had a
situation that I runned tests, runned again and I noticed that speed
is worse, typed command 'w', super was not idle, there was Kai on
super so I e-mailed to him, he turned off his job but after that speed
was the same (worse) I though that this is only problem with my eyes
or memory but now I see that not, maybe this have nothing common with
Kai, he said that he didn't touched GPU's. Solar, can you restart
super?
Do you still want it rebooted now? I can, but I don't expect this to
change anything.

Please note that I upgraded Catalyst from 15.5 to 15.7 on August 11.

I think your Argon2d benchmark showing ~3k at ~1.5 MB was after that
upgrade, though.

Alexander
Agnieszka Bielec
2015-08-14 15:40:29 UTC
Permalink
Post by Solar Designer
And just as important: figure out why the speeds are so poor, compared
to what you're getting on your laptop, as well as to what you had
reported earlier (IIRC).
FYI I had on my laptop speed better on GPU than CPU before
optimization that makes better speed on super both devices (although
on --dev=5 I have error but --v=4 shows speeds) but after this
optmization I had worse speed on my laptop's GPU, (latest Agnieszka's
weekly report)
Agnieszka Bielec
2015-08-14 15:22:05 UTC
Permalink
Post by Agnieszka Bielec
this can have something common with MEM_SIZE/4 (now I have removed /4)
http://www.openwall.com/lists/john-dev/2015/08/06/22
sorry, couldn't find my original e-mail
removed /4 because I had problems on super on Tahiti with this

(this is my previous version from commit
ada77dbb6a8a967271590d7879c39214df92e434 and with nother costs)

[***@super run]$ ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1000.00 kB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws: 256 580 c/s 580 rounds/s 441.045ms per crypt_all()!
gws: 512 1077 c/s 1077 rounds/s 475.273ms per crypt_all()+
gws: 1024 1953 c/s 1953 rounds/s 524.191ms per crypt_all()+
OpenCL error (CL_MEM_OBJECT_ALLOCATION_FAILURE) in file
(opencl_argon2i_fmt_plug.c) at line (524) - (failed in
clEnqueueNDRangeKernel)

[***@super run]$ GWS=1024 ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1000.00 kB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1000, cost 3 (l) of 1
Many salts: 1969 c/s real, 204800 c/s virtual
Only one salt: 1950 c/s real, 204800 c/s virtual

[***@super run]$ GWS=2048 ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1000.00 kB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 64, global worksize (GWS) 2048
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1000, cost 3 (l) of 1
Many salts: 3471 c/s real, 409600 c/s virtual
Only one salt: 3471 c/s real, 409600 c/s virtual

I checked if I forgot about deallocating some memory but I don't think so
Solar Designer
2015-08-14 15:17:35 UTC
Permalink
Post by Agnieszka Bielec
cracking mode on my laptop on argon2d showed that at the beginning
speed is the same to this showed during computing gws, after some time
I am getting speed closest to showed during --test but it's not
exactly the same.
Please use either only wordlist mode, or incremental mode locked to a
fixed length for such tests. When you run with no cracking mode
specified, this eventually proceeds to incremental mode (pass 3/3) with
length switching, and that length switching is rather slow initially (so
may significantly reduce the speeds during the first few minutes).
Anyway, I think your runs are long enough to mostly compensate for this
effect.
Post by Agnieszka Bielec
Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 4114 c/s real, 4077 c/s virtual
Only one salt: 4114 c/s real, 4114 c/s virtual
I don't have big differences with argon2i on my laptop
[...]
Post by Agnieszka Bielec
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts: 390 c/s real, 102400 c/s virtual
Only one salt: 390 c/s real, 102400 c/s virtual
Is super like 10x slower than your laptop at this now? Why is that?
Post by Agnieszka Bielec
so speeds reported by main --test are good
Yes, it seems so, but your benchmarks above also show a GPU on super
performing much slower than in your laptop, and this makes no sense.

Alexander
Alain Espinosa
2015-08-23 14:50:43 UTC
Permalink
-------- Original message --------
From: Solar Designer <***@openwall.com>
Date:08/23/2015 1:57 AM (GMT-05:00)
To: john-***@lists.openwall.com
Cc:
Subject: Re: [john-dev] PHC: Argon2 on GPU

..."Kepler introduces a new warp-level intrinsic called the shuffle
operation. This feature allows the threads of a warp to exchange data
with each other directly without going through shared (or global)
memory..."

...We need to find out if it's available in OpenCL.

They are available in OpenCL 2.0. I think this is only supported in the newest AMD cards.

Regards,
Alain
Alain Espinosa
2015-08-23 15:11:52 UTC
Permalink
-------- Original message --------
From: Solar Designer <***@openwall.com>
Date:08/23/2015 3:21 AM (GMT-05:00)
To: john-***@lists.openwall.com
Cc:
Subject: Re: [john-dev] PHC: Argon2 on GPU

...In source code, it should be something like:
dst_lo = src1_lo + src2_lo;
dst_hi = src1_hi + src2_hi + (dst_lo < src1_lo);

In SHA512 OpenCL code I use:

uint2 x1, x2;// declare vars ulong x1, x2

uint2 result = as_uint2(as_ulong(x1)+as_ulong(x1));// to sum x1 and X2

This generate the appropriate 32 bits sums with carry in Nvidia, AMD and Intel GPUs.

The 64 bit rotation is done manually, not using OpenCL rotate. am_bitalign provides a very small speedup, but note that when used with multiples of 8 it generate errors, at least when I test it, so we need to use amd_bytealign then.

Regards,
Alain
Loading...