Post by dxforthLocals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.
I have collected some of the versions posted here into
http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th
I then created some versions from it that have the stack effect ( +n1
-- n2 ) and that do not count the iterations:
\ derived from original locals version
: USQRT1 {: u | x0 x1 -- uroot ; positive integer square root :}
u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
REPEAT
x0 ;
\ derived from dxforths locals-less version
: USQRT4 ( +n -- +root )
dup 2/ >r
BEGIN ( +n r:x0 )
dup r@ / r@ + 2/ ( +n x1 R:x0 )
dup r@ <
WHILE
r> drop >r
REPEAT
2drop r> ;
\ An attempt to write code for VFX's current limitations: Try to do
\ everything on the data stack with only one stack item at basic block
\ bounaries. Keep u on the return stack, so we only do a read from
\ there. Unfortunately, the data stack depth changes here.
: usqrt9 ( u -- root )
dup >r 2/ begin ( x0 R:u )
r@ over / over + 2/ ( x0 x1 R:u )
2dup > while
nip repeat
r> 2drop ;
\ Maybe we can do better by not changing the stack depth; however, we
\ use 2 stack items in the basic blocks in the loop.
: usqrta ( u -- root )
dup >r 2/ r@ begin ( x0 u R:u )
over / over + 2/ ( x0 x1 R:u )
2dup > while
nip r@ repeat
r> 2drop ;
Let's first look at the inner loops as produced by VFX64:
usqrt1 usqrt4 usqrt9 usqrta
MOV RAX, [RDI+10] MOV RCX, [RSP] MOV RDX, [RSP] MOV RAX, RBX
CQO MOV RAX, RBX MOV RAX, RDX CQO
IDIV QWord [RDI+-08] CQO CQO IDIV QWord [RBP]
ADD RAX, [RDI+-08] IDIV RCX IDIV RBX ADD RAX, [RBP]
SAR RAX, # 1 MOV RDX, [RSP] ADD RAX, RBX SAR RAX, # 1
MOV [RDI+-10], RAX ADD RDX, RAX SAR RAX, # 1 CMP RAX, [RBP]
MOV RDX, [RDI+-10] SAR RDX, # 1 CMP RBX, RAX MOV RBX, RAX
CMP RDX, [RDI+-08] MOV RCX, [RSP] LEA RBP, [RBP+-08] JNL 004E410D
JNL 004E3F00 CMP RDX, RCX MOV [RBP], RBX MOV RDX, [RSP]
MOV RDX, [RDI+-10] LEA RBP, [RBP+-08] MOV RBX, RAX MOV [RBP], RBX
MOV [RDI+-08], RDX MOV [RBP], RBX JLE/N004E4090 MOV RBX, RDX
JMP 004E3ED3 MOV RBX, RDX LEA RBP, [RBP+08] JMP 004E40E3
JNL 004E4028 JMP 004E4064
LEA RSP, [RSP+08]
PUSH RBX
MOV RBX, [RBP]
LEA RBP, [RBP+08]
JMP 004E3FEA
Here the locals code looks quite good, but it has a lot of memory
accesses. 9 and a look indeed better than 4. Let's see how they
perform:
for i in 1 4 9 a; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'
187.7 257.1 191.3 179.7 [h0:~/pub/forth/programs:86099] for i in 1 4 9 a; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'
1 4 9 a
187.7 257.1 191.3 179.7 instructions per invocation (average from u=2..1e7)
165.9 141.2 136.3 161.9 cycles per invocation on Rocket Lake
169.5 105.0 156.8 163.0 cycles per invocation on Zen3
The Zen3 result is quite surprising: while usqrt4 executes the most
instructions (as expected), it takes far fewer cycles on Zen3 than all
the other versions.
The Rocket Lake result is not as great for usqrt4, and usqrt9 beats it
by a little.
I currently have no explanation why the cycles results are like this.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net