Discussion:
Integer roots game
(too old to reply)
minf...@arcor.de
2022-12-06 14:40:34 UTC
Permalink
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)

\ code:
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;

CR 20 usqrt
CR 200 usqrt
CR 2000 usqrt
CR 20000 usqrt
CR 200000 usqrt
CR 2000000 usqrt
CR 20000000 usqrt
CR 200000000 usqrt

\ results:
include usqrt.fth
root:4 iterations: 2
root:14 iterations: 4
root:44 iterations: 6
root:141 iterations: 8
root:447 iterations: 10
root:1414 iterations: 12
root:4472 iterations: 14
root:14142 iterations: 16 ok
Hans Bezemer
2022-12-06 15:52:06 UTC
Permalink
Post by ***@arcor.de
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Done:
4 3
14 4
44 6
141 8
447 9
1414 11
4472 13
14142 14

: sqrt-rem ( n -- sqrt rem)
begin \ find a power of 4 greater than TORS
dup 1 > \ compute while greater than unity
while
2/ 2/ swap over over + negate r@ + \ integer divide by 4
dup 0< if drop 2/ else rdrop >r 2/ over + then swap
repeat drop r> ( sqrt rem)
;

BTW - less iterations doesn't always mean "faster". I got another pretty fast one, but it always requires 31 iterations on a 64 bit architecture.

Hans Bezemer
minf...@arcor.de
2022-12-06 16:28:20 UTC
Permalink
Post by Hans Bezemer
Post by ***@arcor.de
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
4 3
14 4
44 6
141 8
447 9
1414 11
4472 13
14142 14
: sqrt-rem ( n -- sqrt rem)
begin \ find a power of 4 greater than TORS
dup 1 > \ compute while greater than unity
while
dup 0< if drop 2/ else rdrop >r 2/ over + then swap
repeat drop r> ( sqrt rem)
;
BTW - less iterations doesn't always mean "faster". I got another pretty fast one, but it always requires 31 iterations on a 64 bit architecture.
Nice, although it has more loops and more operations in the main loop,
but which doesn't spoil the cake.
BTW what is that non-standard word: EXCEPT (guessing: 0= WHILE) ?
Hans Bezemer
2022-12-06 16:37:05 UTC
Permalink
Post by ***@arcor.de
BTW what is that non-standard word: EXCEPT (guessing: 0= WHILE) ?
You guessed correctly, Sir. I always hated "0= WHILE" and "0= IF", so I made
myself a few alternatives.

Hans Bezemer
dxforth
2022-12-06 16:53:27 UTC
Permalink
Post by ***@arcor.de
...
Nice, although it has more loops and more operations in the main loop,
This one duplicates your logic.

: SQRT ( +n -- +root )
0 over 2/ >r
BEGIN
over r@ / r@ + 2/
dup r@ <
WHILE
rdrop >r 1+
REPEAT drop
." root:" r> . ." iterations: " . drop ;
Hans Bezemer
2022-12-06 17:24:39 UTC
Permalink
Post by dxforth
Post by ***@arcor.de
Nice, although it has more loops and more operations in the main loop,
This one duplicates your logic.
That one shaves another 8/100 of a second from my version. ;-)

Hans Bezemer
dxforth
2022-12-07 00:09:21 UTC
Permalink
Post by Hans Bezemer
Post by dxforth
Post by ***@arcor.de
Nice, although it has more loops and more operations in the main loop,
This one duplicates your logic.
That one shaves another 8/100 of a second from my version. ;-)
Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.
minf...@arcor.de
2022-12-07 00:49:47 UTC
Permalink
Post by dxforth
Post by Hans Bezemer
Post by dxforth
Post by ***@arcor.de
Nice, although it has more loops and more operations in the main loop,
This one duplicates your logic.
That one shaves another 8/100 of a second from my version. ;-)
Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.
Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
is that he was born before Charles Moore .. about 2000 years ... :o)
dxforth
2022-12-07 01:22:10 UTC
Permalink
Post by ***@arcor.de
Post by dxforth
Post by Hans Bezemer
Post by dxforth
Post by ***@arcor.de
Nice, although it has more loops and more operations in the main loop,
This one duplicates your logic.
That one shaves another 8/100 of a second from my version. ;-)
Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.
Lol!! Tell that Heron of Alexandria, it is his root-finding method.
Maybe but it was your implementation. So did the non-locals version
prove to be "shorter AND faster"? That was your challenge - no?
minf...@arcor.de
2022-12-07 09:20:44 UTC
Permalink
Post by ***@arcor.de
Post by dxforth
Post by Hans Bezemer
Post by dxforth
Post by ***@arcor.de
Nice, although it has more loops and more operations in the main loop,
This one duplicates your logic.
That one shaves another 8/100 of a second from my version. ;-)
Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.
Lol!! Tell that Heron of Alexandria, it is his root-finding method.
Maybe but it was your implementation. So did the non-locals version
prove to be "shorter AND faster"? That was your challenge - no?
Admitted that, I won't yawn at that old "shun locals" talk again. If the proposal
just duplicates Heron's logic, it is just equivalent IMO. If speed depends on the
used compiler, it is not better really - lxf seems even capable to generate the same
code for both versions.

OTOH Marcel, Albert and Anton accelerated the simple algorithm by better start values.

Yesterday I found a different algorithm in the net that avoids slow divisions and should
therefore be even faster. The ARM code at the end of the document is a gem:
http://www.azillionmonkeys.com/qed/ulerysqroot.pdf

I didn't do timings but the other code that I presented in the thread was a linear translation
of the C version to Forth. I checked it with gforth.

P.S. pray forgive that while doing this I did not kick out those nasty locals. ;o)
dxforth
2022-12-07 10:47:28 UTC
Permalink
Post by ***@arcor.de
Post by ***@arcor.de
Post by dxforth
Post by Hans Bezemer
Post by dxforth
Post by ***@arcor.de
Nice, although it has more loops and more operations in the main loop,
This one duplicates your logic.
That one shaves another 8/100 of a second from my version. ;-)
Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.
Lol!! Tell that Heron of Alexandria, it is his root-finding method.
Maybe but it was your implementation. So did the non-locals version
prove to be "shorter AND faster"? That was your challenge - no?
Admitted that, I won't yawn at that old "shun locals" talk again. If the proposal
just duplicates Heron's logic, it is just equivalent IMO. If speed depends on the
used compiler, it is not better really -
If the improvement is consistent across compilers, how is it not better?
You can say you don't care - but that's a different argument.
Post by ***@arcor.de
lxf seems even capable to generate the same
code for both versions.
According to Jeff Fox, Moore considered optimizing locals a backwards solution
to the problem of the user failing to optimize the stack.

https://groups.google.com/g/comp.lang.forth/c/liQl9h9OzgY/m/PfO98kkYEXYJ
Hans Bezemer
2022-12-07 12:33:00 UTC
Permalink
Post by ***@arcor.de
Yesterday I found a different algorithm in the net that avoids slow divisions and should
http://www.azillionmonkeys.com/qed/ulerysqroot.pdf
Quick hack:

cell-bits 2 / 1- constant (bshift)

: sqrt
Post by ***@arcor.de
r (bshift) 1 over lshift >r 0 ( bs nh R: v b )
begin
tuck 1 lshift r@ + over lshift ( nh bs t R: v b)
r> over r@ > ( nh bs t b f R: v)
if nip rot else swap negate r> + >r rot over + then ( bs b nh R: v)
rot 1- rot 1 rshift >r swap r@ 0= ( bs nh R: v b)
until rdrop rdrop nip
;

Works on both 32- and 64-bit. Isn't surprisingly fast. As a matter of fact: I've seen implementations of binary algorithms that were faster.

Hans Bezemer
minf...@arcor.de
2022-12-07 13:01:04 UTC
Permalink
Post by Hans Bezemer
Post by ***@arcor.de
Yesterday I found a different algorithm in the net that avoids slow divisions and should
http://www.azillionmonkeys.com/qed/ulerysqroot.pdf
cell-bits 2 / 1- constant (bshift)
: sqrt
Post by ***@arcor.de
r (bshift) 1 over lshift >r 0 ( bs nh R: v b )
begin
if nip rot else swap negate r> + >r rot over + then ( bs b nh R: v)
until rdrop rdrop nip
;
Works on both 32- and 64-bit. Isn't surprisingly fast. As a matter of fact: I've seen implementations of binary algorithms that were faster.
Yes, it is a mixed thing in Forth. Lots of time spent in bringing arguments to the top.
The ultra-compact ARM code used registers.

Readability: algorithm and stack movements are like mashed carrots and potatoes, difficult to separate.
In my Forth

: USQRT32 {: u | T H B:$8000 S:15 == H :}
BEGIN
H 2* B + S lshift := T
T u < IF B += H T -= u THEN
1 -= S B 2/ := B
B ~UNTIL ;

CR 200_000_000 USQRT32 .( 200mil-root: ) .

Yes, this is slower and thus a game loser. ;-)
Hans Bezemer
2022-12-07 13:24:49 UTC
Permalink
: USQRT32 {: u | T H B:$8000 S:15 == H :}
Oops - 32-bit only! It's actually two loops - one to figure out the first candidate for a binary square
and the second one to figure out the square itself. That's where S and B come from - if you offer
MAX-N the first possible candidate (in 32 bits = SQRT(MAX-) ~ 2^15). So S and B are related.

Therefore, you can derive B from S at the start of the program - and may be one could eradicate
a lot of these vars from the algorithm - with some speed penalty of course. But I haven't delved
that deep (yet).

S seems to be equal to half the number of bits of MAX-N, rounded down. On this basis you can
calculate the number of bits for S for a specific architecture.

BTW, this might be an interesting one for this discussion.
https://sourceforge.net/p/forth-4th/wiki/Inside%20Zen%20FSQRT/

It explains how (on 32 bit) an FSQRT is made by:
1. Cranking up the mantissa to the max;
2. Halving the exponent;
3. Calculating a good approximation now the (integer) mantissa is close to MAX-N.

In 32-bit it required about 3-4 integer iterations and 3-4 floating point iterations - if I remember
correctly.

HB
Anton Ertl
2022-12-07 11:08:37 UTC
Permalink
Post by ***@arcor.de
Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
is that he was born before Charles Moore .. about 2000 years ... :o)
But the question is how Heron described it. Modern descriptions of
course use medern notation and terminology, but the original notation
and terminology is often quite different.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
minf...@arcor.de
2022-12-07 12:07:16 UTC
Permalink
Post by ***@arcor.de
Lol!! Tell that Heron of Alexandria, it is his root-finding method. His biggest drawback
is that he was born before Charles Moore .. about 2000 years ... :o)
But the question is how Heron described it. Modern descriptions of
course use medern notation and terminology, but the original notation
and terminology is often quite different.
To my amateurish knowledge, mathematics at that time (Pythagoreans put aside)
was mostly influenced by physics (i.e. mechanics, astronomy, etc.) and geometry
(i.e. in architecture).

Therefore I guess that Heron's recursion (probably invented much earlier in Egypt)
had been described as algorithmic recipe using geometric elements and short text
annotations in Greek.

Certainly from our modern notations a form using variables comes closer to the original
than stack machine code.
Anton Ertl
2022-12-07 22:06:01 UTC
Permalink
Post by dxforth
Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.
I have collected some of the versions posted here into

http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th

I then created some versions from it that have the stack effect ( +n1
-- n2 ) and that do not count the iterations:

\ derived from original locals version
: USQRT1 {: u | x0 x1 -- uroot ; positive integer square root :}
u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
REPEAT
x0 ;

\ derived from dxforths locals-less version
: USQRT4 ( +n -- +root )
dup 2/ >r
BEGIN ( +n r:x0 )
dup r@ / r@ + 2/ ( +n x1 R:x0 )
dup r@ <
WHILE
r> drop >r
REPEAT
2drop r> ;

\ An attempt to write code for VFX's current limitations: Try to do
\ everything on the data stack with only one stack item at basic block
\ bounaries. Keep u on the return stack, so we only do a read from
\ there. Unfortunately, the data stack depth changes here.
: usqrt9 ( u -- root )
dup >r 2/ begin ( x0 R:u )
r@ over / over + 2/ ( x0 x1 R:u )
2dup > while
nip repeat
r> 2drop ;

\ Maybe we can do better by not changing the stack depth; however, we
\ use 2 stack items in the basic blocks in the loop.
: usqrta ( u -- root )
dup >r 2/ r@ begin ( x0 u R:u )
over / over + 2/ ( x0 x1 R:u )
2dup > while
nip r@ repeat
r> 2drop ;

Let's first look at the inner loops as produced by VFX64:

usqrt1 usqrt4 usqrt9 usqrta
MOV RAX, [RDI+10] MOV RCX, [RSP] MOV RDX, [RSP] MOV RAX, RBX
CQO MOV RAX, RBX MOV RAX, RDX CQO
IDIV QWord [RDI+-08] CQO CQO IDIV QWord [RBP]
ADD RAX, [RDI+-08] IDIV RCX IDIV RBX ADD RAX, [RBP]
SAR RAX, # 1 MOV RDX, [RSP] ADD RAX, RBX SAR RAX, # 1
MOV [RDI+-10], RAX ADD RDX, RAX SAR RAX, # 1 CMP RAX, [RBP]
MOV RDX, [RDI+-10] SAR RDX, # 1 CMP RBX, RAX MOV RBX, RAX
CMP RDX, [RDI+-08] MOV RCX, [RSP] LEA RBP, [RBP+-08] JNL 004E410D
JNL 004E3F00 CMP RDX, RCX MOV [RBP], RBX MOV RDX, [RSP]
MOV RDX, [RDI+-10] LEA RBP, [RBP+-08] MOV RBX, RAX MOV [RBP], RBX
MOV [RDI+-08], RDX MOV [RBP], RBX JLE/N004E4090 MOV RBX, RDX
JMP 004E3ED3 MOV RBX, RDX LEA RBP, [RBP+08] JMP 004E40E3
JNL 004E4028 JMP 004E4064
LEA RSP, [RSP+08]
PUSH RBX
MOV RBX, [RBP]
LEA RBP, [RBP+08]
JMP 004E3FEA

Here the locals code looks quite good, but it has a lot of memory
accesses. 9 and a look indeed better than 4. Let's see how they
perform:

for i in 1 4 9 a; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'
187.7 257.1 191.3 179.7 [h0:~/pub/forth/programs:86099] for i in 1 4 9 a; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'

1 4 9 a
187.7 257.1 191.3 179.7 instructions per invocation (average from u=2..1e7)
165.9 141.2 136.3 161.9 cycles per invocation on Rocket Lake
169.5 105.0 156.8 163.0 cycles per invocation on Zen3

The Zen3 result is quite surprising: while usqrt4 executes the most
instructions (as expected), it takes far fewer cycles on Zen3 than all
the other versions.

The Rocket Lake result is not as great for usqrt4, and usqrt9 beats it
by a little.

I currently have no explanation why the cycles results are like this.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
dxforth
2022-12-08 01:24:48 UTC
Permalink
Post by Anton Ertl
Post by dxforth
Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.
I have collected some of the versions posted here into
http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th
I then created some versions from it that have the stack effect ( +n1
...
\ derived from dxforths locals-less version
: USQRT4 ( +n -- +root )
dup 2/ >r
BEGIN ( +n r:x0 )
WHILE
r> drop >r
REPEAT
2drop r> ;
With no count to maintain, it can be simplified to:

: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;
SpainHackForth
2022-12-08 10:09:39 UTC
Permalink
Post by Anton Ertl
Post by Anton Ertl
Post by dxforth
Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.
I have collected some of the versions posted here into
http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th
I then created some versions from it that have the stack effect ( +n1
...
\ derived from dxforths locals-less version
: USQRT4 ( +n -- +root )
dup 2/ >r
BEGIN ( +n r:x0 )
WHILE
r> drop >r
REPEAT
2drop r> ;
: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;
I wrote a assembler word for Mecrisp passed on this article… I think it was number number 4… I tried number 9 and it was not very precise… https://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi
Anton Ertl
2022-12-08 18:41:36 UTC
Permalink
Post by Anton Ertl
: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;
The version above is called USQRTB (or B) below. It inspired me to:

: usqrtc ( u -- root )
dup 2/ dup
begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;

The idea here is to never move a value on the critical path to memory
in VFX (i.e., the critical path value should be on the TOS at the
basic block boundary).

Here are the numbers, I'll show the code in a later posting:

for i in 1 4 9 a b c; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null
for i in 1 4 9 a b c; do perf stat -x" " -e cycles:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'

1 4 9 a b c
187.7 257.1 191.3 179.7 177.7 141.0 instructions per invocation
166.3 141.2 139.0 161.4 160.6 136.3 cycles per invocation on Rocket Lake
169.6 105.1 157.5 162.8 168.5 95.7 cycles per invocation on Zen3

The cycles for USQRT4 vary a lot on the Rocket Lake; they are often
174 cycles, but here I show a good result.

Anyway, the idea behind USQRTC seems to work well: It wins in the
number of instructions, and the number of cycles on both CPUs.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
P Falth
2022-12-08 20:04:25 UTC
Permalink
Post by Anton Ertl
Post by Anton Ertl
: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;
: usqrtc ( u -- root )
dup 2/ dup
begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;
The idea here is to never move a value on the critical path to memory
in VFX (i.e., the critical path value should be on the TOS at the
basic block boundary).
for i in 1 4 9 a b c; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null
for i in 1 4 9 a b c; do perf stat -x" " -e cycles:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'
1 4 9 a b c
187.7 257.1 191.3 179.7 177.7 141.0 instructions per invocation
166.3 141.2 139.0 161.4 160.6 136.3 cycles per invocation on Rocket Lake
169.6 105.1 157.5 162.8 168.5 95.7 cycles per invocation on Zen3
The cycles for USQRT4 vary a lot on the Rocket Lake; they are often
174 cycles, but here I show a good result.
Anyway, the idea behind USQRTC seems to work well: It wins in the
number of instructions, and the number of cycles on both CPUs.
But its name is wrong. It is not unsigned. It requires a signed positive number
as seen also by the original stack comment .

try -1 usqrtc

BR
Peter
Anton Ertl
2022-12-08 22:26:50 UTC
Permalink
Post by P Falth
But its name is wrong. It is not unsigned. It requires a signed positive number
as seen also by the original stack comment .
Yes, that can be fixed with

: usqrtd ( u -- root )
dup 1 rshift dup
begin ( u x0 x1 )
nip ( u x1 )
2dup u/ over + 1 rshift ( u x1 x2 )
2dup u<= until
drop nip ;

but not every Forth system has U/ and U<= (and VFX64 has U/, but does
not inline it).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
Marcel Hendrix
2022-12-08 21:22:28 UTC
Permalink
On Thursday, December 8, 2022 at 7:52:02 PM UTC+1, Anton Ertl wrote:
[..]
FORTH> in ( #100,000,000 iterations )
303700049900000000 7.753 seconds elapsed. minforth
303700049900000000 7.039 seconds elapsed. dxforth #1
303700049900000000 4.424 seconds elapsed. Albert
303700049900000000 0.681 seconds elapsed. Anton #1
303700049900000000 6.229 seconds elapsed. Gerry
303700049900000000 6.540 seconds elapsed. Anton #2
303700049900000000 6.208 seconds elapsed. Anton #3
303700049900000000 6.554 seconds elapsed. dxforth #2
303700049900000000 6.497 seconds elapsed. Anton #3 ok
303700049900000000 0.036 seconds elapsed. iForth ok
FORTH> .TICKER-INFO
AMD Ryzen 7 5800X 8-Core Processor, TICKS-GET uses iTSC at 4192MHz

Locals cost something, but a smart algorithm obviously wins.

The last result is because some inlined words evaluate their arguments.

-marcel
minf...@arcor.de
2022-12-08 22:21:37 UTC
Permalink
Post by Marcel Hendrix
[..]
FORTH> in ( #100,000,000 iterations )
303700049900000000 7.753 seconds elapsed. minforth
303700049900000000 7.039 seconds elapsed. dxforth #1
303700049900000000 4.424 seconds elapsed. Albert
303700049900000000 0.681 seconds elapsed. Anton #1
303700049900000000 6.229 seconds elapsed. Gerry
303700049900000000 6.540 seconds elapsed. Anton #2
303700049900000000 6.208 seconds elapsed. Anton #3
303700049900000000 6.554 seconds elapsed. dxforth #2
303700049900000000 6.497 seconds elapsed. Anton #3 ok
303700049900000000 0.036 seconds elapsed. iForth ok
FORTH> .TICKER-INFO
AMD Ryzen 7 5800X 8-Core Processor, TICKS-GET uses iTSC at 4192MHz
Locals cost something, but a smart algorithm obviously wins.
I think many Forths do some form of register caching for top stack element(s).
But keeping locals in registers seems still unpopular in Forth. This could explain
those many memory accesses by VFX Forth for the locals version, and the
incurred speed penalty.
Anton Ertl
2022-12-08 22:33:15 UTC
Permalink
Post by Anton Ertl
Post by Anton Ertl
: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;
: usqrtc ( u -- root )
dup 2/ dup
begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;
The idea here is to never move a value on the critical path to memory
in VFX (i.e., the critical path value should be on the TOS at the
basic block boundary).
for i in 1 4 9 a b c; do perf stat -x" " -e instructions:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null
for i in 1 4 9 a b c; do perf stat -x" " -e cycles:u vfx64 "include usqrt.4th ' usqrt$i bench bye" 2>&1 >/dev/null; done|awk '{printf("%5.1f ",$1/10000000);}'
1 4 9 a b c
187.7 257.1 191.3 179.7 177.7 141.0 instructions per invocation
166.3 141.2 139.0 161.4 160.6 136.3 cycles per invocation on Rocket Lake
169.6 105.1 157.5 162.8 168.5 95.7 cycles per invocation on Zen3
The cycles for USQRT4 vary a lot on the Rocket Lake; they are often
174 cycles, but here I show a good result.
Anyway, the idea behind USQRTC seems to work well: It wins in the
number of instructions, and the number of cycles on both CPUs.
Here is the promised code for the inner loop on VFX64:

usqrt4 usqrt9 usqrtb usqrtc
MOV RCX, [RSP] MOV RDX, [RSP] MOV RAX, [RBP+08] MOV RAX, [RBP+08]
MOV RAX, RBX MOV RAX, RDX CQO CQO
CQO CQO IDIV QWord [RBP] IDIV RBX
IDIV RCX IDIV RBX ADD RAX, [RBP] ADD RAX, RBX
MOV RDX, [RSP] ADD RAX, RBX SAR RAX, # 1 SAR RAX, # 1
ADD RDX, RAX SAR RAX, # 1 CMP RAX, [RBP] CMP RBX, RAX
SAR RDX, # 1 CMP RBX, RAX MOV RBX, RAX MOV [RBP], RBX
MOV RCX, [RSP] LEA RBP, [RBP+-08] JNL 004E418D MOV RBX, RAX
CMP RDX, RCX MOV [RBP], RBX MOV RDX, RBX JNLE 004E41E2
LEA RBP, [RBP+-08] MOV RBX, RAX MOV RBX, [RBP]
MOV [RBP], RBX JLE/N004E4090 MOV [RBP], RDX
MOV RBX, RDX LEA RBP, [RBP+08] JMP 004E4162
JNL 004E4028 JMP 004E4064
LEA RSP, [RSP+08]
PUSH RBX
MOV RBX, [RBP]
LEA RBP, [RBP+08]
JMP 004E3FEA

For space reasons I left USQRT1 and USQRTA (relatively slow versions
away here, you can look at them in an earlier posting.

The memory reference in USQRTC are for different memory locations, so
one is only read (the value U), and another is only written (to a
different location, which is then not used in the next iteration; it's
only relevant when leaving the loop.

USQRT9 also seems to have this property, yet is signifiantly slower;
the reasons for that are unclear to me.

USQRT4 PUSHes a value (x1) to the return stack towards the end of an
iteration, and loads it (as x0) at the start of the next iteration.
This does not seem to hurt much, though, and I don't know why.

USQRTB moves a value to data stack RAM in the penultimate instruction
of the iteration and loads it again in the IDIV instruction.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
dxforth
2022-12-09 00:09:43 UTC
Permalink
Post by Anton Ertl
Post by Anton Ertl
: USQRT4 ( +n -- +root )
dup 2/ dup
BEGIN drop
2dup / over + 2/
2dup >
WHILE
swap
REPEAT
rot 2drop ;
: usqrtc ( u -- root )
dup 2/ dup
begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;
One less branch. Neat.
Anton Ertl
2022-12-09 09:58:27 UTC
Permalink
Post by Anton Ertl
Post by dxforth
Locals are supposed to reduce 'stack juggling' and simplify code but in this
instance it was the juggling between variables that proved extraneous.
I have collected some of the versions posted here into
http://www.complang.tuwien.ac.at/forth/programs/usqrt.4th
After establishing that USQRTD has the fastest loop, I now add the
better initial value. I benchmark ***@cherry.(none)'s cheap, but
less accurate initial value as well as my more precise, but more
expensive initial value. The values tested are for 2..10^7 (like in
the other tests).

First, the results:

c d 5 e
141.1 104.5 77.0 63.8 instructions/invocation
136.3 46.7 46.0 45.9 cycles/invocation Rocket Lake
95.8 40.0 32.5 32.5 cycles/invocation Zen3

c uses the original initial value: u 2/
d uses my log2-based initial value (see below)
5 and e use ***@cherry.(none)'s initial value;
5 is ***@cherry.(none)'s code
e uses his initial value with the rest from c.

Concerning usqrtd, I could not use exactly, what I outlined earlier, but
had to adapt it slightly to guarantee an initial value that's larger
than the root (required by the terminating condition of the loop):

: log2 ( n -- log2 )
dup 0= swap ( log2' n )
dup $ffffffff u> 32 and rot over + -rot rshift
dup $0000ffff u> 16 and rot over + -rot rshift
dup $000000ff u> 8 and rot over + -rot rshift
dup $0000000f u> 4 and rot over + -rot rshift
dup $00000003 u> 2 and rot over + -rot rshift
$00000001 u> 1 and + ;

: usqrtd ( u -- root )
2 over log2 2/ lshift
dup begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;

: uSQRT5 DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
NIP REPEAT DROP R> DROP ;

: usqrte ( u -- root )
dup 10 rshift 1024 max
dup begin ( u x0 x1 )
nip ( u x1 )
2dup / over + 2/ ( u x1 x2 )
2dup <= until
drop nip ;

It's obvious that, for the range of numbers benchmarked, the increased
precision of USQRTD does not compensate for its increased cost. I
also benchmarked with u=1000000000002..1000010000000, with the
following results:

c d 5 e
226.9 102.7 201.4 149.9 instructions/invocation
270.7 45.3 142.2 142.6 cycles/invocation Rocket Lake
170.5 39.5 102.1 101.8 cycles/invocation Zen3

So for larger u, the higher precision of the log2 approach pays off.
OTOH, if you knew that u is in that range, you could use the following
cheap initial value:

DUP 20 RSHIFT $100000 MAX

and win over SQRTD again.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
Marcel Hendrix
2022-12-09 18:09:48 UTC
Permalink
Post by Anton Ertl
Concerning usqrtd, I could not use exactly, what I outlined earlier, but
had to adapt it slightly to guarantee an initial value that's larger
iForth has the log2 method implemented as a table (it is very
frequently used in my type of applications). For your version of
the benchmark implementation, I didn't want to mention that
I had to use "log2 1+".

Therefore the sqrt with log2 initial value wins by several
streetlengths. (Although the SQRT based on FSQRT might
still be faster).

-marcel
Anton Ertl
2022-12-09 18:36:07 UTC
Permalink
Post by Marcel Hendrix
iForth has the log2 method implemented as a table (it is very
frequently used in my type of applications).
Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
BMI1 or ABM extensions are present) that allow a fast implementation
of log2. In (development) Gforth it's

55CD95A2B62A: test r8,r8
55CD95A2B62D: jz $55CD95A2B645
55CD95A2B62F: bsr r8,r8
55CD95A2B633: xor r8,$3F
55CD95A2B637: movsx rax,r8
55CD95A2B63A: mov r8d,$3F
55CD95A2B640: sub r8,rax
55CD95A2B643: jmp $55CD95A2B64C
55CD95A2B645: mov r8,$FFFFFFFF
55CD95A2B64C: add r13,$08
55CD95A2B650: mov rcx,-$08[r13]
55CD95A2B654: jmp ecx

but I can see several ways to improve on this gcc output (bsr itself
is almost there).

Given that Gforth has LOG2 and iforth has LOG2, there is a slight
chance of standardization.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
Marcel Hendrix
2022-12-09 22:12:43 UTC
Permalink
Post by Anton Ertl
Post by Marcel Hendrix
iForth has the log2 method implemented as a table (it is very
frequently used in my type of applications).
Sorry, I mixed log2 up with 2^x ( u -- 2^u ) ...
Post by Anton Ertl
Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
BMI1 or ABM extensions are present) that allow a fast implementation
of log2.
FORTH> : test 7 log2 . ; ok
FORTH> see test
Flags: ANSI
$0133DC40 : test
$0133DC4A mov rbx, 7 d#
$0133DC51 mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC5B bsr rax, rbx
$0133DC5F mov rbx, rax
$0133DC62 push rbx
$0133DC63 jmp .+10 ( $0124A102 ) offset NEAR
$0133DC68 ;

-marcel
Anton Ertl
2022-12-09 22:24:29 UTC
Permalink
Post by Marcel Hendrix
Post by Anton Ertl
Post by Marcel Hendrix
iForth has the log2 method implemented as a table (it is very
frequently used in my type of applications).
Sorry, I mixed log2 up with 2^x ( u -- 2^u ) ...
Post by Anton Ertl
Note that Intel and AMD have bsr (since the 386) and lzcnt (if the
BMI1 or ABM extensions are present) that allow a fast implementation
of log2.
FORTH> : test 7 log2 . ; ok
FORTH> see test
Flags: ANSI
$0133DC40 : test
$0133DC4A mov rbx, 7 d#
$0133DC51 mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC5B bsr rax, rbx
$0133DC5F mov rbx, rax
$0133DC62 push rbx
$0133DC63 jmp .+10 ( $0124A102 ) offset NEAR
$0133DC68 ;
bsr's result is not defined if the source=0, so I would write
something like

bsr dest, source
bne nonzero
mov dest, -1
nonzero:
# do something with dest

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
Marcel Hendrix
2022-12-09 23:32:46 UTC
Permalink
On Friday, December 9, 2022 at 11:27:13 PM UTC+1, Anton Ertl wrote:
[..]
Post by Anton Ertl
bsr's result is not defined if the source=0, so I would write
something like
bsr dest, source
bne nonzero
mov dest, -1
# do something with dest
The log2 of 0 is undefined anyway, so I prefer to let the user
of log2 decide how it should be handled...

Or do you mean that bsr with 0 does not work the same on all
architectures/chip generations?

FORTH> : test 0 log2 . ; ok
FORTH> see test ( on AMD )
Flags: ANSI
$0133DC40 : test
$0133DC4A xor rbx, rbx
$0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC57 bsr rax, rbx
$0133DC5B mov rbx, rax
$0133DC5E push rbx
$0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
FORTH> test -1 ok

-marcel
dxforth
2022-12-10 03:05:05 UTC
Permalink
Post by Marcel Hendrix
[..]
Post by Anton Ertl
bsr's result is not defined if the source=0, so I would write
something like
bsr dest, source
bne nonzero
mov dest, -1
# do something with dest
The log2 of 0 is undefined anyway, so I prefer to let the user
of log2 decide how it should be handled...
The implementer you mean? Several options spring to mind:

a) THROW - clumsy but gets user's attention; always works
b) 0 - can be useful
c) -1 - throwing a spanner in the works in the hope user notices
Anton Ertl
2022-12-10 15:37:23 UTC
Permalink
Post by Marcel Hendrix
[..]
Post by Anton Ertl
bsr's result is not defined if the source=0, so I would write
something like
bsr dest, source
bne nonzero
mov dest, -1
# do something with dest
The log2 of 0 is undefined anyway, so I prefer to let the user
of log2 decide how it should be handled...
What happens in programming is that some user happens to write a
program that just uses the result as-is, and happens to work as
intended even for input 0. And if you want to provide a high-quality
experience, your programming system should better be consistent in
what it returns in that case.

Gforth returns -1 for 0 LOG2.

One benefit of returning -1 is that you can get a count of leading
zeros with

( x ) bits/cell 1- swap log2 -
Post by Marcel Hendrix
Or do you mean that bsr with 0 does not work the same on all
architectures/chip generations?
Intel certainly reserves the right to do it that way by saying that
the result is undefined in that case. However, for the same reason as
discussed above, they (and AMD) better stick with the behaviour of
their original implementation, or their shiny new processor might get
a reputation for incompatibility (yes, Intel may blame the
application, but the customer does not care who is to blame).

So they might just as well document the actual behaviour.
Post by Marcel Hendrix
FORTH> : test 0 log2 . ; ok
FORTH> see test ( on AMD )
Flags: ANSI
$0133DC40 : test
$0133DC4A xor rbx, rbx
$0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC57 bsr rax, rbx
$0133DC5B mov rbx, rax
$0133DC5E push rbx
$0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
FORTH> test -1 ok
It seems that on your hardware (Zen3 IIRC) bsr leaves the destination
register unchanged when the source is 0, and iForth is one application
that relies on this behaviour in order to return -1 (otherwise the
instruction at $0133DC4D would be superfluous); I just tried in on a
Skylake, and unsurprisingly it behaves the same way. If you don't
want to rely on Intel and AMD sticking to the same behaviour, you can
change the central instructions as shown above or as

mov rax, -1
bsr rbx,rbx
cmove rbx, rax

Interstingly, AMD has added an instruction lzcnt with the ABM
extension since the K10 (2007), and Intel has picked it up in the BMI1
extension since the Haswell (2013). lzcnt delivers a complementary
result (e.g., all-bits-set produces 0, and 0 produces 64), but in
particular it is also defined for 0. It is faster than bsr on the AMD
CPUs listed on https://gmplib.org/~tege/x86-timing.pdf (e.g., 1 cycle
latency on Zen2 and 4 lzcnt/cycle throughput, vs. 4 cycle latency for
bsr and 1 bsr every 4 cycles throughput. On the Intel CPUs listed
they both have 3 cycles latency, and 1 instruction/cycle throughput.
I wonder what makes bsr so hard to implement on the AMD CPUs that they
added lzcnt.

Of course for log2 you would need something like:

lzcnt rax, rbx
mov rbx, 63
sub rbx, rax

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
P Falth
2022-12-10 22:37:33 UTC
Permalink
Post by Anton Ertl
Post by Marcel Hendrix
[..]
Post by Anton Ertl
bsr's result is not defined if the source=0, so I would write
something like
bsr dest, source
bne nonzero
mov dest, -1
# do something with dest
The log2 of 0 is undefined anyway, so I prefer to let the user
of log2 decide how it should be handled...
What happens in programming is that some user happens to write a
program that just uses the result as-is, and happens to work as
intended even for input 0. And if you want to provide a high-quality
experience, your programming system should better be consistent in
what it returns in that case.
Gforth returns -1 for 0 LOG2.
One benefit of returning -1 is that you can get a count of leading
zeros with
( x ) bits/cell 1- swap log2 -
Post by Marcel Hendrix
Or do you mean that bsr with 0 does not work the same on all
architectures/chip generations?
Intel certainly reserves the right to do it that way by saying that
the result is undefined in that case. However, for the same reason as
discussed above, they (and AMD) better stick with the behaviour of
their original implementation, or their shiny new processor might get
a reputation for incompatibility (yes, Intel may blame the
application, but the customer does not care who is to blame).
So they might just as well document the actual behaviour.
Post by Marcel Hendrix
FORTH> : test 0 log2 . ; ok
FORTH> see test ( on AMD )
Flags: ANSI
$0133DC40 : test
$0133DC4A xor rbx, rbx
$0133DC4D mov rax, $FFFFFFFF:FFFFFFFF q#
$0133DC57 bsr rax, rbx
$0133DC5B mov rbx, rax
$0133DC5E push rbx
$0133DC5F jmp .+10 ( $0124A102 ) offset NEAR
FORTH> test -1 ok
It seems that on your hardware (Zen3 IIRC) bsr leaves the destination
register unchanged when the source is 0, and iForth is one application
that relies on this behaviour in order to return -1 (otherwise the
instruction at $0133DC4D would be superfluous); I just tried in on a
Skylake, and unsurprisingly it behaves the same way. If you don't
want to rely on Intel and AMD sticking to the same behaviour, you can
change the central instructions as shown above or as
mov rax, -1
bsr rbx,rbx
cmove rbx, rax
Interstingly, AMD has added an instruction lzcnt with the ABM
extension since the K10 (2007), and Intel has picked it up in the BMI1
extension since the Haswell (2013). lzcnt delivers a complementary
result (e.g., all-bits-set produces 0, and 0 produces 64), but in
particular it is also defined for 0. It is faster than bsr on the AMD
CPUs listed on https://gmplib.org/~tege/x86-timing.pdf (e.g., 1 cycle
latency on Zen2 and 4 lzcnt/cycle throughput, vs. 4 cycle latency for
bsr and 1 bsr every 4 cycles throughput. On the Intel CPUs listed
they both have 3 cycles latency, and 1 instruction/cycle throughput.
I wonder what makes bsr so hard to implement on the AMD CPUs that they
added lzcnt.
lzcnt rax, rbx
mov rbx, 63
sub rbx, rax
My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.

BR
Peter
Post by Anton Ertl
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
Anton Ertl
2022-12-11 09:28:07 UTC
Permalink
Post by P Falth
My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.
Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string
instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
Kerr-Mudd, John
2022-12-11 11:48:30 UTC
Permalink
On Sun, 11 Dec 2022 09:28:07 GMT
***@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


[log2 to give a first estimate for a sqrt program]
Post by Anton Ertl
Post by P Falth
My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.
Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string
instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.
I just tried a leading zero count in asm, trying "repnz shr ax,1"; the
assembler (nasm) was OK encoding it but on execution it dropped out
after one shift, even though the flags register shows 'nz'. This is on a
20 year old Pentium.
Intel's Instruction Reference (ASDM) says rep[] only applies to string ops,
so I can't really complain.
I rarely use rep lods[x] though!

xpost to ala, as it's more about asm.
--
Bah, and indeed Humbug.
Kerr-Mudd, John
2022-12-11 11:55:27 UTC
Permalink
On Sun, 11 Dec 2022 11:48:30 +0000
Post by Kerr-Mudd, John
On Sun, 11 Dec 2022 09:28:07 GMT
[log2 to give a first estimate for a sqrt program]
Post by Anton Ertl
Post by P Falth
My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.
Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string
instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.
I just tried a leading zero count in asm, trying "repnz shr ax,1"; the
assembler (nasm) was OK encoding it but on execution it dropped out
after one shift, even though the flags register shows 'nz'. This is on a
20 year old Pentium.
Intel's Instruction Reference (ASDM) says rep[] only applies to string ops,
so I can't really complain.
I rarely use rep lods[x] though!
xpost to ala, as it's more about asm.
/alt/.lang.asm
doh!
--
Bah, and indeed Humbug.
Steve
2022-12-12 12:43:34 UTC
Permalink
Post by Kerr-Mudd, John
On Sun, 11 Dec 2022 11:48:30 +0000
Post by Kerr-Mudd, John
On Sun, 11 Dec 2022 09:28:07 GMT
[log2 to give a first estimate for a sqrt program]
Post by Anton Ertl
Post by P Falth
My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.
Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string
instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.
I just tried a leading zero count in asm, trying "repnz shr ax,1"; the
assembler (nasm) was OK encoding it but on execution it dropped out
after one shift, even though the flags register shows 'nz'. This is on a
20 year old Pentium.
Intel's Instruction Reference (ASDM) says rep[] only applies to string ops,
so I can't really complain.
I rarely use rep lods[x] though!
xpost to ala, as it's more about asm.
/alt/.lang.asm
doh!
--
Bah, and indeed Humbug.
Hi,

It seems strange to use such an encoding?

Cheers,

Steve N,
Kerr-Mudd, John
2022-12-12 12:57:36 UTC
Permalink
On Mon, 12 Dec 2022 12:43:34 -0000 (UTC)
Post by Steve
Post by Kerr-Mudd, John
On Sun, 11 Dec 2022 11:48:30 +0000
Post by Kerr-Mudd, John
On Sun, 11 Dec 2022 09:28:07 GMT
[log2 to give a first estimate for a sqrt program]
Post by Anton Ertl
Post by P Falth
My E5 v2 cpu is the generation before Haswell and do not have the lzcnt opcode.
But it does not flag an error if executed, instead it executes bsr! and give a wrong
result. This is also stated in the Intel manual.
Interesting. Looking at the encoding, lzcount is encoded as repne
bsr, and the repne prefix normally is ignored except for the string
instructions. Apparently nobody encoded bsr with a repne prefix,
otherwise they could not have encoded lzcnt in this way.
I just tried a leading zero count in asm, trying "repnz shr ax,1"; the
assembler (nasm) was OK encoding it but on execution it dropped out
after one shift, even though the flags register shows 'nz'. This is on a
20 year old Pentium.
Intel's Instruction Reference (ASDM) says rep[] only applies to string ops,
so I can't really complain.
I rarely use rep lods[x] though!
xpost to ala, as it's more about asm.
Hi,
It seems strange to use such an encoding?
Maybe so, but it would've saved a loop.

This was my poor attempt:

Log2b: ; ax<-ln2(ax) ; uses cx [14]
mov cx,16 ; bits per word
push cx
repnz shr ax,1
inc cx ; 1 too many
pop ax
sub ax,cx ;
; ln(0)=-1 fix
jnz .ln0nofix
dec ax
.ln0nofix:
ret


Ok, we can use 'bsr' to find the leading bit these days.

Log2c: ; ax<-ln2(ax) ; uses bx [7]

mov bx,-1
xchg ax,bx
bsr ax,bx ; cpu 386




for rep lods, perhaps size coders use
rep lodsb
rather than
mov bx,cx
mov al,[si+bx]
--
Bah, and indeed Humbug.
dxforth
2022-12-10 23:56:48 UTC
Permalink
Post by Anton Ertl
Gforth returns -1 for 0 LOG2.
One benefit of returning -1 is that you can get a count of leading
zeros with
( x ) bits/cell 1- swap log2 -
OTOH simplest implementations return 0:

: LOG2 ( u1 -- u2 ) \ Rick C.
0 begin swap u2/ dup while swap 1+ repeat drop ;

code LOG2 ( u1 -- u2 )
bx pop ax ax xor 1 $: bx shr 2 $ jz ax inc 1 $ ju 2 $: 1push
end-code

To return something else requires it be special-cased - which can be
left to the application.
Heinrich Hohl
2022-12-11 08:34:13 UTC
Permalink
: LOG2 ( u1 -- u2 ) \ Rick C.
0 begin swap u2/ dup while swap 1+ repeat drop ;
To return something else requires it be special-cased - which can be
left to the application.
You can rewrite this as:

: LOG2 ( u1 -- u2 ) -1 SWAP BEGIN DUP WHILE SWAP 1+ SWAP U2/ REPEAT DROP ;

Same number of operations, but u1 = 0 returns u2 = -1.

For u1 = 0, the output u2 = -1 is the best solution. We need a way to find out that the output
is illegal after the calculation has been done (i.e. the argument u1 is no longer available).
As Anton stated, -1 has certain advantages. It also has the advantage that we can use
0< to check whether the output is correct or illegal.

u2 = 0 should allow us to conclude that u1 = 1.

Henry
lehs
2022-12-11 09:37:54 UTC
Permalink
I can't prove it, but it seems that beginning with a value greater than the square root and terminate when Xn - Xn+1 < 2 always gives the correct floor function SQRTF.

: sqrtf \ u -- n
locals| u |
u sqrtc~
begin dup u over / + 2/
tuck - 2 <
until ;
lehs
2022-12-12 16:03:19 UTC
Permalink
Post by lehs
I can't prove it, but it seems that beginning with a value greater than the square root and terminate when Xn - Xn+1 < 2 always gives the correct floor function SQRTF.
: sqrtf \ u -- n
locals| u |
u sqrtc~
begin dup u over / + 2/
tuck - 2 <
until ;
No it don't work, but

: lgf \ u -- n
0 swap
begin 1 rshift dup
while 1 under+
repeat drop ;

: sqrtc~ \ u -- n 2^(1+½lg2 u)
lgf 2/ 1+ 1 swap lshift ;

: sqrtf \ u -- n almost as fast as float
1- locals| u |
u sqrtc~
begin dup u over u/ + u2/ tuck u> 0=
until ;

false [if]
: sqrtf \ u -- floor a few procent faster
0 d>f fsqrt f>d drop ;
[then]
dxforth
2022-12-12 02:35:14 UTC
Permalink
Post by Heinrich Hohl
: LOG2 ( u1 -- u2 ) \ Rick C.
0 begin swap u2/ dup while swap 1+ repeat drop ;
To return something else requires it be special-cased - which can be
left to the application.
: LOG2 ( u1 -- u2 ) -1 SWAP BEGIN DUP WHILE SWAP 1+ SWAP U2/ REPEAT DROP ;
Same number of operations, but u1 = 0 returns u2 = -1.
On VFX it's even smaller (12 instr vs. 17) - though this may be a compiler
quirk. I doubt I could re-code the 8086 version to give a -1 result in the
same number of instructions/bytes let alone make it smaller.
Post by Heinrich Hohl
For u1 = 0, the output u2 = -1 is the best solution. We need a way to find out that the output
is illegal after the calculation has been done (i.e. the argument u1 is no longer available).
As Anton stated, -1 has certain advantages. It also has the advantage that we can use
0< to check whether the output is correct or illegal.
u2 = 0 should allow us to conclude that u1 = 1.
I would be inclined to test the input rather than the output. Testing
the output is a good indicator as any that it wasn't 'the best solution'.
Heinrich Hohl
2022-12-12 11:04:57 UTC
Permalink
I would be inclined to test the input rather than the output. Testing
the output is a good indicator as any that it wasn't 'the best solution'.
Absolutely correct. What I want to say is that for u1=0 the output should
either be undefined, or it should be u2=-1. Not any other value.
dxforth
2022-12-13 03:52:43 UTC
Permalink
Post by Heinrich Hohl
I would be inclined to test the input rather than the output. Testing
the output is a good indicator as any that it wasn't 'the best solution'.
Absolutely correct. What I want to say is that for u1=0 the output should
either be undefined, or it should be u2=-1. Not any other value.
It should be u2=-1 only until such time as someone comes along and says
u2=0 is useful. I don't have an integer example but I do have an f/p
example:

\ return the decimal exponent of a number
100e flog . 2 ok
10e flog . 1 ok
1e flog . 0 ok
0e flog . 0 ok
0.1e flog . -1 ok
0.01e flog . -2 ok
minf...@arcor.de
2022-12-13 07:16:52 UTC
Permalink
Post by dxforth
Post by Heinrich Hohl
I would be inclined to test the input rather than the output. Testing
the output is a good indicator as any that it wasn't 'the best solution'.
Absolutely correct. What I want to say is that for u1=0 the output should
either be undefined, or it should be u2=-1. Not any other value.
It should be u2=-1 only until such time as someone comes along and says
u2=0 is useful. I don't have an integer example but I do have an f/p
\ return the decimal exponent of a number
100e flog . 2 ok
10e flog . 1 ok
1e flog . 0 ok
0e flog . 0 ok
0.1e flog . -1 ok
0.01e flog . -2 ok
0e flog has to be a NAN.
If any, -INF as approximation, but never zero.
dxforth
2022-12-14 01:22:33 UTC
Permalink
Post by ***@arcor.de
0e flog has to be a NAN.
If any, -INF as approximation, but never zero.
Only in the cult of IEEE. Mathematics simply calls it 'undefined'.
Heinrich Hohl
2022-12-13 09:14:22 UTC
Permalink
On Tuesday, December 13, 2022 at 4:52:47 AM UTC+1, dxforth wrote:
Looks correct at first sight, but at closer inspection we find:

10^2 = 100
10^1 = 10
10^0 = 1
10^-1 = 0.1
10^-2 = 0.01

The result 0 is not part of this series. 0 does not have an exponent.
For x --> 0, the exponent grows towards - infinity.

However, in a simple FP package it may be convenient to represent
f=0.0 as significand = 0 and exponent = 0.
Post by dxforth
\ return the decimal exponent of a number
100e flog . 2 ok
10e flog . 1 ok
1e flog . 0 ok
0e flog . 0 ok
0.1e flog . -1 ok
0.01e flog . -2 ok
minf...@arcor.de
2022-12-13 11:40:19 UTC
Permalink
Post by Heinrich Hohl
10^2 = 100
10^1 = 10
10^0 = 1
10^-1 = 0.1
10^-2 = 0.01
The result 0 is not part of this series. 0 does not have an exponent.
For x --> 0, the exponent grows towards - infinity.
However, in a simple FP package it may be convenient to represent
f=0.0 as significand = 0 and exponent = 0.
Post by dxforth
\ return the decimal exponent of a number
100e flog . 2 ok
10e flog . 1 ok
1e flog . 0 ok
0e flog . 0 ok
0.1e flog . -1 ok
0.01e flog . -2 ok
Before starting with negative 0.0e shouldn't it suffice to say that any
similarities between integer and fp logarithm stop for values below 1 ?
dxforth
2022-12-14 01:14:48 UTC
Permalink
Post by Heinrich Hohl
10^2 = 100
10^1 = 10
10^0 = 1
10^-1 = 0.1
10^-2 = 0.01
The result 0 is not part of this series. 0 does not have an exponent.
For x --> 0, the exponent grows towards - infinity.
However, in a simple FP package it may be convenient to represent
f=0.0 as significand = 0 and exponent = 0.
Indeed in every f/p package, since:

0e fs. 0.E0 ok

To print the exponent it has to be extracted and FLOG is handy for that.
While FLOG isn't guaranteed to work on 0e, having it return 0 is better
than some other value.

In the case of LOG2 I don't see sufficient argument for it returning -1.
If least effort and least consequences from misuse be one's aim, then
returning 0 looks good to me.
Hans Bezemer
2022-12-06 16:25:09 UTC
Permalink
Post by ***@arcor.de
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
I'll give you that: it is FAST! However, this one is even faster (on my system):

: usqrt
begin
r@ over / over + 2/ swap over over < ( ct x1 x0 f)
while
drop swap 1+ swap
repeat rdrop nip nip
;
time pp4th -x usqrt.4th

Yours:
real 0m0,762s
user 0m0,761s
sys 0m0,000s

Mine:
real 0m0,682s
user 0m0,682s
sys 0m0,000s

Hans Bezemer
minf...@arcor.de
2022-12-06 16:46:23 UTC
Permalink
Post by Hans Bezemer
Post by ***@arcor.de
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
: usqrt
begin
while
drop swap 1+ swap
repeat rdrop nip nip
;
time pp4th -x usqrt.4th
real 0m0,762s
user 0m0,761s
sys 0m0,000s
real 0m0,682s
user 0m0,682s
sys 0m0,000s
Even nicer, also faster here (I have implemented only TOS-caching, but not register locals).
BTW over over is 2dup. For perfectionists: u2/ u/ and u< could double the range.
Anton Ertl
2022-12-06 16:59:59 UTC
Permalink
Post by ***@arcor.de
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
This is the specialization of the Newton-Raphson method for the square
root, see
<https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>

For the initial value I would use

1 bits/cell u clz - 2/ lshift

where clz ( x -- u ) is the count-leading-zeros function that many
architectures have built-in. Newton-Raphson profits a lot from a good
initial value.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
none) (albert
2022-12-06 17:38:41 UTC
Permalink
Post by Anton Ertl
Post by ***@arcor.de
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
This is the specialization of the Newton-Raphson method for the square
root, see
<https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>
For the initial value I would use
1 bits/cell u clz - 2/ lshift
where clz ( x -- u ) is the count-leading-zeros function that many
architectures have built-in. Newton-Raphson profits a lot from a good
initial value.
Where `clz is available you should use it. In this context I consider it
as cheating.
If clz is not available in hardware, it takes up as much iterations
(or more) than you save on the Newton iteration.
Post by Anton Ertl
- anton
Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
***@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
Anton Ertl
2022-12-06 22:01:25 UTC
Permalink
Post by none) (albert
Post by Anton Ertl
Post by ***@arcor.de
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
This is the specialization of the Newton-Raphson method for the square
root, see
<https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>
For the initial value I would use
1 bits/cell u clz - 2/ lshift
where clz ( x -- u ) is the count-leading-zeros function that many
architectures have built-in. Newton-Raphson profits a lot from a good
initial value.
Where `clz is available you should use it. In this context I consider it
as cheating.
If clz is not available in hardware, it takes up as much iterations
(or more) than you save on the Newton iteration.
Here's a LOG2 definition (originally by Wil Baden
<***@netcom.com>, with his peculiarities removed, extended to 64 bits, and slightly more efficient):

: log2 ( n -- log2 )
-1 swap ( log2 n)
dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
dup $000000000000ff00 and if swap 8 + swap 8 rshift then
dup $00000000000000f0 and if swap 4 + swap 4 rshift then
dup $000000000000000c and if swap 2 + swap 2 rshift then
dup $0000000000000002 and if swap 1 + swap 1 rshift then
+ ;

With this function the initial value becomes:

1 u log2 2/ lshift

It will be interesting to see at what values the crossover is between saving iterations for Newton, and the additional cost of log2.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
Heinrich Hohl
2022-12-08 10:12:34 UTC
Permalink
Post by Anton Ertl
Here's a LOG2 definition (originally by Wil Baden
: log2 ( n -- log2 )
-1 swap ( log2 n)
dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
dup $000000000000ff00 and if swap 8 + swap 8 rshift then
dup $00000000000000f0 and if swap 4 + swap 4 rshift then
dup $000000000000000c and if swap 2 + swap 2 rshift then
dup $0000000000000002 and if swap 1 + swap 1 rshift then
+ ;
Hello Anton,

did Wil Baden explain in detail how the algorithm works?

The routine works properly and the code is clear, but the algorithm
behind it is not easy to understand. I would be grateful if you have
any reference with more information on this Log2 algorithm.

Henry
Hans Bezemer
2022-12-08 17:12:19 UTC
Permalink
Post by Heinrich Hohl
did Wil Baden explain in detail how the algorithm works?
The routine works properly and the code is clear, but the algorithm
behind it is not easy to understand. I would be grateful if you have
any reference with more information on this Log2 algorithm.
Maybe this helps:
decimal $ffffffff00000000 2 base ! u. 1111111111111111111111111111111100000000000000000000000000000000 ok
decimal $00000000ffff0000 2 base ! u. 11111111111111110000000000000000 ok
decimal $000000000000ff00 2 base ! u. 1111111100000000 ok
decimal $00000000000000f0 2 base ! u. 11110000 ok
decimal $000000000000000c 2 base ! u. 1100 ok
decimal $0000000000000002 2 base ! u. 10

If the first one checks out (AND returns a non-zero value) we know it is in the top 32-bits.
So we update out count (it's bits 0 - 63, so we start off with -1) by adding 32 to it, shift
the whole shebang 32 bits to the right and test the first half of the last 32 bits.

If AND had returned zero, it's not in the first 32 bits - so we leave it be and continue with the
first half of the 32 bits. Same thing here - if zero, we leave it be, if not we add 16 to the count
and shift the whole darn thing 16 bits to the right - and repeat the entire procedure for the
first half of the 16 bits (which is 8 bits).

The last bit left is either 0 or 1. So we add it to our count. Let's assume we did "*0001" then
that bit is added to -1 - and it returns that bit zero is set. If we did "*0000" then zero is added
to -1 - signalling that NO bit is set at all.

It's quite simple and elegant when you think of it.

Hans Bezemer
Heinrich Hohl
2022-12-08 17:44:34 UTC
Permalink
Post by Hans Bezemer
It's quite simple and elegant when you think of it.
Hans Bezemer
Hello Hans,

thank you for the explanations. Yes, the algorithm is tricky but also elegant.
This is basically a binary search for the topmost nonzero bit.

I had a look at "Bit Twiddling Hacks" and found similar algorithms.
https://graphics.stanford.edu/~seander/bithacks.html

"Count the consecutive zero bits (trailing) on the right by binary search"
is almost the same algorithm, except that it searches for the lowermost nonzero bit.

Henry
Anton Ertl
2022-12-08 17:32:24 UTC
Permalink
Post by Heinrich Hohl
Post by Anton Ertl
Here's a LOG2 definition (originally by Wil Baden
: log2 ( n -- log2 )
-1 swap ( log2 n)
dup $ffffffff00000000 and if swap 32 + swap 32 rshift then
dup $00000000ffff0000 and if swap 16 + swap 16 rshift then
dup $000000000000ff00 and if swap 8 + swap 8 rshift then
dup $00000000000000f0 and if swap 4 + swap 4 rshift then
dup $000000000000000c and if swap 2 + swap 2 rshift then
dup $0000000000000002 and if swap 1 + swap 1 rshift then
+ ;
Hello Anton,
did Wil Baden explain in detail how the algorithm works?
I only have found it in a reply by me, and what I kept from his
posting does not include an explanation. You could take the
Message-Id <***@netcom.com> and look it up.
al.howardknight.net does not have it (probably too old); maybe Google
Groups can find it (but the last few times I tried to find Message-Ids
with Google Groups, I was unsuccessful).
Post by Heinrich Hohl
The routine works properly and the code is clear, but the algorithm
behind it is not easy to understand. I would be grateful if you have
any reference with more information on this Log2 algorithm.
Hans Bezemer explains it.

A similar way to do it, but without ifs (derived from the C code in
Gforth):

: log2a ( n -- log2 )
dup 0= swap ( log2' n )
dup $ffffffff u> 32 and rot over + -rot rshift
dup $0000ffff u> 16 and rot over + -rot rshift
dup $000000ff u> 8 and rot over + -rot rshift
dup $0000000f u> 4 and rot over + -rot rshift
dup $00000003 u> 2 and rot over + -rot rshift
$00000001 u> 1 and + ;

\ checked against Gforth's builtin log2
: check
100000000 0 do i log2 i log2a <> if cr i hex. i log2 . i log2a . then loop ;

VFX64 translates this into 47 instructions (163 bytes) without a
single memory reference.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
minf...@arcor.de
2022-12-06 18:05:12 UTC
Permalink
Post by Anton Ertl
Post by ***@arcor.de
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
This is the specialization of the Newton-Raphson method for the square
root, see
<https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>
Correct. Another algorithm would include tabulation and interpolation. Probably faster
but eats more space.

If anyone is interested, there is a practical story behind the game: flow measurement
through orifices where dp behaves about quadratic with the flow. For instance the basic
principles are explained here:
https://www.efunda.com/formulae/fluids/calc_orifice_flowmeter.cfm
lehs
2022-12-09 22:11:49 UTC
Permalink
Post by Anton Ertl
Post by ***@arcor.de
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
This is the specialization of the Newton-Raphson method for the square
root, see
<https://hackernoon.com/calculating-the-square-root-of-a-number-using-the-newton-raphson-method-a-how-to-guide-yr4e32zo>
For the initial value I would use
1 bits/cell u clz - 2/ lshift
where clz ( x -- u ) is the count-leading-zeros function that many
architectures have built-in. Newton-Raphson profits a lot from a good
initial value.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net
In android gforth lg2 is slightly faster than log2, especially for smaller u. The idea of lg2 is to asume that u is exponential distributed rather than rectangular.

: lg2 \ u -- n
locals| u |
0 -1
begin 2* dup u and
while 1 under+
repeat drop ;
none) (albert
2022-12-06 17:35:20 UTC
Permalink
Post by ***@arcor.de
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
CR 20 usqrt
CR 200 usqrt
CR 2000 usqrt
CR 20000 usqrt
CR 200000 usqrt
CR 2000000 usqrt
CR 20000000 usqrt
CR 200000000 usqrt
include usqrt.fth
root:4 iterations: 2
root:14 iterations: 4
root:44 iterations: 6
root:141 iterations: 8
root:447 iterations: 10
root:1414 iterations: 12
root:4472 iterations: 14
root:14142 iterations: 16 ok
My take on this.

Newton is only crazy fast if the start value is good.
The following takes 10 iterations away:

\ For N return FLOOR of the square root of n.
: SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE
NIP REPEAT DROP RDROP ;

S[ ] OK -1 1 RSHIFT
S[ 9223372036854775807 ] OK SQRT
S[ 3037000499 ] OK

(that is MAX-INT , if you had not noticed.

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
***@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
Marcel Hendrix
2022-12-06 18:01:50 UTC
Permalink
On Tuesday, December 6, 2022 at 6:35:24 PM UTC+1, none albert wrote:
[..]
Post by none) (albert
Newton is only crazy fast if the start value is good.
\ For N return FLOOR of the square root of n.
: SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
NIP REPEAT DROP RDROP ;
S[ ] OK -1 1 RSHIFT
S[ 9223372036854775807 ] OK SQRT
S[ 3037000499 ] OK
(that is MAX-INT , if you had not noticed.
[..]

FORTH> : SQRT DUP >R 10 RSHIFT 1024 MAX BEGIN R@ OVER / OVER + 1 RSHIFT 2DUP > WHILE NIP REPEAT DROP -R ;
Redefining `SQRT` ok
FORTH> : test timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
FORTH> test 3037000499000000000 53.520 seconds elapsed. ok
FORTH> forget sqrt : test2 timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
FORTH> test2 3037000499000000000 0.435 seconds elapsed. ok

-marcel
minf...@arcor.de
2022-12-06 21:05:10 UTC
Permalink
[..]
Post by none) (albert
Newton is only crazy fast if the start value is good.
\ For N return FLOOR of the square root of n.
: SQRT DUP >R 10 RSHIFT 1024 MAX \ Minimize iterations.
NIP REPEAT DROP RDROP ;
S[ ] OK -1 1 RSHIFT
S[ 9223372036854775807 ] OK SQRT
S[ 3037000499 ] OK
(that is MAX-INT , if you had not noticed.
[..]
Redefining `SQRT` ok
FORTH> : test timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
FORTH> test 3037000499000000000 53.520 seconds elapsed. ok
FORTH> forget sqrt : test2 timer-reset 0 #1000000000 0 do -1 1 RSHIFT sqrt + loop . .elapsed ; ok
FORTH> test2 3037000499000000000 0.435 seconds elapsed. ok
Cute. But all those algorithms use slow division in the loop.

Here is a version ( in pseudo-Forth ;-) } using only shifts and additions:
: USQRT {: u | T H B S -- uroot32 :}
0 to H $8000 to B 15 to S
BEGIN
H 2* B + S lshift to T
T u < IF
H B + to H
u T - to u
THEN
S 1- to S
B 2/ to B
B 0=
UNTIL
." root:" H . ;

BTW (not shown above) for syntactic sugar I regularly use
:= += -/ *= /=
with locals. Makes code more readable.
Gerry Jackson
2022-12-07 11:36:21 UTC
Permalink
Post by ***@arcor.de
Inspired by another thread I came up with the iterative code shown below.
The challenge: find a shorter AND faster version of USQRT!
Have fun!
( please don't short-circuit through FSQRT or such ;-)
: USQRT {: u | x0 x1 ct -- uroot ; positive integer square root :}
0 to ct u 2/ to x0
BEGIN u x0 / x0 + 2/ to x1
x1 x0 <
WHILE x1 to x0
ct 1+ to ct
REPEAT
." root:" x0 . ." iterations: " ct . ;
CR 20 usqrt
CR 200 usqrt
CR 2000 usqrt
CR 20000 usqrt
CR 200000 usqrt
CR 2000000 usqrt
CR 20000000 usqrt
CR 200000000 usqrt
include usqrt.fth
root:4 iterations: 2
root:14 iterations: 4
root:44 iterations: 6
root:141 iterations: 8
root:447 iterations: 10
root:1414 iterations: 12
root:4472 iterations: 14
root:14142 iterations: 16 ok
In ANS/2012 Forth, solutions that use 2/ on an unsigned number with the
ms bit set won't work for unsigned square root as 2/ leaves the ms bit
unchanged.

Here's a recursive solution I wrote a few months ago based on:
https://en.wikipedia.org/wiki/Integer_square_root#Using_bitwise_operations

I've put stack comments in so it can be related to the original in
Wikipedia.

: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
r> u> if 1- then ( -- u1 )
;

Worked OK for me, I wasn't concerned about speed so don't know how it
compares in that respect. Incidentally this provides an example of using
the stack vs locals.
--
Gerry
minf...@arcor.de
2022-12-07 12:13:44 UTC
Permalink
Post by Gerry Jackson
https://en.wikipedia.org/wiki/Integer_square_root#Using_bitwise_operations
I've put stack comments in so it can be related to the original in
Wikipedia.
: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
r> u> if 1- then ( -- u1 )
;
Worked OK for me, I wasn't concerned about speed so don't know how it
compares in that respect. Incidentally this provides an example of using
the stack vs locals.
Cool ! I really like this one ! :-)
Hans Bezemer
2022-12-07 13:44:58 UTC
Permalink
Post by Gerry Jackson
Worked OK for me, I wasn't concerned about speed so don't know how it
compares in that respect. Incidentally this provides an example of using
the stack vs locals.
The algorithm that started this discussion did 1M iterations in about 0.76s.
Yours is 0.2s faster on my combo - even with all the function call overhead.

HB
Gerry Jackson
2022-12-07 15:30:16 UTC
Permalink
Post by Hans Bezemer
Post by Gerry Jackson
Worked OK for me, I wasn't concerned about speed so don't know how it
compares in that respect. Incidentally this provides an example of using
the stack vs locals.
The algorithm that started this discussion did 1M iterations in about 0.76s.
Yours is 0.2s faster on my combo - even with all the function call overhead.
HB
I've just realised it can be made slightly quicker using the well formed
flag trick in the last line to eliminate the IF ... THEN

: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
\ r> u> if 1- then ( -- u1 )
r> u> + ( -- u1 )
;

I believe 4th uses 1 for TRUE so the last line for 4th would be:
r> u> -

I'm a bit dubious about whether a marginal saving like that is worthwhile.
--
Gerry
minf...@arcor.de
2022-12-07 18:12:09 UTC
Permalink
Post by Gerry Jackson
Post by Hans Bezemer
Post by Gerry Jackson
Worked OK for me, I wasn't concerned about speed so don't know how it
compares in that respect. Incidentally this provides an example of using
the stack vs locals.
The algorithm that started this discussion did 1M iterations in about 0.76s.
Yours is 0.2s faster on my combo - even with all the function call overhead.
HB
I've just realised it can be made slightly quicker using the well formed
flag trick in the last line to eliminate the IF ... THEN
: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
\ r> u> if 1- then ( -- u1 )
r> u> + ( -- u1 )
;
r> u> -
I'm a bit dubious about whether a marginal saving like that is worthwhile.
\ with microscopic change & tested
: USQRT
dup 2 u< IF EXIT THEN
dup >r 2 rshift recurse
2* dup 1+ dup * r> u< - ;
Gerry Jackson
2022-12-07 21:07:28 UTC
Permalink
Post by ***@arcor.de
Post by Gerry Jackson
Post by Hans Bezemer
Post by Gerry Jackson
Worked OK for me, I wasn't concerned about speed so don't know how it
compares in that respect. Incidentally this provides an example of using
the stack vs locals.
The algorithm that started this discussion did 1M iterations in about 0.76s.
Yours is 0.2s faster on my combo - even with all the function call overhead.
HB
I've just realised it can be made slightly quicker using the well formed
flag trick in the last line to eliminate the IF ... THEN
: usqrt ( u -- u1 )
dup 2 u< if exit then
dup >r 2 rshift recurse
2* ( -- u sm )
1+ dup ( -- u la la )
dup * ( -- u la la^2 )
\ r> u> if 1- then ( -- u1 )
r> u> + ( -- u1 )
;
r> u> -
I'm a bit dubious about whether a marginal saving like that is worthwhile.
\ with microscopic change & tested
: USQRT
dup 2 u< IF EXIT THEN
dup >r 2 rshift recurse
2* dup 1+ dup * r> u< - ;
Sorry but that doesn't seem to work for all cases e.g. using your USQRT

25 usqrt . \ displays 4

Using this test program on GForth
: x ( n -- ) \ get square roots from 0 to n-1
-1 swap 0
do
i usqrt 2dup <
if cr i 3 .r ." : " nip dup then \ Display next square number
. \ Followed by square roots of integers up to next square
loop drop
;

\ With my USQRT
101 x
0: 0
1: 1 1 1
4: 2 2 2 2 2
9: 3 3 3 3 3 3 3
16: 4 4 4 4 4 4 4 4 4
25: 5 5 5 5 5 5 5 5 5 5 5
36: 6 6 6 6 6 6 6 6 6 6 6 6 6
49: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
64: 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
81: 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
100: 10 ok

Whereas using your version we get

101 x
0: 0
1: 1 1 1
4: 2 2 2 2 2 2
10: 3 3 3 3 3 3
16: 4 4 4 4 4 4 4 4 4 4
26: 5 5 5 5 5 5 5 5 5 5 5 5 5 5
40: 6 6 6 6 6 6 6 6 6 6
50: 7 7 7 7 7 7 7 7 7 7 7 7 7 7
64: 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
82: 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 ok

I'm tired and it's too late to start thinking about it tonight.
--
Gerry
dxforth
2022-12-08 00:09:55 UTC
Permalink
...
I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
...
I'm a bit dubious about whether a marginal saving like that is worthwhile.
Especially when an optimizing compiler is compelled to generate the 'well-formed'
flag. I don't think it was ever a given that 'calculate' was better than 'test
and branch'. Each situation according to its merits.
P Falth
2022-12-08 17:54:14 UTC
Permalink
Post by dxforth
...
I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
...
I'm a bit dubious about whether a marginal saving like that is worthwhile.
Especially when an optimizing compiler is compelled to generate the 'well-formed'
flag. I don't think it was ever a given that 'calculate' was better than 'test
and branch'. Each situation according to its merits.
This is exactly what happens with my codegenerator. The "optimized" version is
5% larger (but just 3 bytes) and more surprisingly have 30% longer runtime! .

Gerry's original recursive version is anyway 5 times faster then your version!

Could be the division is slow on my processor, an E5-4657L v2

BR
Peter
minf...@arcor.de
2022-12-08 20:35:37 UTC
Permalink
Post by P Falth
Post by dxforth
...
I've just realised it can be made slightly quicker using the well formed flag trick in the last line to eliminate the IF ... THEN
...
I'm a bit dubious about whether a marginal saving like that is worthwhile.
Especially when an optimizing compiler is compelled to generate the 'well-formed'
flag. I don't think it was ever a given that 'calculate' was better than 'test
and branch'. Each situation according to its merits.
This is exactly what happens with my codegenerator. The "optimized" version is
5% larger (but just 3 bytes) and more surprisingly have 30% longer runtime! .
Gerry's original recursive version is anyway 5 times faster then your version!
Could be the division is slow on my processor, an E5-4657L v2
IMO 5 times speed difference is too much to be explained by a single division.
Memory and CPU cache effects seem more plausible. Recursion retakes
items from the stack that is probably still cached, because those items were
stored only recently there in a preceding cycle.
minf...@arcor.de
2022-12-11 10:38:59 UTC
Permalink
Since steam went out of square roots, here's what I have for cube roots.
Seldom used for volumetric calculations. Optimization ideas?

: UCBRT64 {: x | Y B -- cbrt ; integer cube root :}
0 to Y
63 BEGIN dup 0 >=
WHILE
Y 2* to Y
Y 1+ Y * 3 * 1+ to B
x over rshift B >= IF
B over lshift negate x + to x
Y 1+ to Y
THEN 3 -
REPEAT drop
Y ;

cr 0 ucbrt64 . .( 0? )
cr 1 ucbrt64 . .( 1? )
cr 2 ucbrt64 . .( 1? )
cr 2 ucbrt64 . .( 1? )
cr 7 ucbrt64 . .( 1? )
cr 8 ucbrt64 . .( 2? )
cr 9 ucbrt64 . .( 2? )
cr 999 ucbrt64 . .( 9? )
cr 1000 ucbrt64 . .( 10? )
cr 1001 ucbrt64 . .( 10? )
Loading...