Discussion:
if (%eax = 1 ) and int $0x80, `echo $?` == %ebx??
(too old to reply)
do yeon
2010-07-21 02:43:02 UTC
Permalink
hello. I am learning assembly. but I have a problem in linux

i wrote code.
--------------------------
.section .data
.section .text
.globl _start
_start:
pushl $6
call factorial

movl %eax, %ebx
movl $1, %eax
int $0x80

.type factorial, @function
factorial:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %ebx
movl 8(%ebp), %ecx
start_loop:
cmpl $1, %ebx
je end_loop
decl %ebx
imull %ebx, %ecx
jmp start_loop

end_loop:
movl %ecx, %eax
movl %ebp, %esp
popl %ebp
ret
-----------------------
and I do 'echo $?' in shell.
i expected the out number is 720. but 208! so, i use debug program.

I traced the program and at the end of the program the result is this.

(gdb) info registers
eax 0x1 1
ecx 0x2d0 720
edx 0x0 0
ebx 0x2d0 720
esp 0xbffff54c 0xbffff54c
ebp 0x0 0x0
esi 0x0 0
edi 0x0 0
eip 0x8048062 0x8048062 <_start+14>
eflags 0x246 [ PF ZF IF ]
cs 0x73 115
ss 0x7b 123
ds 0x7b 123
es 0x7b 123
fs 0x0 0
gs 0x0 0
(gdb) stepi

Program exited with code 0320.
(gdb) info registers
The program has no registers now.
(gdb) quit
-----------------------------------
as you see
==============
ebx 0x2d0 720
==============
ebx is 720

so, if {(%eax = 1 ) and int $0x80, `echo $?` == %ebx}, why the output
is 208??
do yeon
2010-07-21 05:23:07 UTC
Permalink
256+256+208 = 720 ???
James Harris
2010-07-22 00:08:03 UTC
Permalink
Post by do yeon
256+256+208 = 720 ???
Others have explained that you only get the lower 8 bits of eax as
return code but you could print as big a number as you want from
within the code. Linux service 4 is for writing a string of a certain
length. Some hints:

http://codewiki.wikispaces.com/linux_stdout.nasm
http://codewiki.wikispaces.com/wr_dec.nasm

The first shows how to write to stdout. The second converts a number
to a string of decimal digits. They are both in Nasm format but you
should be able to convert them to your assembler.

James
wolfgang kern
2010-07-22 06:58:21 UTC
Permalink
Post by James Harris
Post by do yeon
256+256+208 = 720 ???
Yes, if my fingers work still correct :)
Post by James Harris
Others have explained that you only get the lower 8 bits of eax as
return code but you could print as big a number as you want from
within the code. Linux service 4 is for writing a string of a certain
http://codewiki.wikispaces.com/linux_stdout.nasm
http://codewiki.wikispaces.com/wr_dec.nasm
The first shows how to write to stdout. The second converts a number
to a string of decimal digits. They are both in Nasm format but you
should be able to convert them to your assembler.
I'm not sure for the latter will do it at all:

_________________
;The code is from
; http://codewiki.wikispaces.com/wr_dec.nasm
;

wr_dec:
test eax, 0x80000000 ;Negative?
jz .positive ;Skip if no
*** why not: TEST eax,eax
JNS .positive

mov [edi], byte '-' ;Leading minus sign
inc edi
neg eax
call .positive
neg eax ;Restore incoming value
ret

.positive:
push eax
push ebx
push edx

mov ebx, 10
xor edx, edx
div ebx

;We need to print what is in edx /after/ any digits of the value in eax

*** shouldn't here be anything ?

or eax, eax ;Any prior digits?
jz .donethis ;Skip if no
call .positive ;Print these first

*** now this call pushes 'some' bytes onto the stack :)
I'd had instead:
JNZ .positive

.donethis:
add dl, '0' ;Make a digit
mov [edi], dl ;Write it
inc edi

pop edx
pop ebx
pop eax
ret
___________________

Sorry but I can't give this an 'A' yet ;)
__
wolfgang
Frank Kotler
2010-07-22 06:04:25 UTC
Permalink
Post by wolfgang kern
Post by James Harris
Post by do yeon
256+256+208 = 720 ???
Yes, if my fingers work still correct :)
Post by James Harris
Others have explained that you only get the lower 8 bits of eax as
return code but you could print as big a number as you want from
within the code. Linux service 4 is for writing a string of a certain
http://codewiki.wikispaces.com/linux_stdout.nasm
http://codewiki.wikispaces.com/wr_dec.nasm
The first shows how to write to stdout. The second converts a number
to a string of decimal digits. They are both in Nasm format but you
should be able to convert them to your assembler.
_________________
;The code is from
; http://codewiki.wikispaces.com/wr_dec.nasm
;
test eax, 0x80000000 ;Negative?
jz .positive ;Skip if no
*** why not: TEST eax,eax
JNS .positive
mov [edi], byte '-' ;Leading minus sign
inc edi
neg eax
call .positive
neg eax ;Restore incoming value
ret
push eax
push ebx
push edx
mov ebx, 10
xor edx, edx
div ebx
;We need to print what is in edx /after/ any digits of the value in eax
*** shouldn't here be anything ?
or eax, eax ;Any prior digits?
jz .donethis ;Skip if no
call .positive ;Print these first
*** now this call pushes 'some' bytes onto the stack :)
JNZ .positive
add dl, '0' ;Make a digit
mov [edi], dl ;Write it
inc edi
pop edx
pop ebx
pop eax
ret
___________________
Sorry but I can't give this an 'A' yet ;)
Because it doesn't segfault? :)

Best,
Frank
Frank Kotler
2010-07-22 08:27:36 UTC
Permalink
Frank Kotler wrote:

...
Post by Frank Kotler
Post by wolfgang kern
call .positive ;Print these first
*** now this call pushes 'some' bytes onto the stack :)
JNZ .positive
add dl, '0' ;Make a digit
mov [edi], dl ;Write it
inc edi
pop edx
pop ebx
pop eax
ret
___________________
Sorry but I can't give this an 'A' yet ;)
Because it doesn't segfault? :)
Now how'd I answer that before it was posted? Oh, I know. Nathan went
out of town and left me in charge. Apparently too much power for me,
'cause first thing, the electricity went out for a couple hours. When it
came back on, my clock was pooched. I think my battery's dead (in the
computer... too). Sorry 'bout that.

Best,
Frank
wolfgang kern
2010-07-22 09:45:25 UTC
Permalink
Frank Kotler replied:
...
Post by Frank Kotler
Post by wolfgang kern
_________________
;The code is from
; http://codewiki.wikispaces.com/wr_dec.nasm
;
test eax, 0x80000000 ;Negative?
jz .positive ;Skip if no
*** why not: TEST eax,eax
JNS .positive
mov [edi], byte '-' ;Leading minus sign
inc edi
neg eax
call .positive
neg eax ;Restore incoming value
ret
push eax
push ebx
push edx
mov ebx, 10
xor edx, edx
div ebx
;We need to print what is in edx /after/ any digits of the value in eax
*** shouldn't here be anything ?
or eax, eax ;Any prior digits?
jz .donethis ;Skip if no
call .positive ;Print these first
*** now this call pushes 'some' bytes onto the stack :)
JNZ .positive
add dl, '0' ;Make a digit
mov [edi], dl ;Write it
inc edi
pop edx
pop ebx
pop eax
ret
___________________
Sorry but I can't give this an 'A' yet ;)
Because it doesn't segfault? :)
No, but because of stack abusing alot.

__
wolfgang
Frank Kotler
2010-07-22 20:15:58 UTC
Permalink
...
Post by wolfgang kern
Post by Frank Kotler
Post by wolfgang kern
Sorry but I can't give this an 'A' yet ;)
Because it doesn't segfault? :)
No, but because of stack abusing alot.
Yeah, I knew that's why you didn't like it. Some OSen have a dynamic
stack, so it isn't really a problem... except of course if you run off
the end of the existing stack and have to run the paging mechanism to
create another page for it - which would be *really* slow, but doesn't
happen often, even if we "abuse" the stack like this.

Simply replacing the "call .positive" with "jnz .positive" doesn't do
the trick - we push a bunch of cruft that doesn't get popped before the
"ret". I assume you had further modifications in mind. I had to try
this, to make sure it didn't work - even though I knew it wouldn't...

I wasted a *whole* lot of time on account of this! I figured I'd
"translate" this to Gas, in case do yeon (original poster) had trouble
with it. I ran "intel2gas" on it (and it needed a little hand
editing)... foolishly, on the "modified" code! Segfault, of course. Okay
"unmodify" the Gas version... Segfault! Segfault! Segfault! WTF??? Heh!

Doing it in Nasm, I did:

nasm -f elf32 wr_dec.asm
ld -o wr_dec wr_dec.o

So for Gas, I did:

as wr_dec.s
ld -o wr_dec wr_dec.o

Gas doesn't create wr_dec.o unless you tell it "-o wr_dec.o", it creates
"a.out"! I *know* this... on a "good" day. But I was repeatedly linking
the nasm-created, "modified" wr_dec.o. Duh! Sometimes I wonder if I'm
smart enough to do this at all!

Anyway... Besides "abusing" the stack, "pop" is a slow instruction. But
if we're going to worry about speed, "div" is slower than a gut-shot
wolf bitch with nine suckling pups dragging a number 9 trap uphill in a
snowstorm!

"Terje's method", multiplying by a reciprocal instead of the "div", is
*much* faster... but not so suitable for instructing beginners. I tried
to do that myself, and screwed it up badly - "mostly" correct digits,
but now an then a digit that was completely off!

I'll get back to this subject - got several alternatives we can look at
and tear apart. But... got company, and messages are backing up - my
"whitelist" has gotten corrupted, I think...

Later,
Frank
wolfgang kern
2010-07-24 11:10:51 UTC
Permalink
Frank Kotler wrote:
...
Post by Frank Kotler
Post by wolfgang kern
Post by Frank Kotler
Post by wolfgang kern
Sorry but I can't give this an 'A' yet ;)
Because it doesn't segfault? :)
No, but because of stack abusing alot.
Yeah, I knew that's why you didn't like it. Some OSen have a dynamic
stack, so it isn't really a problem... except of course if you run off the
end of the existing stack and have to run the paging mechanism to create
another page for it - which would be *really* slow, but doesn't happen
often, even if we "abuse" the stack like this.
As in my last reply to James: stack is memory, 'dynamic' may just mean
that it isn't already in L1 cache :)
Post by Frank Kotler
Simply replacing the "call .positive" with "jnz .positive" doesn't do the
trick - we push a bunch of cruft that doesn't get popped before the "ret".
I assume you had further modifications in mind. I had to try this, to make
sure it didn't work - even though I knew it wouldn't...
Wasn't it 'that' obvious ? :)
Post by Frank Kotler
I wasted a *whole* lot of time on account of this! I figured I'd
"translate" this to Gas, in case do yeon (original poster) had trouble
with it. I ran "intel2gas" on it (and it needed a little hand editing)...
foolishly, on the "modified" code! Segfault, of course. Okay "unmodify"
the Gas version... Segfault! Segfault! Segfault! WTF??? Heh!
Oh, I'm awful sorry for I made you try this ;)
hope you were just kidding yet.

...
Post by Frank Kotler
Anyway... Besides "abusing" the stack, "pop" is a slow instruction. But if
we're going to worry about speed, "div" is slower than a gut-shot wolf
bitch with nine suckling pups dragging a number 9 trap uphill in a
snowstorm!
:) yeah, the DIV is a huge time-eater, but three push/pop and call/ret
pairs per result digit may double the overall timing of this function.
Post by Frank Kotler
"Terje's method", multiplying by a reciprocal instead of the "div", is
*much* faster... but not so suitable for instructing beginners. I tried to
do that myself, and screwed it up badly - "mostly" correct digits, but now
an then a digit that was completely off!
The problems with reciprocal MUL are correct rounding and remainder-
building. MUL by 1/10 is a worst case, because 0.1 is a periodic
figure in the binary world.

ie: 2**32/10 =429496729.6 =0x19999999.9999...
we should roundup early to 0x1999999A, not after the MUL

modifying my example (in answer to James) with reciprocal MUL:
add push/pop esi on start resp. tail.

L1:
mov ebx,1999999Ah ;now within the loop
mul ebx
;;
we have the quotient in edx yet (hidden shift-right 32)
and eax holds the binary fraction of the quotient,
but what we need for our bin2dec is MOD (the remainder)...
MUL the fraction in eax by 10 will produce MOD in edx,
but previous edx is needed as well, I save it in esi yet.
;;
mov ebx,10
mov esi,edx ;to avoid push/pop within the loop
mul ebx ;DL hols MOD 10 yet
or dl,30h
mov [edi+ecx],dl ;store it in buffer
mov eax,esi ;restore quotient
dec ecx
test eax,eax
jnz L1

when I remember Terje's examples, they looked more elegant than this ...
Post by Frank Kotler
I'll get back to this subject - got several alternatives we can look at
and tear apart. But... got company, and messages are backing up - my
"whitelist" has gotten corrupted, I think...
good luck! seaya then,
__
wolfgang
James Harris
2010-07-24 19:18:30 UTC
Permalink
...
Post by Frank Kotler
I wasted a *whole* lot of time on account of this!
Me too! No complaints, though. This is heading for a better solution.
Post by Frank Kotler
Anyway... Besides "abusing" the stack, "pop" is a slow instruction.
I was puzzled by this and I've run some tests on a Pentium M. Against
a number of pushes taking one cycle each time the first pop took 9
cycles and all the rest took 1 cycle each.
Post by Frank Kotler
But
if we're going to worry about speed, "div" is slower than a gut-shot
wolf bitch with nine suckling pups dragging a number 9 trap uphill in a
snowstorm!
True. Consecutive divs measured at about 40 cycles each on a Pentium M
and the same on an AMD Athlon 64 X2. By contrast here are some figures
for multiply. These are all on the Pentium M.

mul eax

took an average of 3.5 cycles. Faster on this processor is to multiply
by ten using

lea eax, [eax + 4 * eax]
shl eax, 1

This averaged 0.9 cycles.

It should be said that div in particular can be overlapped with other
instructions. One can normally set the div running and get on with
other work while waiting for it to finish. Only when the result is
needed does one have to stop and wait.

James
James Harris
2010-07-24 22:00:34 UTC
Permalink
...
Post by wolfgang kern
Post by Frank Kotler
Post by wolfgang kern
Sorry but I can't give this an 'A' yet ;)
Because it doesn't segfault? :)
No, but because of stack abusing alot.
Here's another offering. The stack feels less 'abused' with this one.
It places the digits on the stack as they are generated but is
iterative rather than recursive. It doesn't call itself but uses two
loops, one to generate the digits and one to write them to the output
buffer. I think it's quite easy to understand. It is currently the
second link at

http://codewiki.wikispaces.com/print+decimal

The two loops, one after the other, are as follows.

;Split the number into its constituent digits

jmp .split_test

.split_loop:
xor edx, edx
div dword [.number_base] ;Split off the digit to the low byte of
EDX
push edx ;Save this digit

.split_test:
cmp eax, [.number_base]
jae .split_loop

;Write the digits. The first is in AL, the rest are on the stack

.write_loop:
add al, "0"
mov [edi], al
inc edi
pop eax
cmp eax, [.number_base]
jne .write_loop

If SD is the number of significant digits this routine hasn't removed
stack use completely but has taken it down from 12SD to 4SD.

I'm undecided whether this one or the original algorithm is easiest to
understand so will leave both there. And there looks to be another
good way based on Wolfgang's code hence the above page to group
together solutions.

James
wolfgang kern
2010-07-25 11:32:15 UTC
Permalink
Post by James Harris
Post by wolfgang kern
Post by Frank Kotler
Post by wolfgang kern
Sorry but I can't give this an 'A' yet ;)
Because it doesn't segfault? :)
No, but because of stack abusing alot.
Here's another offering. The stack feels less 'abused' with this one.
It places the digits on the stack as they are generated but is
iterative rather than recursive. It doesn't call itself but uses two
loops, one to generate the digits and one to write them to the output
buffer. I think it's quite easy to understand. It is currently the
second link at
http://codewiki.wikispaces.com/print+decimal
This look much better yet !
Post by James Harris
The two loops, one after the other, are as follows.
;Split the number into its constituent digits
jmp .split_test
xor edx, edx
div dword [.number_base] ;Split off the digit to the low byte of
EDX
push edx ;Save this digit
cmp eax, [.number_base]
jae .split_loop
;Write the digits. The first is in AL, the rest are on the stack
add al, "0"
mov [edi], al
inc edi
pop eax
cmp eax, [.number_base]
jne .write_loop
If SD is the number of significant digits this routine hasn't removed
stack use completely but has taken it down from 12SD to 4SD.
I'm undecided whether this one or the original algorithm is easiest to
understand so will leave both there. And there looks to be another
good way based on Wolfgang's code hence the above page to group
together solutions.
Recursive examples may perhaps lead to misunderstanding, see me ? :)
Before you put my examples onto wiki you better test it, my worn
eyes and the glumpsy fingers are prone for typos of any kind.

__
wolfgang
Richard Russell
2010-07-22 08:17:36 UTC
Permalink
        test    eax, 0x80000000         ;Negative?
        jz      .positive               ;Skip if no
*** why not: TEST eax,eax
             JNS  .positive
True.
*** shouldn't here be anything ?
No!
call .positive ;Print these first
*** now this call pushes 'some' bytes onto the stack :)
I don't think you spotted that .positive is called recursively.
Sorry but I can't give this an 'A' yet ;)
The original code works fine. The use of a recursive call to print
the digits in the correct order is quite neat. Why 'fix' it when it
isn't broken?

Richard.
http://www.rtrussell.co.uk/
wolfgang kern
2010-07-22 09:52:47 UTC
Permalink
Richard Russell said:
<q>
test eax, 0x80000000 ;Negative?
jz .positive ;Skip if no
*** why not: TEST eax,eax
JNS .positive
True.
*** shouldn't here be anything ?
No!
call .positive ;Print these first
*** now this call pushes 'some' bytes onto the stack :)
I don't think you spotted that .positive is called recursively.
Sorry but I can't give this an 'A' yet ;)
The original code works fine. The use of a recursive call to print
the digits in the correct order is quite neat. Why 'fix' it when it
isn't broken?

Richard.
</q>

For your "No!": I expected something like 'print'.

I saw the 'neat' recursive idea, but I have no clue where the
gain with stackup four dwords per byte should be.
To make it in the other order I'd just start at the end.

__
wolfgang
James Harris
2010-07-22 16:27:52 UTC
Permalink
Post by Richard Russell
<q>
Re. http://codewiki.wikispaces.com/wr_dec.nasm

Thanks, Wolfgang and others, for challenging the code. We should have
more code reviews!
Post by Richard Russell
test eax, 0x80000000 ;Negative?
jz .positive ;Skip if no
*** why not: TEST eax,eax
JNS .positive
True.
This is a better test, thanks. I have changed it and have also changed
the test further down from

or eax, eax

to

test eax, eax

which is slightly better. (It removes the fake write-dependency.)
Post by Richard Russell
*** shouldn't here be anything ?
No!
      call    .positive               ;Print these first
*** now this call pushes 'some' bytes onto the stack :)
I don't think you spotted that .positive is called recursively.
Sorry but I can't give this an 'A' yet ;)
The original code works fine.  The use of a recursive call to print
the digits in the correct order is quite neat.  Why 'fix' it when it
isn't broken?
Richard.
</q>
For your "No!": I expected something like 'print'.
Richard's right but I know why you felt that. I have too when looking
at that point in the code. I've changed some comments and a label to
try to make it clearer that the print happens later. What else can I
do?
Post by Richard Russell
I saw the 'neat' recursive idea, but I have no clue where the
gain with stackup four dwords per byte should be.
I've changed the code to avoid pushing EBX repeatedly by adding the
number base to the data section. It saves 4 bytes on the stack per
digit but is a little less universal as it includes the assembler-
specific section change. Not sure if this is an improvement.

Is stack use a problem? The routine's stack budget is now three dwords
per digit. Is that too many? A 32-bit number has a strict maximum size
of ten digits so the stack cannot grow unbounded.
Post by Richard Russell
To make it in the other order I'd just start at the end.
I tried that but without success. The routine is meant partly to be
instructive and any alternative routine I came up with, such as one
starting from the left, was longer and more complex. Mind you, this
one's not as clear as it should be. Can you come up with an improved
(i.e. clearer, maybe faster, maybe simpler, maybe shorter) version?
Bear in mind the routine *left-justifies* the number. It's easier to
write a number right-justified if starting from the left as you know
up front how wide it needs to be. If you want to write it from the
left and also write the number left justified you need to skip initial
zeroes and still make sure you print a zero if the number is zero.

BTW, Wolfgang, your posts sometimes don't have recognised reply
markers (such as ">" or similar on the left). In reading some of your
posts it's hard to work out who said what. Anything you can do about
that?

James
wolfgang kern
2010-07-22 21:09:09 UTC
Permalink
James Harris wrote:

<I again need to quote manually,
damned OE6 dunno quoted printable ?>
Post by James Harris
Re. http://codewiki.wikispaces.com/wr_dec.nasm
...
Post by James Harris
Thanks, Wolfgang and others, for challenging the code. We should have
more code reviews!
...
Post by James Harris
Richard's right but I know why you felt that. I have too when looking
at that point in the code. I've changed some comments and a label to
try to make it clearer that the print happens later. What else can I
do?
I just misunderstood this comment.
Post by James Harris
Post by wolfgang kern
I saw the 'neat' recursive idea, but I have no clue where the
gain with stackup four dwords per byte should be.
I've changed the code to avoid pushing EBX repeatedly by adding the
number base to the data section. It saves 4 bytes on the stack per
digit but is a little less universal as it includes the assembler-
specific section change. Not sure if this is an improvement.
Is stack use a problem?
stack 'is' memory, even the last accessed 64 bytes may be in L1-cache.
Post by James Harris
The routine's stack budget is now three dwords
per digit. Is that too many? A 32-bit number has a strict maximum size
of ten digits so the stack cannot grow unbounded.
Post by wolfgang kern
To make it in the other order I'd just start at the end.
I tried that but without success. The routine is meant partly to be
instructive and any alternative routine I came up with, such as one
starting from the left, was longer and more complex. Mind you, this
one's not as clear as it should be. Can you come up with an improved
(i.e. clearer, maybe faster, maybe simpler, maybe shorter) version?
Bear in mind the routine *left-justifies* the number. It's easier to
write a number right-justified if starting from the left as you know
up front how wide it needs to be. If you want to write it from the
left and also write the number left justified you need to skip initial
zeroes and still make sure you print a zero if the number is zero.
Let me try it quick and dirty. Not tested, just out of my head and
not sure if this is shorter, but it should be 'a bit' faster because
it doesn't push/pop nor call/ret in the loop.

________
bin2dec:
;edi = bufferptr
;eax = signed entry value (unchanged)
;returns: ecx = strlen
edi = pointer to first (leftmost) character
;out-commented: zero-termination

push ebx
push edx ;if required
push eax
test eax,eax
jns L0
neg eax
L0:
mov ecx,10 ;max. ten digits
mov ebx,ecx ;divider is also a ten
;mov byte [edi+ecx+1],0 ;zero-termination wanted ? ...*
L1:
xor edx,edx ;reciprocal MUL with
div ebx ;remainder may work faster.
or dl,30h ;add "0"
mov [edi+ecx],dl ;rightmost is first, also if all zero
dec ecx ;0..9 are ten, but we may have a sign
test eax,eax
jnz L1 ;until nothing left to divide
pop eax
inc ecx ;just to have strlen correct
test eax,eax ;check again for the sign
jns L2
dec ecx
mov byte[edi+ecx],2Dh ;"-"
L2:
add edi,ecx ;adjust edi to leftmost character
neg ecx ;* ...or like strlen in ecx better ?
add ecx,10 ;* ...or both if desired ;)
pop edx
pop ebx
ret
Post by James Harris
BTW, Wolfgang, your posts sometimes don't have recognised reply
markers (such as ">" or similar on the left). In reading some of your
posts it's hard to work out who said what. Anything you can do about
that?
It's annoying, but my OE6 interpretes the google-news format
somehow wrong and I often have to add quote-marks manually.

__
wolfgang
James Harris
2010-07-24 17:48:38 UTC
Permalink
On 22 July, 22:09, "wolfgang kern" <***@never.at> wrote:

...
Post by wolfgang kern
Post by James Harris
I've changed the code to avoid pushing EBX repeatedly by adding the
number base to the data section. It saves 4 bytes on the stack per
digit but is a little less universal as it includes the assembler-
specific section change. Not sure if this is an improvement.
Is stack use a problem?
stack 'is' memory, even the last accessed 64 bytes may be in L1-cache.
Stack use is fine, AFAICS. 64 bytes might be just one line of the data
cache. The L1 data cache may be 32,768 bytes or more.
Post by wolfgang kern
Post by James Harris
The routine's stack budget is now three dwords
per digit. Is that too many? A 32-bit number has a strict maximum size
of ten digits so the stack cannot grow unbounded.
Post by wolfgang kern
To make it in the other order I'd just start at the end.
I tried that but without success. The routine is meant partly to be
instructive and any alternative routine I came up with, such as one
starting from the left, was longer and more complex. Mind you, this
one's not as clear as it should be. Can you come up with an improved
(i.e. clearer, maybe faster, maybe simpler, maybe shorter) version?
Bear in mind the routine *left-justifies* the number. It's easier to
write a number right-justified if starting from the left as you know
up front how wide it needs to be. If you want to write it from the
left and also write the number left justified you need to skip initial
zeroes and still make sure you print a zero if the number is zero.
Let me try it quick and dirty. Not tested, just out of my head and
not sure if this is shorter, but it should be 'a bit' faster because
it doesn't push/pop nor call/ret in the loop.
Push/pop and call/ret should be blindingly fast. (The div in our
routines is the killer.)
Post by wolfgang kern
;edi = bufferptr
;eax = signed entry value (unchanged)
;returns: ecx = strlen
          edi = pointer to first (leftmost) character
;out-commented: zero-termination
  push ebx
  push edx            ;if required
 push eax
  test eax,eax
  jns L0
  neg eax
  mov ecx,10          ;max. ten digits
  mov ebx,ecx         ;divider is also a ten
 ;mov byte [edi+ecx+1],0   ;zero-termination wanted ? ...*
  xor edx,edx         ;reciprocal MUL with
  div ebx             ;remainder may work faster.
  or dl,30h           ;add "0"
  mov [edi+ecx],dl    ;rightmost is first, also if all zero
  dec ecx             ;0..9 are ten, but we may have a sign
  test eax,eax
  jnz L1              ;until nothing left to divide
 pop eax
  inc ecx             ;just to have strlen correct
  test eax,eax        ;check again for the sign
  jns L2
  dec ecx
  mov byte[edi+ecx],2Dh ;"-"
  add edi,ecx         ;adjust edi to leftmost character
  neg ecx             ;* ...or like strlen in ecx better ?
  add ecx,10          ;* ...or both if desired ;)
  pop edx
  pop ebx
ret
It's good but does it write the number left justified? (One of the
requirements.)

James
wolfgang kern
2010-07-25 11:16:17 UTC
Permalink
James Harris asked:
...
<q-------->
Post by wolfgang kern
;edi = bufferptr
;eax = signed entry value (unchanged)
;returns: ecx = strlen
edi = pointer to first (leftmost) character
;out-commented: zero-termination
push ebx
push edx ;if required
push eax
test eax,eax
jns L0
neg eax
mov ecx,10 ;max. ten digits
mov ebx,ecx ;divider is also a ten
;mov byte [edi+ecx+1],0 ;zero-termination wanted ? ...*
xor edx,edx ;reciprocal MUL with
div ebx ;remainder may work faster.
or dl,30h ;add "0"
mov [edi+ecx],dl ;rightmost is first, also if all zero
dec ecx ;0..9 are ten, but we may have a sign
test eax,eax
jnz L1 ;until nothing left to divide
pop eax
inc ecx ;just to have strlen correct
test eax,eax ;check again for the sign
jns L2
dec ecx
mov byte[edi+ecx],2Dh ;"-"
add edi,ecx ;adjust edi to leftmost character
neg ecx ;* ...or like strlen in ecx better ?
add ecx,10 ;* ...or both if desired ;)
pop edx
pop ebx
ret
It's good but does it write the number left justified? (One of the
requirements.)
</quote------>


Now this is a matter of view :)
Adjusting edi make it work left justified if edi points to a temporary
buffer, but if edi is used as incremental pointer in a text-string
then it would need size determination before the loop:
...
L0: ;eax is positive here, but it works unsigned as well
bsr ebx,eax ;get MSBit# = 2**n
;we need decimal digits = 1+n*0.30102999.. [log(2)]
imul ecx,ebx,0x4d10 ;=n * 2**16 * log(2) this's close enough
shr ecx,16 ; / 2**16
inc ecx ;1+
...
and except for the sign this ecx is already strlen even perhaps only
required for linewrap in an INSTRING format.
this also require a different trail or preserve of ecx for add edi,...

BTW. I really hate unformatted numeric displays where text and numbers
jump back and forward whenever a value changes. I use formatted num_out
only, so figures keep their position also INSTRING (in
opposition to D.P.-TAB set used in forms and tables) by space out
leading zeros and also the sign position remain unaltered on redraw.

__
wolfgang
wolfgang kern
2010-07-25 15:48:07 UTC
Permalink
I now combined all my recently spread codesnips.
many dependencies are still in there, but feel free to
check and/or benchmark it against stackup solutions,
... if this will work correct or at all ;)
__
wolfgang
________
bin2dec: ;instring variant
;edi = text_strptr (updated)
;eax = signed entry value
;returns: ecx = strlen-1 -sign (useless)
; edi = points to next after this
push esi
push ebx
push edx ;if required
push eax ;if required
test eax,eax
jns L0
mov byte[edi],"-"
neg eax
inc edi
L0:
bsr ebx,eax ;get MSBit# = 2**n
;we need decimal digits = 1+n*log(2) [0.30102999..]
imul ecx,ebx,0x4d10 ; 2**16 * n*log(2), is close enough here
shr ecx,16 ;/2**16
;inc ecx ;no 1+, as we include 0 and
add edi,ecx ;start at the end, ie: +9..+0
L1:
mov ebx,1999999Ah ;2**32 /10 rounded up
mul ebx
; we have the quotient in edx yet (hidden shift-right 32)
; and eax holds the binary fraction of the quotient,
; but we need MOD 10 (the remainder)...
; MUL the fraction in eax by 10 will produce MOD in edx,
; but previous edx is needed as well, I save it in esi yet.
mov ebx,10
mov esi,edx ;save quotient
mul ebx ;DL hold MOD 10 yet
or dl,30h
mov [edi],dl ;store it in buffer
mov eax,esi ;restore quotient
dec edi
test eax,eax
jnz L1 ;until nothing left to divide
lea edi,[edi+ecx+1] ;adjust edi to point after this
pop eax
pop edx
pop ebx
pop esi
ret
_____
Terje Mathisen
2010-07-25 18:44:28 UTC
Permalink
Post by wolfgang kern
I now combined all my recently spread codesnips.
many dependencies are still in there, but feel free to
check and/or benchmark it against stackup solutions,
.... if this will work correct or at all ;)
__
wolfgang
________
bin2dec: ;instring variant
;edi = text_strptr (updated)
;eax = signed entry value
;returns: ecx = strlen-1 -sign (useless)
; edi = points to next after this
push esi
push ebx
push edx ;if required
push eax ;if required
test eax,eax
jns L0
mov byte[edi],"-"
neg eax
inc edi
bsr ebx,eax ;get MSBit# = 2**n
;we need decimal digits = 1+n*log(2) [0.30102999..]
imul ecx,ebx,0x4d10 ; 2**16 * n*log(2), is close enough here
shr ecx,16 ;/2**16
;inc ecx ;no 1+, as we include 0 and
add edi,ecx ;start at the end, ie: +9..+0
mov ebx,1999999Ah ;2**32 /10 rounded up
mul ebx
That's a neat trick, it replaces lots of divisions with simple
multiplications.
Post by wolfgang kern
; we have the quotient in edx yet (hidden shift-right 32)
; and eax holds the binary fraction of the quotient,
; but we need MOD 10 (the remainder)...
; MUL the fraction in eax by 10 will produce MOD in edx,
; but previous edx is needed as well, I save it in esi yet.
mov ebx,10
mov esi,edx ;save quotient
mul ebx ;DL hold MOD 10 yet
Oops, it is possible to multiply by 5 using LEA, and save quite a few
cycles, see below.
Post by wolfgang kern
or dl,30h
mov [edi],dl ;store it in buffer
mov eax,esi ;restore quotient
dec edi
test eax,eax
jnz L1 ;until nothing left to divide
lea edi,[edi+ecx+1] ;adjust edi to point after this
pop eax
pop edx
pop ebx
pop esi
ret
Here's the code that AMD "borrowed" for their optimization manual for a
number of years: (3 MUL and 1 IMUL, all the rest is single-cycle ops)

char *dtoa(char *buf, unsigned long n)
{
__asm {
mov ebx,[n]
mov eax,2814749767

mul ebx

shr ebx,1
xor ecx,ecx

mov edi,[buf]
add eax,ebx

adc ecx,edx
mov eax,100000

shr ecx,16 // ECX = high part
mov ebx,[n]

imul eax,ecx // High part * 100k

sub ebx,eax // Low part
mov eax,429497

mul ecx

mov ecx,eax
add dl,'0'

mov eax,429497
mov [edi],dl

mul ebx

mov ebx,eax
add ecx,7

shr ecx,3
add dl,'0'

mov [edi+5],dl
add ebx,7

shr ebx,3
lea ecx,[ecx+ecx*4]

mov edx,ecx
and ecx,0fffffffh

shr edx,28
lea ebx,[ebx+ebx*4]

add dl,'0'
mov eax,ebx

shr eax,28
mov [edi+1],dl

and ebx,0fffffffh
add al,'0'

mov [edi+6],al
lea ecx,[ecx+ecx*4]

lea ebx,[ebx+ebx*4]
mov edx,ecx

mov eax,ebx
and ecx,07ffffffh

shr edx,27
and ebx,07ffffffh

shr eax,27
add dl,'0'

add al,'0'
mov [edi+2],dl

mov [edi+7],al
lea ecx,[ecx+ecx*4]

lea ebx,[ebx+ebx*4]
mov edx,ecx

mov eax,ebx
and ecx,03ffffffh

shr edx,26
and ebx,03ffffffh

shr eax,26
add dl,'0'

add al,'0'
mov [edi+3],dl

mov [edi+8],al
lea ecx,[ecx+ecx*4]

shr ecx,25
lea ebx,[ebx+ebx*4]

shr ebx,25
add cl,'0'

add bl,'0'
mov [edi+10],ah

mov [edi+4],cl
mov [edi+9],bl
}

return buf;
}

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
wolfgang kern
2010-07-26 12:25:51 UTC
Permalink
Post by Terje Mathisen
Post by wolfgang kern
I now combined all my recently spread codesnips.
many dependencies are still in there, but feel free to
check and/or benchmark it against stackup solutions,
.... if this will work correct or at all ;)
________
bin2dec: ;instring variant
;edi = text_strptr (updated)
;eax = signed entry value
;returns: ecx = strlen-1 -sign (useless)
; edi = points to next after this
push esi
push ebx
push edx ;if required
push eax ;if required
test eax,eax
jns L0
mov byte[edi],"-"
neg eax
inc edi
bsr ebx,eax ;get MSBit# = 2**n
** missing lines bug in here
Post by Terje Mathisen
Post by wolfgang kern
;we need decimal digits = 1+n*log(2) [0.30102999..]
imul ecx,ebx,0x4d10 ; 2**16 * n*log(2), is close enough here
shr ecx,16 ;/2**16
;inc ecx ;no 1+, as we include 0 and
add edi,ecx ;start at the end, ie: +9..+0
mov ebx,1999999Ah ;2**32 /10 rounded up
mul ebx
That's a neat trick, it replaces lots of divisions with simple
multiplications.
Post by wolfgang kern
; we have the quotient in edx yet (hidden shift-right 32)
; and eax holds the binary fraction of the quotient,
; but we need MOD 10 (the remainder)...
; MUL the fraction in eax by 10 will produce MOD in edx,
; but previous edx is needed as well, I save it in esi yet.
mov ebx,10
mov esi,edx ;save quotient
mul ebx ;DL hold MOD 10 yet
Oops, it is possible to multiply by 5 using LEA, and save quite a few
cycles, see below.
Yes mmh.., but we have the fraction in eax and if I'd use LEA for MUL,5
followed by SHL,1 or ADD itself, where would the wanted overflow be?
It would need to split it into two regs for not loosing precision.
Post by Terje Mathisen
Post by wolfgang kern
or dl,30h
mov [edi],dl ;store it in buffer
mov eax,esi ;restore quotient
dec edi
test eax,eax
jnz L1 ;until nothing left to divide
lea edi,[edi+ecx+1] ;adjust edi to point after this
pop eax
pop edx
pop ebx
pop esi
ret
Here's the code that AMD "borrowed" for their optimization manual for a
number of years: (3 MUL and 1 IMUL, all the rest is single-cycle ops)
I haven't seen it before, which xxxxx.pdf do you mean here ?

With interleaved functions you have much lesser dependencies in this.
But I tried (minor success) to figure out where 'your figures' come
from, a few comments would help in understanding this example ...

__
wolfgang
Post by Terje Mathisen
char *dtoa(char *buf, unsigned long n)
{
__asm {
mov ebx,[n]
mov eax,2814749767
mul ebx
shr ebx,1
xor ecx,ecx
mov edi,[buf]
add eax,ebx
adc ecx,edx
mov eax,100000
shr ecx,16 // ECX = high part
mov ebx,[n]
imul eax,ecx // High part * 100k
sub ebx,eax // Low part
mov eax,429497
mul ecx
mov ecx,eax
add dl,'0'
mov eax,429497
mov [edi],dl
mul ebx
mov ebx,eax
add ecx,7
shr ecx,3
add dl,'0'
mov [edi+5],dl
add ebx,7
shr ebx,3
lea ecx,[ecx+ecx*4]
mov edx,ecx
and ecx,0fffffffh
shr edx,28
lea ebx,[ebx+ebx*4]
add dl,'0'
mov eax,ebx
shr eax,28
mov [edi+1],dl
and ebx,0fffffffh
add al,'0'
mov [edi+6],al
lea ecx,[ecx+ecx*4]
lea ebx,[ebx+ebx*4]
mov edx,ecx
mov eax,ebx
and ecx,07ffffffh
shr edx,27
and ebx,07ffffffh
shr eax,27
add dl,'0'
add al,'0'
mov [edi+2],dl
mov [edi+7],al
lea ecx,[ecx+ecx*4]
lea ebx,[ebx+ebx*4]
mov edx,ecx
mov eax,ebx
and ecx,03ffffffh
shr edx,26
and ebx,03ffffffh
shr eax,26
add dl,'0'
add al,'0'
mov [edi+3],dl
mov [edi+8],al
lea ecx,[ecx+ecx*4]
shr ecx,25
lea ebx,[ebx+ebx*4]
shr ebx,25
add cl,'0'
add bl,'0'
mov [edi+10],ah
mov [edi+4],cl
mov [edi+9],bl
}
return buf;
}
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Noob
2010-07-26 13:46:55 UTC
Permalink
Post by wolfgang kern
I haven't seen it before, which xxxxx.pdf do you mean here ?
I think he meant Athlon x86 Code Optimization Guide
http://support.amd.com/us/Processor_TechDocs/22007.pdf
Chapter 8: Integer Optimizations
Efficient Binary-to-ASCII Decimal Conversion

Regards.
wolfgang kern
2010-07-26 18:25:53 UTC
Permalink
Post by Noob
Post by wolfgang kern
I haven't seen it before, which xxxxx.pdf do you mean here ?
I think he meant Athlon x86 Code Optimization Guide
http://support.amd.com/us/Processor_TechDocs/22007.pdf
Chapter 8: Integer Optimizations
Efficient Binary-to-ASCII Decimal Conversion
Thanks, I have this on my HD, but the example there
look a bit different (yet).

__
wolfgang
Terje Mathisen
2010-07-26 21:19:33 UTC
Permalink
Post by wolfgang kern
Post by Noob
Post by wolfgang kern
I haven't seen it before, which xxxxx.pdf do you mean here ?
I think he meant Athlon x86 Code Optimization Guide
http://support.amd.com/us/Processor_TechDocs/22007.pdf
Chapter 8: Integer Optimizations
Efficient Binary-to-ASCII Decimal Conversion
Thanks, I have this on my HD, but the example there
look a bit different (yet).
That version is a bit updated from the first time I saw it in an AMD
manual afair, but the algorithm is still exactly as when I first
published it on usenet. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Terje Mathisen
2010-07-26 17:42:54 UTC
Permalink
Post by wolfgang kern
Post by Terje Mathisen
Here's the code that AMD "borrowed" for their optimization manual for a
number of years: (3 MUL and 1 IMUL, all the rest is single-cycle ops)
I haven't seen it before, which xxxxx.pdf do you mean here ?
With interleaved functions you have much lesser dependencies in this.
But I tried (minor success) to figure out where 'your figures' come
from, a few comments would help in understanding this example ...
OK:

The task is convert an usigned 32-bit value into 10 decimal digits:

First split the input into two halves, modulo 100 000. I do this with a
constant division by 100K, i.e. reciprocal multiplication by
ceil(65536*2^32/100K) = 2814749767.

This gives an exact answer, so I back-multiply by 100K and subtract to
get the remainder, right?

The next step is to scale each of the sub-100K numbers by the same
constant: 429497, i.e. ceil(2^32/10K), which will leave the top digit as
the top 4 bits in a 32-bit value.

From this point on I extract digits by shifting right 28 bits and
adding '0', then masking away those top bits I've used.

The next step is to take the remainder and multiply by 5

LEA [reg+reg*4]

At this point there is a factor of two still missing (to get from 5 to
10), but I can fix that by reducing the shift count by 1, to 27.

Repeating this for each of the 3 remaining digits in each 5-digit half
gives all 10 digits in ~30-40 cycles. :-)

OK?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert Redelmeier
2010-07-26 18:11:22 UTC
Permalink
Post by Terje Mathisen
First split the input into two halves, modulo 100 000. I do this
with a constant division by 100K, i.e. reciprocal multiplication
by ceil(65536*2^32/100K) = 2814749767.
This gives an exact answer, so I back-multiply by 100K and subtract
to get the remainder, right?
The next step is to scale each of the sub-100K numbers by the same
constant: 429497, i.e. ceil(2^32/10K), which will leave the top
digit as the top 4 bits in a 32-bit value.
From this point on I extract digits by shifting right 28 bits and
adding '0', then masking away those top bits I've used.
The next step is to take the remainder and multiply by 5
LEA [reg+reg*4]
At this point there is a factor of two still missing (to get from
5 to 10), but I can fix that by reducing the shift count by 1, to 27.
Repeating this for each of the 3 remaining digits in each 5-digit
half gives all 10 digits in ~30-40 cycles. :-)
OK?
Very clever and possibly faster than the obvious code:

FILD [buff]
FBSTP [buf2]

followed by ASCII deswizzling if needed.

-- Robert
Terje Mathisen
2010-07-26 21:24:00 UTC
Permalink
Post by Robert Redelmeier
Post by Terje Mathisen
At this point there is a factor of two still missing (to get from
5 to 10), but I can fix that by reducing the shift count by 1, to 27.
Repeating this for each of the 3 remaining digits in each 5-digit
half gives all 10 digits in ~30-40 cycles. :-)
OK?
FILD [buff]
Can't be used directly, this is a 32-bit unsigned, not a signed value.
Post by Robert Redelmeier
FBSTP [buf2]
followed by ASCII deswizzling if needed.
Possibly???
:-)

Have you ever timed FBSTP?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert Redelmeier
2010-07-27 01:59:28 UTC
Permalink
Post by Terje Mathisen
Post by Robert Redelmeier
FILD [buff]
Can't be used directly, this is a 32-bit unsigned, not a signed value.
Sure, presuming the data is unsigned just clear the dword above [buff] and iload 64
Post by Terje Mathisen
Post by Robert Redelmeier
FBSTP [buf2] >> >> followed by ASCII deswizzling if needed.
Possibly??? :-)
Have you ever timed FBSTP?
Just did. ~120 clocks on my Centrino. Paul Allen's DEBUG
remains Microsoft's best (least bad?) software ever produced.
Not that it has much competition.

-- Robert
Terje Mathisen
2010-07-27 07:52:01 UTC
Permalink
Post by Robert Redelmeier
Post by Terje Mathisen
Post by Robert Redelmeier
FILD [buff]
Can't be used directly, this is a 32-bit unsigned, not a signed value.
Sure, presuming the data is unsigned just clear the dword above [buff] and iload 64
BTDT, the problem is that storing two 32-bit values and loading 64-bit
cause a stall of the forwarding network, so you have to wait until the
second store has been retired before the pipeline can restart.

It can be faster to load the value blindly, then adjust (add 2^32) if
needed, using the top bit as the index into a two-word float table.
Post by Robert Redelmeier
Post by Terje Mathisen
Post by Robert Redelmeier
FBSTP [buf2]>> >> followed by ASCII deswizzling if needed.
Possibly??? :-)
Have you ever timed FBSTP?
Just did. ~120 clocks on my Centrino. Paul Allen's DEBUG
120 isn't that bad actually, way faster than the naive DIV loop!
Post by Robert Redelmeier
remains Microsoft's best (least bad?) software ever produced.
Not that it has much competition.
What about SYMDEB? I still use that every month or two, even if I have
to use a 32-bit VM in order to run it.:-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert Redelmeier
2010-07-27 14:15:54 UTC
Permalink
Post by Terje Mathisen
BTDT, the problem is that storing two 32-bit values and loading 64-bit
cause a stall of the forwarding network, so you have to wait until the
second store has been retired before the pipeline can restart.
Sure, but to L1 cache this stall should not be long.
Post by Terje Mathisen
It can be faster to load the value blindly, then adjust (add 2^32) if
needed, using the top bit as the index into a two-word float table.
Good point, and a table is _way_ better than a poorly predicted test.
Post by Terje Mathisen
What about SYMDEB? I still use that every month or two,
even if I have to use a 32-bit VM in order to run it.:-)
Never much liked it. DEBUG is still part of the base MS-WIndows install.

-- Robert
Terje Mathisen
2010-07-27 20:07:19 UTC
Permalink
Post by Robert Redelmeier
Post by Terje Mathisen
BTDT, the problem is that storing two 32-bit values and loading 64-bit
cause a stall of the forwarding network, so you have to wait until the
second store has been retired before the pipeline can restart.
Sure, but to L1 cache this stall should not be long.
You might be surprised, on a Pentium4 it could be 20+ cycles.

Even on a modern Core it is a very significant hit, unless the write
forwarding circuitry has lots of extra tweaks in order to detect and
handle this situation, and then you would probably have to write both
words close together, i.e. you can't leave a top/second word of zero
laying around in a buffer.

OTOH, using a fixed buffer is very bad for reentrancy, so two
stack-based words is the better idea.
Post by Robert Redelmeier
Post by Terje Mathisen
It can be faster to load the value blindly, then adjust (add 2^32) if
needed, using the top bit as the index into a two-word float table.
Good point, and a table is _way_ better than a poorly predicted test.
Conditional fp moves (x87) or masking operations (SSE) also works.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Bob Masta
2010-07-28 11:10:21 UTC
Permalink
On Tue, 27 Jul 2010 14:15:54 +0000 (UTC), Robert Redelmeier
<snip>
Post by Robert Redelmeier
Post by Terje Mathisen
What about SYMDEB? I still use that every month or two,
even if I have to use a 32-bit VM in order to run it.:-)
Never much liked it. DEBUG is still part of the base MS-WIndows install.
Just checked Win7 and found no DEBUG.

My main gripe about DEBUG is that it is apparently unchanged
from its original incarnation, and only supports 16-bit
8086/8088 opcodes. Last I checked, it can't even handle
immediate shifts/rotates like SHL AX,2, which came in with
the 286.

Olly Debug would be my first choice for modern code.

Best regards,


Bob Masta

DAQARTA v5.10
Data AcQuisition And Real-Time Analysis
www.daqarta.com
Scope, Spectrum, Spectrogram, Sound Level Meter
Frequency Counter, FREE Signal Generator
Pitch Track, Pitch-to-MIDI
DaqMusic - FREE MUSIC, Forever!
(Some assembly required)
Science (and fun!) with your sound card!
Richard Russell
2010-07-27 08:25:15 UTC
Permalink
On 26 July, 22:24, Terje Mathisen <"terje.mathisen at
Post by Terje Mathisen
Post by Terje Mathisen
Repeating this for each of the 3 remaining digits in each 5-digit
half gives all 10 digits in ~30-40 cycles. :-)
Have you ever timed FBSTP?
I would like to step back from this obsession with speed. The purpose
of converting a number to an ASCII decimal representation is to make
it understandable to a *human*. The resulting decimal number will
most likely be presented on the screen, or sent to another output
device (e.g. a printer), or stored in a file. That being the case,
the time taken to output/store the data is likely to be very much
greater than the time taken for the conversion.

Devising super-fast binary-to-decimal conversion routines is an
interesting academic exercise, but has little practical benefit. The
AMD Optimization Guide states it "can be important to the performance
of software working with text oriented protocols like HTML" but I'm
far from convinced.

Richard.
http://www.rtrussell.co.uk/
Robert Redelmeier
2010-07-27 11:58:37 UTC
Permalink
Post by Richard Russell
I would like to step back from this obsession with speed.
The purpose of converting a number to an ASCII decimal
representation is to make it understandable to a *human*.
The resulting decimal number will most likely be presented on
the screen, or sent to another output device (e.g. a printer),
or stored in a file. That being the case, the time taken to
output/store the data is likely to be very much greater than
the time taken for the conversion.
Devising super-fast binary-to-decimal conversion routines is an
interesting academic exercise, but has little practical benefit.
The AMD Optimization Guide states it "can be important to the
performance of software working with text oriented protocols
like HTML" but I'm far from convinced.
I was about to make the same comment in my reply, but then
remembered that COBOL does all math and stores variables in
packed BCD. So conversions matter. Or did at one time.


-- Robert
Terje Mathisen
2010-07-27 12:40:23 UTC
Permalink
Post by Richard Russell
Devising super-fast binary-to-decimal conversion routines is an
interesting academic exercise, but has little practical benefit. The
AMD Optimization Guide states it "can be important to the performance
of software working with text oriented protocols like HTML" but I'm
far from convinced.
Please read some of the IBM papers on the background for the recently
accepted decimal fp standard: The time to convert to/from decimal is
being used as way to show how decimal FP is required. :-(

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
H. Peter Anvin
2010-07-27 18:22:43 UTC
Permalink
Post by Terje Mathisen
Post by Richard Russell
Devising super-fast binary-to-decimal conversion routines is an
interesting academic exercise, but has little practical benefit. The
AMD Optimization Guide states it "can be important to the performance
of software working with text oriented protocols like HTML" but I'm
far from convinced.
Please read some of the IBM papers on the background for the recently
accepted decimal fp standard: The time to convert to/from decimal is
being used as way to show how decimal FP is required. :-(
Terje
In particular, converting from decimal to binary with correct rounding
is a very difficult exercise.

-hpa
Terje Mathisen
2010-07-27 20:55:53 UTC
Permalink
Post by H. Peter Anvin
Post by Terje Mathisen
Post by Richard Russell
Devising super-fast binary-to-decimal conversion routines is an
interesting academic exercise, but has little practical benefit. The
AMD Optimization Guide states it "can be important to the performance
of software working with text oriented protocols like HTML" but I'm
far from convinced.
Please read some of the IBM papers on the background for the recently
accepted decimal fp standard: The time to convert to/from decimal is
being used as way to show how decimal FP is required. :-(
Terje
In particular, converting from decimal to binary with correct rounding
is a very difficult exercise.
Yes, so what you do is to work in base 1E9, i.e. storing your numbers
with 9 FP digits in each word, plus a leading word for sign and exponent.

With 128-bit storage this allows you to handle 27-digit values exactly,
with an exponent range of +- 10^2G.

Alternatively you can reduce the exponent range to something more
reasonable and store extra digits in the remaining bits:

32-bit: 10^+-64 + 7 digits
64-bit: 10^+-4K + 15 digits or 10^+-256 + 16 digits
128-bit: 10^+-4K + 39 digits (Exp up to +-64K if we use the 4 free
leading bits in the bottom 64 bits.)

Doing it this way, and allowing un-normalized numbers, i.e. assume the
decimal point after the last digit, makes it easy to emulate the same
kind of "human decimal" math which the decimal FP standard is designed for.

The only expensive operation that remains is the normalization required
when we run out of digits, or when we need/want some rounding to happen:

This is most easily done by converting all the digits to BCD and back,
but it is faster to do it directly using tables of reciprocal muls which
separates out the desired number of digits before & after the splitting
point:


;; Split a 9-digit number into x+y digits
uint64_t split(unsigned n, unsigned digits)
{
__asm {
mov ebx,[digits]
mov ecx,[n]
mov eax,ecx
mul recip_table[ebx*4]
shr edx,shift_table[ebx*4] ;; Top part of result
mov eax,mult_table[ebx*4]
imul eax,edx
sub ecx,eax ;; Bottom part/remainder
mov eax,edx
mov edx,ecx
}
}

I.e. this function uses about 10 cycles to split a 9-digit value into
two sets of digits. Rounding would take place based on the remainder term.

Combining the various parts together requires just one IMUL and one ADD
for each word, so even for the 128-bit format with 33 digits we need a
maximum of four of these operations, so it should take less than 60 cycles.

Multiplication requires taking two 30-bit inputs and producing a 64-bit
product, then splitting the result into two 30-bit words. This is easy
with a 64x64->128 reciprocal multiplication.

The only really tough operation (as usual!) is division, this is
probably fastest to do by using the FP hardware: Load the divisor as a
pair of 30-bit integers, calculate the reciprocal and convert back to
integer.

Multiplying with an approximate reciprocal gives an approximate result,
so we have to back-multiply and subtract, and then possibly repeat once
or twice to get all the bits we need.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
wolfgang kern
2010-07-26 18:21:50 UTC
Permalink
Post by Terje Mathisen
Post by wolfgang kern
Post by Terje Mathisen
Here's the code that AMD "borrowed" for their optimization manual for a
number of years: (3 MUL and 1 IMUL, all the rest is single-cycle ops)
I haven't seen it before, which xxxxx.pdf do you mean here ?
With interleaved functions you have much lesser dependencies in this.
But I tried (minor success) to figure out where 'your figures' come
from, a few comments would help in understanding this example ...
First split the input into two halves, modulo 100 000. I do this with a
constant division by 100K, i.e. reciprocal multiplication by
ceil(65536*2^32/100K) = 2814749767.
I see 2^48/1e5 ... I didn't check that far
Post by Terje Mathisen
This gives an exact answer, so I back-multiply by 100K and subtract to get
the remainder, right?
yeah, and this doesn't have rounding issues like my dirty hack.
Post by Terje Mathisen
The next step is to scale each of the sub-100K numbers by the same
constant: 429497, i.e. ceil(2^32/10K), which will leave the top digit as
the top 4 bits in a 32-bit value.
I see now that 2^48/1e5 : 2^32/1e4 === 2^16/10 , but I need to read
this several times and check on paper to see it in full context :)
Post by Terje Mathisen
From this point on I extract digits by shifting right 28 bits and adding
'0', then masking away those top bits I've used.
The next step is to take the remainder and multiply by 5
LEA [reg+reg*4]
At this point there is a factor of two still missing (to get from 5 to
10), but I can fix that by reducing the shift count by 1, to 27.
Repeating this for each of the 3 remaining digits in each 5-digit half
gives all 10 digits in ~30-40 cycles. :-)
OK?
Oh yes, thanks Terje.
It would have taken me 'a while' to see how this trick work.

__
wolfgang
wolfgang kern
2010-07-26 11:20:02 UTC
Permalink
I posted:
and James found a bug in here:

it now put a space after sign (or leftmost) to fill
the otherwise uncovered leading zero position.
It's a dirty hack, because some values have a space
between sign and number then.

a theoretical check of -999 vs -1000:

3e7 and 3e8 will both report 11 bits and become 4 digits
so the output will look as "- 999" and "-1000" yet.
Post by wolfgang kern
I now combined all my recently spread codesnips.
many dependencies are still in there, but feel free to
check and/or benchmark it against stackup solutions,
... if this will work correct or at all ;)
any more bugs ? I just see one yet in recipr. DIV,
it roundup the result for input values >0x4000_0004
__
wolfgang

________
bin2dec: ;instring variant
;edi = text_strptr (updated)
;eax = signed entry value
;returns: ecx = strlen-1 -sign (useless)
; edi = points to next after this
push esi
push ebx
push edx ;if required
push eax ;if required
test eax,eax
jns L0
mov byte[edi],"-"
neg eax
inc edi
L0:
bsr ebx,eax ;get MSBit# = 2**n
* mov byte[edi],20h ;just in case
* INC ebx ;we need bit count rather then its number
;max. decimal digits = 1+n*log(2) [0.30102999..]
imul ecx,ebx,0x4d10 ; 2**16 * n*log(2), is close enough here
shr ecx,16 ;/2**16
;inc ecx ;no 1+, as we include 0 and
add edi,ecx ;start at the end, ie: +9..+0
L1:
mov ebx,1999999Ah ;2**32 /10 rounded up
mul ebx
; we have the quotient in edx yet (hidden shift-right 32)
; and eax holds the binary fraction of the quotient,
; but we need MOD 10 (the remainder)...
; MUL the fraction in eax by 10 will produce MOD in edx,
; but previous edx is needed as well, I save it in esi yet.
mov ebx,10
mov esi,edx ;save quotient
mul ebx ;DL hold MOD 10 yet
or dl,30h
mov [edi],dl ;store it in buffer
mov eax,esi ;restore quotient
dec edi
test eax,eax
jnz L1 ;until nothing left to divide
lea edi,[edi+ecx+1] ;adjust edi to point after this
pop eax
pop edx
pop ebx
pop esi
ret
_____
James Harris
2010-07-25 19:18:16 UTC
Permalink
On 25 July, 12:16, "wolfgang kern" <***@never.at> wrote:

...
Post by wolfgang kern
Adjusting edi make it work left justified if edi points to a temporary
buffer, but if edi is used as incremental pointer in a text-string
...
L0: ;eax is positive here, but it works unsigned as well
bsr ebx,eax ;get MSBit# = 2**n
;we need decimal digits = 1+n*0.30102999.. [log(2)]
imul ecx,ebx,0x4d10 ;=n * 2**16 * log(2) this's close enough
shr ecx,16 ; / 2**16
inc ecx ;1+
I don't understand this. Take as an example the numbers 9 and 10. The
first has a length in decimal of 1, the second a length of 2. Yet
because the numbers are, in binary, 1001 and 1010 the bsr instruction
for them both would give the same result.
Post by wolfgang kern
and except for the sign this ecx is already strlen even perhaps only
required for linewrap in an INSTRING format.
this also require a different trail or preserve of ecx for add edi,...
An "INSTRING format"? A web search hasn't helped explain that.
Post by wolfgang kern
BTW. I really hate unformatted numeric displays where text and numbers
jump back and forward whenever a value changes. I use formatted num_out
only, so figures keep their position also INSTRING (in
opposition to D.P.-TAB set used in forms and tables) by space out
leading zeros and also the sign position remain unaltered on redraw.
That's fine in columnar output or for continually updating values.
(Though if updating values on screen one needs to be sure to overwrite
leading digit positions with spaces.)

In other contexts, though, - glass teletype, log, numbers within
pieces of text - minimum-width numbers are often more appropriate. The
original routine was to print arbitrary widths. It could have been
supplied a length but would have lost some simplicity.

BTW, you mentioned once that I'd posted in a format that you could
easily reply to in OE6. I'm not aware of doing anything different so
sorry if this is not easy to reply to!

James
wolfgang kern
2010-07-25 22:09:18 UTC
Permalink
Post by James Harris
Post by wolfgang kern
Adjusting edi make it work left justified if edi points to a temporary
buffer, but if edi is used as incremental pointer in a text-string
...
L0: ;eax is positive here, but it works unsigned as well
bsr ebx,eax ;get MSBit# = 2**n
;we need decimal digits = 1+n*0.30102999.. [log(2)]
imul ecx,ebx,0x4d10 ;=n * 2**16 * log(2) this's close enough
shr ecx,16 ; / 2**16
inc ecx ;1+
I don't understand this. Take as an example the numbers 9 and 10. The
first has a length in decimal of 1, the second a length of 2. Yet
because the numbers are, in binary, 1001 and 1010 the bsr instruction
for them both would give the same result.
Oh yes, sorry, seems I need to resort the directory in my brain,
I missed not only the 'max' info in this corner of my memory ...

max. decimal digits = 1+n*log(2) where n is bit-count, not bit-number.
So my code is buggy because even if I'd insert an INC EBX after bsr
it will produce one leading zero on some values.
Post by James Harris
Post by wolfgang kern
and except for the sign this ecx is already strlen even perhaps only
required for linewrap in an INSTRING format.
this also require a different trail or preserve of ecx for add edi,...
An "INSTRING format"? A web search hasn't helped explain that.
'in string' == numeric output within text flow
I borrowed this term from PowerBasic even with modified means.
My formats are RSET,TABSET,DP-TABSET and INSTRING mean LSET but
text flow oriented with or without padding spaces.
Post by James Harris
Post by wolfgang kern
BTW. I really hate unformatted numeric displays where text and numbers
jump back and forward whenever a value changes. I use formatted num_out
only, so figures keep their position also INSTRING (in
opposition to D.P.-TAB set used in forms and tables) by space out
leading zeros and also the sign position remain unaltered on redraw.
That's fine in columnar output or for continually updating values.
(Though if updating values on screen one needs to be sure to overwrite
leading digit positions with spaces.)
In other contexts, though, - glass teletype, log, numbers within
pieces of text - minimum-width numbers are often more appropriate. The
original routine was to print arbitrary widths. It could have been
supplied a length but would have lost some simplicity.
BTW, you mentioned once that I'd posted in a format that you could
easily reply to in OE6. I'm not aware of doing anything different so
sorry if this is not easy to reply to!
this time I had no problems, looks like google.news switches sometimes
between 'quoted printable' and 'plain text'.

__
wolfgang
James Harris
2010-07-26 14:16:08 UTC
Permalink
On 22 July, 22:09, "wolfgang kern" <***@never.at> wrote:

...
Post by wolfgang kern
Post by wolfgang kern
To make it in the other order I'd just start at the end.
If the position of the rightmost digit can be determined in advance
this seems the best idea. Then the digits can be written to the buffer
without using the stack. Of course, it's dependent on working out
where the rightmost digit should go. I've put online my attempt at
working out offset of that rightmost digit. The link is below but the
routine's basic form is

;Multiply EBX by ten repeatedly until it exceeds the input value.
.loop:
inc ecx
lea ebx, [ebx + ebx * 4] ;Multiply EBX
shl ebx, 1 ; by ten
cmp ebx, eax
jna .loop

On exit ECX holds the offset from the leftmost to the rightmost digit,
from which the position of that digit can be determined.

...
Post by wolfgang kern
Let me try it quick and dirty. Not tested, just out of my head and
not sure if this is shorter, but it should be 'a bit' faster because
it doesn't push/pop nor call/ret in the loop.
... (code snipped)

Based on that here's some code to write digits. Because divide is slow
it does as few divides as necessary by stopping when the value is less
than ten, rather than stopping when the value has been reduced to
zero. This means that for single-digit values no division is needed.

jmp short .digits_loop_test

.digits_loop:
xor edx, edx
div dword [.number_base]
or dl, "0"
mov [edi + ecx], dl
dec ecx

.digits_loop_test:
cmp eax, [.number_base]
jnb .digits_loop

;The value in EAX is the leftmost digit to be written

or al, "0"
mov [edi], al

I've added a third routine for handling the sign. All three pieces of
code are online in full as the "precalculated position" option at

http://codewiki.wikispaces.com/print+decimal

It seems more complicated than the other two stack-based options but
that's at least partly due to it being split into reusable sections.
Maybe I should split the other options too....

James
wolfgang kern
2010-07-27 14:24:22 UTC
Permalink
Post by James Harris
...
Post by wolfgang kern
Post by wolfgang kern
To make it in the other order I'd just start at the end.
If the position of the rightmost digit can be determined in advance
this seems the best idea. Then the digits can be written to the buffer
without using the stack. Of course, it's dependent on working out
where the rightmost digit should go. I've put online my attempt at
working out offset of that rightmost digit. The link is below but the
routine's basic form is
;Multiply EBX by ten repeatedly until it exceeds the input value.
inc ecx
lea ebx, [ebx + ebx * 4] ;Multiply EBX
shl ebx, 1 ; by ten
cmp ebx, eax
jna .loop
On exit ECX holds the offset from the leftmost to the rightmost digit,
from which the position of that digit can be determined.
...
Post by wolfgang kern
Let me try it quick and dirty. Not tested, just out of my head and
not sure if this is shorter, but it should be 'a bit' faster because
it doesn't push/pop nor call/ret in the loop.
... (code snipped)
Based on that here's some code to write digits. Because divide is slow
it does as few divides as necessary by stopping when the value is less
than ten, rather than stopping when the value has been reduced to
zero. This means that for single-digit values no division is needed.
jmp short .digits_loop_test
xor edx, edx
div dword [.number_base]
or dl, "0"
mov [edi + ecx], dl
dec ecx
cmp eax, [.number_base]
jnb .digits_loop
;The value in EAX is the leftmost digit to be written
or al, "0"
mov [edi], al
I've added a third routine for handling the sign. All three pieces of
code are online in full as the "precalculated position" option at
http://codewiki.wikispaces.com/print+decimal
It seems more complicated than the other two stack-based options but
that's at least partly due to it being split into reusable sections.
Maybe I should split the other options too....
Yes, every improvement makes it better :)

But in the aspect of a general usable conversion routine, I still
think that buffered versions like my early attempt here and also
Terje's more exact reciprocal DIV show much more opportunities to
decide for any numeric output-format at a later point. And this
would save us from determining output size ahead of the story too.
__
wolfgang
James Harris
2010-07-27 15:44:53 UTC
Permalink
...
Post by wolfgang kern
Post by James Harris
I've added a third routine for handling the sign. All three pieces of
code are online in full as the "precalculated position" option at
 http://codewiki.wikispaces.com/print+decimal
It seems more complicated than the other two stack-based options but
that's at least partly due to it being split into reusable sections.
Maybe I should split the other options too....
Yes, every improvement makes it better :)
Too true - but changes are not necessarily improvements!
Post by wolfgang kern
But in the aspect of a general usable conversion routine, I still
think that buffered versions like my early attempt here and also
Terje's more exact reciprocal DIV show much more opportunities to
decide for any numeric output-format at a later point. And this
would save us from determining output size ahead of the story too.
What do you mean by "buffered"?

As for division by multiplication I've no problem documenting that
except that it is already documented elsewhere.

For numeric formats, you mean digit grouping, floating point numbers?

Determining the size ahead of writing it seems to me a good thing. It
gives the programmer a chance to test for buffer or line overflow
before writing the number. And partly due to having neither divides
nor multiplies, and partly due to favouring smaller numbers the
routine is fast.

I'm confused as to what you would prefer or think to be better. Care
to elucidate?

James
wolfgang kern
2010-07-27 23:27:16 UTC
Permalink
James Harris asked:

...
Post by James Harris
Post by wolfgang kern
But in the aspect of a general usable conversion routine, I still
think that buffered versions like my early attempt here and also
Terje's more exact reciprocal DIV show much more opportunities to
decide for any numeric output-format at a later point. And this
would save us from determining output size ahead of the story too.
What do you mean by "buffered"?
I meant the variant where I adjusted edi to point to the
leftmost character in a temporary buffer in oppostion to
have edi as incremental pointer for a text-string.
Post by James Harris
As for division by multiplication I've no problem documenting that
except that it is already documented elsewhere.
Sure, we find this in optimisation guides of all kind.
Post by James Harris
For numeric formats, you mean digit grouping, floating point numbers?
Yes, but not only.
My Os use several output-formats for a lot of types: (from signed
nibble up to unsigned 256 bit mantissa with 32 bit 10^n exponents,
some standards are ie: 24+8 bit, 48+16 bit beside fixpoint and
wanted display-precision with either expand,roundup or truncate)

1. integer. signed or unsigned, no exponent but can have a DP
if fixpoint and can use thousand-marks ie: 123'456'789.01
2. scientific. regardless of the storage format the figure
is shown with wanted precision ie round 5 as: 1.2346 E+8
3. engineering. similar to 2) but got exponent modulo 3,
so max. 3 integer digits. ie truncate to 6: 123.456 E+6
4. zero padded. tiny exponents are replaced by added zeros,
usable for limited range ie: signed nibble exponents.

All these can have additional attributes like Rset, Tabset,
DP-TABset, Lset, instring, and show plus-sign or space or none.
1)..3) can also use supress/space/show/expand leading and same
for trailing zeros.

I would have needed many different bin2dec routines if I hadn't
my conversion work in a decimal-NUM-dedicated-buffer apart
from any final text-string.
Post by James Harris
Determining the size ahead of writing it seems to me a good thing. It
gives the programmer a chance to test for buffer or line overflow
before writing the number. And partly due to having neither divides
nor multiplies, and partly due to favouring smaller numbers the
routine is fast.
Yes, and with conversion into a large enough temporary buffer, any
later used print-routine or string-contentation can see the actual
size of it. Like I tried with adjusted edi and strlen in ecx.
Post by James Harris
I'm confused as to what you would prefer or think to be better.
Care to elucidate?
No ;) wiki examples shouldn't be too complex nor should a newbie
find complete ready solutions in there :)

more serious again:
your example show only one (LSET in string) variant, perhaps
this is the one you needed most.
My experience with client demands show higher need for
DP-tabset and input-field aligned RSET number-display.

But anyway your example is just fine for beginners to start with.
__
wolfgang
James Harris
2010-07-29 17:42:23 UTC
Permalink
On 28 July, 00:27, "wolfgang kern" <***@never.at> wrote:

...
Post by wolfgang kern
No ;) wiki examples shouldn't be too complex nor should a newbie
find complete ready solutions in there :)
Although you're kidding the point is valid, IMHO. It's best when,
rather than providing complete solutions, the wiki carries code which
is short, easy to learn from, and helps the programmer develop a
routine.
Post by wolfgang kern
your example show only one (LSET in string) variant, perhaps
this is the one you needed most.
Not so! The third option can be used for any alignment and it could be
used to write digits to a buffer first if the programmer prefers.

On the point of options for formatting numbers, the IBM 360 had two
assembler instructions, ed and edmk, for writing packed decimal
numbers to a field. The field would be preloaded with a set of
characters which could include literal chars and digit placeholders.
The machine would overwrite the digit placeholders with digits
including some support for handling leading zeroes, signs and currency
symbols.

And there was another instruction for converting binary to packed
decimal so one could write formatted data to a field in three
instructions something like the following. (Exact forms may not be
accurate - memory is hazy - but I'm pretty sure the principles are
right.)

mvc buffer, field_template Set up the format
cvd decimal_copy, r2 Convert R2 to decimal
ed buffer, decimal_copy Write the decimal into the format

Not bad for a 1960s design, eh!
Post by wolfgang kern
My experience with client demands show higher need for
DP-tabset and input-field aligned RSET number-display.
But anyway your example is just fine for beginners to start with.
OK.

James
Frank Kotler
2010-07-21 05:23:04 UTC
Permalink
Post by do yeon
hello. I am learning assembly. but I have a problem in linux
i wrote code.
[snip]
...
Post by do yeon
-----------------------------------
as you see
==============
ebx 0x2d0 720
==============
ebx is 720
so, if {(%eax = 1 ) and int $0x80, `echo $?` == %ebx}, why the output
is 208??
0xd0 = 208 - "echo $?" is really only %bl.

Best,
Frank
Bjarni Juliusson
2010-07-21 21:31:58 UTC
Permalink
Post by Frank Kotler
Post by do yeon
so, if {(%eax = 1 ) and int $0x80, `echo $?` == %ebx}, why the output
is 208??
0xd0 = 208 - "echo $?" is really only %bl.
Or to quote the man page of exit():

SYNOPSIS
#include <stdlib.h>

void exit(int status);

DESCRIPTION
The exit() function causes normal process termination
and the value of status & 0377 is returned to the parent
(see wait(2)).

And in the man page of wait():

...

WEXITSTATUS(status)
returns the exit status of the child. This consists of the
least significant 8 bits of the status argument that the child
specified in a call to exit(3) or _exit(2) or as the argument
for a return statement in main().

The other bits of the int status indicate whether the waited-for process
died or was stopped, if it exited or was killed by a signal, which
signal, etc.


Bjarni
--
INFORMATION WANTS TO BE FREE
io_x
2010-07-23 09:27:21 UTC
Permalink
Post by do yeon
hello. I am learning assembly. but I have a problem in linux
i wrote code.
--------------------------
.section .data
.section .text
.globl _start
pushl $6
call factorial
here you not clear the stack
because in factorial() there is not "ret 4" but only "ret"
Post by do yeon
movl %eax, %ebx
movl $1, %eax
ebx=eax
eax=1
Post by do yeon
int $0x80
ebp=[0ebp, 4ra, 8p]

ebx=ebp+8
ok
Post by do yeon
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %ebx
movl 8(%ebp), %ecx
cmpl $1, %ebx
je end_loop
decl %ebx
imull %ebx, %ecx
jmp start_loop
i do not understand imull %ebx, %ecx
in the above loop ecx has to be the dest
and ebx the source
c=6 b=6
1: loop c=6*(6-1) b=5
2: c=6*5*4 b=4
etc ok
Post by do yeon
movl %ecx, %eax
movl %ebp, %esp
popl %ebp
ret
so result is in eax
Post by do yeon
-----------------------
and I do 'echo $?' in shell.
i expected the out number is 720. but 208! so, i use debug program.
I traced the program and at the end of the program the result is this.
(gdb) info registers
eax 0x1 1
ecx 0x2d0 720
edx 0x0 0
ebx 0x2d0 720
esp 0xbffff54c 0xbffff54c
ebp 0x0 0x0
esi 0x0 0
edi 0x0 0
eip 0x8048062 0x8048062 <_start+14>
eflags 0x246 [ PF ZF IF ]
cs 0x73 115
ss 0x7b 123
ds 0x7b 123
es 0x7b 123
fs 0x0 0
gs 0x0 0
(gdb) stepi
Program exited with code 0320.
(gdb) info registers
The program has no registers now.
(gdb) quit
-----------------------------------
as you see
==============
ebx 0x2d0 720
==============
ebx is 720
so, if {(%eax = 1 ) and int $0x80, `echo $?` == %ebx}, why the output
is 208??
this is the RosAsm version

[buf: B$ 0 #2048
IRisultato: B$ "Risultato", 0 , 0
IErrore: B$ "Errore", 0 , 0
IFactorialIIuIIIu: B$ "Factorial(%u)=%u", 0 , 0
]

; u32 Factorial(u32 val)
; return in eax==val!
; CF==0 all ok
; CF==1 error
; 0i, 4ra, 8P_in
align 4
Factorial:
push esi
mov esi, D$esp+ 8|mov eax, 1|cmp esi, 0|je @f |jmp @1
@e: stc |jmp @z
@1: mul esi |jc @e |dec esi |jnz @1
@f: clc
@z:
pop esi
ret 4

align 4
Main:
;iint3
jmp @1
@e: pushad
push &MB_SYSTEMMODAL__&MB_OK |push IRisultato|push IErrore|push 0
call 'user32.MessageBoxA'
popad
jmp @z
@1: push 9 |call Factorial |jc @e
push eax|push 9|push IFactorialIIuIIIu|push buf
call 'user32.wsprintfA'|add esp, 16
pushad
push &MB_SYSTEMMODAL__&MB_OK|push IRisultato|push buf|push 0
call 'user32.MessageBoxA'
popad
@z: mov eax, 0
ret
-----------------------------
PREPARSE MACRO2D

%define MsgBox 'user32.MessageBoxA'
%define wsprintfA 'user32.wsprintfA'

section .data

buf times 2048 db 0
"Risultato" db "Risultato", 0, 0
"Errore" db "Errore" , 0, 0
"Factorial(%u)=%u" db "Factorial(%u)=%u", 0, 0

section .text

; u32 Factorial(u32 val)
; return in eax==val!
; CF==0 all ok
; CF==1 error
; 0i, 4ra, 8P_in
align 4
Factorial:
<i
i=^8 |a=1|i==0#.f|#.1
.e: stc |#.z
.1: mul i|jc .e|--i#.1
.f: clc
Post by do yeon
i
ret 4

align 4
Main:
;iint3
#.1
.e: pushad|MsgBox(0, "Errore" , "Risultato", &MB_SYSTEMMODAL__&MB_OK)|popad
#.z
.1: Factorial(9)|jc .e
wsprintfA<(buf, "Factorial(%u)=%u", 9, a)
pushad|MsgBox(0, buf, "Risultato", &MB_SYSTEMMODAL__&MB_OK)|popad
.z: a=0
ret
Bernhard Schornak
2010-07-23 11:25:52 UTC
Permalink
io_x wrote:

<snip>
Post by io_x
Post by do yeon
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %ebx
movl 8(%ebp), %ecx
cmpl $1, %ebx
je end_loop
decl %ebx
imull %ebx, %ecx
jmp start_loop
i do not understand imull %ebx, %ecx
in the above loop ecx has to be the dest
and ebx the source
AS reads operands left to right. "imull
%ebx,%ecx" multiplies ECX by EBX, where
the result is stored in ECX.

AFAIK, factorial means 1*2*3*4...*n, so
the loop should emit a proper result as
long as ECX does not overflow.

Might be improved slightly:

.type factorial, @function
factorial:
movl 4(%esp), %eax
movl 4(%esp), %ecx

0:decl %ecx
jbe 1f
imull %ecx,%eax
jmp 0b

1:ret

A stack frame isn't neccessary if there
are no registers to preserve. ECX isn't
saved per default in Linux environments
(calling conventions).


Greetings from Augsburg

Bernhard Schornak
r***@gmail.com
2015-05-03 15:57:44 UTC
Permalink
Quite long time ago ( around 5 yrs ) I posted this.

At that time I was learning assembly. I was completely new to assembly programming.
This was my first post about assembly.
So I had lack of assembly skills and concepts even I wasn't able to understand your replies when they posted.
But I tried to understand the concepts you mentioned at that time. Your comments was quite helpful.

Wolfgang kern and also other people pointed out the problem in the code.

And I learned a lot from your replies. I appreciated that.

If you want to find out more about the code used in this post
you can find the code in this book:
"Programming from the Ground Up" by Jonathan Bartlett.

When I was doing some examples in the first few chapters in the book,
I just wonder what will happen if I change %ebx more?
What is the maximum number that `echo $?` can produce?

And during experiment it seemed to produce a strange result.
It was different with what I expected.
That's why I introduced this post.

Anyway, Thanks for all your replies. Bye.
wolfgang kern
2015-05-04 09:16:49 UTC
Permalink
Post by r***@gmail.com
Quite long time ago ( around 5 yrs ) I posted this.
At that time I was learning assembly. I was completely new to assembly programming.
This was my first post about assembly.
So I had lack of assembly skills and concepts even I wasn't able to
understand your replies when they posted.
But I tried to understand the concepts you mentioned at that time. Your
comments was quite helpful.
Wolfgang kern and also other people pointed out the problem in the code.
And I learned a lot from your replies. I appreciated that.
If you want to find out more about the code used in this post
"Programming from the Ground Up" by Jonathan Bartlett.
When I was doing some examples in the first few chapters in the book,
I just wonder what will happen if I change %ebx more?
What is the maximum number that `echo $?` can produce?
And during experiment it seemed to produce a strange result.
It was different with what I expected.
That's why I introduced this post.
Anyway, Thanks for all your replies. Bye.
You're welcome, even I don't remember what it was :)

__
wolfgang

Loading...