Discussion:
4g/4g for 2.6.6
Phy Prabab
2004-05-23 19:43:02 UTC
Permalink
Hello,

Please cc me as I am not on this mailling list.

I have been researching the 4g patches for kernels.
Seems there was a rift between people over this. Is
there any plan to resume publishing 4g patches for
developing kernels?

I am currently trying to get 4g to work with 2.6.6-mm5
but of course running into issues,so any help on this
would be great!

Thank you for your time.
Phy




__________________________________
Do you Yahoo!?
Yahoo! Domains � Claim yours for only $14.70/year
http://smallbusiness.promotions.yahoo.com/offer
Linus Torvalds
2004-05-23 20:32:19 UTC
Permalink
Post by Phy Prabab
I have been researching the 4g patches for kernels.
Seems there was a rift between people over this. Is
there any plan to resume publishing 4g patches for
developing kernels?
Quite frankly, a number of us are hoping that we can make them
unnecessary. The cost of the 4g/4g split is absolutely _huge_ on some
things, including basic stuff like kernel compiles.

The only valid reason for the 4g split is that the VM doesn't always
behave well with huge amounts of highmem. The anonvma stuff in 2.6.7-pre1
is hoped to make that much less of an issue.

Personally, if we never need to merge 4g for real, I'll be really really
happy. I see it as a huge ugly hack.

Linus
Jeff Garzik
2004-05-23 20:51:13 UTC
Permalink
Post by Linus Torvalds
Post by Phy Prabab
I have been researching the 4g patches for kernels.
Seems there was a rift between people over this. Is
there any plan to resume publishing 4g patches for
developing kernels?
Quite frankly, a number of us are hoping that we can make them
unnecessary. The cost of the 4g/4g split is absolutely _huge_ on some
things, including basic stuff like kernel compiles.
Sorta like I'm hoping that cheap and prevalent 64-bit CPUs make PAE36
and PAE40 on ia32 largely unnecessary. Addressing more memory than 32
bits of memory on a 32-bit CPU always seemed like a hack to me, and a
source of bugs and lost performance...

Jeff
Linus Torvalds
2004-05-24 01:55:45 UTC
Permalink
Post by Jeff Garzik
Sorta like I'm hoping that cheap and prevalent 64-bit CPUs make PAE36
and PAE40 on ia32 largely unnecessary. Addressing more memory than 32
bits of memory on a 32-bit CPU always seemed like a hack to me, and a
source of bugs and lost performance...
I agree. I held out on PAE for a longish while, in the unrealistic hope
that people would switch to alpha's.

Oh, well. I don't expect _everybody_ to switch to x86-64 immediately, but
I hope we can hold out long enough without 4g that it does work out this
time.

Linus
Jeff Garzik
2004-05-24 02:19:33 UTC
Permalink
Post by Linus Torvalds
Oh, well. I don't expect _everybody_ to switch to x86-64 immediately, but
I hope we can hold out long enough without 4g that it does work out this
time.
I think the switchover will happen fairly rapidly, since it is
positioned "the upgrade you get next time you buy a new computer",
similar to the current PATA->SATA switchover.

Since these CPUs run in 32-bit mode just fine, I bet you wind up with
people running Intel or AMD 64-bit CPUs long they abandon their 32-bit
OS. I recall people often purchasing gigabit ethernet cards long before
they had a gigabit switch, simply because it was the volume technology
that was being at the time. I think the same thing is going to happen
here, with AMD64 and EM64T.

AMD got a lot of things right with this one...

Jeff
Wim Coekaerts
2004-05-24 02:33:05 UTC
Permalink
Post by Jeff Garzik
I think the switchover will happen fairly rapidly, since it is
positioned "the upgrade you get next time you buy a new computer",
similar to the current PATA->SATA switchover.
you're thinking what people do at home, the large business side doesn't
throw stuff out.. let me know when rhat replaces every desktop with
amd64, and ll buy you a drink if that's within the next few years ;)
Post by Jeff Garzik
Since these CPUs run in 32-bit mode just fine, I bet you wind up with
people running Intel or AMD 64-bit CPUs long they abandon their 32-bit
OS. I recall people often purchasing gigabit ethernet cards long before
they had a gigabit switch, simply because it was the volume technology
that was being at the time. I think the same thing is going to happen
here, with AMD64 and EM64T.
yeah well what we can do (and do) is advice that hey 64bit, in some form
is really a much better environment to run servers with lots of gb on.
just takes time for products to mature on the market, even today you
can't get a lot of those boxes, not in bulk, and I am not talking about
a 1 or 2 way or a desktop with 2gb. don't confuse the systems where
4/4gb matters with your home box.

on the plus side, all the changes that went in recently seem to work
mighty well on 32gb, and I still have to see the first 64gb ia32 boxes,
m sure they are out there but I doubt that really matters any more.

anyhow, the current VM seems to be capable of 32gb systems on ia32
happily without needing ugly 4/4. which is awesome. it's also
interesting timing. anywys, 2.6 kicks butt quite decently :)

Wim
Pavel Machek
2004-05-31 09:51:20 UTC
Permalink
Hi!
Post by Wim Coekaerts
Post by Jeff Garzik
I think the switchover will happen fairly rapidly, since it is
positioned "the upgrade you get next time you buy a new computer",
similar to the current PATA->SATA switchover.
you're thinking what people do at home, the large business side doesn't
throw stuff out.. let me know when rhat replaces every desktop with
amd64, and ll buy you a drink if that's within the next few years ;)
Desktops do not tent to have 4GB+ memory.
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms
Martin J. Bligh
2004-05-24 03:30:52 UTC
Permalink
Post by Jeff Garzik
Since these CPUs run in 32-bit mode just fine, I bet you wind up with
people running Intel or AMD 64-bit CPUs long they abandon their 32-bit OS.
I recall people often purchasing gigabit ethernet cards long before they
had a gigabit switch, simply because it was the volume technology that
was being at the time. I think the same thing is going to happen here,
with AMD64 and EM64T.
Except some bright spark at Intel decided to fix the low end stuff first,
and not the high end stuff for ages. Marvellous - all those 64GB 2 CPU boxes
are fixed. Triples all round.
Post by Jeff Garzik
AMD got a lot of things right with this one...
If the rest of the world would catch up, I'd be a happy man ...

M.
Eric W. Biederman
2004-06-01 05:52:10 UTC
Permalink
Post by Linus Torvalds
Post by Jeff Garzik
Sorta like I'm hoping that cheap and prevalent 64-bit CPUs make PAE36
and PAE40 on ia32 largely unnecessary. Addressing more memory than 32
bits of memory on a 32-bit CPU always seemed like a hack to me, and a
source of bugs and lost performance...
I agree. I held out on PAE for a longish while, in the unrealistic hope
that people would switch to alpha's.
Oh, well. I don't expect _everybody_ to switch to x86-64 immediately, but
I hope we can hold out long enough without 4g that it does work out this
time.
Sounds sane.

One of the real problems on machines with more than 4GB in 32bit mode
is where do you put all of the pci resources. Especially when you start
getting into machines with large memory mapped resources of 128MB or more.

On Intel chipsets that is usually not a problem because you can relocate the
memory from under those resources and still get at it with PAE36. Other
chipsets don't really have that kind of capability.
Post by Linus Torvalds
From what I can tell all of the large PCI memory mapped resources are
64bit. Which gives another solution of simply moving all of those
large memory mapped I/O resources above the memory entirely. Besides
solving my immediate problem of loosing 1/2GB of memory in some
cases, that looks like the only sane long term solution.

So to encourage x86-64 usage, I am going to implement that in
LinuxBIOS and encourage any other BIOS vendor I run into to
follow suit. That way the only painful customer questions I will
get will be why doesn't this high performance card work with a
32bit kernel :) Which is much easier to explain :)

Can anyone think of a reason that would not be a good solution?


Eric
Phy Prabab
2004-05-23 21:55:19 UTC
Permalink
So do I understand this correctly, in 2.6.7(+) it will
no longer be necessary to have the 4g patches? I will
be able to get 4g/process with the going forward
kernels?

Thank you for your time.
Phy
Post by Phy Prabab
Post by Phy Prabab
I have been researching the 4g patches for
kernels.
Post by Phy Prabab
Seems there was a rift between people over this.
Is
Post by Phy Prabab
there any plan to resume publishing 4g patches for
developing kernels?
Quite frankly, a number of us are hoping that we can
make them
unnecessary. The cost of the 4g/4g split is
absolutely _huge_ on some
things, including basic stuff like kernel compiles.
The only valid reason for the 4g split is that the
VM doesn't always
behave well with huge amounts of highmem. The
anonvma stuff in 2.6.7-pre1
is hoped to make that much less of an issue.
Personally, if we never need to merge 4g for real,
I'll be really really
happy. I see it as a huge ugly hack.
Linus
__________________________________
Do you Yahoo!?
Yahoo! Domains � Claim yours for only $14.70/year
http://smallbusiness.promotions.yahoo.com/offer
Arjan van de Ven
2004-05-24 07:05:14 UTC
Permalink
Post by Phy Prabab
So do I understand this correctly, in 2.6.7(+) it will
no longer be necessary to have the 4g patches? I will
be able to get 4g/process with the going forward
kernels?
The kernel RPMs I do (http://people.redhat.com/arjanv/2.6/) pretty much
will have it always for 2.6. Not just for large memory configs, but
because several userspace applications (databases, java etc) really like
that extra Gb of virtual space too.
As for the cost; 4:4 split seems to be hardly expensive at all, only in
some microbenchmarks.
Phy Prabab
2004-05-24 07:11:26 UTC
Permalink
Is this source, or precompiled targeting just RH
setups?

Thank you for your help.
Phy
Post by Arjan van de Ven
Post by Phy Prabab
So do I understand this correctly, in 2.6.7(+) it
will
Post by Phy Prabab
no longer be necessary to have the 4g patches? I
will
Post by Phy Prabab
be able to get 4g/process with the going forward
kernels?
The kernel RPMs I do
(http://people.redhat.com/arjanv/2.6/) pretty much
will have it always for 2.6. Not just for large
memory configs, but
because several userspace applications (databases,
java etc) really like
that extra Gb of virtual space too.
As for the cost; 4:4 split seems to be hardly
expensive at all, only in
some microbenchmarks.
ATTACHMENT part 2 application/pgp-signature
name=signature.asc





__________________________________
Do you Yahoo!?
Yahoo! Domains � Claim yours for only $14.70/year
http://smallbusiness.promotions.yahoo.com/offer
Arjan van de Ven
2004-05-24 07:24:49 UTC
Permalink
Post by Phy Prabab
Is this source, or precompiled targeting just RH
setups?
it's both actually.
Phy Prabab
2004-05-24 07:27:05 UTC
Permalink
Thats great! Do you include a change file to see what
other patches have been applied?

Thank you very much for your help.
Phy
On Mon, May 24, 2004 at 12:11:26AM -0700, Phy Prabab
Post by Phy Prabab
Is this source, or precompiled targeting just RH
setups?
it's both actually.
ATTACHMENT part 2 application/pgp-signature
__________________________________
Do you Yahoo!?
Yahoo! Domains � Claim yours for only $14.70/year
http://smallbusiness.promotions.yahoo.com/offer
Dave Jones
2004-05-24 12:01:54 UTC
Permalink
Post by Phy Prabab
Thats great! Do you include a change file to see what
other patches have been applied?
It's in the kernel-2.6.spec in those RPMs.

Dave
William Lee Irwin III
2004-05-24 02:39:52 UTC
Permalink
Post by Linus Torvalds
Quite frankly, a number of us are hoping that we can make them
unnecessary. The cost of the 4g/4g split is absolutely _huge_ on some
things, including basic stuff like kernel compiles.
The only valid reason for the 4g split is that the VM doesn't always
behave well with huge amounts of highmem. The anonvma stuff in 2.6.7-pre1
is hoped to make that much less of an issue.
Personally, if we never need to merge 4g for real, I'll be really really
happy. I see it as a huge ugly hack.
The performance can be improved by using a u area to store and map
things like vmas, kernel stacks, pagetables, file handles and
descriptor tables, and the like with supervisor privileges in the same
set of pagetables as the user context so that system calls may be
serviced without referencing the larger global kernel data area, which
would require the %cr3 reload. This does, however, seem at odds with
Linux' design in a number of respects, e.g. vmas etc. are on lists
containing elements belonging to different contexts. I suspect kernels
doing this would have to architect their page replacement algorithms
and truncate() semantics so as to avoid these out-of-context accesses
or otherwise suffer these operations being inefficient.

-- wli
Ingo Molnar
2004-05-24 08:25:22 UTC
Permalink
Post by Linus Torvalds
Quite frankly, a number of us are hoping that we can make them
unnecessary. The cost of the 4g/4g split is absolutely _huge_ on some
things, including basic stuff like kernel compiles.
The only valid reason for the 4g split is that the VM doesn't always
behave well with huge amounts of highmem. The anonvma stuff in
2.6.7-pre1 is hoped to make that much less of an issue.
Personally, if we never need to merge 4g for real, I'll be really
really happy. I see it as a huge ugly hack.
i agree with the hack part - but the performance aspect has been blown
out of proportion. 4:4 has the same cost on kernel compiles as highpte.
There are also real workloads where it actually helps performance.

also, the 4:4 overhead is really a hardware problem - and there are
x86-compatible CPUs (amd64) where the TLB flush problem has already been
solved: on amd64 the 4:4 feature has no noticeable overhead. So as long
as people opt for a 32-bit OS (even on a 64-bit CPU, for whatever weird
compatibility reason), 4:4 can be useful. For the other workloads i as
much hope as everyone else that people switch to a 64-bit OS on x86-64
ASAP!

also, while a quick transition to x86-64 will most likely happen, the
large installed base of big x86 boxes is a matter of fact too - and they
wont vanish into thin air. Also, there will always be specific user
workloads where lowmem grows to large values. Not to mention the fact
that 4:4 is a nice debugging/security tool as you cannot dereference
user pointers ;) - it has caught countless bugs already. Plus there are
specific workloads that want 4GB of userspace (and no, 3.5 GB wont do).

So 4:4 will have its niches where it will live on. We could argue on and
on how quickly 'x86 with more than 4GB of RAM' and 'x86 with 4GB of a
userspace' will become a niche, hopefully it happens fast but we've got
to keep our options open.

Ingo
Andrea Arcangeli
2004-05-24 12:48:34 UTC
Permalink
Post by Ingo Molnar
on how quickly 'x86 with more than 4GB of RAM' and
s/4GB/32GB/

my usual x86 test box has 48G of ram (though to keep an huge margin of
safety we assume 32G is the very safe limit).
Rik van Riel
2004-05-25 19:15:14 UTC
Permalink
Post by Andrea Arcangeli
Post by Ingo Molnar
on how quickly 'x86 with more than 4GB of RAM' and
s/4GB/32GB/
my usual x86 test box has 48G of ram (though to keep an huge margin of
safety we assume 32G is the very safe limit).
Just how many 3GB sized processes can you run on that
system, if each of them have a hundred or so VMAs ?
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Andrea Arcangeli
2004-05-25 19:41:15 UTC
Permalink
Post by Rik van Riel
Post by Andrea Arcangeli
Post by Ingo Molnar
on how quickly 'x86 with more than 4GB of RAM' and
s/4GB/32GB/
my usual x86 test box has 48G of ram (though to keep an huge margin of
safety we assume 32G is the very safe limit).
Just how many 3GB sized processes can you run on that
system, if each of them have a hundred or so VMAs ?
with 400m of normal zone we can allocate 5000000 vmas, if each task uses
100 of them that's around 50000 processes. The 3G size doesn't matter.

I think you meant each task using some _dozen_thousand_ of vmas (not
hundreds), that's what actually happens with 32k large vmas spread over
2G of shared memory on some database (assuming no merging, with merging
it goes down to the few thousands), but remap_file_pages has been
designed exactly to avoid allocating several thousands of VMA per task,
no? So it's just 1 vma per task (plus a few more vmas for the binary
shared libs and anonymous memory).

Clearly by opening enough files or enough network sockets or enough vmas
or similar, you can still run out of normal zone, even on a 2G system,
but this is not the point or you would be shipping 4:4 on the 2G systems
too, no? We're not trying to make it impossible to run out of zone
normal, even 4:4 can't make that impossible to happen on >4G boxes.
Rik van Riel
2004-05-25 19:50:23 UTC
Permalink
Post by Andrea Arcangeli
Clearly by opening enough files or enough network sockets or enough vmas
or similar, you can still run out of normal zone, even on a 2G system,
but this is not the point or you would be shipping 4:4 on the 2G systems
too, no?
The point is, people like to run bigger workloads on
bigger systems. Otherwise they wouldn't bother buying
those bigger systems.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Rik van Riel
2004-05-25 20:10:29 UTC
Permalink
Post by Rik van Riel
The point is, people like to run bigger workloads on
bigger systems. Otherwise they wouldn't bother buying
those bigger systems.
Btw, you're right about the VMAs. Looking through customer
stuff a bit more the more common issues are low memory being
eaten by dentry / inode cache - which you can't always reclaim
due to files being open, and don't always _want_ to reclaim
because that could well be a bigger performance hit than the
4:4 split.

The primary impact of the dentry / inode cache using memory
isn't lowmem exhaustion, btw. It's lowmem fragmentation.

Fragmentation causes fork trouble (gone with the 4k stacks)
and trouble for the network layer and kiobuf allocation,
which still do need higher order allocations.

Sure, the 4/4 kernel could also have problems with lowmem
fragmentation, but it just seems to be nowhere near as bad.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Andrew Morton
2004-05-25 21:16:22 UTC
Permalink
Post by Rik van Riel
Post by Rik van Riel
The point is, people like to run bigger workloads on
bigger systems. Otherwise they wouldn't bother buying
those bigger systems.
Btw, you're right about the VMAs. Looking through customer
stuff a bit more the more common issues are low memory being
eaten by dentry / inode cache - which you can't always reclaim
due to files being open, and don't always _want_ to reclaim
because that could well be a bigger performance hit than the
4:4 split.
I did some testing a year or two back with the normal zone wound down to a
few hundred megs - filesytem benchmarks were *severely* impacted by the
increased turnover rate of fs metadata pagecache and VFS caches. I forget
the details, but it was "wow".
Post by Rik van Riel
The primary impact of the dentry / inode cache using memory
isn't lowmem exhaustion, btw. It's lowmem fragmentation.
Fragmentation causes fork trouble (gone with the 4k stacks)
and trouble for the network layer and kiobuf allocation,
which still do need higher order allocations.
I'm suspecting we'll end up needing mempools (or something) of 1- and
2-order pages to support large-frame networking. I'm surprised there isn't
more pressure to do something about this. Maybe people are increasing
min_free_kbytes.
Ingo Molnar
2004-05-25 21:48:17 UTC
Permalink
Post by Andrew Morton
Post by Rik van Riel
Btw, you're right about the VMAs. Looking through customer
stuff a bit more the more common issues are low memory being
eaten by dentry / inode cache - which you can't always reclaim
due to files being open, and don't always _want_ to reclaim
because that could well be a bigger performance hit than the
4:4 split.
I did some testing a year or two back with the normal zone wound down
to a few hundred megs - filesytem benchmarks were *severely* impacted
by the increased turnover rate of fs metadata pagecache and VFS
caches. I forget the details, but it was "wow".
and it's not only the normal workloads we know about - people really do
lots of weird stuff with Linux (and we are happy that they use Linux and
that Linux keeps chugging along), and people seem to prefer a 10%
slowdown to a box that locks up or -ENOMEM's. I'm not trying to insert
any unjustified fear, 3:1 can be ok with lots of RAM, but it's _clearly_
wishful thinking that 32 GB x86 will be OK with just 600 MB of lowmem.
600 MB of lowmem means a 1:80 lowmem to RAM ratio, which is insane. Yes,
it will be OK with select workloads and applications. With 4:4 it's 3.4
GB lowmem and the ratio is down to a much saner 1:9, and boxes pushed
against the wall keep up better.

(but i wont attempt to convince Andrea - and i'm not at all unhappy that
he is trying to fix 3:1 to be more usable on big boxes because those
efforts also decrease the RAM footprint of the UP kernel, which is
another sensitive area. These efforts also help sane 64-bit
architectures, so it's a win-win situation even considering our
disagreement wrt. 4:4.)
Post by Andrew Morton
I'm suspecting we'll end up needing mempools (or something) of 1- and
2-order pages to support large-frame networking. I'm surprised there
isn't more pressure to do something about this. Maybe people are
increasing min_free_kbytes.
hm, 1.5K pretty much seems to be the standard. Plus large frames can be
scatter-gathered via fragmented skbs. Seldom is there a need for a large
skb to be linear.

Ingo
David S. Miller
2004-05-25 22:09:16 UTC
Permalink
On Tue, 25 May 2004 23:48:17 +0200
Post by Ingo Molnar
hm, 1.5K pretty much seems to be the standard. Plus large frames can be
scatter-gathered via fragmented skbs. Seldom is there a need for a large
skb to be linear.
Unfortunately TSO with non-sendfile apps makes huge 64K SKBs get
built.
Ingo Molnar
2004-05-25 22:20:32 UTC
Permalink
Post by David S. Miller
Post by Ingo Molnar
hm, 1.5K pretty much seems to be the standard. Plus large frames can be
scatter-gathered via fragmented skbs. Seldom is there a need for a large
skb to be linear.
Unfortunately TSO with non-sendfile apps makes huge 64K SKBs get
built.
hm, shouldnt we disable TSO in this case - or is it a win even in this
case?

Ingo
David S. Miller
2004-05-25 23:10:15 UTC
Permalink
On Wed, 26 May 2004 00:20:32 +0200
Post by Ingo Molnar
Post by David S. Miller
Unfortunately TSO with non-sendfile apps makes huge 64K SKBs get
built.
hm, shouldnt we disable TSO in this case - or is it a win even in this
case?
It is a win even in this case.
Andrea Arcangeli
2004-05-25 21:15:22 UTC
Permalink
Post by Rik van Riel
Fragmentation causes fork trouble (gone with the 4k stacks)
btw, the 4k stacks sounds not safe to me, most people only tested with
8k stacks so far, I wouldn't make that change in a production tree
without an unstable cycle of testing in between. I'd rather risk a
an allocation failure than a stack memory corruption.

x86-64 has per-irq stacks that allowed to reduce the stack size to 8k
(which is very similar to 4k for an x86, but without per-irq stack it's
too risky).

as for the dcache size, I can certainly imagine you can measure slowdown
bigger than 4:4 in some very vfs intensive workload, but the stuff that
needs 32G of ram normally uses only rawio or a few huge files as storage
anyways. as for the kiobufs they're long gone in 2.6.

Possibly some webserving could use 32G as pure pagecache for the
webserver with tons of tiny files, for such an app the 2:2 or 1:3 model
should be better than 4:4.
Ingo Molnar
2004-05-26 10:33:03 UTC
Permalink
Post by Andrea Arcangeli
Post by Rik van Riel
Fragmentation causes fork trouble (gone with the 4k stacks)
btw, the 4k stacks sounds not safe to me, most people only tested with
8k stacks so far, I wouldn't make that change in a production tree
without an unstable cycle of testing in between. I'd rather risk a an
allocation failure than a stack memory corruption.
4k stacks is a cool and useful feature and tons of effort that went into
making them as safe as possible. Sure, we couldnt fix up bin-only
modules, but all the kernel drivers are audited for stack footprint, and
many months of beta testing has gone into this as well. Anyway, if you
prefer you can turn on 8k stacks - especially if you tree has lots of
not-yet-upstream driver patches.
Post by Andrea Arcangeli
x86-64 has per-irq stacks that allowed to reduce the stack size to 8k
(which is very similar to 4k for an x86, but without per-irq stack
it's too risky).
do you realize that the 4K stacks feature also adds a separate softirq
and a separate hardirq stack? So the maximum footprint is 4K+4K+4K, with
a clear and sane limit for each type of context, while the 2.4 kernel
has 6.5K for all 3 contexts combined. (Also, in 2.4 irq contexts pretty
much assumed that there's 2K of stack for them - leaving a de-facto 4K
stack for the process and softirq contexts.) So in fact there is more
space in 2.6 for all, and i dont really understand your fears.

Ingo
Jörn Engel
2004-05-26 12:50:14 UTC
Permalink
Post by Rik van Riel
Fragmentation causes fork trouble (gone with the 4k stacks)
=20
btw, the 4k stacks sounds not safe to me, most people only tested w=
ith
8k stacks so far, I wouldn't make that change in a production tree
without an unstable cycle of testing in between. I'd rather risk a =
an
allocation failure than a stack memory corruption.
=20
4k stacks is a cool and useful feature and tons of effort that went i=
nto
making them as safe as possible. Sure, we couldnt fix up bin-only
modules, but all the kernel drivers are audited for stack footprint, =
and
many months of beta testing has gone into this as well. Anyway, if yo=
u
prefer you can turn on 8k stacks - especially if you tree has lots of
not-yet-upstream driver patches.
=20
x86-64 has per-irq stacks that allowed to reduce the stack size to =
8k
(which is very similar to 4k for an x86, but without per-irq stack
it's too risky).
=20
do you realize that the 4K stacks feature also adds a separate softir=
q
and a separate hardirq stack? So the maximum footprint is 4K+4K+4K, w=
ith
a clear and sane limit for each type of context, while the 2.4 kernel
has 6.5K for all 3 contexts combined. (Also, in 2.4 irq contexts pret=
ty
much assumed that there's 2K of stack for them - leaving a de-facto 4=
K
stack for the process and softirq contexts.) So in fact there is more
space in 2.6 for all, and i dont really understand your fears.
Experience indicates that for whatever reason, big stack consumers for
all three contexts never hit at the same time. Big stack consumers
for one context happen too often, though. "Too often" may be quite
rare, but considering the result of a stack overflow, even "quite
rare" is too much. "Never" is the only acceptable target.

Change gcc to catch stack overflows before the fact and disallow
module load unless modules have those checks as well. If that is
done, a stack overflow will merely cause a kernel panic. Until then,
I am just as conservative as Andreas.

J=F6rn

--=20
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong
Arjan van de Ven
2004-05-26 12:53:00 UTC
Permalink
Post by Jörn Engel
Experience indicates that for whatever reason, big stack consumers for
all three contexts never hit at the same time. Big stack consumers
for one context happen too often, though. "Too often" may be quite
rare, but considering the result of a stack overflow, even "quite
rare" is too much. "Never" is the only acceptable target.
Actually it's not mever in 2.4. It does get here there by our customers once
in a while. Esp with several NICs hitting an irq on the same CPU (eg the irq
context goes over it's 2Kb limit)
Post by Jörn Engel
done, a stack overflow will merely cause a kernel panic. Until then,
I am just as conservative as Andreas.
actually the 4k stacks approach gives MORE breathing room for the problem
cases that are getting hit by our customers...
Jörn Engel
2004-05-26 13:00:47 UTC
Permalink
=20
Experience indicates that for whatever reason, big stack consumers =
for
all three contexts never hit at the same time. Big stack consumers
for one context happen too often, though. "Too often" may be quite
rare, but considering the result of a stack overflow, even "quite
rare" is too much. "Never" is the only acceptable target.
=20
Actually it's not mever in 2.4. It does get here there by our custome=
rs once
in a while. Esp with several NICs hitting an irq on the same CPU (eg =
the irq
context goes over it's 2Kb limit)
=20
done, a stack overflow will merely cause a kernel panic. Until the=
n,
I am just as conservative as Andreas.
=20
actually the 4k stacks approach gives MORE breathing room for the pro=
blem
cases that are getting hit by our customers...
=46or the cases you described, yes. For some others like nvidia, no.
Not sure if we want to make things worse for some users in order to
improve things for others (better paying ones?). I want the seperate
interrupt stacks, sure. I'm just not comfy with 4k per process yet.

But I'll shut up now and see if I can generate better data over the
weekend. -test11 still had fun stuff like 3k stack consumption over
some code paths in a pretty minimal kernel. Wonder what 2.6.6 will do
with allyesconfig. ;)

J=F6rn

--=20
He who knows that enough is enough will always have enough.
-- Lao Tsu
Arjan van de Ven
2004-05-26 13:05:00 UTC
Permalink
Post by Arjan van de Ven
Post by Jörn Engel
Experience indicates that for whatever reason, big stack consumers for
all three contexts never hit at the same time. Big stack consumers
for one context happen too often, though. "Too often" may be quite
rare, but considering the result of a stack overflow, even "quite
rare" is too much. "Never" is the only acceptable target.
actually the 4k stacks approach gives MORE breathing room for the problem
cases that are getting hit by our customers...
For the cases you described, yes. For some others like nvidia, no.
Not sure if we want to make things worse for some users in order to
improve things for others (better paying ones?). I want the seperate
You used the word "Never" and now you go away from it.... It wasn't Never,
and it will never be never if you want to include random binary only
modules. However in 2.4 for all intents and pruposes there was 4Kb already,
and now there still is, for user context. Because those interrupts DO
happen. NVidia was a walking timebomb, and with one function using 4Kb
that's an obvious Needs-Fix case. The kernel had a few of those in rare
drivers, most of which have been fixed by now. It'll never be never, but it
never was never either.
Jörn Engel
2004-05-26 16:41:29 UTC
Permalink
=20
You used the word "Never" and now you go away from it.... It wasn't N=
ever,
and it will never be never if you want to include random binary only
modules. However in 2.4 for all intents and pruposes there was 4Kb al=
ready,
and now there still is, for user context. Because those interrupts DO
happen. NVidia was a walking timebomb, and with one function using 4K=
b
that's an obvious Needs-Fix case. The kernel had a few of those in ra=
re
drivers, most of which have been fixed by now. It'll never be never, =
but it
never was never either.
In a way, you are right. nVidia was and is a walking timebomb and
making bugs more likely to happen is a good thing in general. Except
that this bug can eat filesystems, so making it more likely will cause
more filesystems to be eaten.

Anyway, whether we go for 4k in 2.6 or not, we should do our best to
fix bad code and I will go looking for some more so others can go and
fix some more. There's still enough horror in mainline for more than
one amusement park, we just haven't found it yet.

J=F6rn

--=20
All art is but imitation of nature.
-- Lucius Annaeus Seneca
Ingo Molnar
2004-05-27 12:45:51 UTC
Permalink
Anyway, whether we go for 4k in 2.6 or not, [...]
4K stacks have been added to the 2.6 kernel more than a month ago, are
in the official 2.6.6 kernel and are used by FC2 happily, so objections
are a bit belated. I only reacted to Andrea's mail to clear up apparent
misunderstandings about the impact and implementation of this feature.

Ingo
Andrea Arcangeli
2004-05-27 13:59:30 UTC
Permalink
Post by Ingo Molnar
are a bit belated. I only reacted to Andrea's mail to clear up apparent
misunderstandings about the impact and implementation of this feature.
note that there is something relevant to improve in the implementation,
that is the per-cpu irq stack size should be bigger than 4k, we use 16k
on x86-64, on x86 it should be 8k. Currently you're decreasing _both_
the normal kernel context and even the irq stack in some condition.
There's no good reason to decrease the irq stack too, that's cheap, it's
per-cpu.
Arjan van de Ven
2004-05-27 14:03:22 UTC
Permalink
Post by Andrea Arcangeli
Post by Ingo Molnar
are a bit belated. I only reacted to Andrea's mail to clear up apparent
misunderstandings about the impact and implementation of this feature.
note that there is something relevant to improve in the implementation,
that is the per-cpu irq stack size should be bigger than 4k, we use 16k
on x86-64, on x86 it should be 8k. Currently you're decreasing _both_
the normal kernel context and even the irq stack in some condition.
There's no good reason to decrease the irq stack too, that's cheap, it's
per-cpu.
In theory you are absolutely right, problem is the current macro..... it's
SO much easier to have one stacksize everywhere (and cheaper too) for
this... (and it hasn't been a problem so far, esp since the softirq's have
their own stack, irq handlers seem to be all really light on the stack
already since they punt all the heavy lifting to tasklets etc.
Tasklets don't recurse stack wise, and have their own stack, so that ought
to be fine.

On x86_64 you have the PDA for current so that's not a problem, and you can
do the bigger stacks easily but for x86 you don't...
Andrea Arcangeli
2004-05-27 14:42:37 UTC
Permalink
Post by Arjan van de Ven
In theory you are absolutely right, problem is the current macro..... it's
SO much easier to have one stacksize everywhere (and cheaper too) for
this... (and it hasn't been a problem so far, esp since the softirq's have
I see the problem, but then why don't we wait to implement it right, to
allow 8k irq-stacks before merging into mainline?

grep for "~s 4k" (i.e. the word "4[kK]" in the subject) on l-k and
you'll see there's more than just nvidia. one user reported not being
able to boot at all with 4k stacks since 2.6.6 doesn't have a stack
overflow in the oops, so I hope he tested w/ and w/o 4KSTACKS option
enabled to be able to claim what broke his machine is the 4KSTACKS
option. (his oops doesn't reveal a stack overflow, the thread_info is at
0xf000 and the esp is at 0xffxx)

Making it a config option, is a sort of proof that you agree it can
break something, or you wouldn't make it a config option in the first
place. What's the point of making it a configuration option if it cannot
break anything and if it's not risky? Making it a config option is not
good, because then some developer may develop leaving 4KSTACKS disabled,
and then his kernel code might break on the users with 4KSTACKS enabled
(it's not much different from PREEMPT). Amittedly code that overflows
4k is likely to be not legitimate but not all code is good (the most
common error is to allocate big strutures locally on the stack with
local vars), and if developers are able to notice the overflow on their
own testing it's better.

Clearly it's more relaxed to merge something knowing with a config
option you can choose if to use 4k or 8k stacks, but I'm not sure if
it's the right thing to do for the long term. If we go 4k stacks, then
I'd prefer that you drop the 4KSTACKS option and force people to reduce
the stack usage in their code, and secondly that we fixup the irqstack
to be 8k.

Plus the allocation errors you had, could be just 2.6 vm issues with
order > 0 allocations, we never had issues with 8k stacks in 2.4, so
using the 4k stacks may just hide the real problem. archs like x86-64
have to use order > 0 allocations for kernel stack, no way around it, so
order > 0 must work reliably regardless of whatever code we change in
x86.
Post by Arjan van de Ven
On x86_64 you have the PDA for current so that's not a problem, and
you can do the bigger stacks easily but for x86 you don't...
yep.
Bill Davidsen
2004-06-02 19:40:37 UTC
Permalink
Post by Andrea Arcangeli
Post by Arjan van de Ven
In theory you are absolutely right, problem is the current macro..... it's
SO much easier to have one stacksize everywhere (and cheaper too) for
this... (and it hasn't been a problem so far, esp since the softirq's have
I see the problem, but then why don't we wait to implement it right, to
allow 8k irq-stacks before merging into mainline?
grep for "~s 4k" (i.e. the word "4[kK]" in the subject) on l-k and
you'll see there's more than just nvidia. one user reported not being
able to boot at all with 4k stacks since 2.6.6 doesn't have a stack
overflow in the oops, so I hope he tested w/ and w/o 4KSTACKS option
enabled to be able to claim what broke his machine is the 4KSTACKS
option. (his oops doesn't reveal a stack overflow, the thread_info is at
0xf000 and the esp is at 0xffxx)
Making it a config option, is a sort of proof that you agree it can
break something, or you wouldn't make it a config option in the first
place. What's the point of making it a configuration option if it cannot
break anything and if it's not risky? Making it a config option is not
good, because then some developer may develop leaving 4KSTACKS disabled,
and then his kernel code might break on the users with 4KSTACKS enabled
(it's not much different from PREEMPT). Amittedly code that overflows
4k is likely to be not legitimate but not all code is good (the most
common error is to allocate big strutures locally on the stack with
local vars), and if developers are able to notice the overflow on their
own testing it's better.
We have lots of options which may cause problems but are useful for
special situations, why is this one any different? The only actual
benefit I've seen quoted for 4k stack is that it improves fork
performance if memory is so fragmented that there is no 8k block left.
And my first thought on hearing that was if that's common the VM should
be investigated. This is a stable kernel, and breaking even such an
abomination as a binary-only driver for the sake of whoever has this
vastly fragmented memory seems to be the anthesis of stable.
--
-bill davidsen (***@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
Brian Gerst
2004-05-27 14:18:40 UTC
Permalink
Post by Andrea Arcangeli
Post by Ingo Molnar
are a bit belated. I only reacted to Andrea's mail to clear up apparent
misunderstandings about the impact and implementation of this feature.
note that there is something relevant to improve in the implementation,
that is the per-cpu irq stack size should be bigger than 4k, we use 16k
on x86-64, on x86 it should be 8k. Currently you're decreasing _both_
the normal kernel context and even the irq stack in some condition.
There's no good reason to decrease the irq stack too, that's cheap, it's
per-cpu.
The problem on i386 (unlike x86-64) is that the thread_info struct sits
at the bottom of the stack and is referenced by masking bits off %esp.
So the stack size must be constant whether in process context or IRQ
context.

--
Brian Gerst
Andrea Arcangeli
2004-05-27 14:50:33 UTC
Permalink
Post by Brian Gerst
The problem on i386 (unlike x86-64) is that the thread_info struct sits
at the bottom of the stack and is referenced by masking bits off %esp.
So the stack size must be constant whether in process context or IRQ
context.
so what, that's a minor implementation detail, pda is a software thing.
Linus Torvalds
2004-05-27 14:55:36 UTC
Permalink
Post by Andrea Arcangeli
Post by Brian Gerst
The problem on i386 (unlike x86-64) is that the thread_info struct sits
at the bottom of the stack and is referenced by masking bits off %esp.
So the stack size must be constant whether in process context or IRQ
context.
so what, that's a minor implementation detail, pda is a software thing.
"minor implementation detail"?

You need to get to the thread info _some_ way, and you need to get to it
_fast_. There are really no sane alternatives. I certainly do not want to
play games with segments.

Linus
Andrea Arcangeli
2004-05-27 15:39:08 UTC
Permalink
Post by Linus Torvalds
Post by Andrea Arcangeli
Post by Brian Gerst
The problem on i386 (unlike x86-64) is that the thread_info struct sits
at the bottom of the stack and is referenced by masking bits off %esp.
So the stack size must be constant whether in process context or IRQ
context.
so what, that's a minor implementation detail, pda is a software thing.
"minor implementation detail"?
You need to get to the thread info _some_ way, and you need to get to it
_fast_. There are really no sane alternatives. I certainly do not want to
play games with segments.
If the page is "even" the thread_info is at the top of the stack. If the
page is "odd" the thread_info is at the bottom of the stack (or the
other way around depending what you mean with "odd" and "even").

the per-cpu irq stack will have the thread_info at both the top and the
bottom of the 8k naturally aligned order1 compound page. The regular
kernel stack will have it at the top or the bottom depending if it's odd
or even.

this should allow 8k irqstack and bh stack fine at in-cpu-core speed w/o
segments or similar.

The only downside is that itadds a branch to current_thread_info that
will have to check the 12th bitflag in the esp before doing andl, the
second downside is having to update two thread_info during irq, instead
of just one.

It would be probably better if the thread_info was just a pointer to a
"pda" instead of being the PDA itself so there are just two writes into
the kernel stack for every irq. In x86-64 this is much more natural
since the pda-pointer is in the cpu 64bit %gs register and that saves a
branch and defereference on the stack for every "current" invocation,
and two writes for every first-irq or first-bh.
Guy Sotomayor
2004-05-27 18:31:50 UTC
Permalink
Post by Linus Torvalds
"minor implementation detail"?
You need to get to the thread info _some_ way, and you need to get to it
_fast_. There are really no sane alternatives. I certainly do not want to
play games with segments.
While segments on x86 are in general to be avoided (aka the 286
segmented memory models) they can be useful for some things in the
kernel.

Here's a couple of examples:
* dereference gs:0 to get the thread info. The first element in
the structure is its linear address (ie usable for being deref'd
off of DS).
* use SS to enforce the stack limit. This way you'd absolutely
get an exception when there was a stack overflow (underflow).
SS gets reloaded on entry into the kernel and on interrupts
anyway so there really shouldn't be a performance impact. I
haven't looked at all the (potential) gcc implications here so
this one may not be completely doable.
--
TTFN - Guy
Brian Gerst
2004-05-27 19:26:35 UTC
Permalink
Post by Guy Sotomayor
Post by Linus Torvalds
"minor implementation detail"?
You need to get to the thread info _some_ way, and you need to get to it
_fast_. There are really no sane alternatives. I certainly do not want to
play games with segments.
While segments on x86 are in general to be avoided (aka the 286
segmented memory models) they can be useful for some things in the
kernel.
* dereference gs:0 to get the thread info. The first element in
the structure is its linear address (ie usable for being deref'd
off of DS).
The only problem with using %gs as a base register is that reloading it
on every entry and exit is rather expensive (GDT access and priviledge
checks) compared to masking bits off %esp. x86-64 can get away with it
because it has the swapgs instruction which makes it efficient to use.
Post by Guy Sotomayor
* use SS to enforce the stack limit. This way you'd absolutely
get an exception when there was a stack overflow (underflow).
SS gets reloaded on entry into the kernel and on interrupts
anyway so there really shouldn't be a performance impact. I
haven't looked at all the (potential) gcc implications here so
this one may not be completely doable.
Not possible. GCC completely assumes that we are working with a single
flat address space. It has no concept of segmentation at all.

--
Brian Gerst
Jörn Engel
2004-06-01 05:56:16 UTC
Permalink
Post by Jörn Engel
Anyway, whether we go for 4k in 2.6 or not, we should do our best to
fix bad code and I will go looking for some more so others can go and
fix some more. There's still enough horror in mainline for more than
one amusement park, we just haven't found it yet.
Here's some. My tool is still buggy, so if any results don't make
sense, please tell me. Full results are a bit verbose (2M), but
compress quite well, so I have them attached. For the lazy, here are
a few interesting things. First the recursions that shouldn't iterate
too often:

WARNING: recursion detected:
0 default_wake_function
36 try_to_wake_up
0 task_rq_unlock
0 preempt_schedule
68 schedule
52 load_balance
0 find_busiest_queue
0 double_lock_balance
0 __preempt_spin_lock
0 _raw_spin_lock
0 printk
0 printk
16 release_console_sem
16 __wake_up
0 __wait_queue->func
WARNING: recursion detected:
12 kfree
12 cache_flusharray
20 free_block
12 slab_destroy
0 kernel_map_pages
20 change_page_attr
12 __change_page_attr
16 split_large_page
0 alloc_pages_node
24 __alloc_pages
284 try_to_free_pages
0 backing_dev_info->congested_fn
0 dm_any_congested
0 dm_table_put
0 table_destroy
0 vfree
0 __vunmap
WARNING: recursion detected:
0 kernel_map_pages
20 change_page_attr
12 __change_page_attr
16 split_large_page
0 alloc_pages_node
24 __alloc_pages
284 try_to_free_pages
0 backing_dev_info->congested_fn
0 dm_any_congested
0 dm_table_put
0 table_destroy
0 vfree
0 __vunmap
0 __free_pages
0 free_hot_page
0 free_hot_cold_page
WARNING: recursion detected:
0 dm_table_any_congested
0 backing_dev_info->congested_fn
0 dm_any_congested
WARNING: recursion detected:
12 kmem_cache_free
12 cache_flusharray
20 free_block
12 slab_destroy
WARNING: recursion detected:
68 schedule
0 finish_task_switch
0 __put_task_struct
0 free_task
12 kfree
12 cache_flusharray
20 free_block
12 slab_destroy
0 kernel_map_pages
20 change_page_attr
12 __change_page_attr
16 split_large_page
0 alloc_pages_node
24 __alloc_pages
284 try_to_free_pages
12 shrink_caches
12 shrink_zone
124 shrink_cache
176 shrink_list
0 handle_write_error
0 lock_page
72 __lock_page
0 io_schedule
WARNING: recursion detected:
0 kmem_cache_alloc
16 cache_alloc_refill
36 cache_grow
0 alloc_slabmgmt
WARNING: recursion detected:
24 __alloc_pages
284 try_to_free_pages
12 shrink_caches
12 shrink_zone
124 shrink_cache
176 shrink_list
0 add_to_swap
0 __add_to_swap_cache
16 radix_tree_insert
0 radix_tree_node_alloc
0 kmem_cache_alloc
0 kmem_cache_alloc
16 cache_alloc_refill
36 cache_grow
0 kmem_getpages
0 __get_free_pages
0 alloc_pages_node
WARNING: recursion detected:
28 qla2x00_handle_port_rscn
28 qla2x00_send_login_iocb
0 qla2x00_issue_marker
28 qla2x00_marker
0 __qla2x00_marker
24 qla2x00_req_pkt
0 qla2x00_poll
28 qla2x00_intr_handler
100 qla2x00_async_event
WARNING: recursion detected:
0 qla2x00_process_iodesc
28 qla2x00_handle_port_rscn
28 qla2x00_send_login_iocb
0 qla2x00_issue_marker
28 qla2x00_marker
0 __qla2x00_marker
24 qla2x00_req_pkt
0 qla2x00_poll
28 qla2x00_intr_handler
100 qla2x00_async_event
0 qla2x00_process_response_queue
WARNING: recursion detected:
28 qla2x00_marker
0 __qla2x00_marker
24 qla2x00_req_pkt
0 qla2x00_poll
28 qla2x00_intr_handler
88 qla2x00_next
32 qla2x00_start_scsi
WARNING: recursion detected:
92 qla2x00_mailbox_command
40 qla2x00_abort_isp
24 qla2x00_restart_isp
24 qla2x00_setup_chip
96 qla2x00_verify_checksum
WARNING: recursion detected:
96 qla2x00_issue_iocb
92 qla2x00_mailbox_command
40 qla2x00_abort_isp
24 qla2x00_restart_isp
0 qla2x00_configure_loop
80 qla2x00_configure_fabric
0 qla2x00_rff_id
WARNING: recursion detected:
72 qla2x00_rsnn_nn
96 qla2x00_issue_iocb
92 qla2x00_mailbox_command
40 qla2x00_abort_isp
24 qla2x00_restart_isp
0 qla2x00_configure_loop
80 qla2x00_configure_fabric
WARNING: recursion detected:
16 acpi_ut_remove_reference
24 acpi_ut_update_object_reference
16 acpi_ut_update_ref_count
16 acpi_ut_delete_internal_obj
WARNING: recursion detected:
32 acpi_ex_field_datum_io
76 acpi_ex_insert_into_field
52 acpi_ex_write_with_update_rule
WARNING: recursion detected:
72 acpi_ex_extract_from_field
32 acpi_ex_field_datum_io
WARNING: recursion detected:
32 acpi_ex_read_data_from_field
72 acpi_ex_extract_from_field
32 acpi_ex_field_datum_io
32 acpi_ex_access_region
20 acpi_ex_setup_region
16 acpi_ds_get_region_arguments
28 acpi_ds_execute_arguments
24 acpi_ps_parse_aml
36 acpi_ps_parse_loop
0 acpi_walk_state->ascending_callback
24 acpi_ds_exec_end_op
40 acpi_ex_resolve_operands
20 acpi_ex_resolve_to_value
28 acpi_ex_resolve_object_to_value
WARNING: recursion detected:
28 acpi_ds_execute_arguments
24 acpi_ps_parse_aml
36 acpi_ps_parse_loop
0 acpi_walk_state->ascending_callback
24 acpi_ds_exec_end_op
40 acpi_ex_resolve_operands
20 acpi_ex_resolve_to_value
28 acpi_ex_resolve_object_to_value
16 acpi_ds_get_package_arguments
WARNING: recursion detected:
32 acpi_ex_resolve_node_to_value
32 acpi_ex_read_data_from_field
72 acpi_ex_extract_from_field
32 acpi_ex_field_datum_io
32 acpi_ex_access_region
20 acpi_ex_setup_region
16 acpi_ds_get_region_arguments
28 acpi_ds_execute_arguments
24 acpi_ps_parse_aml
36 acpi_ps_parse_loop
0 acpi_walk_state->ascending_callback
24 acpi_ds_exec_end_op
40 acpi_ex_resolve_operands
20 acpi_ex_resolve_to_value
WARNING: recursion detected:
28 acpi_ns_evaluate_by_handle
20 acpi_ns_get_object_value
32 acpi_ex_resolve_node_to_value
32 acpi_ex_read_data_from_field
72 acpi_ex_extract_from_field
32 acpi_ex_field_datum_io
32 acpi_ex_access_region
20 acpi_ex_setup_region
16 acpi_ds_get_region_arguments
28 acpi_ds_execute_arguments
24 acpi_ps_parse_aml
36 acpi_ps_parse_loop
0 acpi_walk_state->ascending_callback
24 acpi_ds_exec_end_op
32 acpi_ds_load2_end_op
20 acpi_ex_create_table_region
28 acpi_ev_initialize_region
36 acpi_ev_execute_reg_method
WARNING: recursion detected:
24 acpi_ps_parse_aml
36 acpi_ds_call_control_method

There are more, but this shows a few ugly spots. It also shows bugs
in my tool, I'll have to look into those. Next month.

Now some of the top stack killers:

stackframes for call path too long (3136):
size function
0 radeonfb_pci_resume
2576 radeonfb_set_par
0 preempt_schedule
68 schedule
0 __put_task_struct
0 audit_free
0 audit_log_end
12 audit_log_end_fast
12 netlink_unicast
76 netlink_attachskb
0 __kfree_skb
0 ip_conntrack_put
0 ip_conntrack_put
12 kfree
0 kernel_map_pages
20 change_page_attr
24 __alloc_pages
284 try_to_free_pages
0 out_of_memory
0 mmput
0 exit_aio
52 aio_cancel_all
0 list_kiocb
stackframes for call path too long (3056):
size function
720 ncp_ioctl
616 ncp_conn_logged_in
24 ncp_lookup_volume
0 ncp_request2
164 sock_sendmsg
0 wait_on_sync_kiocb
68 schedule
0 __put_task_struct
0 audit_free
0 audit_log_end
12 audit_log_end_fast
12 netlink_unicast
76 netlink_attachskb
0 __kfree_skb
0 ip_conntrack_put
0 ip_conntrack_put
12 kfree
0 kernel_map_pages
20 change_page_attr
24 __alloc_pages
284 try_to_free_pages
0 out_of_memory
0 mmput
0 exit_aio
0 __put_ioctx
16 do_munmap
0 fput
0 __fput
0 locks_remove_flock
0 panic
0 sys_sync
0 sync_inodes
308 sync_inodes_sb
0 do_writepages
128 mpage_writepages
4 write_boundary_block
0 ll_rw_block
28 submit_bh
0 bio_alloc
88 mempool_alloc
256 wakeup_bdflush
20 pdflush_operation
0 printk
16 release_console_sem
16 __wake_up
0 printk
0 vscnprintf
32 vsnprintf
112 number
stackframes for call path too long (3024):
size function
0 acpi_device_ops->bind
292 acpi_pci_bind
292 acpi_pci_irq_add_prt
20 acpi_get_irq_routing_table
20 acpi_rs_get_prt_method_data
24 acpi_ut_evaluate_object
32 acpi_ns_evaluate_relative
28 acpi_ns_evaluate_by_handle
20 acpi_ns_get_object_value
32 acpi_ex_resolve_node_to_value
32 acpi_ex_read_data_from_field
72 acpi_ex_extract_from_field
32 acpi_ex_field_datum_io
32 acpi_ex_access_region
20 acpi_ex_setup_region
16 acpi_ds_get_region_arguments
28 acpi_ds_execute_arguments
24 acpi_ps_parse_aml
36 acpi_ds_call_control_method
24 acpi_ps_parse_aml
36 acpi_ps_parse_loop
24 acpi_ds_exec_end_op
32 acpi_ds_load2_end_op
20 acpi_ex_create_table_region
24 acpi_tb_find_table
224 acpi_get_firmware_table
68 acpi_tb_get_table
24 acpi_tb_get_table_body
36 acpi_tb_table_override
36 acpi_tb_get_this_table
0 acpi_os_map_memory
0 __ioremap
0 __pmd_alloc
0 preempt_schedule
68 schedule
0 __put_task_struct
0 audit_free
0 audit_log_end
12 audit_log_end_fast
12 netlink_unicast
76 netlink_attachskb
0 __kfree_skb
0 ip_conntrack_put
0 ip_conntrack_put
12 kfree
0 kernel_map_pages
20 change_page_attr
24 __alloc_pages
284 try_to_free_pages
0 out_of_memory
0 mmput
0 exit_aio
0 __put_ioctx
16 do_munmap
0 fput
0 __fput
0 locks_remove_flock
0 panic
0 sys_sync
0 sync_inodes
308 sync_inodes_sb
0 do_writepages
128 mpage_writepages
4 write_boundary_block
0 ll_rw_block
28 submit_bh
0 bio_alloc
88 mempool_alloc
256 wakeup_bdflush
20 pdflush_operation
0 printk
0 preempt_schedule
68 schedule
stackframes for call path too long (3104):
size function
0 client_reg_t->event_handler
1168 ide_config
12 ide_register_hw
1596 ide_unregister
0 device_unregister
0 device_del
0 kobject_del
0 kobject_hotplug
132 call_usermodehelper
80 wait_for_completion
68 schedule
0 __put_task_struct
0 audit_free
16 audit_filter_syscall
32 audit_filter_rules
stackframes for call path too long (3144):
size function
148 generic_ide_ioctl
12 ide_register_hw
1596 ide_unregister
0 device_unregister
0 device_del
0 kobject_del
0 kobject_hotplug
132 call_usermodehelper
80 wait_for_completion
68 schedule
0 __put_task_struct
0 audit_free
0 audit_log_end
12 audit_log_end_fast
12 netlink_unicast
76 netlink_attachskb
0 __kfree_skb
0 ip_conntrack_put
0 ip_conntrack_put
12 kfree
0 kernel_map_pages
20 change_page_attr
24 __alloc_pages
284 try_to_free_pages
0 out_of_memory
0 mmput
0 exit_aio
0 __put_ioctx
16 do_munmap
0 fput
0 __fput
0 locks_remove_flock
0 panic
0 sys_sync
0 sync_inodes
308 sync_inodes_sb
0 do_writepages
128 mpage_writepages
204 mpage_writepage
12 mpage_alloc
0 bio_alloc
tackframes for call path too long (3004):
size function
0 ____FAKE.Name.Chip.stat.Regi.LILP.Opti.high.lowe->ProcessIMQEntry
2076 CpqTsProcessIMQEntry
12 cpqfcTSCompleteExchange
12 kfree
0 kernel_map_pages
20 change_page_attr
24 __alloc_pages
284 try_to_free_pages
0 out_of_memory
0 mmput
0 exit_aio
0 __put_ioctx
16 do_munmap
0 fput
0 __fput
0 locks_remove_flock
0 panic
0 sys_sync
0 sync_inodes
308 sync_inodes_sb
0 do_writepages
128 mpage_writepages
4 write_boundary_block
0 ll_rw_block
28 submit_bh
36 submit_bio
56 generic_make_request
0 bdev_get_queue

Not too many above 3k and none above 4k. Actually, intermezzo had
quite a few paths that went above 4k, but that one's gone.

So effectively, it comes down to the recursive paths. Unless someone
comes up with a semantical parser that can figure out the maximum
number of iterations, we have to look at them manually.

Jörn
--
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
Jörn Engel
2004-06-01 06:02:05 UTC
Permalink
Post by Jörn Engel
=20
So effectively, it comes down to the recursive paths. Unless someone
comes up with a semantical parser that can figure out the maximum
number of iterations, we have to look at them manually.
Linus, Andrew, would you accept patches like the one below? With such
information and assuming that the comments will get maintained, it's
relatively simple to unroll recursions and measure stack comsumption
more accurately.

J=F6rn

--=20
ticks =3D jiffies;
while (ticks =3D=3D jiffies);
ticks =3D jiffies;
-- /usr/src/linux/init/main.c

Add recursion markers to teach automated test tools how bad documented
recursions really are. Currently, there is only a single such too that
can use the information and there is always the danger of documentation
and reality getting out of sync. But until there's a better tool...

Currently, this patch also has a few cleanup bits included. The cleanu=
ps
were helpful to figure out the depth of some recursions and could be
useful on their own. If not, they are easily removed.

arch/i386/kernel/apm.c | 4 ++
drivers/char/random.c | 6 ++++
drivers/ide/ide-tape.c | 33 +++++++++++++-----------
drivers/ide/ide-timing.h | 60 ++++++++++++++++++-----------------=
---------
drivers/isdn/i4l/isdn_tty.c | 5 +++
drivers/isdn/icn/icn.c | 5 +++
fs/block_dev.c | 5 +++
fs/quota_v2.c | 43 +++++++++++++++++++------------
kernel/signal.c | 7 +++++
kernel/sysctl.c | 10 +++++++
10 files changed, 113 insertions(+), 65 deletions(-)


--- linux-2.6.6stack/arch/i386/kernel/apm.c~recursion 2004-05-10 18:10:=
06.000000000 +0200
+++ linux-2.6.6stack/arch/i386/kernel/apm.c 2004-05-30 18:24:54.0000000=
00 +0200
@@ -1058,6 +1058,10 @@
* monitor powerdown for us.
*/
=20
+/**
+ * RECURSION: 2
+ * STEP: apm_console_blank
+ */
static int apm_console_blank(int blank)
{
int error;
--- linux-2.6.6stack/kernel/sysctl.c~recursion 2004-05-30 17:51:03.0000=
00000 +0200
+++ linux-2.6.6stack/kernel/sysctl.c 2004-05-30 17:52:25.000000000 +020=
0
@@ -1188,6 +1188,11 @@
#ifdef CONFIG_PROC_FS
=20
/* Scan the sysctl entries in table and add them all into /proc */
+
+/**
+ * RECURSION: 100
+ * STEP: register_proc_table
+ */
static void register_proc_table(ctl_table * table, struct proc_dir_ent=
ry *root)
{
struct proc_dir_entry *de;
@@ -1237,6 +1242,11 @@
/*
* Unregister a /proc sysctl table and any subdirectories.
*/
+
+/**
+ * RECURSION: 100
+ * STEP: unregister_proc_table
+ */
static void unregister_proc_table(ctl_table * table, struct proc_dir_e=
ntry *root)
{
struct proc_dir_entry *de;
--- linux-2.6.6stack/kernel/signal.c~recursion 2004-05-10 18:10:38.0000=
00000 +0200
+++ linux-2.6.6stack/kernel/signal.c 2004-05-30 18:24:38.000000000 +020=
0
@@ -626,6 +626,13 @@
* actual continuing for SIGCONT, but not the actual stopping for stop
* signals. The process stop is done as a signal action for SIG_DFL.
*/
+
+/**
+ * RECURSION: 2
+ * STEP: handle_stop_signal
+ * STEP: do_notify_parent_cldstop
+ * STEP: __group_send_sig_info
+ */
static void handle_stop_signal(int sig, struct task_struct *p)
{
struct task_struct *t;
--- linux-2.6.6stack/fs/block_dev.c~recursion 2004-05-10 18:10:30.00000=
0000 +0200
+++ linux-2.6.6stack/fs/block_dev.c 2004-05-31 17:20:53.000000000 +0200
@@ -547,6 +547,11 @@
}
EXPORT_SYMBOL(bd_set_size);
=20
+/**
+ * RECURSION: 2
+ * STEP: do_open
+ * STEP: blkdev_get
+ */
static int do_open(struct block_device *bdev, struct file *file)
{
struct module *owner =3D NULL;
--- linux-2.6.6stack/fs/quota_v2.c~recursion 2004-05-10 18:10:32.000000=
000 +0200
+++ linux-2.6.6stack/fs/quota_v2.c 2004-05-30 18:36:23.000000000 +0200
@@ -352,6 +352,12 @@
}
=20
/* Insert reference to structure into the trie */
+
+/**
+ * Recursion count equals V2_DQTREEDEPTH, keep both in sync
+ * RECURSION: 4
+ * STEP: do_insert_tree
+ */
static int do_insert_tree(struct dquot *dquot, uint *treeblk, int dept=
h)
{
struct file *filp =3D sb_dqopt(dquot->dq_sb)->files[dquot->dq_type];
@@ -369,12 +375,9 @@
*treeblk =3D ret;
memset(buf, 0, V2_DQBLKSIZE);
newact =3D 1;
- }
- else {
- if ((ret =3D read_blk(filp, *treeblk, buf)) < 0) {
- printk(KERN_ERR "VFS: Can't read tree quota block %u.\n", *treeblk)=
;
- goto out_buf;
- }
+ } else if ((ret =3D read_blk(filp, *treeblk, buf)) < 0) {
+ printk(KERN_ERR "VFS: Can't read tree quota block %u.\n", *treeblk);
+ goto out_buf;
}
ref =3D (u32 *)buf;
newblk =3D le32_to_cpu(ref[GETIDINDEX(dquot->dq_id, depth)]);
@@ -389,14 +392,12 @@
}
#endif
newblk =3D find_free_dqentry(dquot, &ret);
- }
- else
+ } else
ret =3D do_insert_tree(dquot, &newblk, depth+1);
if (newson && ret >=3D 0) {
ref[GETIDINDEX(dquot->dq_id, depth)] =3D cpu_to_le32(newblk);
ret =3D write_blk(filp, *treeblk, buf);
- }
- else if (newact && ret < 0)
+ } else if (newact && ret < 0)
put_free_dqblk(filp, dquot->dq_type, buf, *treeblk);
out_buf:
freedqbuf(buf);
@@ -498,6 +499,12 @@
}
=20
/* Remove reference to dquot from tree */
+
+/**
+ * Recursion count equals V2_DQTREEDEPTH, keep both in sync
+ * RECURSION: 4
+ * STEP: remove_tree
+ */
static int remove_tree(struct dquot *dquot, uint *blk, int depth)
{
struct file *filp =3D sb_dqopt(dquot->dq_sb)->files[dquot->dq_type];
@@ -516,19 +523,17 @@
if (depth =3D=3D V2_DQTREEDEPTH-1) {
ret =3D free_dqentry(dquot, newblk);
newblk =3D 0;
- }
- else
+ } else
ret =3D remove_tree(dquot, &newblk, depth+1);
if (ret >=3D 0 && !newblk) {
int i;
ref[GETIDINDEX(dquot->dq_id, depth)] =3D cpu_to_le32(0);
- for (i =3D 0; i < V2_DQBLKSIZE && !buf[i]; i++); /* Block got empty?=
*/
+ for (i =3D 0; i < V2_DQBLKSIZE && !buf[i]; i++)
+ ; /* Block got empty? */
if (i =3D=3D V2_DQBLKSIZE) {
put_free_dqblk(filp, dquot->dq_type, buf, *blk);
*blk =3D 0;
- }
- else
- if ((ret =3D write_blk(filp, *blk, buf)) < 0)
+ } else if ((ret =3D write_blk(filp, *blk, buf)) < 0)
printk(KERN_ERR "VFS: Can't write quota tree block %u.\n", *blk);
}
out_buf:
@@ -584,6 +589,12 @@
}
=20
/* Find entry for given id in the tree */
+
+/**
+ * Recursion count equals V2_DQTREEDEPTH, keep both in sync
+ * RECURSION: 4
+ * STEP: find_tree_dqentry
+ */
static loff_t find_tree_dqentry(struct dquot *dquot, uint blk, int dep=
th)
{
struct file *filp =3D sb_dqopt(dquot->dq_sb)->files[dquot->dq_type];
--- linux-2.6.6stack/drivers/char/random.c~recursion 2004-05-10 18:10:2=
3.000000000 +0200
+++ linux-2.6.6stack/drivers/char/random.c 2004-05-30 18:48:55.00000000=
0 +0200
@@ -1311,6 +1311,12 @@
* from the primary pool to the secondary extraction pool. We make
* sure we pull enough for a 'catastrophic reseed'.
*/
+
+/**=20
+ * RECURSION: 2
+ * STEP: xfer_secondary_pool
+ * STEP: extract_entropy
+ */
static inline void xfer_secondary_pool(struct entropy_store *r,
size_t nbytes, __u32 *tmp)
{
--- linux-2.6.6stack/drivers/ide/ide-timing.h~recursion 2004-01-09 07:5=
9:26.000000000 +0100
+++ linux-2.6.6stack/drivers/ide/ide-timing.h 2004-05-31 16:56:23.00000=
0000 +0200
@@ -208,63 +208,53 @@
return t;=20
}
=20
+/**
+ * RECURSION: 2
+ * STEP: ide_timing_compute
+ */
static int ide_timing_compute(ide_drive_t *drive, short speed, struct =
ide_timing *t, int T, int UT)
{
struct hd_driveid *id =3D drive->id;
struct ide_timing *s, p;
=20
-/*
- * Find the mode.
- */
-
- if (!(s =3D ide_timing_find_mode(speed)))
+ s =3D ide_timing_find_mode(speed);
+ if (!s)
return -EINVAL;
=20
-/*
- * If the drive is an EIDE drive, it can tell us it needs extended
- * PIO/MWDMA cycle timing.
- */
-
- if (id && id->field_valid & 2) { /* EIDE drive */
-
+ /* If the drive is an EIDE drive, it can tell us it needs extended
+ * PIO/MWDMA cycle timing.
+ */
+ if (id && (id->field_valid & 2)) { /* EIDE drive */
memset(&p, 0, sizeof(p));
=20
switch (speed & XFER_MODE) {
+ case XFER_PIO:
+ if (speed <=3D XFER_PIO_2)
+ p.cycle =3D p.cyc8b =3D id->eide_pio;
+ else
+ p.cycle =3D p.cyc8b =3D id->eide_pio_iordy;
+ break;
=20
- case XFER_PIO:
- if (speed <=3D XFER_PIO_2) p.cycle =3D p.cyc8b =3D id->eide_pio;
- else p.cycle =3D p.cyc8b =3D id->eide_pio_iordy;
- break;
-
- case XFER_MWDMA:
- p.cycle =3D id->eide_dma_min;
- break;
+ case XFER_MWDMA:
+ p.cycle =3D id->eide_dma_min;
+ break;
}
-
ide_timing_merge(&p, t, t, IDE_TIMING_CYCLE | IDE_TIMING_CYC8B);
}
=20
-/*
- * Convert the timing to bus clock counts.
- */
-
+ /* Convert the timing to bus clock counts. */
ide_timing_quantize(s, t, T, UT);
=20
-/*
- * Even in DMA/UDMA modes we still use PIO access for IDENTIFY, S.M.A.=
R.T
- * and some other commands. We have to ensure that the DMA cycle timin=
g is
- * slower/equal than the fastest PIO timing.
- */
-
+ /* Even in DMA/UDMA modes we still use PIO access for IDENTIFY,
+ * S.M.A.R.T and some other commands. We have to ensure that the
+ * DMA cycle timing is slower/equal than the fastest PIO timing.
+ */
if ((speed & XFER_MODE) !=3D XFER_PIO) {
ide_timing_compute(drive, ide_find_best_mode(drive, XFER_PIO | XFER_=
EPIO), &p, T, UT);
ide_timing_merge(&p, t, t, IDE_TIMING_ALL);
}
=20
-/*
- * Lenghten active & recovery time so that cycle time is correct.
- */
-
+ /* Lenghten active & recovery time so that cycle time is correct. */
if (t->act8b + t->rec8b < t->cyc8b) {
t->act8b +=3D (t->cyc8b - (t->act8b + t->rec8b)) / 2;
t->rec8b =3D t->cyc8b - t->act8b;
--- linux-2.6.6stack/drivers/ide/ide-tape.c~recursion 2004-05-10 18:10:=
24.000000000 +0200
+++ linux-2.6.6stack/drivers/ide/ide-tape.c 2004-05-31 16:58:30.0000000=
00 +0200
@@ -3653,6 +3653,11 @@
* the filemark is in our internal pipeline even if the tape doesn't
* support spacing over filemarks in the reverse direction.
*/
+
+/**
+ * RECURSION: 2
+ * STEP: idetape_space_over_filemarks
+ */
static int idetape_space_over_filemarks (ide_drive_t *drive,short mt_o=
p,int mt_count)
{
idetape_tape_t *tape =3D drive->driver_data;
@@ -3711,21 +3716,21 @@
* Now we can issue the space command.
*/
switch (mt_op) {
- case MTFSF:
- case MTBSF:
- idetape_create_space_cmd(&pc,mt_count-count,IDETAPE_SPACE_OVER_FILE=
MARK);
- return (idetape_queue_pc_tail(drive, &pc));
- case MTFSFM:
- case MTBSFM:
- if (!tape->capabilities.sprev)
- return (-EIO);
- retval =3D idetape_space_over_filemarks(drive, MTFSF, mt_count-coun=
t);
- if (retval) return (retval);
- count =3D (MTBSFM =3D=3D mt_op ? 1 : -1);
- return (idetape_space_over_filemarks(drive, MTFSF, count));
- default:
- printk(KERN_ERR "ide-tape: MTIO operation %d not supported\n",mt_op=
);
+ case MTFSF:
+ case MTBSF:
+ idetape_create_space_cmd(&pc,mt_count-count,IDETAPE_SPACE_OVER_FILEM=
ARK);
+ return (idetape_queue_pc_tail(drive, &pc));
+ case MTFSFM:
+ case MTBSFM:
+ if (!tape->capabilities.sprev)
return (-EIO);
+ retval =3D idetape_space_over_filemarks(drive, MTFSF, mt_count-count=
);
+ if (retval) return (retval);
+ count =3D (MTBSFM =3D=3D mt_op ? 1 : -1);
+ return (idetape_space_over_filemarks(drive, MTFSF, count));
+ default:
+ printk(KERN_ERR "ide-tape: MTIO operation %d not supported\n",mt_op)=
;
+ return (-EIO);
}
}
=20
--- linux-2.6.6stack/drivers/isdn/i4l/isdn_tty.c~recursion 2004-03-28 2=
1:51:38.000000000 +0200
+++ linux-2.6.6stack/drivers/isdn/i4l/isdn_tty.c 2004-05-31 17:02:39.00=
0000000 +0200
@@ -2519,6 +2519,11 @@
* For RING-message handle auto-ATA if register 0 !=3D 0
*/
=20
+/**
+ * RECURSION: 2
+ * STEP: isdn_tty_modem_result
+ * STEP: isdn_tty_cmd_ATA
+ */
static void
isdn_tty_modem_result(int code, modem_info * info)
{
--- linux-2.6.6stack/drivers/isdn/icn/icn.c~recursion 2004-03-28 21:51:=
38.000000000 +0200
+++ linux-2.6.6stack/drivers/isdn/icn/icn.c 2004-05-31 17:03:55.0000000=
00 +0200
@@ -1097,6 +1097,11 @@
/*
* Delete card's pending timers, send STOP to linklevel
*/
+
+/**
+ * RECURSION: 2
+ * STEP: icn_stopcard
+ */
static void
icn_stopcard(icn_card * card)
{
Pavel Machek
2004-06-01 12:20:13 UTC
Permalink
Hi!
Post by Jörn Engel
Post by Jörn Engel
So effectively, it comes down to the recursive paths. Unless someone
comes up with a semantical parser that can figure out the maximum
number of iterations, we have to look at them manually.
Linus, Andrew, would you accept patches like the one below? With such
information and assuming that the comments will get maintained, it's
relatively simple to unroll recursions and measure stack comsumption
more accurately.
Perhaps some other format of comment should be introduced? Will not
this interfere with linuxdoc?
Pavel
--
934a471f20d6580d5aad759bf0d97ddc
Jörn Engel
2004-06-01 13:27:12 UTC
Permalink
Post by Pavel Machek
=20
Perhaps some other format of comment should be introduced? Will not
this interfere with linuxdoc?
I'm open for suggestions. ;)

J=F6rn

--=20
Data dominates. If you've chosen the right data structures and organize=
d
things well, the algorithms will almost always be self-evident. Data
structures, not algorithms, are central to programming.
-- Rob Pike
Pavel Machek
2004-06-01 13:32:29 UTC
Permalink
Hi!
Post by Jörn Engel
Post by Pavel Machek
Perhaps some other format of comment should be introduced? Will not
this interfere with linuxdoc?
I'm open for suggestions. ;)
/*! Recursion-count: 2 Whatever-else: 5 */

?
Pavel
--
934a471f20d6580d5aad759bf0d97ddc
Jörn Engel
2004-06-01 13:37:53 UTC
Permalink
Post by Pavel Machek
Post by Jörn Engel
=20
I'm open for suggestions. ;)
=20
/*! Recursion-count: 2 Whatever-else: 5 */
What I need is:
1. Recursion count
2. All functions involved in the recursion in the correct order (a
calls b calls c calls d calls a, something like that).

How do you pass 2?

J=F6rn

--=20
ticks =3D jiffies;
while (ticks =3D=3D jiffies);
ticks =3D jiffies;
-- /usr/src/linux/init/main.c
v***@parcelfarce.linux.theplanet.co.uk
2004-06-01 12:39:22 UTC
Permalink
Add recursion markers to teach automated test tools how bad documente=
d
recursions really are. Currently, there is only a single such too th=
at
can use the information and there is always the danger of documentati=
on
and reality getting out of sync. But until there's a better tool...
=20
+/**
+ * RECURSION: 100
+ * STEP: register_proc_table
+ */
This is too ugly for words ;-/ Who will maintain that data, anyway?
Jörn Engel
2004-06-01 13:26:29 UTC
Permalink
Post by v***@parcelfarce.linux.theplanet.co.uk
=20
Post by Jörn Engel
+/**
+ * RECURSION: 100
+ * STEP: register_proc_table
+ */
=20
This is too ugly for words ;-/ Who will maintain that data, anyway?
What format do you propose? I don't care too much.

Maintenance would get easier with less recursions, obviously. ;)

I could hack up something that will generate digests from the function
source code (through smatch or so) and put those digests into the
comments. As long as they match, the comments remain valid. And that
should get past lawyers, as I work on a different basis now.

J=F6rn

--=20
Data dominates. If you've chosen the right data structures and organize=
d
things well, the algorithms will almost always be self-evident. Data
structures, not algorithms, are central to programming.
-- Rob Pike
Timothy Miller
2004-06-07 18:14:51 UTC
Permalink
Post by Jörn Engel
But I'll shut up now and see if I can generate better data over the
weekend. -test11 still had fun stuff like 3k stack consumption over
some code paths in a pretty minimal kernel. Wonder what 2.6.6 will d=
o
Post by Jörn Engel
with allyesconfig. ;)
That gave me an idea. Sometimes in chip design, we 'overconstrain' the=
=20
logic synthesizer, because static timing analyzers often produce=20
inaccurate results. Anyhow, what if we were to go to 4K stacks but in=20
static code analysis, flag anything which uses more than 2K or even 1K?
Arjan van de Ven
2004-06-08 06:26:25 UTC
Permalink
Post by Timothy Miller
That gave me an idea. Sometimes in chip design, we 'overconstrain' the
logic synthesizer, because static timing analyzers often produce
inaccurate results. Anyhow, what if we were to go to 4K stacks but in
static code analysis, flag anything which uses more than 2K or even 1K?
the patch I sent to akpm went to 400 bytes actually, but yeah, even that
already is debatable.
Jörn Engel
2004-06-08 08:45:06 UTC
Permalink
=20
That gave me an idea. Sometimes in chip design, we 'overconstrain'=
the=20
logic synthesizer, because static timing analyzers often produce=20
inaccurate results. Anyhow, what if we were to go to 4K stacks but=
in=20
static code analysis, flag anything which uses more than 2K or even=
1K?

With 2.6.6, there are currently just a few non-recursive paths over
3k. 2k will give you a *lot* of output, but if you insist... ;)

http://wh.fh-wedel.de/~joern/data.nointermezzo.cs2.2k.bz2
470k compressed, 65M uncompressed

=46eel free to send patches.
the patch I sent to akpm went to 400 bytes actually, but yeah, even t=
hat
already is debatable.
400 bytes? That is for a single function, I assume.

J=F6rn

--=20
Those who come seeking peace without a treaty are plotting.
-- Sun Tzu

David S. Miller
2004-05-26 18:12:22 UTC
Permalink
On Wed, 26 May 2004 14:50:14 +0200
Post by Jörn Engel
Change gcc to catch stack overflows before the fact and disallow
module load unless modules have those checks as well.
That's easy, just enable profiling then implement a suitable
_mcount that checks for stack overflow. I bet someone has done
this already.

=46or full coverage, some trap entry handler checks in entry.S
would be necessary too of course.
Matt Mackall
2004-05-26 19:02:22 UTC
Permalink
Post by David S. Miller
On Wed, 26 May 2004 14:50:14 +0200
Post by Jörn Engel
Change gcc to catch stack overflows before the fact and disallow
module load unless modules have those checks as well.
That's easy, just enable profiling then implement a suitable
_mcount that checks for stack overflow. I bet someone has done
this already.
There was a patch floating around for this in the 2.2 era that I
ported to 2.4 on one occassion. It won't tell you worst case though,
just worst observed case.

Sparse is probably not a bad place to put a real call chain stack analysis.
--
Mathematics is the supreme nostalgia of our time.
Dave Jones
2004-05-26 19:25:29 UTC
Permalink
Post by Matt Mackall
There was a patch floating around for this in the 2.2 era that I
ported to 2.4 on one occassion. It won't tell you worst case though,
just worst observed case.
Sparse is probably not a bad place to put a real call chain stack analysis.
That won't measure any dynamic stack allocations that we're doing
at runtime, nor will it test all n combinations of drivers, which
is where most of the stack horrors have been found in recent times.

Dave
Bill Davidsen
2004-05-25 21:04:03 UTC
Permalink
Post by Rik van Riel
Post by Andrea Arcangeli
Clearly by opening enough files or enough network sockets or enough vmas
or similar, you can still run out of normal zone, even on a 2G system,
but this is not the point or you would be shipping 4:4 on the 2G systems
too, no?
The point is, people like to run bigger workloads on
bigger systems. Otherwise they wouldn't bother buying
those bigger systems.
Sure, and "bigger workloads" can mean a lot of small client processes
talking to a threaded large process, like database or news, or it can be
huge datasets, like image or multimedia processing. Unfortunately it
seems that these all (ab)use the VM in various ways.

x86 with lots of memory is likely to remain cost effective for years,
not only because it not only allows more memory but needs more memory in
most cases, but because vendors will hang on to their profit margins on
64 bit CPUs for that long. And for uses with many small client
processes, the advantage of 64 bits is pretty small when you have 2MB
processes. Given a bus which allows i/o to more memory, the benefits in
performance are really hard to see.
--
-bill davidsen (***@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
Thomas Glanzmann
2004-05-23 23:42:31 UTC
Permalink
To:

Hi,
So do I understand this correctly, in 2.6.7(+) it will no longer be
necessary to have the 4g patches? I will be able to get 4g/process
with the going forward kernels?
a coworker made the old 4g patch ready to apply clean into 2.6.6 . He's
also working on a nice model, I think he's reading lkml so maybe he'll
answer.
Martin J. Bligh
2004-05-24 01:33:04 UTC
Permalink
Post by Phy Prabab
I have been researching the 4g patches for kernels.
Seems there was a rift between people over this. Is
there any plan to resume publishing 4g patches for
developing kernels?
I am currently trying to get 4g to work with 2.6.6-mm5
but of course running into issues,so any help on this
would be great!
It's in -mjb tree - the update from 2.6.6-rc3 to 2.6.6 should be trivial.

M.
Phy Prabab
2004-05-24 01:38:40 UTC
Permalink
Okay, I will work with that patch and see if I can it
to apply.

Thanks!
Phy
Post by Phy Prabab
Post by Phy Prabab
I have been researching the 4g patches for
kernels.
Post by Phy Prabab
Seems there was a rift between people over this.
Is
Post by Phy Prabab
there any plan to resume publishing 4g patches for
developing kernels?
I am currently trying to get 4g to work with
2.6.6-mm5
Post by Phy Prabab
but of course running into issues,so any help on
this
Post by Phy Prabab
would be great!
It's in -mjb tree - the update from 2.6.6-rc3 to
2.6.6 should be trivial.
M.
__________________________________
Do you Yahoo!?
Yahoo! Domains � Claim yours for only $14.70/year
http://smallbusiness.promotions.yahoo.com/offer
Thomas Glanzmann
2004-05-24 10:27:09 UTC
Permalink
Hi,
Post by Thomas Glanzmann
a coworker made the old 4g patch ready to apply clean into 2.6.6 . He's
also working on a nice model, I think he's reading lkml so maybe he'll
answer.
forget the URL:

http://wwwcip.informatik.uni-erlangen.de/~sithglan/2.6.6-4g

Thomas
Manfred Spraul
2004-05-25 19:49:49 UTC
Permalink
Post by Ingo Molnar
also, the 4:4 overhead is really a hardware problem - and there are
x86-compatible CPUs (amd64) where the TLB flush problem has already been
solved: on amd64 the 4:4 feature has no noticeable overhead.
Do you have an idea why amd64 is better for 4g4g? Which benchmark did
you use for testing?

--
Manfred
Ingo Molnar
2004-05-25 19:54:37 UTC
Permalink
Post by Manfred Spraul
Post by Ingo Molnar
also, the 4:4 overhead is really a hardware problem - and there are
x86-compatible CPUs (amd64) where the TLB flush problem has already been
solved: on amd64 the 4:4 feature has no noticeable overhead.
Do you have an idea why amd64 is better for 4g4g? Which benchmark did
you use for testing?
i used an althlon64 CPU. amd64 is better because it has a hardware
feature that 'watches' for memory updates to cached TLBs, and it tags
the TLBs by cr3. So it can avoid having to flush those TLBs that didnt
actually change. So an amd64 CPU has in excess of 1000+ TLBs ...

Ingo
Andi Kleen
2004-05-26 13:57:05 UTC
Permalink
Post by Ingo Molnar
do you realize that the 4K stacks feature also adds a separate softirq
and a separate hardirq stack? So the maximum footprint is 4K+4K+4K, with
A nice combination would be 8K process stacks with separate irq stacks on
i386.

Any chance the CONFIGs for those two could be split?

-Andi
h***@infradead.org
2004-05-26 18:17:34 UTC
Permalink
Post by Andi Kleen
Post by Ingo Molnar
do you realize that the 4K stacks feature also adds a separate softirq
and a separate hardirq stack? So the maximum footprint is 4K+4K+4K, with
A nice combination would be 8K process stacks with separate irq stacks on
i386.
Any chance the CONFIGs for those two could be split?
Any reason not to enable interrupt stacks unconditionally and leave
the stack size choice to the user?
Andi Kleen
2004-05-26 18:24:50 UTC
Permalink
Post by h***@infradead.org
Post by Andi Kleen
Post by Ingo Molnar
do you realize that the 4K stacks feature also adds a separate softirq
and a separate hardirq stack? So the maximum footprint is 4K+4K+4K, with
A nice combination would be 8K process stacks with separate irq stacks on
i386.
Any chance the CONFIGs for those two could be split?
Any reason not to enable interrupt stacks unconditionally and leave
the stack size choice to the user?
It will probably still break some other patches, like debuggers.

Given that the kernel is supposed to be stable I would not change
it unconditionally in 2.6. Maybe in 2.7.

-Andi
Zwane Mwaikambo
2004-05-26 20:39:22 UTC
Permalink
Post by Andi Kleen
Post by Ingo Molnar
do you realize that the 4K stacks feature also adds a separate softirq
and a separate hardirq stack? So the maximum footprint is 4K+4K+4K, with
A nice combination would be 8K process stacks with separate irq stacks on
i386.
Any chance the CONFIGs for those two could be split?
Couldn't this just be done with a THREAD_SIZE config option?
Albert Cahalan
2004-05-26 15:17:50 UTC
Permalink
Post by Ingo Molnar
do you realize that the 4K stacks feature also adds
a separate softirq and a separate hardirq stack?
So the maximum footprint is 4K+4K+4K, with a clear
and sane limit for each type of context, while the
2.4 kernel has 6.5K for all 3 contexts combined.
(Also, in 2.4 irq contexts pretty much assumed that
there's 2K of stack for them - leaving a de-facto 4K
stack for the process and softirq contexts.) So in fact
there is more space in 2.6 for all, and i dont really
understand your fears.
Is that 4K per IRQ (total 64K to 1024K) or 4K total?
If it's total, then it's cheap to go with 32K.

The same goes for softirqs: 4K total, or per softirq?
Andi Kleen
2004-05-26 19:32:32 UTC
Permalink
Post by David S. Miller
Post by Jörn Engel
Change gcc to catch stack overflows before the fact and disallow
module load unless modules have those checks as well.
It's impossible to do anything but panic, so it's not too helpful
in practice. You can only do better for interrupts
(not handle an interrupt when the stack is too low).
Post by David S. Miller
That's easy, just enable profiling then implement a suitable
_mcount that checks for stack overflow. I bet someone has done
this already.
I did it for x86-64 a long time ago. Should be easy to port to i386
too.

ftp://ftp.x86-64.org/pub/linux/debug/stackcheck-1

-Andi
Jörn Engel
2004-05-27 11:27:05 UTC
Permalink
Post by Andi Kleen
=20
Post by David S. Miller
Post by Jörn Engel
Change gcc to catch stack overflows before the fact and disallow
module load unless modules have those checks as well.
=20
It's impossible to do anything but panic, so it's not too helpful
in practice.
Oh, panic is *very* helpful. Panic won't do random funny things, it
will just stop the machine. If we got an immediate panic on any stack
overflow, I would want 4k stacks right now.
Post by Andi Kleen
Post by David S. Miller
That's easy, just enable profiling then implement a suitable
_mcount that checks for stack overflow. I bet someone has done
this already.
=20
I did it for x86-64 a long time ago. Should be easy to port to i386=20
too.
=20
ftp://ftp.x86-64.org/pub/linux/debug/stackcheck-1
Cool! If that is included, I don't have any objections against 4k
stacks anymore.

J=F6rn

--=20
The cheapest, fastest and most reliable components of a computer
system are those that aren't there.
-- Gordon Bell, DEC labratories
Andrea Arcangeli
2004-05-27 13:49:50 UTC
Permalink
Post by Jörn Engel
Cool! If that is included, I don't have any objections against 4k
stacks anymore.
note that it will introduce an huge slowdown, there's no way to enable
that in production. But for testing it's fine.
Jörn Engel
2004-05-27 14:15:47 UTC
Permalink
Post by Jörn Engel
Cool! If that is included, I don't have any objections against 4k
stacks anymore.
=20
note that it will introduce an huge slowdown, there's no way to enabl=
e
that in production. But for testing it's fine.
Would it be possible to add something short to the function preamble
on x86 then? Similar to this code, maybe:

if (!(stack_pointer & 0xe00)) /* less than 512 bytes left */
*NULL =3D 1;

Not sure how this can be translated into short and fast x86 assembler,
but if it is possible, I would really like to have it. Then all we
have left to do is make sure no function ever uses more than 512
bytes. Famous last words, I know.

J=F6rn

--=20
Time? What's that? Time is only worth what you do with it.
-- Theo de Raadt
Andrea Arcangeli
2004-05-27 14:49:16 UTC
Permalink
Post by Jörn Engel
Cool! If that is included, I don't have any objections against 4=
k
Post by Jörn Engel
stacks anymore.
=20
note that it will introduce an huge slowdown, there's no way to ena=
ble
Post by Jörn Engel
that in production. But for testing it's fine.
=20
Would it be possible to add something short to the function preamble
=20
if (!(stack_pointer & 0xe00)) /* less than 512 bytes left */
*NULL =3D 1;
=20
Not sure how this can be translated into short and fast x86 assembler=
,
Post by Jörn Engel
but if it is possible, I would really like to have it. Then all we
have left to do is make sure no function ever uses more than 512
bytes. Famous last words, I know.
If it would be _inlined_ it would be *much* faster, but it would likely
be measurable anyways. Less measurable though. There's no way with gcc
to inline the above in the preamble, one could hack gcc for it though
(there's exactly an asm preable thing in gcc that is the one that is
currently implemented as call mcount plus the register saving, chaning
it to the above may be feasible, though it would need a new option in
gcc)

another nice thing to have (this one zerocost at runtime) would be a
way to set a limit on the size of the local variables for each function=
=2E
gcc knows that value very well, it's the sub it does on the stack
pointer the first few asm instructions after the call. That would
reduce the common mistakes. An equivalent script is the one from Keith
Owens checking the vmlinux binary after compilation but I'm afraid
people runs that one only after the fact.
Jörn Engel
2004-05-27 14:59:35 UTC
Permalink
=20
Would it be possible to add something short to the function preambl=
e
=20
if (!(stack_pointer & 0xe00)) /* less than 512 bytes left */
*NULL =3D 1;
=20
Not sure how this can be translated into short and fast x86 assembl=
er,
but if it is possible, I would really like to have it. Then all we
have left to do is make sure no function ever uses more than 512
bytes. Famous last words, I know.
=20
If it would be _inlined_ it would be *much* faster, but it would like=
ly
be measurable anyways. Less measurable though. There's no way with gc=
c
to inline the above in the preamble, one could hack gcc for it though
(there's exactly an asm preable thing in gcc that is the one that is
currently implemented as call mcount plus the register saving, chanin=
g
it to the above may be feasible, though it would need a new option in
gcc)
It is on my list, although I care more about ppc32. Can anyone
translate the above into assembler?
another nice thing to have (this one zerocost at runtime) would be a
way to set a limit on the size of the local variables for each functi=
on.
gcc knows that value very well, it's the sub it does on the stack
pointer the first few asm instructions after the call. That would
reduce the common mistakes. An equivalent script is the one from Kei=
th
Owens checking the vmlinux binary after compilation but I'm afraid
people runs that one only after the fact.
Plus the script is wrong sometimes. I have had trouble with sizes
around 4G or 2G, and never found the time to really figure out what's
going on. Might be an alloca thing that got misparsed somehow.

Having the check in gcc should cause less surprises.

J=F6rn

--=20
It's not whether you win or lose, it's how you place the blame.
-- unknown
Keith Owens
2004-05-27 15:08:02 UTC
Permalink
On Thu, 27 May 2004 16:59:35 +0200,=20
Post by Jörn Engel
Post by Andrea Arcangeli
An equivalent script is the one from Keith
Owens checking the vmlinux binary after compilation but I'm afraid
people runs that one only after the fact.
Plus the script is wrong sometimes. I have had trouble with sizes
around 4G or 2G, and never found the time to really figure out what's
going on. Might be an alloca thing that got misparsed somehow.
Some code results in negative adjustments to the stack size on exit,
which look like 4G sizes. My script checks for those and ignores them.
/^[89a-f].......$/d;
Jörn Engel
2004-05-27 15:21:56 UTC
Permalink
Post by Keith Owens
On Thu, 27 May 2004 16:59:35 +0200,=20
Post by Jörn Engel
Plus the script is wrong sometimes. I have had trouble with sizes
around 4G or 2G, and never found the time to really figure out what'=
s
Post by Keith Owens
Post by Jörn Engel
going on. Might be an alloca thing that got misparsed somehow.
=20
Some code results in negative adjustments to the stack size on exit,
which look like 4G sizes. My script checks for those and ignores the=
m.
Post by Keith Owens
/^[89a-f].......$/d;
Ok, looks as if only my script is wrong. Do you know what exactly
causes such a negative adjustment?

J=F6rn

--=20
Optimizations always bust things, because all optimizations are, in
the long haul, a form of cheating, and cheaters eventually get caught.
-- Larry Wall=20
Arjan van de Ven
2004-05-27 15:34:26 UTC
Permalink
Post by Jörn Engel
Post by Keith Owens
On Thu, 27 May 2004 16:59:35 +0200,
Post by Jörn Engel
Plus the script is wrong sometimes. I have had trouble with sizes
around 4G or 2G, and never found the time to really figure out what's
going on. Might be an alloca thing that got misparsed somehow.
Some code results in negative adjustments to the stack size on exit,
which look like 4G sizes. My script checks for those and ignores them.
/^[89a-f].......$/d;
Ok, looks as if only my script is wrong. Do you know what exactly
causes such a negative adjustment?
you can write "add 100,%esp" as "sub -100, %esp" :)
compilers seem to do that at times, probably some cpu model inside the
compiler decides the later is better code in some cases :)
Jörn Engel
2004-05-27 15:46:30 UTC
Permalink
Post by Arjan van de Ven
=20
you can write "add 100,%esp" as "sub -100, %esp" :)
compilers seem to do that at times, probably some cpu model inside th=
e
Post by Arjan van de Ven
compiler decides the later is better code in some cases :)
Makes sense (in a way). For x86 and ppc*, my script should be safe as
a nice side effect:
qr/^.*sub \$(0x$x{3,5}),\%esp$/o

Anything above 5 digits is ignored. That also misses allocations
above 1MB, but as long as human stupidity is finite... ;)

J=F6rn

--=20
ticks =3D jiffies;
while (ticks =3D=3D jiffies);
ticks =3D jiffies;
-- /usr/src/linux/init/main.c
Jörn Engel
2004-06-01 05:25:11 UTC
Permalink
Post by Arjan van de Ven
=20
you can write "add 100,%esp" as "sub -100, %esp" :)
compilers seem to do that at times, probably some cpu model inside th=
e
Post by Arjan van de Ven
compiler decides the later is better code in some cases :)
That and even worse things. sys_sendfile has a "sub $0x10,%esp"
followed by an "add $0x20,%esp". Can you explain that one as well?
0x20 is the size of all automatic variables on i386.

I have no idea what kind of trick gcc is playing there, but it appears
to work which makes me only more curious.

J=F6rn

--=20
Simplicity is prerequisite for reliability.
-- Edsger W. Dijkstra
Loading...