Discussion:
PCI memory allocation bug with CONFIG_HIGHMEM
David Hinds
2004-01-05 20:07:07 UTC
Permalink
In arch/i386/kernel/setup.c we have:

/* Tell the PCI layer not to allocate too close to the RAM area.. */
low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff;
if (low_mem_size > pci_mem_start)
pci_mem_start = low_mem_size;

which is meant to round up pci_mem_start to the nearest 1 MB boundary
past the top of physical RAM. However this does not consider highmem.
Should this just be using max_pfn rather than max_low_pfn?

(I have a report of this failing on a laptop with a highmem kernel,
causing a PCI memory resource to be allocated on top of a RAM area)

-- Dave
Russell King
2004-01-05 23:00:16 UTC
Permalink
Post by David Hinds
/* Tell the PCI layer not to allocate too close to the RAM area.. */
low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff;
if (low_mem_size > pci_mem_start)
pci_mem_start = low_mem_size;
which is meant to round up pci_mem_start to the nearest 1 MB boundary
past the top of physical RAM. However this does not consider highmem.
Should this just be using max_pfn rather than max_low_pfn?
(I have a report of this failing on a laptop with a highmem kernel,
causing a PCI memory resource to be allocated on top of a RAM area)
Beware - people sometimes use mem= to tell the kernel how much RAM is
available for its use. Unfortunately, this overrides the E820 map,
and causes the kernel to believe that all memory above the end of RAM
is available for use.

This is not the case, especially on ACPI systems.

I have come to the conclusion that the use of mem= is a _very_ bad idea
unless someone has an extremely good reason to override the E820 map.
And even then, it must be used with extreme care, and also in combination
with the reserve= parameter to ensure that reserved memory areas remain
marked as such. (Reserved regions as in the ACPI data tables.)

Failure to follow this will result in non-functional PCMCIA/Cardbus
because of memory resource collisions between system RAM and PCI
memory space.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core
David Hinds
2004-01-05 23:45:54 UTC
Permalink
Post by Russell King
Post by David Hinds
/* Tell the PCI layer not to allocate too close to the RAM area.. */
low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff;
if (low_mem_size > pci_mem_start)
pci_mem_start = low_mem_size;
which is meant to round up pci_mem_start to the nearest 1 MB boundary
past the top of physical RAM. However this does not consider highmem.
Should this just be using max_pfn rather than max_low_pfn?
(I have a report of this failing on a laptop with a highmem kernel,
causing a PCI memory resource to be allocated on top of a RAM area)
Beware - people sometimes use mem= to tell the kernel how much RAM is
available for its use. Unfortunately, this overrides the E820 map,
and causes the kernel to believe that all memory above the end of RAM
is available for use.
This is not the case, especially on ACPI systems.
Yes and that was the original reason for this snippet of code. It is
just a quick fix and shouldn't be needed if the E820 map is correct or
if the user has specified a correct mem= parameter.

-- Dave
Linus Torvalds
2004-01-06 00:36:15 UTC
Permalink
Post by David Hinds
/* Tell the PCI layer not to allocate too close to the RAM area.. */
low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff;
if (low_mem_size > pci_mem_start)
pci_mem_start = low_mem_size;
which is meant to round up pci_mem_start to the nearest 1 MB boundary
past the top of physical RAM. However this does not consider highmem.
Should this just be using max_pfn rather than max_low_pfn?
Yes and no. That doesn't really work either, for any machine with more
than 4GB of RAM.

We want to find the memory hole (in the low 4GB region), and usually the
e820 memory map should make that all happen properly. What does that
report on this laptop?

This is why we put the memory resources in /proc/iomem, and mark them
busy: so that the PCI subsystem won't try to allocate PCI memory in the
RAM (or ACPI reserved) area. The "pci_mem_start" thing is just a point to
_start_ the allocation, the PCI subsystem still should honor the fact that
we have memory above it. That's the whole point of doing proper resource
allocation, after all.

Does this not work, or have you disabled e820 for some reason?

Linus
David Hinds
2004-01-06 00:44:23 UTC
Permalink
Post by Linus Torvalds
Post by David Hinds
/* Tell the PCI layer not to allocate too close to the RAM area.. */
low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff;
if (low_mem_size > pci_mem_start)
pci_mem_start = low_mem_size;
which is meant to round up pci_mem_start to the nearest 1 MB boundary
past the top of physical RAM. However this does not consider highmem.
Should this just be using max_pfn rather than max_low_pfn?
Yes and no. That doesn't really work either, for any machine with more
than 4GB of RAM.
Ugh.
Post by Linus Torvalds
We want to find the memory hole (in the low 4GB region), and usually the
e820 memory map should make that all happen properly. What does that
report on this laptop?
This is why we put the memory resources in /proc/iomem, and mark them
busy: so that the PCI subsystem won't try to allocate PCI memory in the
RAM (or ACPI reserved) area. The "pci_mem_start" thing is just a point to
_start_ the allocation, the PCI subsystem still should honor the fact that
we have memory above it. That's the whole point of doing proper resource
allocation, after all.
Does this not work, or have you disabled e820 for some reason?
The original problem was actually that grub was passing a bogus mem=
parameter to the kernel that was 4K too small, I guess because it was
intending to indicate the amount of "available" memory (the top 4K is
reserved for ACPI). If highmem had not been enabled, the above code
would have corrected the problem; but with highmem, the computed
low_mem_size was incorrect. I would say that grub is just broken and
is misusing the mem= parameter, but this has been a problem for years
and they don't seem interested in fixing it.

-- Dave
Linus Torvalds
2004-01-06 01:11:56 UTC
Permalink
Post by David Hinds
The original problem was actually that grub was passing a bogus mem=
parameter to the kernel that was 4K too small, I guess because it was
intending to indicate the amount of "available" memory (the top 4K is
reserved for ACPI). If highmem had not been enabled, the above code
would have corrected the problem; but with highmem, the computed
low_mem_size was incorrect. I would say that grub is just broken and
is misusing the mem= parameter, but this has been a problem for years
and they don't seem interested in fixing it.
Hmm.. I suspect that it might be ok to check "max_pfn" for being less than
4GB, and use that if so. Add something like

if (max_pfn < 0x100000)
if (pci_mem_start < (max_pfn << PAGE_SHIFT))
pci_mem_start = max_pfn << PAGE_SHIFT;

to that sequence too.. I dunno. Ugly as hell. The basic issue is that if
the kernel doesn't know the RAM layout, there's no way it will get things
right all the time, so e820 or another other "good" memory layout should
really always be used.

"mem=xxx" really doesn't work too well on modern machines. The issue is
just too complex, with RAM that is reserved etc..

Linus
Linus Torvalds
2004-01-06 01:41:54 UTC
Permalink
Post by Linus Torvalds
Hmm.. I suspect that it might be ok to check "max_pfn" for being less than
4GB, and use that if so. Add something like
if (max_pfn < 0x100000)
if (pci_mem_start < (max_pfn << PAGE_SHIFT))
pci_mem_start = max_pfn << PAGE_SHIFT;
Actually, that would suck.

I think the proper fix would be to make the "mem=" stuff do the right
thing to the iomem_resource handling, and add the "round up" code there
too (and mark it as being reserved).

Basically, it shouldn't be impossible to get a "reasonably good" map from
"mem=xxxx" that would work more of the time. It wouldn't necessarily be
perfect, but it would be better than what we have now.

You can always use much more complicated "exactmap" stuff to really
generate a full e820 map, but I suspect nobody has ever done that in real
life. Something like

mem=exactmap mem=***@0 mem=***@0x100000 mem=0x100000$0xff00000

can be used to give you 255MB of RAM with the last 1MB marked as being
"reserved".

Or it _should_ work that way. I've never used it myself ;)

Anyway, we could change what the "simple" form of "mem=xxx" means to
something that is more likely to have success. Anybody willing to look at
this?

Linus
Andi Kleen
2004-01-06 03:32:41 UTC
Permalink
Post by David Hinds
/* Tell the PCI layer not to allocate too close to the RAM area.. */
low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff;
if (low_mem_size > pci_mem_start)
pci_mem_start = low_mem_size;
which is meant to round up pci_mem_start to the nearest 1 MB boundary
past the top of physical RAM. However this does not consider highmem.
Should this just be using max_pfn rather than max_low_pfn?
max_pfn would get memory >4GB on highmem systems, which generally
doesn't work because many PCI devices only support 32bit addresses.

IMHO the only reliable way to get physical bus space for mappings
is to allocate some memory and map the mapping over that.
On x86-64 the allocation must be GFP_DMA, on i386 it can be GFP_KERNEL.

The problem is that BIOS commonly use physical address space without
marking it in the e820 map. For example the AGP aperture is normally
not marked in any way in the e820 map, but you definitely cannot reuse
its bus space. The old code assumed that there is a memory hole below
the highest memory address <4GB, but that can be not true on a system
with >3GB.

We unfortunately must assume on such systems that all holes in e820
space are already used by something. On a system with <3GB you
are usually lucky because there is some space left, but even that
can break and e.g. conflict with reserved ACPI mappings. In theory
you could have a heuristic with something like "if E820_RAM is
<2GB just allocate it after the highest E820_RAM map not conflicting
with other E820 mappings", but this would be quite hackish and
may break on weird systems.

BTW drivers/char/mem.c makes the same broken assumption. It really
wants to default to uncached access for any holes, but default to
cached for real memory. Doing that also requires reliable hole detection,
which we don't have.

One approach I haven't checked is that the ACPI memory map may have
fixed the problem (no defined way to get a hole)

As long as you only have e820 I think there is no real alternative to
the "put io map over memory" technique.

-Andi
Linus Torvalds
2004-01-06 03:40:11 UTC
Permalink
Post by Andi Kleen
IMHO the only reliable way to get physical bus space for mappings
is to allocate some memory and map the mapping over that.
You literally can't do that: the RAM addresses are decoded by the
northbridge before they ever hit the PCI bus, so it's impossible to "map
over" RAM in general.

Normally, the way this works is that there are magic northbridge mapping
registers that remap part of the memory, so that the memory that is
physically in the upper 4GB of RAM shows up somewhere else (or just
possibly disappears entirely - once you have more than 4GB of RAM, you
might not care too much about a few tens of megs missing).

Linus
Andi Kleen
2004-01-06 04:05:46 UTC
Permalink
Post by Linus Torvalds
Post by Andi Kleen
IMHO the only reliable way to get physical bus space for mappings
is to allocate some memory and map the mapping over that.
You literally can't do that: the RAM addresses are decoded by the
northbridge before they ever hit the PCI bus, so it's impossible to "map
over" RAM in general.
Are you sure? I have a doc from AMD somewhere on the memory ordering
on K8 and it gives this order: (highest to lowest)

AGP aperture, TSEG, ASEG, IORR, Fixed MTRR, TOP_MEM

Note that TOP_MEM comes last, IORR comes earlier. It would require
setting an IORR though, which would be admittedly a bit nasty
(there are not that many of them). As long as it is only a single
area it should be possible though, we already have some code to change
IORRs in the AGP driver. That would be admittedly AMD specific,
but I suspect Intel has a similar mechanism.

I have successfully mapped the AGP aperture
over RAM and also seen it shadowing PCI mappings. I admit I haven't tried
it with PCI mappings.

But can you suggest a reliable way to find a memory hole in e820?
I haven't one figured out and AFAIK there isn't even any guarantee
by the BIOS that there is any. e.g. Opteron BIOS tend to use all
the precious space < 4GB up for existing mappings and I would expect
other i386 BIOS to behave the same.

-Andi
Linus Torvalds
2004-01-06 05:04:31 UTC
Permalink
Post by Andi Kleen
Post by Linus Torvalds
You literally can't do that: the RAM addresses are decoded by the
northbridge before they ever hit the PCI bus, so it's impossible to "map
over" RAM in general.
Are you sure? I have a doc from AMD somewhere on the memory ordering
on K8 and it gives this order: (highest to lowest)
AGP aperture, TSEG, ASEG, IORR, Fixed MTRR, TOP_MEM
Those are all in the CPU or northbridge (well, on the opteron, the
northbridge is integrated so it all boils down to the CPU).

So yes, I'm sure. You have to have northbridge-specific code to punch a
"hole" in the RAM decoder, and some of them are "bios-locked", ie they
have registers that become read-only after the first time they are written
(or after a special lock-bit has been written).

So in some cases you can't do it at all.
Post by Andi Kleen
I have successfully mapped the AGP aperture
over RAM and also seen it shadowing PCI mappings. I admit I haven't tried
it with PCI mappings.
The AGP aperture is generally done in the northbridge, so it all depends
on what the decode priority is for the northbridge chip. That's
implementation-dependent.
Post by Andi Kleen
But can you suggest a reliable way to find a memory hole in e820?
I haven't one figured out and AFAIK there isn't even any guarantee
by the BIOS that there is any. e.g. Opteron BIOS tend to use all
the precious space < 4GB up for existing mappings and I would expect
other i386 BIOS to behave the same.
If you ahve a proper e820 map, then it should work correctly, with
anything that is RAM being marked as such (or being marked as "reserved").

The problems happen when you do _not_ have a proper e820 map, either due
to bootloader bugs or BIOS problems, or because the user overrode the
values with a "mem=xxxx" thing.

Linus
Andi Kleen
2004-01-06 08:12:03 UTC
Permalink
Post by Linus Torvalds
If you ahve a proper e820 map, then it should work correctly, with
anything that is RAM being marked as such (or being marked as "reserved").
Every e820 map i've seen did not have the AGP aperture marked reserved.
It is just an undescribed hole. In fact when you mark the aperture in the
e820 map the Linux AGP driver stops working, it relies on it being
in an undescribed hole.

This means you cannot just reuse holes. And there is no other way to get
mapping space.

-Andi
Mika Penttilä
2004-01-06 09:11:21 UTC
Permalink
Post by Andi Kleen
Post by Linus Torvalds
If you ahve a proper e820 map, then it should work correctly, with
anything that is RAM being marked as such (or being marked as "reserved").
Every e820 map i've seen did not have the AGP aperture marked reserved.
Why should it? It's not ram, and the aperture is marked as reserved
while doing PCI resource assignment/reservation.
Post by Andi Kleen
It is just an undescribed hole. In fact when you mark the aperture in the
e820 map the Linux AGP driver stops working, it relies on it being
in an undescribed hole.
Andi Kleen
2004-01-06 09:44:42 UTC
Permalink
Post by Mika Penttilä
Post by Andi Kleen
Post by Linus Torvalds
If you ahve a proper e820 map, then it should work correctly, with
anything that is RAM being marked as such (or being marked as "reserved").
Every e820 map i've seen did not have the AGP aperture marked reserved.
Why should it? It's not ram, and the aperture is marked as reserved
while doing PCI resource assignment/reservation.
It implies that you cannot just put your IO mappings
into any holes. Because something else like the aperture may
be already there.

In my opinion it would have been cleaner if the aperture had always
an reserved entry in the e820 map. Or better all usable holes get
an special entry. Then you could actually reliable allocate IO space
on your own. Currently it's just impossible.

-Andi
Mika Penttilä
2004-01-06 10:16:14 UTC
Permalink
Post by Andi Kleen
Post by Mika Penttilä
Post by Andi Kleen
Post by Linus Torvalds
If you ahve a proper e820 map, then it should work correctly, with
anything that is RAM being marked as such (or being marked as "reserved").
Every e820 map i've seen did not have the AGP aperture marked reserved.
Why should it? It's not ram, and the aperture is marked as reserved
while doing PCI resource assignment/reservation.
It implies that you cannot just put your IO mappings
into any holes. Because something else like the aperture may
be already there.
But AGP aperture is controlled with the standard APBASE pci base
register, so you always know where it is, can relocate it and reserve
address space for it. Of course there may exist other uncontrollable hw,
which may cause problems.
Post by Andi Kleen
In my opinion it would have been cleaner if the aperture had always
an reserved entry in the e820 map. Or better all usable holes get
an special entry. Then you could actually reliable allocate IO space
on your own. Currently it's just impossible.
-Andi
--Mika
Andi Kleen
2004-01-06 10:49:04 UTC
Permalink
Post by Mika Penttilä
But AGP aperture is controlled with the standard APBASE pci base
register, so you always know where it is, can relocate it and reserve
address space for it. Of course there may exist other uncontrollable hw,
which may cause problems.
Actually not. There are quite a lot of chipsets that require special
programming for the AGP aperture (why do you think drivers/char/agp/*.c
is so big?). And not even everything AGPv2 compliant.

And as Linus points out you would likely need to do some Northbridge
specific magic to make that area usable for PCI then.

Also you would need to put it over RAM because again there is no
reliable way to find a hole.

-Andi
Linus Torvalds
2004-01-06 15:27:33 UTC
Permalink
Post by Andi Kleen
In my opinion it would have been cleaner if the aperture had always
an reserved entry in the e820 map.
That does sound like a bug in the AGP drivers. It shouldn't be hard at all
to make them reserve their aperture.

Hint hint.

Linus
Andi Kleen
2004-01-06 15:37:06 UTC
Permalink
Post by Linus Torvalds
Post by Andi Kleen
In my opinion it would have been cleaner if the aperture had always
an reserved entry in the e820 map.
That does sound like a bug in the AGP drivers. It shouldn't be hard at all
to make them reserve their aperture.
Hint hint.
No, it's a bug in the BIOS that they're not marked. But I've actually
seen a BIOS that marked it and it lead to the Linux AGP driver failing
(due to some interaction with how setup.c sets up resources). So the Linux
driver currently even relies on the broken state.

Anyways, I already implemented reservation for the aperture for the K8
driver some time ago. And it's in your tree. But it doesn't help for
finding IO holes because there could be other unmarked hardware lurking
there ... Or worse there is just no free space below 4GB.

-Andi
Linus Torvalds
2004-01-06 15:48:37 UTC
Permalink
Post by Andi Kleen
Anyways, I already implemented reservation for the aperture for the K8
driver some time ago. And it's in your tree. But it doesn't help for
finding IO holes because there could be other unmarked hardware lurking
there ... Or worse there is just no free space below 4GB.
The "unmarked hardware" is why we have PCI quirks. Look at
drivers/pci/quirks.c, and notice how many of the quirks are all about
quirk_io_region(). Exactly because there isn't any way for the BIOS to
tell us about these things on the IO side.

(Actually, there is: PnP-BIOS calls are supposed to give us that
information. However, not only are the BIOSes buggy and don't give a
complete list _anyway_, anybody who uses the PnP-BIOS is much more likely
to just get a kernel oops when the BIOS is buggy and assumes that only
Windows will call it. So I strongly suggest you not _ever_ use pnp unless
you absolutely have to).

The same quirks could be done on the MMIO side for northbridges.

Linus
Adam Belay
2004-01-06 22:29:59 UTC
Permalink
Post by Linus Torvalds
Post by Andi Kleen
Anyways, I already implemented reservation for the aperture for the K8
driver some time ago. And it's in your tree. But it doesn't help for
finding IO holes because there could be other unmarked hardware lurking
there ... Or worse there is just no free space below 4GB.
The "unmarked hardware" is why we have PCI quirks. Look at
drivers/pci/quirks.c, and notice how many of the quirks are all about
quirk_io_region(). Exactly because there isn't any way for the BIOS to
tell us about these things on the IO side.
(Actually, there is: PnP-BIOS calls are supposed to give us that
information. However, not only are the BIOSes buggy and don't give a
complete list _anyway_, anybody who uses the PnP-BIOS is much more likely
to just get a kernel oops when the BIOS is buggy and assumes that only
Windows will call it. So I strongly suggest you not _ever_ use pnp unless
you absolutely have to).
For those with legacy systems, the isapnp protocol, a component of pnp, is
unaffected by this problem. Most systems that support ISA addin cards,
have correctly implemented PnPBIOSes.
Post by Linus Torvalds
The same quirks could be done on the MMIO side for northbridges.
Linus
For the past few weeks I've been doing research on the PnPBIOS general
protection faults, and I've come up with a few observations and a proposed
solution. Any comments would be appreciated.

1.) There probably isn't anything wrong with the way we're calling the PnPBIOS.

After searching through various mailing lists I discovered that several other
open source operating systems, although having many variations on the PnPBIOS
code, are having identical problems (including that the same type of calls
trigger the faults). A while back I added a change that was similar to some
of apm's buggy bios handling code. It appears to fix the problems with getting
dynamic resource information on many buggy systems. I later decided (see
pnp-fix-1 in -mm) to get static resource information (the resources set at boot
time) because the specifications suggest using that call when enumerating
devices for the first time. To my surprise, many have reported problems with
the PnPBIOS driver found in -mm. In addition, there are some, but
significantly fewer, BIOSes that are completely broken and don't work with
either call type.

The recent escd fix I have made corrects a thinko in the PnPBIOS code and it
turns out that faults from calling /proc/pnp/bus/escd were probably not caused
by BIOS bugs. I've attached this fix to the end of the email. This leaves
only the get node calls.

2.) Windows works with buggy BIOSes because of the way it calls them.

I looked into how Windows handles the PnPBIOS and may have discovered why it
works on buggy BIOS. It turns out that exclusively realmode calls are used.
See www.missl.cs.umd.edu/Projects/sebos/winint/index2.html#pnpbios. My
knowledge is limited in this area of the x86 architecture but it is my
impression that it would not be possible, or perhaps worth it, to implement
realmode calls for the Linux PnPBIOS driver because of the time it is
initialized.

3.) BIOS bugs appear to affect mostly laptops.

The Oops seems to generally occur when getting information about the mouse
controller. Because of touchpads and external mouses, the BIOS code may be
a little different from desktop systems. Nonetheless my laptop, as well as
all my other test systems, do not have any PnPBIOS problems.

4.) PnPBIOS support may not be fully implemented on a few rare systems with
ACPI.

The PnPBIOS standard has been obsoleted by ACPI. Only in systems made before
ACPI or systems with blacklisted ACPI support (there are many), is the PnPBIOS
necessary. Unfortunantly resource management in the Linux ACPI driver isn't
fully supported relative to resource management in the Linux PnPBIOS driver.
It is concievable that some PnPBIOSes only implement a minimal set of calls
properly.


A proposed solution...

For 2.6...

1.) only get dynamic resource information
2.) blacklist any BIOSes that fail on dynamic resource calls. We might get
lucky and there will be few enough that it is possible to create a blacklist.
Also look into them to see if they work with static instead.
3.) attach a warning, by printk and/or kconfig help, to the /proc/bus/pnp
interface as it is able to make any PnPBIOS call. (done in -mm)
4.) As a last resort disable PnPBIOS support if ACPI is successful. Although
the two can currently coexist, this would prevent the buggy BIOSes found in
more modern x86 systems from being used. Of course this would be useless if
the user decides to not include the ACPI driver.
5.) Look into other ways of finding out if the PnPBIOS might be buggy,
currently we only have DMI.

Any others?

For the next development kernel...

I am working on a new resource management infustructure, tied more closely to
the driver model and sysfs, and some ACPI patches. They should make it easier
for us to take advantage of ACPI resource management. Although one of my
biggest focuses is ACPI, I'd like to maintain compatibility with older
protocols such as PnPBIOS. Also it is a major goal to make it usable for all
architectures (like the existing resource management code), but perhaps even
for Open Firmware when it is further implemented.
Post by Linus Torvalds
From there we can phase out PnPBIOS support where ACPI provides an alternative.
It's worth noting that PnPBIOS support is useful on the majority of systems that
support it. In later kernels it can serve as an alternative when ACPI is buggy
or unsupported.

Thanks,
Adam



--- a/drivers/pnp/pnpbios/bioscalls.c 2003-11-26 20:44:47.000000000 +0000
+++ b/drivers/pnp/pnpbios/bioscalls.c 2003-12-02 21:17:42.000000000 +0000
@@ -493,7 +493,7 @@
if (!pnp_bios_present())
return ESCD_FUNCTION_NOT_SUPPORTED;
status = call_pnp_bios(PNP_READ_ESCD, 0, PNP_TS1, PNP_TS2, PNP_DS, 0, 0, 0,
- data, 65536, (void *)nvram_base, 65536);
+ data, 65536, __va((void *)nvram_base), 65536);
return status;
}

@@ -516,7 +516,7 @@
if (!pnp_bios_present())
return ESCD_FUNCTION_NOT_SUPPORTED;
status = call_pnp_bios(PNP_WRITE_ESCD, 0, PNP_TS1, PNP_TS2, PNP_DS, 0, 0, 0,
- data, 65536, nvram_base, 65536);
+ data, 65536, __va((void *)nvram_base), 65536);
return status;
}
#endif
Linus Torvalds
2004-01-07 04:06:42 UTC
Permalink
Post by Adam Belay
5.) Look into other ways of finding out if the PnPBIOS might be buggy,
currently we only have DMI.
Any others?
We could use the exception mechanism, and try to fix up any BIOS errors.
That would require:

- make the BIOS calls save all important registers just before entry (esp
in particular, and the "after-call EIP") and set a flag saying "fix me
up". Do this per-CPU. Clear the flags after exit.

- add magic knowledge to "fixup_exception()" path that looks at the
per-cpu fix-me-up flag, and if it is set, restore all the segments
(which the BIOS may have crapped on), %esp and %eip to the magic fixup
values.

- test it with a bogus trap (on purpose) which has reset all the x86
registers, including an offset %esp.

This could make us recover from some (most?) BIOS bugs and at least
dynamically notice when the BIOS does bad bad things.

Linus
Andi Kleen
2004-01-07 05:02:56 UTC
Permalink
Post by Linus Torvalds
Post by Adam Belay
5.) Look into other ways of finding out if the PnPBIOS might be buggy,
currently we only have DMI.
Any others?
We could use the exception mechanism, and try to fix up any BIOS errors.
[...] It would not work for x86-64 unfortunately where you cannot do
any BIOS calls after the system is running (it would only be possible
early in boot)

My hope was actually that there is some ACPI mechanism to do all this,
but I haven't done much research in this area yet.

-Andi
Dave Jones
2004-01-07 05:55:57 UTC
Permalink
Post by Andi Kleen
Post by Linus Torvalds
Post by Adam Belay
5.) Look into other ways of finding out if the PnPBIOS might be buggy,
currently we only have DMI.
Any others?
We could use the exception mechanism, and try to fix up any BIOS errors.
[...] It would not work for x86-64 unfortunately where you cannot do
any BIOS calls after the system is running (it would only be possible
early in boot)
Why on earth would you want to call PNPBIOS on AMD64 anyway ?

Dave
Linus Torvalds
2004-01-07 06:06:24 UTC
Permalink
Post by Dave Jones
Why on earth would you want to call PNPBIOS on AMD64 anyway ?
For the same reason normal PC's still like to: no technical reason, except
for the fact that system vendors like to hide bugs and quirks by having
magic stuff in ACPI or PnPBIOS to tell the OS "hands off" or "this is how
to route this strange irq".

It's like ACPI: it would be a whole lot better if the hardware was just
standard and documented and didn't need any magic configuration tables and
strange code snippets to do magic acts of perversion. But sadly, it ain't
so, and PnP and ACPI are there as imperfect ways of doing what needs to be
done.

Of course, as with most system vendor crud, some BIOSes are more imperfect
than others.

Linus
Dave Jones
2004-01-07 06:08:43 UTC
Permalink
Post by Linus Torvalds
Post by Dave Jones
Why on earth would you want to call PNPBIOS on AMD64 anyway ?
For the same reason normal PC's still like to: no technical reason, except
for the fact that system vendors like to hide bugs and quirks by having
magic stuff in ACPI or PnPBIOS to tell the OS "hands off" or "this is how
to route this strange irq".
But PNPBIOS is an ISA relic isn't it ?
No amd64 system I know of even has an ISA bus.

Dave
Linus Torvalds
2004-01-07 06:45:38 UTC
Permalink
Post by Dave Jones
But PNPBIOS is an ISA relic isn't it ?
It still shows up. BIOSes use it exactly to tell the system about reserved
magic IO regions (like the IO registers that are reserved for ACPI).

ISA may be gone, but the crap it left behind lingers on. The BIOS writers
know that they can affect windows IO region allocation with it, so they
still do - to make sure windows boots even when the hardware has some
strange IO resource allocations.

And yes, that is likely to be an issue on x86-64 too.. As far as windows
is concerned, it's just another 32-bit CPU.

Linus
Andi Kleen
2004-01-07 06:51:56 UTC
Permalink
Post by Dave Jones
Post by Andi Kleen
Post by Linus Torvalds
Post by Adam Belay
5.) Look into other ways of finding out if the PnPBIOS might be buggy,
currently we only have DMI.
Any others?
We could use the exception mechanism, and try to fix up any BIOS errors.
[...] It would not work for x86-64 unfortunately where you cannot do
any BIOS calls after the system is running (it would only be possible
early in boot)
Why on earth would you want to call PNPBIOS on AMD64 anyway ?
See the preceding thread. We're currently missing a reliable way to find
free IO space for PCI resources, which is needed for some cases. The
PNPBIOS was discussed as one of the possible solutions.

For AMD64 clearly something ACPI based is needed though.

-Andi
Adam Belay
2004-01-07 02:43:09 UTC
Permalink
Post by Andi Kleen
Post by Dave Jones
Post by Andi Kleen
Post by Linus Torvalds
Post by Adam Belay
5.) Look into other ways of finding out if the PnPBIOS might be buggy,
currently we only have DMI.
Any others?
We could use the exception mechanism, and try to fix up any BIOS errors.
[...] It would not work for x86-64 unfortunately where you cannot do
any BIOS calls after the system is running (it would only be possible
early in boot)
Why on earth would you want to call PNPBIOS on AMD64 anyway ?
See the preceding thread. We're currently missing a reliable way to find
free IO space for PCI resources, which is needed for some cases. The
PNPBIOS was discussed as one of the possible solutions.
For AMD64 clearly something ACPI based is needed though.
-Andi
Just as an example...

Here is how the PnPBIOS reserves io space for which it can't find an actual
device: (notice it isn't necessarily related to ISA)

09 PNP0c02 system peripheral: other
flags: [no disable] [no config] [static]
allocated resources:
io 0x04d0-0x04d1 [16-bit decode]
io 0x0cf8-0x0cff [16-bit decode]
io 0x0010-0x001f [16-bit decode]
io 0x0022-0x002d [16-bit decode]
io 0x0030-0x003f [16-bit decode]
io 0x0050-0x0052 [16-bit decode]
io 0x0072-0x0077 [16-bit decode]
io 0x0091-0x0093 [16-bit decode]
io 0x00a2-0x00be [16-bit decode]
io 0x0400-0x047f [16-bit decode]
io 0x0540-0x054f [16-bit decode]
io 0x0500-0x053f [16-bit decode]
io disabled [16-bit decode]
io disabled [16-bit decode]
io disabled [16-bit decode]
mem disabled [8/16 bit] [r/o] [cacheable] [shadow]
mem disabled [8/16 bit] [r/o] [cacheable] [shadow]
mem disabled [8/16 bit] [r/o] [cacheable] [shadow]
mem disabled [8/16 bit] [r/o] [cacheable] [shadow]
mem disabled [8/16 bit] [r/o] [cacheable] [shadow]
mem disabled [8/16 bit] [r/o] [cacheable] [shadow]
mem disabled [8/16 bit] [r/o] [cacheable] [shadow]
mem disabled [8/16 bit] [r/o] [cacheable] [shadow]
mem 0xffb00000-0xffbfffff [32 bit] [r/o]


And here is the output for ACPI on the same system:

00000970: Device SYSR (\_SB_.PCI0.SBRG.SYSR)
00000978: Name _HID (\_SB_.PCI0.SBRG.SYSR._HID)
0000097d: PNP0c02 (0x020cd041)

-->snip

00000995: Name _CRS (\_SB_.PCI0.SBRG.SYSR._CRS)

-->snip

0000099f: Interpreted as PnP Resource Descriptor:
0000099f: Fixed I/O Ports: 0x10 @ 0x10
0000099f: Fixed I/O Ports: 0x1e @ 0x22
0000099f: Fixed I/O Ports: 0x1c @ 0x44
0000099f: Fixed I/O Ports: 0x2 @ 0x62
0000099f: Fixed I/O Ports: 0xb @ 0x65
0000099f: Fixed I/O Ports: 0xe @ 0x72
0000099f: Fixed I/O Ports: 0x1 @ 0x80
0000099f: Fixed I/O Ports: 0x3 @ 0x84
0000099f: Fixed I/O Ports: 0x1 @ 0x88
0000099f: Fixed I/O Ports: 0x3 @ 0x8c
0000099f: Fixed I/O Ports: 0x10 @ 0x90
0000099f: Fixed I/O Ports: 0x1e @ 0xa2
0000099f: Fixed I/O Ports: 0x10 @ 0xe0
0000099f: I/O Ports: 16 bit address decoding,
0000099f: minbase 0x4d0, maxbase 0x4d0, align 0x0, count 0x2
0000099f: I/O Ports: 16 bit address decoding,
0000099f: minbase 0x400, maxbase 0x400, align 0x0, count 0x70
0000099f: I/O Ports: 16 bit address decoding,
0000099f: minbase 0x470, maxbase 0x470, align 0x0, count 0x10
0000099f: I/O Ports: 16 bit address decoding,
0000099f: minbase 0x500, maxbase 0x500, align 0x0, count 0x40
0000099f: I/O Ports: 16 bit address decoding,
0000099f: minbase 0x800, maxbase 0x800, align 0x0, count 0x80
0000099f: 32-bit rw Fixed memory range:
0000099f: base 0xfff00000, count 0x100000
0000099f: 32-bit rw Fixed memory range:
0000099f: base 0xffb00000, count 0x100000
0000099f: Bad checksum 0x6, should be 0 // hmm, interesting ;-)


So they seem to provide a potential solution for this sort of problem.

Thanks,
Adam
Helge Hafting
2004-01-07 08:32:05 UTC
Permalink
Post by Adam Belay
2.) Windows works with buggy BIOSes because of the way it calls them.
I looked into how Windows handles the PnPBIOS and may have discovered why it
works on buggy BIOS. It turns out that exclusively realmode calls are used.
See www.missl.cs.umd.edu/Projects/sebos/winint/index2.html#pnpbios. My
knowledge is limited in this area of the x86 architecture but it is my
impression that it would not be possible, or perhaps worth it, to implement
realmode calls for the Linux PnPBIOS driver because of the time it is
initialized.
Are these PnPBIOS calls needed at boot only? If so, consider
querying the bios early in the boot code - before
switching to protected mode. Just store the results,
and let the driver read them later instead of doing
calls that crash.

Helge Hafting
Eric W. Biederman
2004-01-06 22:45:16 UTC
Permalink
Post by Andi Kleen
Post by Linus Torvalds
Post by Andi Kleen
In my opinion it would have been cleaner if the aperture had always
an reserved entry in the e820 map.
That does sound like a bug in the AGP drivers. It shouldn't be hard at all
to make them reserve their aperture.
Hint hint.
No, it's a bug in the BIOS that they're not marked. But I've actually
seen a BIOS that marked it and it lead to the Linux AGP driver failing
(due to some interaction with how setup.c sets up resources). So the Linux
driver currently even relies on the broken state.
And mtd map drivers for rom chips run into the same problem except in
that case regions is almost always reserved by the BIOS.

Which means it's just silly for the drivers to fail when request_mem_region
fails. They are looking at the hardware and know where the regions are, and
there is not a parent device we can request a subregion from when it is the
BIOS that reserves the region.

Eric
Linus Torvalds
2004-01-07 00:06:42 UTC
Permalink
Post by Eric W. Biederman
And mtd map drivers for rom chips run into the same problem except in
that case regions is almost always reserved by the BIOS.
Which means it's just silly for the drivers to fail when request_mem_region
fails.
Note: you're not supposed to need to do "request_mem_region()" for modern
drivers. You should only need to claim ownership of the resources, and the
PCI driver interfaces should do that automatically.

What you should do for resources you know about is to just _create_ them.
Not necessarily request them (although that is one way of creating them),
but you can literally just tell the kernel that they are there. That will
already mean that anybody else that tries to allocate a resource will
avoid that area.

So if you know the hardware is there, and it _tells_ you it's there
(unlike, say, an ISA device), you can just call "request_mem_region()"
without ever even checking the error return (although you had better make
sure that the name allocation is stable if you are a module - don't want
to start oopsin in /proc if the module gets unloaded).

The PCI layer already does all of that for the "standard" resources. It's
just that the generic code can't do it for nonstandard regions, so drivers
for chips that don't have just the regular BAR things should create their
own resource entries..

Linus
Eric W. Biederman
2004-01-07 04:58:23 UTC
Permalink
Post by Linus Torvalds
Post by Eric W. Biederman
And mtd map drivers for rom chips run into the same problem except in
that case regions is almost always reserved by the BIOS.
Which means it's just silly for the drivers to fail when request_mem_region
fails.
Note: you're not supposed to need to do "request_mem_region()" for modern
drivers. You should only need to claim ownership of the resources, and the
PCI driver interfaces should do that automatically.
What you should do for resources you know about is to just _create_ them.
Which I can do. But what if the BIOS has marked them as reserved?
The BIOS always does this for ROM chips. And it sounds like this occasionally
happens for AGP apertures.
Post by Linus Torvalds
Not necessarily request them (although that is one way of creating them),
but you can literally just tell the kernel that they are there. That will
already mean that anybody else that tries to allocate a resource will
avoid that area.
So if you know the hardware is there, and it _tells_ you it's there
(unlike, say, an ISA device), you can just call "request_mem_region()"
without ever even checking the error return (although you had better make
sure that the name allocation is stable if you are a module - don't want
to start oopsin in /proc if the module gets unloaded).
Or to oops when the module is unloaded, when you try and free the resource.
But actually completely freeing the resource is actually bad manners, because
the resources are used no matter what and you don't want to allocate anything
else in there.
Post by Linus Torvalds
The PCI layer already does all of that for the "standard" resources. It's
just that the generic code can't do it for nonstandard regions, so drivers
for chips that don't have just the regular BAR things should create their
own resource entries..
So thinking out loud about the twist that is in my experience. Southbridges
have a special decode region for BIOS ROM chips. It is at least 64K, but can
be as big as 8M or so at the end of the address space.

On my machine at home the e820 map looks like:

00000000-0009fbff : System RAM
0009fc00-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000c8000-000c8fff : Extension ROM
000f0000-000fffff : System ROM
00100000-1fffbfff : System RAM
00100000-002bdc69 : Kernel code
002bdc6a-00347183 : Kernel data
1fffc000-1fffefff : ACPI Tables
1ffff000-1fffffff : ACPI Non-volatile Storage
cb800000-cb8fffff : Intel Corp. 82557 [Ethernet Pro 100]
cc000000-cc000fff : Intel Corp. 82557 [Ethernet Pro 100]
cc000000-cc000fff : eepro100
cc800000-cddfffff : PCI Bus #01
cc800000-ccffffff : Matrox Graphics, Inc. MGA G400 AGP
cd000000-cd003fff : Matrox Graphics, Inc. MGA G400 AGP
cdf00000-cfffffff : PCI Bus #01
ce000000-cfffffff : Matrox Graphics, Inc. MGA G400 AGP
d0000000-dfffffff : VIA Technologies, Inc. VT82C693A/694x [Apollo PRO133x]
ffff0000-ffffffff : reserved

That last reserved region is 64K. Which looking at the pci registers
is technically correct at the moment. Only 64K happen to be decoded.

If I wanted to flash my ROM what I need to do is:
- Load a driver for the region where a ROM chips can possible be at
the top of memory. This is the region 0xFFF00000 - 0xFFFFFFFF on
the via686.

- The driver comes in and looks for the via686, and finds it so it
knows it can do something.

- The driver can attempt to get the region 0xFFF00000 - 0xFFFFFFFF,
but that is impossible.

- The driver enables the decodes on all of 0xFFF00000 - 0xFFFFFFFF
in the via686

- In general the driver would enable to flip a bit in the via686 to
or someplace to enable the WE (write-enable) line to the ROM chip.

- The driver would then call in the mtd subsystem at likely offsets
into the region 0xFFF00000 - 0xFFFFFFFF and have it do a JEDEC
mostly standard probe to see if a BIOS chip starts at that offset.

This basic algorithm works see drivers/mtd/maps/amd76xrom.c and
drivers/mtd/maps/ich2rom.c. But it does not really play nice with the
existing kernel infrastructure.

So to do this cleanly it looks like I need to write a pci quirk for
the southbridge. Adding a BAR that enables decodes to the BIOS ROM chip.
And that quirk should always be present, so that nothing even thinks
of using that region for something else.

With the quirk doing the heavy lifting the map driver would just need to do
something like grab a child resource for the ROM chip to show that I am
actively using it.

The very practical question. After the BIOS has allocated:
0xFFFF0000 - 0xFFFFFFFF how do I allocate
0xFFF00000 - 0xFFFFFFFF in the pci quirk?

The area is already allocated and it chops of the area I need to
allocate. Which is a general mess, and that happens to be a very
typical scenario for BIOS ROMS.

Because the conflicting resource is allocated in what
is now: legacy_init_iomem_resources() from bootmem and just
dropped on the floor I can't free the conflicting resource.
I don't know of anything I can do cleanly without modifying
the code.

Basically the question becomes what to do about an incorrect
e820 map that you don't find out about until you start initializing
drivers.

Eric
Linus Torvalds
2004-01-07 05:32:06 UTC
Permalink
Post by Eric W. Biederman
Post by Linus Torvalds
What you should do for resources you know about is to just _create_ them.
Which I can do. But what if the BIOS has marked them as reserved?
The BIOS always does this for ROM chips. And it sounds like this occasionally
happens for AGP apertures.
So? The resource functions will refuse to insert an overlapping resource
and return an error, so if the BIOS already did it through a proper e820
map, then it's a no-op.

But that's fine - it's _supposed_ to be a no-op in that case.
Post by Eric W. Biederman
Or to oops when the module is unloaded, when you try and free the resource.
Or, more appropriately, if it's a fixed resource (which it will be, if
this is some special chipset feature), you don't ever try to free it. Just
leave it be. Just make sure that the resource name etc points to stable
data (and "pci_name(dev)" is a good such data).

See the quirk entries in drivers/pci/quirks.c.

Alternatively, you actually keep track of whether it was your resource or
not, the error code will have told you. Don't try to release something
that wasn't yours.
Post by Eric W. Biederman
[ horrorcase deleted ]
So to do this cleanly it looks like I need to write a pci quirk for
the southbridge. Adding a BAR that enables decodes to the BIOS ROM chip.
And that quirk should always be present, so that nothing even thinks
of using that region for something else.
Sounds correct. However, the BIOS map will still clash with this quirk, so
there may be some double resource allocations in the resource maps. The
quirks get run _after_ the memory setup has run, which is why you end up
Post by Eric W. Biederman
0xFFFF0000 - 0xFFFFFFFF how do I allocate
0xFFF00000 - 0xFFFFFFFF in the pci quirk?
And _this_ is the only really nasty case. It's nasty exactly because the
BIOS is involved, and we have _no_ idea why the heck the BIOS marked
certain areas reserved.

The resource allocation code in kernel/resource.c _will_ help you if you
wan tto do this right. The internal "__request_resource()" function will
pinpoint any conflicting entry, and in fact we already have a
"insert_resource()" that uses exactly this to try to "fix up" these
issues.

The "insert_resource()" function is able to put "new" resources below old
ones, but it does assume that the the resources are fully overlapping in
_some_ way. It will correctly insert your PCI quirk (because the BIOS
allocation is wholly inside of the quirk you want to add), but it would
_not_ be able to handle two different regions conflicting.

And such a conflict could happen if the BIOS uses a single "reserved"
region for two different PCI resources. Then your quirk might cover one of
the PCI resources fully, but wouldn't cover the whole BIOS "reserved"
area. "insert_resource()" would still be happy if your quirk is wholly
inside, but it would _not_ be happy if your quirk is bigger than the BIOS
allocation in one direction but not the other.

See?

Right now, the ia64 port actually does _exactly_ this to mark all the
strange PCI window stuff into the resource tree. For a different reason,
but with a number of similar issues.

Linus
Eric W. Biederman
2004-01-07 15:53:50 UTC
Permalink
Post by Linus Torvalds
Post by Eric W. Biederman
[ horrorcase deleted ]
So to do this cleanly it looks like I need to write a pci quirk for
the southbridge. Adding a BAR that enables decodes to the BIOS ROM chip.
And that quirk should always be present, so that nothing even thinks
of using that region for something else.
Sounds correct. However, the BIOS map will still clash with this quirk, so
there may be some double resource allocations in the resource maps. The
quirks get run _after_ the memory setup has run, which is why you end up
Post by Eric W. Biederman
0xFFFF0000 - 0xFFFFFFFF how do I allocate
0xFFF00000 - 0xFFFFFFFF in the pci quirk?
And _this_ is the only really nasty case. It's nasty exactly because the
BIOS is involved, and we have _no_ idea why the heck the BIOS marked
certain areas reserved.
The resource allocation code in kernel/resource.c _will_ help you if you
want to do this right. The internal "__request_resource()" function will
pinpoint any conflicting entry, and in fact we already have a
"insert_resource()" that uses exactly this to try to "fix up" these
issues.
Last time I was looking I got as far as __request_resource, but
it was and still is private to resource.c so it needs a wrapper around
it that does something useful.
Post by Linus Torvalds
The "insert_resource()" function is able to put "new" resources below old
ones, but it does assume that the the resources are fully overlapping in
_some_ way. It will correctly insert your PCI quirk (because the BIOS
allocation is wholly inside of the quirk you want to add), but it would
_not_ be able to handle two different regions conflicting.
And this looks like a useful wrapper, that comes very close to what
I need. insert_resource is new since last time I looked so I missed
it.
Post by Linus Torvalds
And such a conflict could happen if the BIOS uses a single "reserved"
region for two different PCI resources. Then your quirk might cover one of
the PCI resources fully, but wouldn't cover the whole BIOS "reserved"
area. "insert_resource()" would still be happy if your quirk is wholly
inside, but it would _not_ be happy if your quirk is bigger than the BIOS
allocation in one direction but not the other.
See?
Yes. insert_resource does has it's limitations but it doesn't look
like I am likely to run into them.
Post by Linus Torvalds
Right now, the ia64 port actually does _exactly_ this to mark all the
strange PCI window stuff into the resource tree. For a different reason,
but with a number of similar issues.
It comes very close, to that what this weird case needs. And things
are at least close enough now that I can hack something up for 2.6.

However insert_resource does not quite match what I think needs to
happen. After a pci quirk applies insert_resource I will get
something like:

fff0000-ffffffff : BIOS ROM Window
ffff0000-ffffffff : reserved

With the reserved region still present and marked as BUSY. Ideally
the map driver would carve up the window into sub regions for each
ROM chip. Usually that is just one sub region but it is possible to
have multiple ROMs in it. So I would expect to wind up with something like:

fff0000-ffffffff : BIOS ROM Window
fffc0000-ffffffff : mtd0
ffff0000-ffffffff : reserved

But that again runs afoul of the reserved region so it really won't
work. I could again use insert_resource and wind up with:

fff0000-ffffffff : BIOS ROM Window
fffc0000-ffffffff : mtd0
ffff0000-ffffffff : reserved

But that is increasingly a hack instead of a clean solution. Would it
be reasonable to write a variant of request_resource that just drops
BIOS resources. I can live with the restrictions of the current
insert_resource, but especially if I do this in a quirk I just want
the BIOS resources to go away.

Eric
Linus Torvalds
2004-01-07 16:32:37 UTC
Permalink
Post by Eric W. Biederman
However insert_resource does not quite match what I think needs to
happen. After a pci quirk applies insert_resource I will get
fff0000-ffffffff : BIOS ROM Window
ffff0000-ffffffff : reserved
With the reserved region still present and marked as BUSY.
I would suggest ignoring it. Not only because being overly complicated is
bad, but simply because nobody should care.

At some point adding extra regions is _purely_ for "documentation"
reasons, and while that may be nice, it's not worth worrying about. The
only thing you really want from a _correctness_ standpoint is to make sure
that nobody else will try to allocate their stuff in that area, and your
"BIOS ROM Window" resource should do that already.
Post by Eric W. Biederman
Would it be reasonable to write a variant of request_resource that just
drops BIOS resources.
It would not be impossible to just have a "force_resource()" that would
simply override _any_ existing resource, but quite frankly, I'd be more
nervous about that.

We could also mark the e820 non-RAM resources with some special
IORESOURCE_TENTATIVE flag, and allow just overriding those.

But even the simple "insert_resource()" has some potential problems: if
the BIOS has allocated the minimal window for itself (64kB at 0xffff0000),
and has allocated some _other_ chip at 0xfffe0000 that the kernel doesn't
know about yet, your insert_resource() would do the wrong thing and claim
the whole area for the BIOS writing.

Maybe that doesn't happen, but it's something to think about.

At some point, the _correct_ answer may be: don't do complex things, and
write a bootable floppy (without any OS at all, or a really minimal one)
to do BIOS rom updates.

Linus
Eric W. Biederman
2004-01-07 17:32:04 UTC
Permalink
Post by Linus Torvalds
Post by Eric W. Biederman
However insert_resource does not quite match what I think needs to
happen. After a pci quirk applies insert_resource I will get
fff0000-ffffffff : BIOS ROM Window
ffff0000-ffffffff : reserved
With the reserved region still present and marked as BUSY.
I would suggest ignoring it. Not only because being overly complicated is
bad, but simply because nobody should care.
At some point adding extra regions is _purely_ for "documentation"
reasons, and while that may be nice, it's not worth worrying about. The
only thing you really want from a _correctness_ standpoint is to make sure
that nobody else will try to allocate their stuff in that area, and your
"BIOS ROM Window" resource should do that already.
Right it is a documentation thing. The case that causes me to pull
my hair are Itanium boards. Typically they have 6 or 7 1MB rom chips,
for their firmware. My goal with going down this road last was so
user space could figure out which rom chip is at which address and
how those correspond to mtd devices.

Using the existing interfaces to export this information looked like
the cleanest way to make certain that information was available until
I ran into snags like the above. And once I replace the BIOS I can
fix these things at the source, but...
Post by Linus Torvalds
Post by Eric W. Biederman
Would it be reasonable to write a variant of request_resource that just
drops BIOS resources.
It would not be impossible to just have a "force_resource()" that would
simply override _any_ existing resource, but quite frankly, I'd be more
nervous about that.
Same here.
Post by Linus Torvalds
We could also mark the e820 non-RAM resources with some special
IORESOURCE_TENTATIVE flag, and allow just overriding those.
But even the simple "insert_resource()" has some potential problems: if
the BIOS has allocated the minimal window for itself (64kB at 0xffff0000),
and has allocated some _other_ chip at 0xfffe0000 that the kernel doesn't
know about yet, your insert_resource() would do the wrong thing and claim
the whole area for the BIOS writing.
Maybe that doesn't happen, but it's something to think about.
Agreed. In practice it does not happen, but it is worth thinking
about.

The important thing to maintain is that nothing else grabs the
area the BIOS reserves with a dynamic resource. So as long as
there is a resource over that area the kernel is safe, even
if I do grab it with insert_resources it does not really matter
to the rest of the kernel because someone has it.

The ROM chips actually have ID's so I can always positively identify
those, I just don't always know their count. The worst case would be
the rom chip probe causing problems. And that can be avoided by
simply not loading the driver so I think we are fairly safe.

In the case where I open up the decoder beyond the size it is
currently set for I can test for conflicts, from other devices.
The only reason I would not see another device at that point
is if either (a) there are ordering problems in the kernel or
(b) a SMM bios is doing truly stupid things. The case where
there is a device there and we aren't using it is not a problem
because I am just reserving a region of the address space.

Now that I have thought about it some more I think the right
was to do with IORESOURCE_TENATIVE it instead of removing tenative
resources to just push them aside. So in my terrible case I would
get:

fff0000-ffffffff : BIOS ROM Window
fffffff-ffffffff : reserved

And that cleans up all of the structure freeing problems. I guess I
can do that right now with ____request_resource after I find the
conflict and confirm it has the name "reserved". I still like the
tenative idea because then any one else who needs the same
functionality would not need to reimplement it.
Post by Linus Torvalds
At some point, the _correct_ answer may be: don't do complex things, and
write a bootable floppy (without any OS at all, or a really minimal one)
to do BIOS rom updates.
That works to some extent. But it actually a lot more dangerous because
you have to be there in person to verify everything is working fine, and to
insert the floppy. Doing it from Linux I can update the entire an
entire cluster in a minute, and verify everything automatically. And
it happens faster because I can load it all over the network.

Eric
Eric W. Biederman
2004-01-08 19:34:48 UTC
Permalink
Post by Linus Torvalds
At some point, the _correct_ answer may be: don't do complex things, and
write a bootable floppy (without any OS at all, or a really minimal one)
to do BIOS rom updates.
ROM chips fall into the linux mtd layer quite cleanly, and they are
just quirky enough they need someplace where lots of eyes look at the
code, and lots of people use the code. And the linux mtd layer
appears to be that place.

I have had enough success in actually using the linux kernel, for
flashing ROMS, it is becoming worth while to actually fix up the
last couple of annoying cases.

Plus I'm close to the point of finding some value in jffs2 and the
other flash filesystems, at which point I will need to use the mtd
layer anyway.

Eric

Russell King
2004-01-07 09:31:43 UTC
Permalink
Post by Eric W. Biederman
ffff0000-ffffffff : reserved
That last reserved region is 64K. Which looking at the pci registers
is technically correct at the moment. Only 64K happen to be decoded.
We already have this distinction between in use (or busy) resources and
allocated resources. Surely the BIOS ROM region should be an allocation
resource not a busy resource, so that the MTD driver can obtain a busy
resource against it?
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core
Eric W. Biederman
2004-01-07 15:06:04 UTC
Permalink
Post by Russell King
Post by Eric W. Biederman
ffff0000-ffffffff : reserved
That last reserved region is 64K. Which looking at the pci registers
is technically correct at the moment. Only 64K happen to be decoded.
We already have this distinction between in use (or busy) resources and
allocated resources. Surely the BIOS ROM region should be an allocation
resource not a busy resource, so that the MTD driver can obtain a busy
resource against it?
Nope the BIOS region is allocated as BUSY, at least as it comes
out of the E820 map.
Post by Russell King
From arch/i386/kernel/setup.c:legacy_init_iomem_resources
....
res -> start = e820.map[i].addr;
res -> end = res->start + e820.map[i].size - 1;
res -> flags = IORESOURCE_MEM | IORESOURCE_BUSY;
request_resource(&iomem_resource, res);

Eric
Russell King
2004-01-07 20:29:51 UTC
Permalink
Post by Eric W. Biederman
Post by Russell King
Post by Eric W. Biederman
ffff0000-ffffffff : reserved
That last reserved region is 64K. Which looking at the pci registers
is technically correct at the moment. Only 64K happen to be decoded.
We already have this distinction between in use (or busy) resources and
allocated resources. Surely the BIOS ROM region should be an allocation
resource not a busy resource, so that the MTD driver can obtain a busy
resource against it?
Nope the BIOS region is allocated as BUSY, at least as it comes
out of the E820 map.
Post by Russell King
From arch/i386/kernel/setup.c:legacy_init_iomem_resources
....
res -> start = e820.map[i].addr;
res -> end = res->start + e820.map[i].size - 1;
res -> flags = IORESOURCE_MEM | IORESOURCE_BUSY;
request_resource(&iomem_resource, res);
I was hoping someone was going to take my comments as a suggestion for
a possible solution to the problem.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core
Eric W. Biederman
2004-01-06 22:56:34 UTC
Permalink
Post by Linus Torvalds
Post by Andi Kleen
IMHO the only reliable way to get physical bus space for mappings
is to allocate some memory and map the mapping over that.
You literally can't do that: the RAM addresses are decoded by the
northbridge before they ever hit the PCI bus, so it's impossible to "map
over" RAM in general.
On AMD cpus starting at least with the K7 it is a cpu function. They
have both memory access and IO access FSB cycles. The cpu decodes the
address by looking at the IORRS and TOP_MEM (IO range registers are
similar to mtrrs but for specifying IO regions).

Of course there are some northbridges that don't ignore the mem/io
bits..
Post by Linus Torvalds
Normally, the way this works is that there are magic northbridge mapping
registers that remap part of the memory,
So far I have only seen this on the intel E7500 and it's descendants.
Post by Linus Torvalds
so that the memory that is
physically in the upper 4GB of RAM shows up somewhere else (or just
possibly disappears entirely
Having the memory disappear entirely is much more common.
Post by Linus Torvalds
- once you have more than 4GB of RAM, you
might not care too much about a few tens of megs missing).
At least not until you plug in a card with a 256M pci memory region
and loose half a gig of RAM.

There is also the trick of just not mapping the RAM into the address
space in a contiguous fashion. I have been very tempted lately to
just setup boxes with one dimm below 4G and have all of the rest above
to make this easier. But 32bit OS's and the performance hit
they take when accesses memory above 4G to make this a good idea yet.

Eric
Loading...