Discussion:
The disappearing sys_call_table export.
Terje Eggestad
2003-05-05 08:19:45 UTC
Permalink
Now that it seem that all are in agreement that the sys_call_table
symbol shall not be exported to modules, are there any work in progress
to allow modules to get an event/notification whenever a specific
syscall is being called?

We have a specific need to trace mmap() and sbrk() calls.
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Christoph Hellwig
2003-05-05 08:23:24 UTC
Permalink
Post by Terje Eggestad
Now that it seem that all are in agreement that the sys_call_table
symbol shall not be exported to modules, are there any work in progress
to allow modules to get an event/notification whenever a specific
syscall is being called?
No.
Post by Terje Eggestad
We have a specific need to trace mmap() and sbrk() calls.
Well, you get mmap events for your driver and I can't imagine a sane
reason for intwercepting sbrk(). Do you have a pointer to the driver
source doing such strange things?
Terje Eggestad
2003-05-05 09:33:36 UTC
Permalink
Unfortunately we live in an insane world.

First of all, in the Changelog where the export was removed for 2.5.41

http://www.kernel.org/pub/linux/kernel/v2.5/ChangeLog-2.5.41

Arjan lists 4 reasons for having the export in the first place, and I'm
on point 3. Here Arjan pretty much acknowledges that there is a
legitimate need to have a event/hook system to be informed of a syscall.
The exact quote is: "Eg the use of the export in this just a bandaid due
to lack of a proper mechanism".

My argument for *why* there should be a mechanism stops here.


Since you're bright inquisitive: The exact problem I'm facing is pretty
complex:


1. performance is everything.
2. We're making a MPI library, and as such we don't have any control
with the application.
3a. The various hardware for cluster interconnect all work with DMA.
3b. the performance loss from copying from a receive area to the
userspace buffer is unacceptable.
3c. It's therefore necessary for HW to access user pages.
4. In order to to 3, the user pages must be pinned down.
5. the way MPI is written, it's not using a special malloc() to allocate
the send receive buffers. It can't since it would break language binding
to fortran. Thus ANY writeable user page may be used.
6. point 4: pinning is VERY expensive (point 1), so I can't pin the
buffers every time they're used.
7. The only way to cache buffers (to see if they're used before and
hence pinned) is the user space virtual address. A syscall, thus ioctl
to a device file is prohibitive expensive under point 1.
8a. if the app (glibc in practice, but you never know) use sbrk() with a
negative arg, and then a positive argument, I can get a a different set
of user pages with the same address.
8b ditto with a set of munmap()/mmap().
9. since the number of times. any 'realloc' may happen is << than the
numbers of times any buffer may be used, it's necessary under point 1 to
to trace changes to virtual addresses to phys pages, rather than test
every time an address is being used.
10. kernel patches are impractical, I must be able to do this with std
stock, redhat, AND suse kernels.
Post by Christoph Hellwig
Post by Terje Eggestad
Now that it seem that all are in agreement that the sys_call_table
symbol shall not be exported to modules, are there any work in progress
to allow modules to get an event/notification whenever a specific
syscall is being called?
No.
Post by Terje Eggestad
We have a specific need to trace mmap() and sbrk() calls.
Well, you get mmap events for your driver and I can't imagine a sane
reason for intwercepting sbrk(). Do you have a pointer to the driver
source doing such strange things?
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Arjan van de Ven
2003-05-05 09:38:10 UTC
Permalink
Post by Terje Eggestad
1. performance is everything.
2. We're making a MPI library, and as such we don't have any control
with the application.
3a. The various hardware for cluster interconnect all work with DMA.
3b. the performance loss from copying from a receive area to the
userspace buffer is unacceptable.
3c. It's therefore necessary for HW to access user pages.
4. In order to to 3, the user pages must be pinned down.
see how AIO does this, and O_DIRECT, and rawio.

They all have the same requirement and manage to cope.
Terje Eggestad
2003-05-05 10:12:23 UTC
Permalink
Post by Arjan van de Ven
Post by Terje Eggestad
1. performance is everything.
2. We're making a MPI library, and as such we don't have any control
with the application.
3a. The various hardware for cluster interconnect all work with DMA.
3b. the performance loss from copying from a receive area to the
userspace buffer is unacceptable.
3c. It's therefore necessary for HW to access user pages.
4. In order to to 3, the user pages must be pinned down.
see how AIO does this, and O_DIRECT, and rawio.
They all have the same requirement and manage to cope.
Ok, I havn't actually checked the code , but no, they don't have the
same requirement. they pin and unpin the user space memory at the
beginning and and of the operations.

take aio pseudo code:

aio_write()
{
pinmem();
if (file)
add_write_to_disk_queue();
.
.
.


};

kernel_aio_completion_handler()
{
unpinmem();
send_completion_event_to_task();
};
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Christoph Hellwig
2003-05-05 10:25:31 UTC
Permalink
Post by Terje Eggestad
1. performance is everything.
then Linux is the wrong OS for you :)
Post by Terje Eggestad
2. We're making a MPI library, and as such we don't have any control
with the application.
I can't remember that the MPI spec tells anything about intercepting
syscalls..
Post by Terje Eggestad
3b. the performance loss from copying from a receive area to the
userspace buffer is unacceptable.
3c. It's therefore necessary for HW to access user pages.
4. In order to to 3, the user pages must be pinned down.
5. the way MPI is written, it's not using a special malloc() to allocate
the send receive buffers. It can't since it would break language binding
to fortran. Thus ANY writeable user page may be used.
so use get_user_pages.
Post by Terje Eggestad
6. point 4: pinning is VERY expensive (point 1), so I can't pin the
buffers every time they're used.
Umm, pinning memory all the time means you get a bunch of nice DoS
attachs due to the huge amount of memory.
Post by Terje Eggestad
7. The only way to cache buffers (to see if they're used before and
hence pinned) is the user space virtual address. A syscall, thus ioctl
to a device file is prohibitive expensive under point 1.
That's a horribly b0rked approach..

Again, where's your driver source so we can help you to find a better
approach out of that mess?
Terje Eggestad
2003-05-05 11:23:19 UTC
Permalink
Post by Christoph Hellwig
Post by Terje Eggestad
1. performance is everything.
then Linux is the wrong OS for you :)
Strangely enough not. You just have to try and stay out of the kernel as
much as possible ;-)

Of course some idiot sold the total-cost-of-ownership thingy of linux to
the customers. What they really need is a OS/360...
Post by Christoph Hellwig
Post by Terje Eggestad
2. We're making a MPI library, and as such we don't have any control
with the application.
I can't remember that the MPI spec tells anything about intercepting
syscalls..
It's says quite a bit about what memory can be used for comm buffers.
Post by Christoph Hellwig
Post by Terje Eggestad
3b. the performance loss from copying from a receive area to the
userspace buffer is unacceptable.
3c. It's therefore necessary for HW to access user pages.
4. In order to to 3, the user pages must be pinned down.
5. the way MPI is written, it's not using a special malloc() to allocate
the send receive buffers. It can't since it would break language binding
to fortran. Thus ANY writeable user page may be used.
so use get_user_pages.
Let me clearify: pinning pages are not, repeat not a problem.

The problem occur when you
1. pinn a buffer
2. sbrk(-n) or munmap() (usually thru free()) the area the buffer
3. a new malloc() resulting in a sbrk(+n) or mmap()
4. then my new buffer has the exactly same virtual address as the prev.

(belive it or not this happens, and relatively frequently).
Post by Christoph Hellwig
Post by Terje Eggestad
6. point 4: pinning is VERY expensive (point 1), so I can't pin the
buffers every time they're used.
Umm, pinning memory all the time means you get a bunch of nice DoS
attachs due to the huge amount of memory.
This is HPC clusters. DoS is a non issue. This is not the normal multi
user systems. In fact you run one active process per CPU.
Post by Christoph Hellwig
Post by Terje Eggestad
7. The only way to cache buffers (to see if they're used before and
hence pinned) is the user space virtual address. A syscall, thus ioctl
to a device file is prohibitive expensive under point 1.
That's a horribly b0rked approach..
It's *FAST*.
Post by Christoph Hellwig
Again, where's your driver source so we can help you to find a better
approach out of that mess?
The trace module we made to trace munmap() and sbrk() could be opened,
but you'll be disappointed since all the pinning ( get_user_pages() and
friends), send() recv() etc are in the drivers for the various hardware,
most of which are not our property.

The module works as follows. It catches sbrk(-arg) and munmap() and lays
out the trace info in a memory area mmap()'able thru the device file.
So when processes need the trace info they have the info in memory to
avoid doing a ioctl().

Thats all we need to know if a given virtual address needs to be
(re)pinned.

Lets deal, I'll GPL the trace module if you get me a
EXPORT_SYMBOL_GPL(sys_call_table);



TJ
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Arjan van de Ven
2003-05-05 11:27:27 UTC
Permalink
Post by Terje Eggestad
Lets deal, I'll GPL the trace module if you get me a
EXPORT_SYMBOL_GPL(sys_call_table);
the sys call table is not un-exported for license-political reasons.
It's unexported because there is no correct use for it and that it can't
be used correctly either. Tell me which lock your module uses to protect
modifications to it? Tell me how you handle other modules trying to
overload the same syscall and those modules loading before your module but
then unloading while yours is still loaded?

It's the wrong mechanism to do ANYTHING. Really.
Terje Eggestad
2003-05-05 11:31:19 UTC
Permalink
Post by Terje Eggestad
Post by Christoph Hellwig
Again, where's your driver source so we can help you to find a better
approach out of that mess?
The trace module we made to trace munmap() and sbrk() could be opened,
but you'll be disappointed since all the pinning ( get_user_pages() and
friends), send() recv() etc are in the drivers for the various hardware,
most of which are not our property.
The module works as follows. It catches sbrk(-arg) and munmap() and lays
out the trace info in a memory area mmap()'able thru the device file.
So when processes need the trace info they have the info in memory to
avoid doing a ioctl().
Thats all we need to know if a given virtual address needs to be
(re)pinned.
In all fairness this should be done in glibc, but the task of getting it
done there was several orders of magnitude larger than just adding the
syscall intercepts. Serves you right for writing clean code :-)

The thing is of course this *worked* until someone decided to remove the
export of sys_call_table.

Which is a decision that is most probably right, I just need another way
of getting a hook or notification of the sys calls.
Post by Terje Eggestad
TJ
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Arjan van de Ven
2003-05-05 11:33:30 UTC
Permalink
Post by Terje Eggestad
In all fairness this should be done in glibc,
... or a LD_PRELOAD library......
Arjan van de Ven
2003-05-05 14:59:20 UTC
Permalink
Post by Arjan van de Ven
Post by Terje Eggestad
In all fairness this should be done in glibc,
... or a LD_PRELOAD library......
which doesn't work with statically linked binaries, does it?
good thing the LGPL on glibc requires a relinkable version to be offered
as well ;)
Tigran Aivazian
2003-05-05 15:53:12 UTC
Permalink
Post by Arjan van de Ven
Post by Terje Eggestad
In all fairness this should be done in glibc,
... or a LD_PRELOAD library......
which doesn't work with statically linked binaries, does it?

Regards
Tigran
Christoph Hellwig
2003-05-05 14:57:45 UTC
Permalink
Post by Arjan van de Ven
... or a LD_PRELOAD library......
which doesn't work with statically linked binaries, does it?
No. But given the source to the application you can
easily override glibc's weak malloc symbol at link-time.
Christoph Hellwig
2003-05-05 12:52:11 UTC
Permalink
Post by Terje Eggestad
The problem occur when you
1. pinn a buffer
2. sbrk(-n) or munmap() (usually thru free()) the area the buffer
3. a new malloc() resulting in a sbrk(+n) or mmap()
4. then my new buffer has the exactly same virtual address as the prev.
(belive it or not this happens, and relatively frequently).
That only shows that you really don't want to use glibc's malloc and
sbrk implementations, but ones that are implemented as mmap in your
driver so you can keep track of it properly. LD_PRELOAD is your friend.
Post by Terje Eggestad
Lets deal, I'll GPL the trace module if you get me a
EXPORT_SYMBOL_GPL(sys_call_table);
Who cares about your trace module? That's the wrong approach to start
with. And the removal of the sys_call_table export is not a political
issue but a technical one. The interesting thing would be your memory
manager, but given the above hints you really should be able to fix it yourself
now..
Christoph Hellwig
2003-05-05 13:43:54 UTC
Permalink
temper, temper
pls read my reply to alan carefully .
Doing own malloc(), free(), m[un]map(), is a possibility we've
considered. Since we've got our own lib linked with the app, we probably
wouldn't even need LD_PRELOAD. our main issue is that not everything is
gcc/g77.
Well, if the compiler doesn't play nicely with that that's your / the
compiler vendors problem. Especially if it's not available in source
code..
Terje Eggestad
2003-05-05 13:50:26 UTC
Permalink
OK

My warp'ed insane problem aside, Let me ask this:


"Do you acknowledge a legitimate need to have syscall hooks?"
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Arjan van de Ven
2003-05-05 13:54:44 UTC
Permalink
Post by Terje Eggestad
OK
"Do you acknowledge a legitimate need to have syscall hooks?"
not as general thing.

there are specific cases when some notification is needed, see for example
oprofile in 2.5....
Christoph Hellwig
2003-05-05 13:55:19 UTC
Permalink
Post by Terje Eggestad
"Do you acknowledge a legitimate need to have syscall hooks?"
No.
Carl-Daniel Hailfinger
2003-05-05 14:28:08 UTC
Permalink
Post by Terje Eggestad
"Do you acknowledge a legitimate need to have syscall hooks?"
No.
LSM?


Regards,
Carl-Daniel
Christoph Hellwig
2003-05-05 14:34:14 UTC
Permalink
LSM?
LSM is explicitly not syscall hooks. And educated readers of lkml should
now my opinion on LSM...
Carl-Daniel Hailfinger
2003-05-05 15:25:39 UTC
Permalink
Post by Christoph Hellwig
LSM?
LSM is explicitly not syscall hooks. And educated readers of lkml should
Yes, sorry, I mixed that up with an old Usenix paper.
Post by Christoph Hellwig
know my opinion on LSM...
Um yeah.
/me puts on asbestos suit
I remember your patch to remove the nested syscall (sys_security) for
LSM quite well.

Carl-Daniel
Terje Eggestad
2003-05-05 13:41:23 UTC
Permalink
temper, temper

pls read my reply to alan carefully .

Doing own malloc(), free(), m[un]map(), is a possibility we've
considered. Since we've got our own lib linked with the app, we probably
wouldn't even need LD_PRELOAD. our main issue is that not everything is
gcc/g77.

Of all the approaches the syscall traps was the least intrusive and most
portable of all, belive it or not.

BTW: this is all technical issues.
Post by Christoph Hellwig
That only shows that you really don't want to use glibc's malloc and
sbrk implementations, but ones that are implemented as mmap in your
driver so you can keep track of it properly. LD_PRELOAD is your friend.
Who cares about your trace module? That's the wrong approach to start
with. And the removal of the sys_call_table export is not a political
issue but a technical one. The interesting thing would be your memory
manager, but given the above hints you really should be able to fix it yourself
now..
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Eric W. Biederman
2003-05-06 07:30:35 UTC
Permalink
Post by Christoph Hellwig
Post by Terje Eggestad
1. performance is everything.
then Linux is the wrong OS for you :)
Post by Terje Eggestad
2. We're making a MPI library, and as such we don't have any control
with the application.
I can't remember that the MPI spec tells anything about intercepting
syscalls..
Post by Terje Eggestad
3b. the performance loss from copying from a receive area to the
userspace buffer is unacceptable.
3c. It's therefore necessary for HW to access user pages.
4. In order to to 3, the user pages must be pinned down.
5. the way MPI is written, it's not using a special malloc() to allocate
the send receive buffers. It can't since it would break language binding
to fortran. Thus ANY writeable user page may be used.
Looking at the mpi spec there are two forms of point to point communications.
1) mpi_send/mpi_recv which do have that limitation.
2) mpi_put/mpi_get which are restricted to be used with a specifically
allocated window, and the window can be restricted to areas allocated
with mpi_alloc_mem.

So the mpi_put/mpi_get should be easy to optimize.

Handling mpi_send/mpi_recv is more difficult. MPI specifies
that the data can be copied it just does not require it so in
sufficiently weird situations a copy slow path can be taken.

So there are really two questions here.
1) What is a clean way to provide a high performance message
passing layer. Assuming you have a network card for which
it is safe to mmap a subset of control registers.

2) What is a good way to map MPI onto that clean layer.

I believe the answer on how to do a clean safe interface is
to allocate the memory and tell the card about it in the driver,
and then allow user space to mmap it. With the driver mmap operation
informing the network card of the mapping.

A good implementation of mpi on top of that is an interesting
question. Replacing malloc and free and having everything run on
top of the mmapped buffer sounds like a possibility. But it is
additionally desirable for the memory used by an MPI job to come
from hugetlbfs, or the equivalent. And I don't know if a driver
can provide huge pages.

At this point I am strongly tempted to see what it would take to come
up with an MPI-2.1 to fix this issue.
Post by Christoph Hellwig
so use get_user_pages.
Post by Terje Eggestad
6. point 4: pinning is VERY expensive (point 1), so I can't pin the
buffers every time they're used.
Umm, pinning memory all the time means you get a bunch of nice DoS
attachs due to the huge amount of memory.
I wonder if there is an easy way to optimize this if you don't have
swap configured. In general it is a bug if an MPI job swaps.

In general there is one mpi process per cpu running on a machine. So
I have trouble seeing this as a denial of service.
Post by Christoph Hellwig
Post by Terje Eggestad
7. The only way to cache buffers (to see if they're used before and
hence pinned) is the user space virtual address. A syscall, thus ioctl
to a device file is prohibitive expensive under point 1.
That's a horribly b0rked approach..
Again, where's your driver source so we can help you to find a better
approach out of that mess?
With some digging I can find the source for both quadrics and myrinet
drivers, and they have the same issues. This is a general problem
for running MPI jobs so it is probably worth finding a solution that
works for those people whose source we can obtain.

Eric
Terje Eggestad
2003-05-06 08:14:37 UTC
Permalink
Post by Eric W. Biederman
Looking at the mpi spec there are two forms of point to point communications.
1) mpi_send/mpi_recv which do have that limitation.
2) mpi_put/mpi_get which are restricted to be used with a specifically
allocated window, and the window can be restricted to areas allocated
with mpi_alloc_mem.
So the mpi_put/mpi_get should be easy to optimize.
Handling mpi_send/mpi_recv is more difficult. MPI specifies
that the data can be copied it just does not require it so in
sufficiently weird situations a copy slow path can be taken.
So there are really two questions here.
1) What is a clean way to provide a high performance message
passing layer. Assuming you have a network card for which
it is safe to mmap a subset of control registers.
2) What is a good way to map MPI onto that clean layer.
All applications pretty much uses send/recv.
Post by Eric W. Biederman
I believe the answer on how to do a clean safe interface is
to allocate the memory and tell the card about it in the driver,
and then allow user space to mmap it. With the driver mmap operation
informing the network card of the mapping.
You can't mmap() a buffer every time your going to do a send/recv, it's
way to costly.
Post by Eric W. Biederman
A good implementation of mpi on top of that is an interesting
question. Replacing malloc and free and having everything run on
top of the mmapped buffer sounds like a possibility. But it is
additionally desirable for the memory used by an MPI job to come
from hugetlbfs, or the equivalent. And I don't know if a driver
can provide huge pages.
At this point I am strongly tempted to see what it would take to come
up with an MPI-2.1 to fix this issue.
all current MPI apps uses MPI-1
Post by Eric W. Biederman
Post by Christoph Hellwig
so use get_user_pages.
Post by Terje Eggestad
6. point 4: pinning is VERY expensive (point 1), so I can't pin the
buffers every time they're used.
Umm, pinning memory all the time means you get a bunch of nice DoS
attachs due to the huge amount of memory.
I wonder if there is an easy way to optimize this if you don't have
swap configured. In general it is a bug if an MPI job swaps.
hmm, it's not a problem as long as you only page out data page used only
under initialization, or pages that are used very infrequent. That is
actually a good thing, since you could fit a bit more live data in
memory.
Post by Eric W. Biederman
In general there is one mpi process per cpu running on a machine. So
I have trouble seeing this as a denial of service.
Post by Christoph Hellwig
Post by Terje Eggestad
7. The only way to cache buffers (to see if they're used before and
hence pinned) is the user space virtual address. A syscall, thus ioctl
to a device file is prohibitive expensive under point 1.
That's a horribly b0rked approach..
Again, where's your driver source so we can help you to find a better
approach out of that mess?
With some digging I can find the source for both quadrics and myrinet
drivers, and they have the same issues. This is a general problem
for running MPI jobs so it is probably worth finding a solution that
works for those people whose source we can obtain.
Hmm, no the drivers, don't have the issue, the MPI implementations do.
The two used approaches are 1) replace malloc() and friends, which break
with fortran 90 compilers 2) tell glibc never to release alloced memory
thru sbrk(-n) or munmap() which also break with f90 compilers, and run
the risk of bloating memory usage.
Post by Eric W. Biederman
Eric
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Eric W. Biederman
2003-05-06 09:21:28 UTC
Permalink
Post by Terje Eggestad
Post by Eric W. Biederman
Handling mpi_send/mpi_recv is more difficult. MPI specifies
that the data can be copied it just does not require it so in
sufficiently weird situations a copy slow path can be taken.
So there are really two questions here.
1) What is a clean way to provide a high performance message
passing layer. Assuming you have a network card for which
it is safe to mmap a subset of control registers.
2) What is a good way to map MPI onto that clean layer.
All applications pretty much uses send/recv.
Post by Eric W. Biederman
I believe the answer on how to do a clean safe interface is
to allocate the memory and tell the card about it in the driver,
and then allow user space to mmap it. With the driver mmap operation
informing the network card of the mapping.
You can't mmap() a buffer every time your going to do a send/recv, it's
way to costly.
Definitely not. But if the memory malloc returns is originally
from a mmaped buffer area (mmaped from your driver) it can be useful.
I assume somewhere your card has the smarts to transform virtual to
physical addresses and this is what the mmap sets up.

That can be handled in user space by querying the mmaped region. But
if the card does not have the smarts to do the virtual to physical
translation, or at the very least limit the set of physical pages a
user space a do DMA to/from that is a fundamental security issue and
means all of the optimizations are not safe. And you must enter/exit
the kernel to send a DMA transaction.
Post by Terje Eggestad
Post by Eric W. Biederman
A good implementation of mpi on top of that is an interesting
question. Replacing malloc and free and having everything run on
top of the mmapped buffer sounds like a possibility. But it is
additionally desirable for the memory used by an MPI job to come
from hugetlbfs, or the equivalent. And I don't know if a driver
can provide huge pages.
At this point I am strongly tempted to see what it would take to come
up with an MPI-2.1 to fix this issue.
all current MPI apps uses MPI-1
Given that mpich does not even implement mpi_put/mpi_get I can
easily believe it for this case. All of the MPI file I/O which
also does get used at least to some extent also is part of MPI-2.
Post by Terje Eggestad
Post by Eric W. Biederman
Post by Christoph Hellwig
so use get_user_pages.
Post by Terje Eggestad
6. point 4: pinning is VERY expensive (point 1), so I can't pin the
buffers every time they're used.
Umm, pinning memory all the time means you get a bunch of nice DoS
attachs due to the huge amount of memory.
I wonder if there is an easy way to optimize this if you don't have
swap configured. In general it is a bug if an MPI job swaps.
hmm, it's not a problem as long as you only page out data page used only
under initialization, or pages that are used very infrequent. That is
actually a good thing, since you could fit a bit more live data in
memory.
Right. Defining it as a bug was to emphasize the point that paging is
a non-issue and for the most part an MPI job is already pinned in
memory. I totally agree that having swapping enabled and being able
to page out every unused page in the is useful.
Post by Terje Eggestad
Post by Eric W. Biederman
In general there is one mpi process per cpu running on a machine. So
I have trouble seeing this as a denial of service.
Post by Christoph Hellwig
Post by Terje Eggestad
7. The only way to cache buffers (to see if they're used before and
hence pinned) is the user space virtual address. A syscall, thus ioctl
to a device file is prohibitive expensive under point 1.
That's a horribly b0rked approach..
Again, where's your driver source so we can help you to find a better
approach out of that mess?
With some digging I can find the source for both quadrics and myrinet
drivers, and they have the same issues. This is a general problem
for running MPI jobs so it is probably worth finding a solution that
works for those people whose source we can obtain.
Hmm, no the drivers, don't have the issue, the MPI implementations do.
The drivers have the issue of how to provide an interface for
the mpi implementation that sits on top of them. I totally agree this
looks like a bug in MPI.
Post by Terje Eggestad
The two used approaches are 1) replace malloc() and friends, which break
with fortran 90 compilers 2) tell glibc never to release alloced memory
thru sbrk(-n) or munmap() which also break with f90 compilers, and run
the risk of bloating memory usage.
Actually there is a third. Hack the vm layer and require a highly
patched kernel. That is the approach quadrics was using last time I
looked although they promised something different in their next major
rev.

Is it pgi or intels f90 compilers that break, and how do they break.
Replacing malloc and friends should be well defined if you simply
replace or wrap the symbols glibc provides.

Quite possibly the answer is to call those compilers ABI
non-conformant and get them fixed. Especially given that they are not
compatible with g77 in fortran mode there is a good case for this. By
default the native compiler is correct.

So far the only fortran issues I have seen that could affect malloc
are adding extra under scores. What issue are you running into?


Eric
Terje Eggestad
2003-05-06 11:21:39 UTC
Permalink
Post by Eric W. Biederman
Post by Terje Eggestad
Post by Eric W. Biederman
I believe the answer on how to do a clean safe interface is
to allocate the memory and tell the card about it in the driver,
and then allow user space to mmap it. With the driver mmap operation
informing the network card of the mapping.
You can't mmap() a buffer every time your going to do a send/recv, it's
way to costly.
Definitely not. But if the memory malloc returns is originally
from a mmaped buffer area (mmaped from your driver) it can be useful.
I assume somewhere your card has the smarts to transform virtual to
physical addresses and this is what the mmap sets up.
The problem I've got happen when an app registers the memory with the
driver, releases the memory back to the kernel thru sbrk(-n) or
munmap()s it. Then get new memory thru sbrk(+n) or mmap() which then get
the same vaddr.

mapping from vaddr to phys addr happen at the registration point.

Querying the kernel for a vaddrs phys addr every time it's used is too
costly. There is a better explanantion in a earlier post.
Post by Eric W. Biederman
That can be handled in user space by querying the mmaped region. But
if the card does not have the smarts to do the virtual to physical
translation, or at the very least limit the set of physical pages a
user space a do DMA to/from that is a fundamental security issue and
means all of the optimizations are not safe. And you must enter/exit
the kernel to send a DMA transaction.
send/recv don't need kernel interaction on high perf interconnects.
Post by Eric W. Biederman
Post by Terje Eggestad
The two used approaches are 1) replace malloc() and friends, which break
with fortran 90 compilers 2) tell glibc never to release alloced memory
thru sbrk(-n) or munmap() which also break with f90 compilers, and run
the risk of bloating memory usage.
Actually there is a third. Hack the vm layer and require a highly
patched kernel. That is the approach quadrics was using last time I
looked although they promised something different in their next major
rev.
Is it pgi or intels f90 compilers that break, and how do they break.
Replacing malloc and friends should be well defined if you simply
replace or wrap the symbols glibc provides.
Quite possibly the answer is to call those compilers ABI
non-conformant and get them fixed. Especially given that they are not
compatible with g77 in fortran mode there is a good case for this. By
default the native compiler is correct.
So far the only fortran issues I have seen that could affect malloc
are adding extra under scores. What issue are you running into?
Some don't use (g)libc, but do syscalls directly.
Post by Eric W. Biederman
Eric
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Eric W. Biederman
2003-05-06 11:37:36 UTC
Permalink
Post by Terje Eggestad
Post by Eric W. Biederman
Post by Terje Eggestad
Post by Eric W. Biederman
I believe the answer on how to do a clean safe interface is
to allocate the memory and tell the card about it in the driver,
and then allow user space to mmap it. With the driver mmap operation
informing the network card of the mapping.
You can't mmap() a buffer every time your going to do a send/recv, it's
way to costly.
Definitely not. But if the memory malloc returns is originally
from a mmaped buffer area (mmaped from your driver) it can be useful.
I assume somewhere your card has the smarts to transform virtual to
physical addresses and this is what the mmap sets up.
The problem I've got happen when an app registers the memory with the
driver, releases the memory back to the kernel thru sbrk(-n) or
munmap()s it. Then get new memory thru sbrk(+n) or mmap() which then get
the same vaddr.
mapping from vaddr to phys addr happen at the registration point.
I was talking about an method that does not require a registration
point. So it sounds like we are talking past each other on this one.
Post by Terje Eggestad
Querying the kernel for a vaddrs phys addr every time it's used is too
costly. There is a better explanantion in a earlier post.
There are 2 possible interfaces to get a vaddr to phys addr mapping.
1) Register the memory area and pin it down.
2) MMap from memory allocated by the driver.
In this case the driver is told about every mmap and unmap.

So I believe that baring the strange issues with hooking malloc
to call a mmap function on your driver 2 is the correct solution.
Post by Terje Eggestad
Post by Eric W. Biederman
That can be handled in user space by querying the mmaped region. But
if the card does not have the smarts to do the virtual to physical
translation, or at the very least limit the set of physical pages a
user space a do DMA to/from that is a fundamental security issue and
means all of the optimizations are not safe. And you must enter/exit
the kernel to send a DMA transaction.
send/recv don't need kernel interaction on high perf interconnects.
Agreed. I was simply mention the requires for that to be safe.
Post by Terje Eggestad
Post by Eric W. Biederman
So far the only fortran issues I have seen that could affect malloc
are adding extra under scores. What issue are you running into?
Some don't use (g)libc, but do syscalls directly.
That is clearly broken code. A user space application compiled statically is
one thing. Directly putting syscalls in libraries other than libc is
quite bad. And I currently cannot think of an excuse for it.
Especially as that will ensure you use the slow syscall path on recent
kernels.

Eric
Terje Eggestad
2003-05-06 12:08:30 UTC
Permalink
Post by Eric W. Biederman
Post by Terje Eggestad
The problem I've got happen when an app registers the memory with the
driver, releases the memory back to the kernel thru sbrk(-n) or
munmap()s it. Then get new memory thru sbrk(+n) or mmap() which then get
the same vaddr.
mapping from vaddr to phys addr happen at the registration point.
I was talking about an method that does not require a registration
point. So it sounds like we are talking past each other on this one.
Post by Terje Eggestad
Querying the kernel for a vaddrs phys addr every time it's used is too
costly. There is a better explanantion in a earlier post.
There are 2 possible interfaces to get a vaddr to phys addr mapping.
1) Register the memory area and pin it down.
2) MMap from memory allocated by the driver.
In this case the driver is told about every mmap and unmap.
So I believe that baring the strange issues with hooking malloc
to call a mmap function on your driver 2 is the correct solution.
Well, since the memory is already alloc'ed as normal user memory, it
gotta be 1), which require a registration point.
Post by Eric W. Biederman
Post by Terje Eggestad
Post by Eric W. Biederman
That can be handled in user space by querying the mmaped region. But
if the card does not have the smarts to do the virtual to physical
translation, or at the very least limit the set of physical pages a
user space a do DMA to/from that is a fundamental security issue and
means all of the optimizations are not safe. And you must enter/exit
the kernel to send a DMA transaction.
send/recv don't need kernel interaction on high perf interconnects.
Agreed. I was simply mention the requires for that to be safe.
Post by Terje Eggestad
Post by Eric W. Biederman
So far the only fortran issues I have seen that could affect malloc
are adding extra under scores. What issue are you running into?
Some don't use (g)libc, but do syscalls directly.
That is clearly broken code. A user space application compiled statically is
one thing. Directly putting syscalls in libraries other than libc is
quite bad. And I currently cannot think of an excuse for it.
Especially as that will ensure you use the slow syscall path on recent
kernels.
Agree, come to think about it, if you write code in fortran it's broken
by default ;-)

The thing is of course that pesky customers have fortran code they need
to run, and as long there is no g90, and g77 performance sucks, there is
only commercial fortran compilers in play....
Post by Eric W. Biederman
Eric
TJ
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Alan Cox
2003-05-05 11:16:43 UTC
Permalink
Post by Terje Eggestad
1. performance is everything.
Then you can live with building custom patched kernels
Post by Terje Eggestad
2. We're making a MPI library, and as such we don't have any control
with the application.
LD_PRELOAD
Post by Terje Eggestad
3c. It's therefore necessary for HW to access user pages.
Like TV cards do. That isnt hard
Post by Terje Eggestad
4. In order to to 3, the user pages must be pinned down.
5. the way MPI is written, it's not using a special malloc() to allocate
the send receive buffers. It can't since it would break language binding
to fortran. Thus ANY writeable user page may be used.
Well not all the pages are guaranteed DMAable, so I guess you already
lost.
Post by Terje Eggestad
10. kernel patches are impractical, I must be able to do this with std
stock, redhat, AND suse kernels.
So you want every vendor to screw up their kernels and the base kernel
for an obscure (but fun) corner case. Thats not a rational choice is it.
You want "performance is everything" you pay the price, don't make
everyone suffer.
Terje Eggestad
2003-05-05 13:23:39 UTC
Permalink
Post by Alan Cox
Post by Terje Eggestad
1. performance is everything.
Then you can live with building custom patched kernels
If there was numerous issues, sure. But every time we get to the point
where it seem that that is necessary we find a workaround.
Right now, this is the ONLY issue we got..
Post by Alan Cox
Post by Terje Eggestad
2. We're making a MPI library, and as such we don't have any control
with the application.
LD_PRELOAD
IN general LD_PRELOAD is fun for testing and academic programs, but not
for production code.

In specific you run into a problem with how fortran 90 compilers do
dynamical arrays. It's very compiler dependent.
Post by Alan Cox
Post by Terje Eggestad
3c. It's therefore necessary for HW to access user pages.
Like TV cards do. That isnt hard
nobody said it is.
Post by Alan Cox
Post by Terje Eggestad
4. In order to to 3, the user pages must be pinned down.
5. the way MPI is written, it's not using a special malloc() to allocate
the send receive buffers. It can't since it would break language binding
to fortran. Thus ANY writeable user page may be used.
Well not all the pages are guaranteed DMAable, so I guess you already
lost.
Nope. The drivers test to see if the page is DMAable, and do a copy if
necessary. Most of the high performance interconnects NIC's do 64 bit
PCI.
Post by Alan Cox
Post by Terje Eggestad
10. kernel patches are impractical, I must be able to do this with std
stock, redhat, AND suse kernels.
So you want every vendor to screw up their kernels and the base kernel
for an obscure (but fun) corner case. Thats not a rational choice is it.
You want "performance is everything" you pay the price, don't make
everyone suffer.
No! I don't disagree with removing the export of the syscall_table!

I just want the "proper mechanism" indicated by Arjan in the changelog.
Pls read this thread. There are legitimate uses to having syscall
hooks/notifications, either you think mine is or not.
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Terje Malmedal
2003-05-08 12:25:51 UTC
Permalink
[Alan Cox]
Post by Alan Cox
Post by Terje Eggestad
10. kernel patches are impractical, I must be able to do this with std
stock, redhat, AND suse kernels.
So you want every vendor to screw up their kernels and the base kernel
for an obscure (but fun) corner case. Thats not a rational choice is it.
You want "performance is everything" you pay the price, don't make
everyone suffer.
Hmm. sys_call_table is gone? That's sad.

How about a

EXPORT_SYMBOL_GPL_AND_DONT_EVEN_THINK_ABOUT_SENDING_A_BUG_REPORT(sys_call_table);

and displaying a nasty warning message on the console whenever a
module used it?

It is rare that I need to use it, but when I do I need it bad, for instance:

fsync on large files used to have severe performance problems, I was
able to just change sys_fsync to be a call to sys_sync without
rebooting or even restarting the database(Solid) before the problem
got out of hand.

A server for an online internet game had several months of uptime and
I needed to rotate the log-files so I made a module which trapped
sys_write and closed and reopened the file with a new name before
continuing[1].

Even if it is discouraged for normal use it is a very nice thing to
have to fix up various surprises.

I know I can still use the Phrack technique, but somehow I am not
convinced that I can rely on it being available.
--
- Terje
***@usit.uio.no

[1] When I do this kind of thing now I do:
(gdb) attach 9597
(gdb) call close(7)
(gdb) call open("out.txt",0100 | 01, 0666 )
(gdb) cont

This did not work back then however.
Christoph Hellwig
2003-05-08 12:29:31 UTC
Permalink
Post by Terje Malmedal
EXPORT_SYMBOL_GPL_AND_DONT_EVEN_THINK_ABOUT_SENDING_A_BUG_REPORT(sys_call_table);
and displaying a nasty warning message on the console whenever a
module used it?
What about just adding the EXPORT_SYMBOL() yourself yo your kernels
if you think you need it so badly because you can't screw yourself
enough without it?
Terje Malmedal
2003-05-08 13:18:29 UTC
Permalink
[Christoph Hellwig]
Post by Christoph Hellwig
Post by Terje Malmedal
EXPORT_SYMBOL_GPL_AND_DONT_EVEN_THINK_ABOUT_SENDING_A_BUG_REPORT(sys_call_table);
and displaying a nasty warning message on the console whenever a
module used it?
What about just adding the EXPORT_SYMBOL() yourself yo your kernels
if you think you need it so badly because you can't screw yourself
enough without it?
And if I wish to help somebody running a kernel I didn't compile?

Do you have anything constructive to say about situation i referred
to:

A database is starting to run slower and slower, turns out that this
is because fsync() is inefficient on large files. Rebooting the server
or restarting the database is undesirable even at night.

?

I was able to fix this without rebooting or restarting the database.
How do you propose to fix something similar today without having
sys_call_table exported?

Also what exactly is the badness people are complaining about, if I do:

int init_module(void)
{
orig_fsync=sys_call_table[SYS_fsync];
sys_call_table[SYS_fsync]=hacked_fsync;
return 0;
}

void cleanup_module(void)
{
sys_call_table[SYS_fsync]=orig_fsync;
}

The only problem I can see is that different modules overloading the
same function needs to be unloaded in the correct order. Is this the
only reason for removing it, or am I missing something?
--
- Terje
***@usit.uio.no
Christoph Hellwig
2003-05-08 14:25:43 UTC
Permalink
Post by Terje Malmedal
And if I wish to help somebody running a kernel I didn't compile?
recompile it. binary patch it. I don't care. Linux is free software
so you're allowed to change whatever you want. Just don't annoy us
about fixing problems in mainline.
Post by Terje Malmedal
Do you have anything constructive to say about situation i referred
A database is starting to run slower and slower, turns out that this
is because fsync() is inefficient on large files. Rebooting the server
or restarting the database is undesirable even at night.
fix the database. hey, if you think it's so important fork the kernel.
if there's enough people that agree with you wour fork will be mainline
some day. It's really _that_ easy.
Post by Terje Malmedal
The only problem I can see is that different modules overloading the
same function needs to be unloaded in the correct order. Is this the
only reason for removing it, or am I missing something?
it's racy - and it doesn't work on half of the arches added over the
last years.
Terje Malmedal
2003-05-08 15:29:55 UTC
Permalink
[Christoph Hellwig]
Post by Christoph Hellwig
Post by Terje Malmedal
The only problem I can see is that different modules overloading the
same function needs to be unloaded in the correct order. Is this the
only reason for removing it, or am I missing something?
it's racy - and it doesn't work on half of the arches added over the
last years.
Would you be so kind as to explain exactly what is racy? Just
asserting that it is does not help me understand anything.
--
- Terje
***@usit.uio.no
Jesse Pollard
2003-05-08 18:13:49 UTC
Permalink
Post by Terje Malmedal
[Christoph Hellwig]
Post by Christoph Hellwig
Post by Terje Malmedal
The only problem I can see is that different modules overloading the
same function needs to be unloaded in the correct order. Is this the
only reason for removing it, or am I missing something?
it's racy - and it doesn't work on half of the arches added over the
last years.
Would you be so kind as to explain exactly what is racy? Just
asserting that it is does not help me understand anything.
Look at this:

[1]int init_module(void)
[2]{
[3] orig_fsync=sys_call_table[SYS_fsync];
[4] sys_call_table[SYS_fsync]=hacked_fsync;
[5] return 0;
[6]}

Unless there is a LOCK on sys_call_table[SYS_fsync] another CPU could
replace the pointer between lines 3 and 4. At that point line 4 would
destroy the existing entry.. or destroy it when the original is restored,
and would NOT be restoring the one insterted.
Christoph Hellwig
2003-05-08 19:17:29 UTC
Permalink
Post by Jesse Pollard
Unless there is a LOCK on sys_call_table[SYS_fsync] another CPU could
replace the pointer between lines 3 and 4. At that point line 4 would
destroy the existing entry.. or destroy it when the original is restored,
and would NOT be restoring the one insterted.
The the race in the replacement. The second race is in actually
using these hooks. As soon as you examine a user pointer/address
in there you're fundamentally racy vs. another thread manipulating
the user address space.
Alan Cox
2003-05-08 14:58:48 UTC
Permalink
Post by Terje Malmedal
How about a
EXPORT_SYMBOL_GPL_AND_DONT_EVEN_THINK_ABOUT_SENDING_A_BUG_REPORT(sys_call_table);
Its in read only space nowdays anyway
Post by Terje Malmedal
A server for an online internet game had several months of uptime and
I needed to rotate the log-files so I made a module which trapped
sys_write and closed and reopened the file with a new name before
continuing[1].
man ptrace
Ben Lau
2003-05-07 02:14:09 UTC
Permalink
Hi,

I am interested with pt2, how NFS did for their syscall?
Post by Terje Eggestad
Unfortunately we live in an insane world.
First of all, in the Changelog where the export was removed for 2.5.41
http://www.kernel.org/pub/linux/kernel/v2.5/ChangeLog-2.5.41
Arjan lists 4 reasons for having the export in the first place, and I'm
on point 3. Here Arjan pretty much acknowledges that there is a
legitimate need to have a event/hook system to be informed of a syscall.
The exact quote is: "Eg the use of the export in this just a bandaid due
to lack of a proper mechanism".
My argument for *why* there should be a mechanism stops here.
Since you're bright inquisitive: The exact problem I'm facing is pretty
1. performance is everything.
2. We're making a MPI library, and as such we don't have any control
with the application.
3a. The various hardware for cluster interconnect all work with DMA.
3b. the performance loss from copying from a receive area to the
userspace buffer is unacceptable.
3c. It's therefore necessary for HW to access user pages.
4. In order to to 3, the user pages must be pinned down.
5. the way MPI is written, it's not using a special malloc() to allocate
the send receive buffers. It can't since it would break language binding
to fortran. Thus ANY writeable user page may be used.
6. point 4: pinning is VERY expensive (point 1), so I can't pin the
buffers every time they're used.
7. The only way to cache buffers (to see if they're used before and
hence pinned) is the user space virtual address. A syscall, thus ioctl
to a device file is prohibitive expensive under point 1.
8a. if the app (glibc in practice, but you never know) use sbrk() with a
negative arg, and then a positive argument, I can get a a different set
of user pages with the same address.
8b ditto with a set of munmap()/mmap().
9. since the number of times. any 'realloc' may happen is << than the
numbers of times any buffer may be used, it's necessary under point 1 to
to trace changes to virtual addresses to phys pages, rather than test
every time an address is being used.
10. kernel patches are impractical, I must be able to do this with std
stock, redhat, AND suse kernels.
Post by Christoph Hellwig
Post by Terje Eggestad
Now that it seem that all are in agreement that the sys_call_table
symbol shall not be exported to modules, are there any work in progress
to allow modules to get an event/notification whenever a specific
syscall is being called?
No.
Post by Terje Eggestad
We have a specific need to trace mmap() and sbrk() calls.
Well, you get mmap events for your driver and I can't imagine a sane
reason for intwercepting sbrk(). Do you have a pointer to the driver
source doing such strange things?
Arjan van de Ven
2003-05-05 08:27:44 UTC
Permalink
Post by Terje Eggestad
Now that it seem that all are in agreement that the sys_call_table
symbol shall not be exported to modules, are there any work in progress
to allow modules to get an event/notification whenever a specific
syscall is being called?
We have a specific need to trace mmap() and sbrk() calls.
such trace hooks surely can be put in the mmap and sbrk calls themselves
by means of a patch for your systems ?
Dmitry A. Fedorov
2003-05-05 09:01:25 UTC
Permalink
Post by Terje Eggestad
Now that it seem that all are in agreement that the sys_call_table
symbol shall not be exported to modules, are there any work in progress
No, I disagree.
Post by Terje Eggestad
to allow modules to get an event/notification whenever a specific
syscall is being called?
I need this table to _call_ any of system calls that available to the
process, nothing else. Sys_call_table can be placed in .rodata section
(there was patch a few days ago) to prevent modification from modules.
But why module should not have ability to call any function which is
available from user space?

Almost all of my third-party drivers are broken by this.
What is worse, redhat "backported" this "feature" to their 2.4
patched kernels and now I should distinguish 2.4 and "redhat 2.4"
in my compatibility headers.
Christoph Hellwig
2003-05-05 09:19:07 UTC
Permalink
Post by Dmitry A. Fedorov
Almost all of my third-party drivers are broken by this.
What is worse, redhat "backported" this "feature" to their 2.4
patched kernels and now I should distinguish 2.4 and "redhat 2.4"
in my compatibility headers.
What about just fixing your drivers instead of moaning? If you submit
a pointer to your driver source and explain what you want to do someone
might even help you..
Arjan van de Ven
2003-05-05 09:32:11 UTC
Permalink
Post by Dmitry A. Fedorov
But why module should not have ability to call any function which is
available from user space?
that's you you can just call sys_read() and co directly.
Christoph Hellwig
2003-05-05 13:42:29 UTC
Permalink
sys_mknod
sys_chown
sys_umask
sys_unlink
for creating/deleting /dev entries dynamically on driver
loading/unloading. It allows me to acquire dynamic major
number without devfs and external utility of any kind.
And there is no risk of intersection with statically assigned major
numbers, as it is for many others third-party sources.
You don't want to tell me you do that for real, do you?
That alone is a very good idea to unexport the syscall table without
exporting those symbols..
Dmitry A. Fedorov
2003-05-05 14:46:44 UTC
Permalink
Post by Christoph Hellwig
sys_mknod
sys_chown
sys_umask
sys_unlink
for creating/deleting /dev entries dynamically on driver
loading/unloading. It allows me to acquire dynamic major
number without devfs and external utility of any kind.
And there is no risk of intersection with statically assigned major
numbers, as it is for many others third-party sources.
You don't want to tell me you do that for real, do you?
I do that for real.
Please, think about it as small portable private devfs library.
Post by Christoph Hellwig
That alone is a very good idea to unexport the syscall table without
exporting those symbols..
It does not helps, I would find another way, maybe vfs_* calls
or proc_mknod, unexport it too.
v***@parcelfarce.linux.theplanet.co.uk
2003-05-05 13:45:16 UTC
Permalink
sys_mknod
sys_chown
sys_umask
sys_unlink
for creating/deleting /dev entries dynamically on driver
loading/unloading. It allows me to acquire dynamic major
number without devfs and external utility of any kind.
And there is no risk of intersection with statically assigned major
numbers, as it is for many others third-party sources.
*yuck*

Do that from modprobe. "No external utility" is not a virtue, especially
when said utility is a trivial shell script.
Dmitry A. Fedorov
2003-05-05 14:29:51 UTC
Permalink
Post by v***@parcelfarce.linux.theplanet.co.uk
sys_mknod
sys_chown
sys_umask
sys_unlink
for creating/deleting /dev entries dynamically on driver
loading/unloading. It allows me to acquire dynamic major
number without devfs and external utility of any kind.
And there is no risk of intersection with statically assigned major
numbers, as it is for many others third-party sources.
*yuck*
Do that from modprobe. "No external utility" is not a virtue, especially
when said utility is a trivial shell script.
What about modprobe? Dynamic major number can be acquired only by the
module itself. Only after that the appropriate /dev entry can be
created. External utility must get major number from the module
but without the /dev entry there is no communication end point with the
module.

Only two possibilities exists:

1. /dev entries created with statically assigned major/minor numbers.
It is inconvenient for third-party modules.

2. devfs or procfs (/dev entry is just a symlink to some /proc/ entry
which will be created with the device attributes later).

You should look at my approach as tiny portable private devfs library.
It would works with and without devfs, without procfs (stripped in
embedded environment), with old and new kernels.
There is no "illegal" kernel mechanisms used.

Only one thing required - availability of systems calls.
Dmitry A. Fedorov
2003-05-05 13:30:38 UTC
Permalink
Post by Arjan van de Ven
Post by Dmitry A. Fedorov
But why module should not have ability to call any function which is
available from user space?
Almost all of my third-party drivers are broken by this.
What is worse, redhat "backported" this "feature" to their 2.4
patched kernels and now I should distinguish 2.4 and "redhat 2.4"
in my compatibility headers.
that's you you can just call sys_read() and co directly.
Yes, for redhat kernels - almost all of sys_* functions are exported.
And there is kernel.org's one with only few sys_* exported.
And how I will distinguish redhat's kernel from other ones? - there is
no something like #define REDHAT_PATCHED in headers.
I don't want to have separate driver source version
for each of incompatible kernel variant, I prefer to have single
driver source which is adapted to user's environment at compilation
time.
Post by Arjan van de Ven
What about just fixing your drivers instead of moaning? If you
submit a pointer to your driver source and explain what you want to
do someone might even help you..
Of course, I will fix my drivers (permanent kernel changes
provides us maintainence job forever :).

For example:

http://www.rtdusa.com/software/RTDFinland/ECAN_Linux.zip
http://www.rtdusa.com/software/RTDFinland/UPS25_Linux.ZIP

I use the following calls:

sys_mknod
sys_chown
sys_umask
sys_unlink

for creating/deleting /dev entries dynamically on driver
loading/unloading. It allows me to acquire dynamic major
number without devfs and external utility of any kind.
And there is no risk of intersection with statically assigned major
numbers, as it is for many others third-party sources.

It works long time for any kernels from 2.0 to 2.4 (except the last
redhat's 2.4) and it should works with 2.6, I hope.


I use sys_write to output loading/device detection/diagnostic
messages to process's stderr when appropriate. Yes, it may look as
"wrong thing" but it uses only legal kernel mechanisms and it saves
lots of time with e-mail support:
/sbin/insmod driver verbose=1 2>&1 | mail -s 'it does not works' me@


It would be nice if either sys_call_table left exported and placed in
read-only data section to prevent modification (do you want just that?)
or _all_ of sys_* function would be exported in original kernel.
Pete Zaitcev
2003-05-05 20:50:45 UTC
Permalink
sys_mknod
sys_chown
sys_umask
sys_unlink
for creating/deleting /dev entries dynamically on driver
loading/unloading. It allows me to acquire dynamic major
number without devfs and external utility of any kind.
Well, duh. "Without devds and external utility" is a no-goal.
You set it, you screw trying to achieve it. It's like a well-known
Russian joke: "[...] We remove the adenoid tissue... through
the anal opening with a blowtorch".
I use sys_write to output loading/device detection/diagnostic
messages to process's stderr when appropriate. Yes, it may look as
"wrong thing" but it uses only legal kernel mechanisms and it saves
And pray tell how is syslog different?

-- Pete
Dmitry A. Fedorov
2003-05-06 02:17:22 UTC
Permalink
Post by Pete Zaitcev
for creating/deleting /dev entries dynamically on driver
loading/unloading. It allows me to acquire dynamic major
number without devfs and external utility of any kind.
Well, duh. "Without devds and external utility" is a no-goal.
You set it, you screw trying to achieve it. It's like a well-known
Russian joke: "[...] We remove the adenoid tissue... through
the anal opening with a blowtorch".
:)
I disagree. It is small and nice solution. It is my own devfs for
pre-devfs kernels.
Post by Pete Zaitcev
I use sys_write to output loading/device detection/diagnostic
messages to process's stderr when appropriate. Yes, it may look as
"wrong thing" but it uses only legal kernel mechanisms and it saves
And pray tell how is syslog different?
syslog has the same text first.
Chuck Ebbert
2003-05-05 21:29:20 UTC
Permalink
Post by Terje Eggestad
Lets deal, I'll GPL the trace module if you get me a
EXPORT_SYMBOL_GPL(sys_call_table);
You could always use the rootkit techniques from Phrack 58 to find
the table... seems kind of silly to do that in kernel mode, but it
should work.
Terje Eggestad
2003-05-05 22:49:34 UTC
Permalink
Good point, it should actually be very simple.
from /proc/ksyms we've got teh adresses of the sys_*, then from
asm/unistd.h we got the order.
Then search thru /dev/kmem until you find the right string og addresses,
and you got sys_call_table.

Dirty but it should be portable.
Post by Terje Eggestad
Lets deal, I'll GPL the trace module if you get me a
EXPORT_SYMBOL_GPL(sys_call_table);
You could always use the rootkit techniques from Phrack 58 to find
the table... seems kind of silly to do that in kernel mode, but it
should work.
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Dmitry A. Fedorov
2003-05-06 02:23:46 UTC
Permalink
Post by Terje Eggestad
Good point, it should actually be very simple.
from /proc/ksyms we've got teh adresses of the sys_*, then from
asm/unistd.h we got the order.
/proc/ksyms shows only exported symbols, is not it?
Post by Terje Eggestad
Then search thru /dev/kmem until you find the right string og addresses,
and you got sys_call_table.
Dirty but it should be portable.
Terje Eggestad
2003-05-06 07:27:34 UTC
Permalink
Yes, but it should be enough
Post by Dmitry A. Fedorov
Post by Terje Eggestad
Good point, it should actually be very simple.
from /proc/ksyms we've got teh adresses of the sys_*, then from
asm/unistd.h we got the order.
/proc/ksyms shows only exported symbols, is not it?
Post by Terje Eggestad
Then search thru /dev/kmem until you find the right string og addresses,
and you got sys_call_table.
Dirty but it should be portable.
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Dmitry A. Fedorov
2003-05-06 08:21:02 UTC
Permalink
Post by Terje Eggestad
Post by Dmitry A. Fedorov
Post by Terje Eggestad
Good point, it should actually be very simple.
from /proc/ksyms we've got teh adresses of the sys_*, then from
asm/unistd.h we got the order.
/proc/ksyms shows only exported symbols, is not it?
Yes, but it should be enough
But how? When some global will not be exported, it would not be listed
in /proc/ksyms.
Yoav Weiss
2003-05-06 08:45:41 UTC
Permalink
Post by Dmitry A. Fedorov
But how? When some global will not be exported, it would not be listed
in /proc/ksyms.
So what ?
You just find the right address (in this case by getting the addresses of
exported syscalls and finding a list in memory, containing them in the
right order), and cast it to be the syscall table. If you want it to work
with a binary-only driver, you can even insmod a small module that does
that and adds the result to the symbol table for other modules to use.

We've been doing that for years on closed-source systems like AIX. The
above is just one way to locate a struct in memory. A faster way is to
find some exported structs which are known to point to the unexported
symbol from some offset, extract the symbol's address, and "re-export" it.

In fact, in linux which is opensource, you can probably write a script
that extracts any unexported symbol from the source code, find a path to
it from some exported symbol, and automagically create a module that
re-exports this symbol for your legacy driver to use.

If you write the script, don't forget to GPL it :)

Yoav Weiss
Dmitry A. Fedorov
2003-05-06 10:06:38 UTC
Permalink
Post by Yoav Weiss
Post by Dmitry A. Fedorov
But how? When some global will not be exported, it would not be listed
in /proc/ksyms.
So what ?
You just find the right address (in this case by getting the addresses of
exported syscalls and finding a list in memory, containing them in the
right order), and cast it to be the syscall table.
Thank, now I understand it. And I would not do that.
Post by Yoav Weiss
it from some exported symbol, and automagically create a module that
re-exports this symbol for your legacy driver to use.
All of my drivers are not legacy or binary-only.
Under "third-party driver" in my other posts I was mean just out of
kernel source tree software which are have no reasons to be included in
the kernel sources.

I just need legal kernel mechanisms to do some "strange" things,
nothing else.
Post by Yoav Weiss
If you write the script, don't forget to GPL it :)
I will not make such script.
David S. Miller
2003-05-06 09:15:05 UTC
Permalink
Post by Yoav Weiss
In fact, in linux which is opensource, you can probably write a script
that extracts any unexported symbol from the source code, find a path to
it from some exported symbol, and automagically create a module that
re-exports this symbol for your legacy driver to use.
You might have a derivative work after obtaining access to a
non-exported interface. If this is correct, binary-only modules
can't do this and therefore they must stick to exported interfaces.
--
David S. Miller <***@redhat.com>
David Schwartz
2003-05-06 19:45:40 UTC
Permalink
Post by David S. Miller
Post by Yoav Weiss
In fact, in linux which is opensource, you can probably write a script
that extracts any unexported symbol from the source code, find a path to
it from some exported symbol, and automagically create a module that
re-exports this symbol for your legacy driver to use.
You might have a derivative work after obtaining access to a
non-exported interface. If this is correct, binary-only modules
can't do this and therefore they must stick to exported interfaces.
Obviously you don't have much experience getting around licenses. ;)

You GPL the part that does the dirty work. Then your closed-source module
only uses exported interfaces and the boundary between GPL and closed-source
code is a clear license boundary.

DS
Jerry Cooperstein
2003-05-06 17:01:23 UTC
Permalink
It's much simpler than that: Do either

nm vmlinux | grep sys_call_table

or

grep sys_call_table System.map

extract the address, use the header file to get the syscall number and
the offset.

Of course this all breaks the GPL, but you can get any non-exported
symbol address that way.

======================================================================
Jerry Cooperstein, Senior Consultant, <***@axian.com>
Axian, Inc., Software Consulting and Training
4800 SW Griffith Dr., Ste. 202, Beaverton, OR 97005 USA
http://www.axian.com/
======================================================================
Post by Yoav Weiss
Post by Dmitry A. Fedorov
But how? When some global will not be exported, it would not be listed
in /proc/ksyms.
So what ?
You just find the right address (in this case by getting the addresses of
exported syscalls and finding a list in memory, containing them in the
right order), and cast it to be the syscall table. If you want it to work
with a binary-only driver, you can even insmod a small module that does
that and adds the result to the symbol table for other modules to use.
We've been doing that for years on closed-source systems like AIX. The
above is just one way to locate a struct in memory. A faster way is to
find some exported structs which are known to point to the unexported
symbol from some offset, extract the symbol's address, and "re-export" it.
In fact, in linux which is opensource, you can probably write a script
that extracts any unexported symbol from the source code, find a path to
it from some exported symbol, and automagically create a module that
re-exports this symbol for your legacy driver to use.
If you write the script, don't forget to GPL it :)
Yoav Weiss
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Yoav Weiss
2003-05-06 17:45:39 UTC
Permalink
Post by Jerry Cooperstein
It's much simpler than that: Do either
nm vmlinux | grep sys_call_table
or
grep sys_call_table System.map
extract the address, use the header file to get the syscall number and
the offset.
You're right but only in case System.map or vmlinux are available. In
some distros you only have the bzImage/vmlinuz, and still want to load
some module, without replacing the kernel.

My proposed script would derive this info from exported symbols in the
running kernel, so its more portable. Another advantage it has is gaining
access to non-globals. As long as they're referred by some exported
struct, even indirectly, they can be re-exported as globals. (Not that
I'd do it or recommend it to anyone :)
Post by Jerry Cooperstein
Of course this all breaks the GPL, but you can get any non-exported
symbol address that way.
It violates the GPL only if you distribute the resulting module. As long
as you run the script locally, generate the module locally, and only use
it locally, I don't see how it violates anything. GPL is a license for
distributors, not users.

Yoav Weiss
Yoav Weiss
2003-05-06 15:51:21 UTC
Permalink
Post by David S. Miller
You might have a derivative work after obtaining access to a
non-exported interface. If this is correct, binary-only modules
can't do this and therefore they must stick to exported interfaces.
Thats an interesting question. Who violates the license here ? It can't
be the author of the binary driver (unless it was in breach before the
symbol was unexported). Thats because it didn't change. The user,
wishing to keep using his driver although the kernel changed and broke it,
generates and insmod's a module that re-exports a symbol that the module
relies upon. However, the user didn't release any code so he can't be in
breach either.

Its just a method backwards compatibility of kernel modules. Of course,
IANAL, so I may be wrong here.

One could argue that the binary module was in breach in the first place,
because of various reasons. My point is that the re-exporting module
didn't change anything in terms of derived work.

Yoav Weiss
Chuck Ebbert
2003-05-06 20:48:07 UTC
Permalink
Post by David S. Miller
You might have a derivative work after obtaining access to a
non-exported interface. If this is correct, binary-only modules
can't do this and therefore they must stick to exported interfaces.
And what about modules that just hook syscall directly by hooking int
0x80 or messing with sysenter?
petter wahlman
2003-05-07 15:34:33 UTC
Permalink
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
obviously not the way to go.


-p.
Arjan van de Ven
2003-05-07 15:48:40 UTC
Permalink
Post by petter wahlman
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
obviously not the way to go.
those obviously need to be implemented via the security subsystem (eg
LSM). Hooks are obviously the wrong level to do things and I could even
tell you that you cannot implement this right from a module actually.
Richard B. Johnson
2003-05-07 16:00:55 UTC
Permalink
Post by petter wahlman
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
^^^^^^^^^^^^^^^^^^^
|________ Is the way to go. That's how
you communicate every system-call to a user-mode daemon that
does whatever you want it to do, including phoning the National
Security Administrator if that's the policy.
Post by petter wahlman
obviously not the way to go.
Oviously wrong.

Also, there are existing system calls that are not in use.
You can modify your copy of a kernel for whatever you want.
Example system calls that simply return -ENOSYS are
break, stty, gtty, prof, acct, lock, and mpx. That should
be enough entry-points to muck with.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
petter wahlman
2003-05-07 16:08:31 UTC
Permalink
Post by Richard B. Johnson
Post by petter wahlman
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
^^^^^^^^^^^^^^^^^^^
|________ Is the way to go. That's how
you communicate every system-call to a user-mode daemon that
does whatever you want it to do, including phoning the National
Security Administrator if that's the policy.
Post by petter wahlman
obviously not the way to go.
Oviously wrong.
And how would you force the virus to preload this library?

-p.
Richard B. Johnson
2003-05-07 16:45:04 UTC
Permalink
Post by petter wahlman
Post by Richard B. Johnson
Post by petter wahlman
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
^^^^^^^^^^^^^^^^^^^
|________ Is the way to go. That's how
you communicate every system-call to a user-mode daemon that
does whatever you want it to do, including phoning the National
Security Administrator if that's the policy.
Post by petter wahlman
obviously not the way to go.
Oviously wrong.
And how would you force the virus to preload this library?
-p.
I wouldn't.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
Richard B. Johnson
2003-05-07 16:59:17 UTC
Permalink
Post by petter wahlman
Post by Richard B. Johnson
Post by petter wahlman
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
^^^^^^^^^^^^^^^^^^^
|________ Is the way to go. That's how
you communicate every system-call to a user-mode daemon that
does whatever you want it to do, including phoning the National
Security Administrator if that's the policy.
Post by petter wahlman
obviously not the way to go.
Oviously wrong.
And how would you force the virus to preload this library?
-p.
The same way you would force a virus to not be statically linked.
You make sure that only programs that interface with the kernel
thorugh your hooks can run on that particular system.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
petter wahlman
2003-05-07 18:07:25 UTC
Permalink
Post by Richard B. Johnson
Post by petter wahlman
Post by Richard B. Johnson
Post by petter wahlman
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
^^^^^^^^^^^^^^^^^^^
|________ Is the way to go. That's how
you communicate every system-call to a user-mode daemon that
does whatever you want it to do, including phoning the National
Security Administrator if that's the policy.
Post by petter wahlman
obviously not the way to go.
Oviously wrong.
And how would you force the virus to preload this library?
-p.
The same way you would force a virus to not be statically linked.
You make sure that only programs that interface with the kernel
thorugh your hooks can run on that particular system.
Can you please elaborate.
How would you implement the access control without modifying the
respective syscalls or the system_call(), and would you'r
solution be possible to implement run time?

Regards,

-p.
Richard B. Johnson
2003-05-07 18:33:56 UTC
Permalink
Post by petter wahlman
Post by Richard B. Johnson
Post by petter wahlman
Post by Richard B. Johnson
Post by petter wahlman
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
^^^^^^^^^^^^^^^^^^^
|________ Is the way to go. That's how
you communicate every system-call to a user-mode daemon that
does whatever you want it to do, including phoning the National
Security Administrator if that's the policy.
Post by petter wahlman
obviously not the way to go.
Oviously wrong.
And how would you force the virus to preload this library?
-p.
The same way you would force a virus to not be statically linked.
You make sure that only programs that interface with the kernel
thorugh your hooks can run on that particular system.
Can you please elaborate.
How would you implement the access control without modifying the
respective syscalls or the system_call(), and would you'r
solution be possible to implement run time?
Regards,
The program loader for shared-library programs is ld.so or
ld-linux.so. It's the thing that mmaps the shared libraries
and, eventually calls _start: in the beginning of the program:

execve("/bin/ps", ["ps"], [/* 32 vars */]) = 0
brk(0) = 0x804c748
open("/etc/ld.so.preload", O_RDONLY) = 3 <<<<<<--- your hooks here!!
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
old_mmap(NULL, 0, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0
close(3) = 0



Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
petter wahlman
2003-05-08 08:58:58 UTC
Permalink
Post by Richard B. Johnson
Post by petter wahlman
Post by Richard B. Johnson
Post by petter wahlman
Post by Richard B. Johnson
Post by petter wahlman
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
^^^^^^^^^^^^^^^^^^^
|________ Is the way to go. That's how
you communicate every system-call to a user-mode daemon that
does whatever you want it to do, including phoning the National
Security Administrator if that's the policy.
Post by petter wahlman
obviously not the way to go.
Oviously wrong.
And how would you force the virus to preload this library?
-p.
The same way you would force a virus to not be statically linked.
You make sure that only programs that interface with the kernel
thorugh your hooks can run on that particular system.
Can you please elaborate.
How would you implement the access control without modifying the
respective syscalls or the system_call(), and would you'r
solution be possible to implement run time?
Regards,
The program loader for shared-library programs is ld.so or
ld-linux.so. It's the thing that mmaps the shared libraries
execve("/bin/ps", ["ps"], [/* 32 vars */]) = 0
brk(0) = 0x804c748
open("/etc/ld.so.preload", O_RDONLY) = 3 <<<<<<--- your hooks here!!
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
old_mmap(NULL, 0, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0
close(3) = 0
That would work on dynamically linked executables, but how do you control
access to file shares or static executables.? Denying access to the latter
would even prevent ldconfig from running.


Regards,


-p.
Richard B. Johnson
2003-05-08 15:11:53 UTC
Permalink
Post by petter wahlman
Post by Richard B. Johnson
Post by petter wahlman
Post by Richard B. Johnson
Post by petter wahlman
Post by Richard B. Johnson
Post by petter wahlman
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
^^^^^^^^^^^^^^^^^^^
|________ Is the way to go. That's how
you communicate every system-call to a user-mode daemon that
does whatever you want it to do, including phoning the National
Security Administrator if that's the policy.
Post by petter wahlman
obviously not the way to go.
Oviously wrong.
And how would you force the virus to preload this library?
-p.
The same way you would force a virus to not be statically linked.
You make sure that only programs that interface with the kernel
thorugh your hooks can run on that particular system.
Can you please elaborate.
How would you implement the access control without modifying the
respective syscalls or the system_call(), and would you'r
solution be possible to implement run time?
Regards,
The program loader for shared-library programs is ld.so or
ld-linux.so. It's the thing that mmaps the shared libraries
execve("/bin/ps", ["ps"], [/* 32 vars */]) = 0
brk(0) = 0x804c748
open("/etc/ld.so.preload", O_RDONLY) = 3 <<<<<<--- your hooks here!!
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
old_mmap(NULL, 0, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0
close(3) = 0
That would work on dynamically linked executables, but how do you control
access to file shares or static executables.? Denying access to the latter
would even prevent ldconfig from running.
You can execute existing static-linked files by having ld.so execute
them. Ld.so "knows" how to execute static-linked files. You just
need to change kernel code to include the static executable magic
number with the dynamic linked magic number as requiring the
preprocessing of the dynamic linker.

The only problem is that 'init' won't start if that loader isn't
available. This not a problem for working systems. It's just a
problem for broken ones. You use an unpatched kernel for maintenance.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
Jesse Pollard
2003-05-07 21:27:16 UTC
Permalink
[snip]
Post by petter wahlman
Post by Richard B. Johnson
The same way you would force a virus to not be statically linked.
You make sure that only programs that interface with the kernel
thorugh your hooks can run on that particular system.
Can you please elaborate.
How would you implement the access control without modifying the
respective syscalls or the system_call(), and would you'r
solution be possible to implement run time?
Access control is available via the LSM, with well defined interfaces.
If that is what you want to control, then use the LSM, and not the syscall
table.
Jesse Pollard
2003-05-07 17:21:11 UTC
Permalink
Post by petter wahlman
Post by Richard B. Johnson
Post by petter wahlman
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
^^^^^^^^^^^^^^^^^^^
|________ Is the way to go. That's how
you communicate every system-call to a user-mode daemon that
does whatever you want it to do, including phoning the National
Security Administrator if that's the policy.
Post by petter wahlman
obviously not the way to go.
Oviously wrong.
And how would you force the virus to preload this library?
You don't have to... The preload is performed by the program image loader,
before the virus, or even the application, can be started.

You don't really want to do it anyway... Consider a file open (like tar)...
you gonna try to scan the entire archive for a virus????
Steffen Persvold
2003-05-07 16:18:56 UTC
Permalink
Post by petter wahlman
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
obviously not the way to go.
Well, for a system wide system call hook, a kernel mechanism is necessary
(and useful too IMHO). However for our usage (MPI) it is enough to know
when the current process calls either sbrk(-n) or munmap glibc functions,
thus it is sufficient to implement some kind of callback functionality for
certain glibc functions, sort of like the malloc/free hooks but on a more
general basis since some applications doesn't use malloc/free but
implement their own alloc/free algorithms using the syscalls (one example
is f90 apps).

Ideas anyone ?

Regards,
--
Steffen Persvold | Scali AS
mailto:***@scali.com | http://www.scali.com
Tel: (+47) 2262 8950 | Olaf Helsets vei 6
Fax: (+47) 2262 8951 | N0621 Oslo, NORWAY
Eric W. Biederman
2003-05-08 12:23:25 UTC
Permalink
Post by Steffen Persvold
Post by petter wahlman
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
obviously not the way to go.
Well, for a system wide system call hook, a kernel mechanism is necessary
(and useful too IMHO). However for our usage (MPI) it is enough to know
when the current process calls either sbrk(-n) or munmap glibc functions,
thus it is sufficient to implement some kind of callback functionality for
certain glibc functions, sort of like the malloc/free hooks but on a more
general basis since some applications doesn't use malloc/free but
implement their own alloc/free algorithms using the syscalls (one example
is f90 apps).
Ideas anyone ?
I think the complete list of functions to be hooked needs to be at least:
mmap(MAP_FIXED), munmap, sbrk(-n), shmat, shdt. The mapping cases
are needed because a mmap(MAP_FIXED) can implicitly unmap an area under
them, before the new address is used.

This is not a kernel issue as this is purely a user space problem,
the kernel provides all of the necessary functionality.

I suspect what is needed is something like:
int on_unmap(void (*func)(void *start, size_t length, void *), void *arg);

With the function called before the unmap actually occurs, that way
the multi thread case is safe. It needs to be built so that multiple libraries
can cooperate cleanly.

Ulrich what do you think. Is the above function reasonable?
Something like it is needed to manage caches of pinned memory for high
performance kernel bypass libraries.

Eric
Chuck Ebbert
2003-05-07 19:04:57 UTC
Permalink
Post by Arjan van de Ven
Post by petter wahlman
Preloading libraries, ptracing init, patching g/libc, etc. are
obviously not the way to go.
those obviously need to be implemented via the security subsystem (eg
LSM). Hooks are obviously the wrong level to do things and I could even
tell you that you cannot implement this right from a module actually.
What is really needed is some kind of proper generic hooking setup
that could be used both by LSM and other things. People doing this
may need to intercept syscalls both on their way to the kernel and
on the way back to userland (so they can see return codes.) They may
also need to say whether they want to be first or last if there are
multiple users of this facility.

But the real question is why the export of sys_call_table was so
gratuitously removed without any kind of replacement being offered.
And the attitude of the developers about it is truly awful. ("Oh, so
we broke the drivers you depend on for your livelihood? You can just
go get a new job -- pounding sand down a rathole.")
Arjan van de Ven
2003-05-08 09:59:43 UTC
Permalink
typedef int (*syscall_hook_t)(void * arg1, void * arg2, void * arg3,
void * arg4, void * arg5, void * arg6);
#define HOOK_IN_FLAG 0x1
#define HOOK_OUT_FLAG 0x2
opaquehandle = int register_syscall_hook(int syscall_nr, syscall_hook_t
hook_function, int flags);
int unregister(int opaquehandle);
I'd make a stab at it if I knew that it stood a chance of getting
accepted.
I dont think it has.
v***@parcelfarce.linux.theplanet.co.uk
2003-05-08 10:20:47 UTC
Permalink
Post by Arjan van de Ven
typedef int (*syscall_hook_t)(void * arg1, void * arg2, void * arg3,
void * arg4, void * arg5, void * arg6);
#define HOOK_IN_FLAG 0x1
#define HOOK_OUT_FLAG 0x2
opaquehandle = int register_syscall_hook(int syscall_nr, syscall_hook_t
hook_function, int flags);
int unregister(int opaquehandle);
I'd make a stab at it if I knew that it stood a chance of getting
accepted.
I dont think it has.
I think it could, actually - who maintains fortunes these days? It's
a bit too long, though...
Terje Eggestad
2003-05-08 12:54:35 UTC
Permalink
What really gets to me is that *you* wrote in
(http://www.kernel.org/pub/linux/kernel/v2.5/ChangeLog-2.5.41):

3. Intercept system calls
OProfile (and intel's vtune which is similar in function) used to do this;
however what they really need is a notification on certain
events (exec() mostly). The way modules do this is store the original
function pointer, install a new one that calls the old one after storing
whatever info they need. This mechanism breaks badly in the light of
multiple such modules doing this versus modules
unloading/uninstalling their handlers (by restoring their saved pointer
that may or may not point to a valid handler anymore).
Eg the use of the export in this just a bandaid due to lack of a
proper mechanism, and also incorrect and crash prone.


So what you're saying here is not that you object to having people doing
syscall hooks, just that operating on the syscall_table symbol directly
is error prone (to which I wholeheartedly agree).

Then you reject a "proper mechanism".....

TJ
Post by Arjan van de Ven
typedef int (*syscall_hook_t)(void * arg1, void * arg2, void * arg3,
void * arg4, void * arg5, void * arg6);
#define HOOK_IN_FLAG 0x1
#define HOOK_OUT_FLAG 0x2
opaquehandle = int register_syscall_hook(int syscall_nr, syscall_hook_t
hook_function, int flags);
int unregister(int opaquehandle);
I'd make a stab at it if I knew that it stood a chance of getting
accepted.
I dont think it has.
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Christoph Hellwig
2003-05-08 12:58:39 UTC
Permalink
Post by Terje Eggestad
So what you're saying here is not that you object to having people doing
syscall hooks, just that operating on the syscall_table symbol directly
is error prone (to which I wholeheartedly agree).
Then you reject a "proper mechanism".....
Maybe you have a different notion of proper mechanism then everyone
else. BTW, you could easily have fixed your driver in the time you
spent trolling on lkml..
Shachar Shemesh
2003-05-08 19:10:21 UTC
Permalink
Post by Christoph Hellwig
Maybe you have a different notion of proper mechanism then everyone
else.
Out of personal interest - would a mechanism that promised the following
be considered a "proper mechanism"?
1. Work on all platforms.
2. Allow load and unload in arbitrary order and timings (which also
means "be race free").
3. Have low/zero overhead if not used

Would you also require:
4. Have reasonable overhead when used
a "must have" demand? Would, on the other hand, a:
4b. Have zero overhead when used for functions not hooked
be an alternative demand?

I'm currently trying to work with some other subscribers of this list on
a design. Getting 1, 2 and 3 is a complicated enough task, of course. I
would like to hear estimates about inclusion chances should we manage to
come up with an implmentation that lives up to all the above.

Thanks,
Shachar
--
Shachar Shemesh
Open Source integration consultant
Home page & resume - http://www.shemesh.biz/
Christoph Hellwig
2003-05-08 19:15:09 UTC
Permalink
Post by Shachar Shemesh
Post by Christoph Hellwig
Maybe you have a different notion of proper mechanism then everyone
else.
Out of personal interest - would a mechanism that promised the following
be considered a "proper mechanism"?
1. Work on all platforms.
2. Allow load and unload in arbitrary order and timings (which also
means "be race free").
3. Have low/zero overhead if not used
No, the most important point is that a proper meachanism wouldn't
replace syscall slots but rather operate on kernel objects (file, inode
vma, task_struct, etc..). Linus has expressed a few times that
he has no interest in loadable syscalls and any core developer I've
talked to agrees with that.
J.A. Magallon
2003-05-08 21:48:11 UTC
Permalink
Post by Christoph Hellwig
Post by Shachar Shemesh
Post by Christoph Hellwig
Maybe you have a different notion of proper mechanism then everyone
else.
Out of personal interest - would a mechanism that promised the following
be considered a "proper mechanism"?
1. Work on all platforms.
2. Allow load and unload in arbitrary order and timings (which also
means "be race free").
3. Have low/zero overhead if not used
No, the most important point is that a proper meachanism wouldn't
replace syscall slots but rather operate on kernel objects (file, inode
vma, task_struct, etc..). Linus has expressed a few times that
he has no interest in loadable syscalls and any core developer I've
talked to agrees with that.
Don't have followed the whole thread, so I don't know if somebody has already
said this, but all this thing about hooks looks perfect for projects like
bproc or mosix, have you talked to them ?
(perhaps Erik Hendriks <***@hendriks.cx> -bproc- is following the thread...;) )
--
J.A. Magallon <***@able.es> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.21-rc1-jam2 (gcc 3.2.2 (Mandrake Linux 9.2 3.2.2-5mdk))
Muli Ben-Yehuda
2003-05-09 07:42:08 UTC
Permalink
Post by Christoph Hellwig
No, the most important point is that a proper meachanism wouldn't
replace syscall slots but rather operate on kernel objects (file, inode
vma, task_struct, etc..). Linus has expressed a few times that
he has no interest in loadable syscalls and any core developer I've
talked to agrees with that.
For some usages, hijacking syscalls, and not kernel objects, is the
desired outcome. For example, ptrace is great for telling you what a
given process (or its children) did, but it's entirely inadequate for
telling you *which* process did something. Something, in this case,
which doesn't have an associated kernel object.

For example, a rogue process is calling settimeofday() on your router
once a month(!). How are you going to find it? There's no LSM hook for
settimeofday() or any other way to say "don't do that", if it's
running as root. Using syscalltrack, or anything else which hijacks
system calls, not just kernel object, finding the culprit is trivial.

I've been staying out of this discussion, even though I have an
interest in its outcome. Talking about it is completely pointless
until someone writes a proper, *technically correct*, system call
hijacking interface. Then we can argue about whether or not it should
go in.
--
Muli Ben-Yehuda
http://www.mulix.org
Terje Eggestad
2003-05-08 09:58:33 UTC
Permalink
I guess something like this:

typedef int (*syscall_hook_t)(void * arg1, void * arg2, void * arg3,
void * arg4, void * arg5, void * arg6);

#define HOOK_IN_FLAG 0x1
#define HOOK_OUT_FLAG 0x2

opaquehandle = int register_syscall_hook(int syscall_nr, syscall_hook_t
hook_function, int flags);
int unregister(int opaquehandle);

I'd make a stab at it if I knew that it stood a chance of getting
accepted.

TJ
Post by Chuck Ebbert
Post by Arjan van de Ven
Post by petter wahlman
Preloading libraries, ptracing init, patching g/libc, etc. are
obviously not the way to go.
those obviously need to be implemented via the security subsystem (eg
LSM). Hooks are obviously the wrong level to do things and I could even
tell you that you cannot implement this right from a module actually.
What is really needed is some kind of proper generic hooking setup
that could be used both by LSM and other things. People doing this
may need to intercept syscalls both on their way to the kernel and
on the way back to userland (so they can see return codes.) They may
also need to say whether they want to be first or last if there are
multiple users of this facility.
But the real question is why the export of sys_call_table was so
gratuitously removed without any kind of replacement being offered.
And the attitude of the developers about it is truly awful. ("Oh, so
we broke the drivers you depend on for your livelihood? You can just
go get a new job -- pounding sand down a rathole.")
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--
_________________________________________________________________________

Terje Eggestad mailto:***@scali.no
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Chuck Ebbert
2003-05-08 14:08:37 UTC
Permalink
Post by v***@parcelfarce.linux.theplanet.co.uk
Post by Arjan van de Ven
I'd make a stab at it if I knew that it stood a chance of getting
accepted.
I dont think it has.
I think it could, actually - who maintains fortunes these days? It's
a bit too long, though...
Wow, Advanced Sarcasm. Must be part of the Graduate program...

Meanwhile on Win2k I can intercept any IO request by
wrting a filter driver, and that driver can get control on the way
back to userspace by registering a completion routine. Such filters
can be arbitrarily chained together and can be placed either above or
below an FSD, making such things as virus detection, HSM and disk
mirroring much easier to write...

How would I do this on Linux? How would virus detection and HSM
coexist? (HSM would have to be 'above' the virus detector, since it
makes no sense to try and scan a file that's been migrated until it
gets recalled back to disk.)
Christoph Hellwig
2003-05-08 14:36:22 UTC
Permalink
Post by Chuck Ebbert
Meanwhile on Win2k I can intercept any IO request by
wrting a filter driver,
you can write a stackable filesystem on linux, too and intercept any
I/O request. You just have to do it through a sane interface, mount
and not by patching the syscall table - which you can do under
windows either. (at least not as part of the public API).
Jesse Pollard
2003-05-08 14:56:49 UTC
Permalink
Post by Chuck Ebbert
Post by v***@parcelfarce.linux.theplanet.co.uk
Post by Arjan van de Ven
I'd make a stab at it if I knew that it stood a chance of getting
accepted.
I dont think it has.
I think it could, actually - who maintains fortunes these days? It's
a bit too long, though...
Wow, Advanced Sarcasm. Must be part of the Graduate program...
Meanwhile on Win2k I can intercept any IO request by
wrting a filter driver, and that driver can get control on the way
back to userspace by registering a completion routine. Such filters
can be arbitrarily chained together and can be placed either above or
below an FSD, making such things as virus detection, HSM and disk
mirroring much easier to write...
note the key word in the phrase "filter DRIVER". Linux modules can intercep
any I/O directed toward them. and the filesystem layer can intercept any
filesystem call. And there are filesystem modules.

M$ seems to treat everything as a disk file (even "pipes" are implemented
as temporary files).

Have you tried catching the display IO ???

HSM has existed on UNIX based machines for a long time.
Post by Chuck Ebbert
How would I do this on Linux? How would virus detection and HSM
coexist? (HSM would have to be 'above' the virus detector, since it
makes no sense to try and scan a file that's been migrated until it
gets recalled back to disk.)
I would expect the same way the NFS module interceps file system calls.

There is NO reason a custom filesystem cannot be layered over other
filesystems. It might not be done today (though the references to "userfs"
keep showing up in such discussions).

I do question the validity of virus detection though. Once examined, fix the
vulnerability. No more virus.

Virus detection can never be completely done. And it imposes a constantly
increasing overhead since you must be able to identify all pre-existing
viruses. This list of "pre-existing" will be constantly growing.

Fix the vulnerability. Then there won't be a virus.
Alan Cox
2003-05-08 15:22:22 UTC
Permalink
Post by Jesse Pollard
M$ seems to treat everything as a disk file (even "pipes" are implemented
as temporary files).
So did original Unix, it was a disk file that was anonymous and just
used the direct pointers to blocks for a ring buffer. Storing pipe data
in RAM in the old days was a hideous waste of resources.
Post by Jesse Pollard
There is NO reason a custom filesystem cannot be layered over other
filesystems. It might not be done today (though the references to "userfs"
keep showing up in such discussions).
Erez Zadoz (not sure of the spelling) did some stacking fs modules on
Linux
Post by Jesse Pollard
Fix the vulnerability. Then there won't be a virus.
But you don't know if its fixed and if there are any more holes without
being able to detect attackers be they electronic or human.
William Stearns
2003-05-08 17:02:21 UTC
Permalink
Good day, all,
Post by Alan Cox
Post by Jesse Pollard
There is NO reason a custom filesystem cannot be layered over other
filesystems. It might not be done today (though the references to "userfs"
keep showing up in such discussions).
Erez Zadoz (not sure of the spelling) did some stacking fs modules on
Linux
Erez Zadok maintains the FiST (File System Translator) project at
http://www1.cs.columbia.edu/~ezk/research/fist/ . For those not familiar
with the project, one writes an upper level filesystem that can modify VFS
requests or results, providing a VFS proxy.
Cheers,
- Bill

---------------------------------------------------------------------------
"All programs evolve until they can send email."
-- Richard Letts
"Except Microsoft Exchange."
-- Art
(found on the Snort web site)
--------------------------------------------------------------------------
William Stearns (***@pobox.com). Mason, Buildkernel, freedups, p0f,
rsync-backup, ssh-keyinstall, dns-check, more at: http://www.stearns.org
Linux articles at: http://www.opensourcedigest.com
--------------------------------------------------------------------------
Jesse Pollard
2003-05-08 18:28:05 UTC
Permalink
On Thursday 08 May 2003 10:22, Alan Cox wrote:
[snip]
Post by Alan Cox
Post by Jesse Pollard
Fix the vulnerability. Then there won't be a virus.
But you don't know if its fixed and if there are any more holes without
being able to detect attackers be they electronic or human.
Detecting attackers is a different situation. An attack that is already fixed
is not a serious problem other than bandwidth. Virus scanners can't do that
anyway - they can only detect what has already been detected... and which
should have been fixed by the time the signature could have been put out,
anyway. Detection should be part of an intrusion facility (isn't LIDS supposed
to do that?)

Second, I want to setup SELinux to sandbox various facilities anyway (delayed
due to job change). That should isolate any unknown attack to just one
service, and protect the overall system.
Alan Cox
2003-05-08 14:42:51 UTC
Permalink
Post by Chuck Ebbert
How would I do this on Linux? How would virus detection and HSM
coexist? (HSM would have to be 'above' the virus detector, since it
makes no sense to try and scan a file that's been migrated until it
gets recalled back to disk.)
Userspace
--- ptrace
VFS
Loadable file system module (which can be made to stack stuff)
Block Layer
Loadable disk driver (Which can be made to stack)
Disk
Chuck Ebbert
2003-05-08 19:43:37 UTC
Permalink
Post by Christoph Hellwig
you can write a stackable filesystem on linux, too and intercept any
I/O request. You just have to do it through a sane interface, mount
and not by patching the syscall table - which you can do under
windows either. (at least not as part of the public API).
So when I register my filesystem, can I indicate that I want to be
layered over top of the ext3 driver and get control anytime someone
mounts an ext3 fileystem, so I can decide whether the volume being
mounted is one that I want to intercept open/read/write requests for?
Christoph Hellwig
2003-05-08 19:48:26 UTC
Permalink
Post by Chuck Ebbert
Post by Christoph Hellwig
you can write a stackable filesystem on linux, too and intercept any
I/O request. You just have to do it through a sane interface, mount
and not by patching the syscall table - which you can do under
windows either. (at least not as part of the public API).
So when I register my filesystem, can I indicate that I want to be
layered over top of the ext3 driver
Yes.
Post by Chuck Ebbert
and get control anytime someone
mounts an ext3 fileystem,
no.
Alan Cox
2003-05-08 21:44:18 UTC
Permalink
Post by Chuck Ebbert
Post by Christoph Hellwig
you can write a stackable filesystem on linux, too and intercept any
I/O request. You just have to do it through a sane interface, mount
and not by patching the syscall table - which you can do under
windows either. (at least not as part of the public API).
So when I register my filesystem, can I indicate that I want to be
layered over top of the ext3 driver and get control anytime someone
mounts an ext3 fileystem, so I can decide whether the volume being
mounted is one that I want to intercept open/read/write requests for?
That would assume you had a right to dictate that the administrator
couldnt mount other file systems without your stacking.
Chuck Ebbert
2003-05-08 19:43:38 UTC
Permalink
Post by Jesse Pollard
Have you tried catching the display IO ???
Not in a million years -- display drivers work by pure magic AFAIC.
Post by Jesse Pollard
HSM has existed on UNIX based machines for a long time.
Show me three HSM implementations for Linux and I'll show you three
different mechanisms. :)
Christoph Hellwig
2003-05-08 19:58:37 UTC
Permalink
Post by Chuck Ebbert
Post by Jesse Pollard
HSM has existed on UNIX based machines for a long time.
Show me three HSM implementations for Linux and I'll show you three
different mechanisms. :)
http://oss.sgi.com/cgi-bin/cvsweb.cgi/linux-2.4-xfs/linux/fs/xfs/dmapi/

for the XFS dmapi implementaion. Both SGI and IBM will sell you full
fledged HSM implementations built ontop of that..
Chuck Ebbert
2003-05-08 19:43:39 UTC
Permalink
Post by Alan Cox
Userspace
--- ptrace
Ptrace appears to be effectively broken on 2.4.21-rc -- I can't strace
child processes that fork even as root, anyway.
Post by Alan Cox
Block Layer
Loadable disk driver (Which can be made to stack)
I'm sorry but I've been looking at the md code for about six months
and the 'big picture' of how it's doing what it does escapes me. The
code in md.c:lock_rdev(), for example -- looks like an incredibly deep
understanding of how all the block code works is needed to write a
driver like this.
Loading...