Kernel Development & Objective-C

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?

Why not C# instead ?

Why not Haskell nor Erlang instead ? :-D

I heard of a bash compiler. That would enable development time
rationalization and maximize the collaborative convergence of a
community-oriented synergy.

Fortran90 it has to be.

David Newall

2007-11-30 14:21:58 UTC

Post by Jan Engelhardt

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?

Why not C# instead ?

Why not Haskell nor Erlang instead ? :-D

I heard of a bash compiler. That would enable development time
rationalization and maximize the collaborative convergence of a
community-oriented synergy.

Fortran90 it has to be.

It used to be written in BCPL; or was that Multics?

Bill Davidsen

2007-11-30 23:31:34 UTC

Post by David Newall

Post by Jan Engelhardt

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?

Why not C# instead ?

Why not Haskell nor Erlang instead ? :-D

I heard of a bash compiler. That would enable development time
rationalization and maximize the collaborative convergence of a
community-oriented synergy.

Fortran90 it has to be.

It used to be written in BCPL; or was that Multics?

BCPL was typeless, as was the successor B (between Bell Labs and GE we
write thousands of lines of B, ported to 8080, GE600, etc). C introduced
types, and the rest is history. Multics is written in PL/1, and I wrote
a lot of PL/1 subset G back when as well. You don't know slow compile
until you get a seven pass compiler with each pass on floppy.

--
Bill Davidsen <***@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

Alan Cox

2007-11-30 23:40:13 UTC

Post by Bill Davidsen
BCPL was typeless, as was the successor B (between Bell Labs and GE we

B isn't quite typeless. It has minimal inbuilt support for concepts like
strings (although you can of course multiply a string by an array
pointer ;))

It also had some elegances that C lost, notably

case 1..5:

the ability to do no zero biased arrays

x[40];
x-=10;

and the ability to reassign function names.

printk = wombat;

as well as stuff like free(function);

Alan (who learned B before C, and is still waiting for P)

Arnaldo Carvalho de Melo

2007-12-01 00:05:45 UTC

Post by Bill Davidsen
BCPL was typeless, as was the successor B (between Bell Labs and GE we

B isn't quite typeless. It has minimal inbuilt support for concepts like
strings (although you can of course multiply a string by an array
pointer ;))
It also had some elegances that C lost, notably

Hey, the language we use, gcC has this too 8-)

[***@doppio net-2.6.25]$ find . -name "*.c" | xargs grep 'case.\+\.\.' | wc -l
400
[***@doppio net-2.6.25]$ find . -name "*.c" | xargs grep 'case.\+\.\.' | head
./kernel/signal.c: default: /* this is just in case for now ... */
./kernel/audit.c: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG:
./kernel/audit.c: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2:
./kernel/audit.c: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG:
./kernel/audit.c: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2:
./kernel/timer.c: * well, in that case 2.2.x was broken anyways...
./arch/frv/kernel/traps.c: case TBR_TT_TRAP2 ... TBR_TT_TRAP126:
./arch/frv/kernel/ptrace.c: case 0 ... PT__END - 1:
./arch/frv/kernel/ptrace.c: case 0 ... PT__END-1:
./arch/frv/kernel/gdb-stub.c: case GDB_REG_GR(1) ... GDB_REG_GR(63):
[***@doppio net-2.6.25]$

- Arnaldo

Bill Davidsen

2007-12-01 18:27:06 UTC

Post by Bill Davidsen
BCPL was typeless, as was the successor B (between Bell Labs and GE we

Well, original C allowed you to do what you wanted with pointers (I used
to teach that back when K&R was "the" C manual). Now people which about
having pointers outside the array, which is a crock in practice, as long
as you don't actually /use/ an out of range value.

Post by Alan Cox
and the ability to reassign function names.
printk = wombat;

I had forgotten that, the function name was actually a variable with the
entry point, say so in section 3.11. And as I recall the code, arrays
were the same thing, a length ten vector was actually the vector and
variable with the address of the start. I was more familiar with the B
stuff, I wrote both the interpreter and the code generator+library for
the 8080 and GE600 machines. B on MULTICS, those were the days... :-D

Post by Alan Cox
as well as stuff like free(function);
Alan (who learned B before C, and is still waiting for P)

I had the BCPL book still on the reference shelf in the office, along
with goodies like the four candidates to be Ada, and a TRAC manual. I
too expected the next language to be "P".

--
Bill Davidsen <***@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Alan Cox

2007-12-01 18:18:59 UTC

Post by Bill Davidsen
Well, original C allowed you to do what you wanted with pointers (I used
to teach that back when K&R was "the" C manual). Now people which about
having pointers outside the array, which is a crock in practice, as long
as you don't actually /use/ an out of range value.

Actually the standards had good reasons to bar this use, because many
runtime environments used segmentation and unsigned segment offsets. On a
286 you could get into quite a mess with out of array reference tricks.

Post by Bill Davidsen
variable with the address of the start. I was more familiar with the B
stuff, I wrote both the interpreter and the code generator+library for
the 8080 and GE600 machines. B on MULTICS, those were the days... :-D

B on Honeywell L66, so that may well have been a relative of your code
generator ?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Bill Davidsen

2007-12-03 01:23:12 UTC

B on Honeywell L66, so that may well have been a relative of your code
generator ?

Probably the Bell Labs one. I did an optimizer on the Pcode which caught
jumps to jumps, then had separate 8080 and L66 code generators into GMAP
on the GE and the CP/M assembler or the Intel (ISIS) assembler for 8080.
There was also an 8085 code generator using the "ten undocumented
instructions" from the Dr Dobbs article. GE actually had a contract with
Intel to provide CPUs with those instructions, and we used them in the
Terminet(r) printers.

Those were the days ;-)
--
Bill Davidsen <***@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

J.A. Magallón

2007-11-30 22:52:32 UTC

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?

Why not C# instead ?

Why not Haskell nor Erlang instead ? :-D

Flash

http://www.lagmonster.info/humor/windowsrg.html

--
J.A. Magallon <jamagallon()ono!com> \ Software is like sex:
\ It's better when it's free
Mandriva Linux release 2008.1 (Cooker) for i586
Linux 2.6.23-jam03 (gcc 4.2.2 (4.2.2-1mdv2008.1)) SMP Sat Nov
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0

Loïc Grenié

2007-11-30 10:29:55 UTC

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?
regards,
BPC

No, it has not. Any language that looks remotely like an OO language
has not ever been considered for (Linux) kernel development and for
most, if not all, other operating systems kernels.

Various problems occur in an object oriented language. One of them
is garbage collection: it provokes asynchronous delays and, during
an interrupt or a system call for a real time task, the kernel cannot
wait. Another is memory overhead: all the magic that OO languages
provide take space in memory and Linux kernel is used in embedded
systems with very tight memory requirements.

Lots of people will think of better reasons why ObjC is not used...

Lo=EFc Greni=E9

Ben Crowhurst

2007-11-30 11:16:14 UTC

=20

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?
regards,
BPC
=20

No, it has not. Any language that looks remotely like an OO langua=

has not ever been considered for (Linux) kernel development and for
most, if not all, other operating systems kernels.
Various problems occur in an object oriented language. One of the=

is garbage collection: it provokes asynchronous delays and, during
an interrupt or a system call for a real time task, the kernel cann=

wait.=20

Objective C 1.0 does not force nor have garbage collection.

Another is memory overhead: all the magic that OO languages
provide take space in memory and Linux kernel is used in embedded
systems with very tight memory requirements.
=20

But are embedded systems not rapidly moving on. Turning to stare at the=
=20
ADSL X6 modem with MB's of ram.

Lots of people will think of better reasons why ObjC is not used.=

=2E.

Lo=EFc Greni=E9
=20

Which I'm looking forward to hear :)

Thank you for your appropriate response.

--

Regards
BPC

Karol Swietlicki

2007-11-30 11:36:49 UTC

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?

<snip>

Lots of people will think of better reasons why ObjC is not use=

d...

Lo=EFc Greni=E9

Which I'm looking forward to hear :)
Thank you for your appropriate response.

Here are a few reasons off the top of my head:
1. Adding extra unneeded complexity. Debugging would be harder.
2. Not many people can code ObjC when compared to the number of C coder=
s.
3. If it ain't broken... Why fix it. The kernel works, right? Good.

You can find a great explanation somewhere out there, I'm not sure who
wrote it and the thing was explaining why C++ is not a great choice
for the Linux kernel. Some things going against C++ will also go
against ObjC. I cannot find it, but it is out there somewhere.

I'm a newbie and I might be wrong, but the above is what I believe to b=
e true.

Karol Swietlicki

Lennart Sorensen

2007-11-30 14:37:13 UTC

Post by Ben Crowhurst
But are embedded systems not rapidly moving on. Turning to stare at the
ADSL X6 modem with MB's of ram.

Some embedded systems run on batteries, so the less ram they have to
power the better, and the less cpu cycles that have to spend executing
code the less power they consume. An ADSL modem on your desk doesn't
have any of those worries, it just has to work and if doubling the ram
cuts the development problems by a lot, then that might have been a
worthwhile trade off.

--
Len Sorensen

Rogelio M. Serrano Jr.

2007-12-08 08:54:11 UTC

Post by LoÃ¯c GreniÃ©

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?
regards,
BPC

I have tried it in a toy kernel. Oskit style. The code reuse is very
high specially with string ops and driver interfaces. Its also very easy
to do unit testing with. My main problem was the quality of the compiler
optimization. Its just not good enough. I think if the compiler can do
the right kind of optimizations correctly then a low overhead OO
language like objective-c can be used in a kernel.

On the other hand its the automated testing part that really matters for
me. Imagine adding features to linux week after week without ever
getting a serious panic or two. And then getting a big performance boost
whenever the compiler does more and more optimizations correctly.

Post by LoÃ¯c GreniÃ©
No, it has not. Any language that looks remotely like an OO language
has not ever been considered for (Linux) kernel development and for
most, if not all, other operating systems kernels.
Various problems occur in an object oriented language. One of them
is garbage collection: it provokes asynchronous delays and, during
an interrupt or a system call for a real time task, the kernel cannot
wait.

Objective C 1.0 does not force nor have garbage collection.

True.

Post by LoÃ¯c GreniÃ©
Another is memory overhead: all the magic that OO languages
provide take space in memory and Linux kernel is used in embedded
systems with very tight memory requirements.

But are embedded systems not rapidly moving on. Turning to stare at
the ADSL X6 modem with MB's of ram.

Its all about optimizations.

--
Democracy is about two wolves and a sheep deciding what to eat for dinner.

J.A. Magallón

2007-11-30 23:19:50 UTC

On Fri, 30 Nov 2007 11:29:55 +0100, "Lo=C3=AFc Greni=C3=A9" <loic.greni=

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?
regards,
BPC

=20

Well, I really would like to learn some things here, could we
keep this off-topic thread alive just a bit, please ?
(I know, I'm going to gain a troll's fame because I can't avoid this
discussions, its one of my secret vices...)

No, it has not. Any language that looks remotely like an OO langua=

has not ever been considered for (Linux) kernel development and for
most, if not all, other operating systems kernels.
=20

I think BeOS was C++ and OSX is C+ObjectiveC (and runs on an iPhone).
Original MacOS (fron 6 to 9) was Pascal (and a mac SE was very near
to embedded hardware :) ).

I do not advocate to rewrite Linux in C++, but don't say a kernel writt=
en
in C++ can not be efficient.

Various problems occur in an object oriented language. One of the=

is garbage collection: it provokes asynchronous delays and, during
an interrupt or a system call for a real time task, the kernel cann=

wait.=20

C++ (and for what I read on other answer, nor ObjectiveC) has no garbag=
e
collection. It does not anything you did not it to do. It just allows
you to change this

struct buffer *x;
x =3D kmalloc(...)
x->sz =3D 128
x->buff =3D kmalloc(...)
...
kfree(x->buff)
kfree(x)
=09
to
struct buffer *x;
x =3D new buffer(128); (that does itself allocates x->buff,
because _you_ programmed it,
so you poor programmer don't forget)
...
delete x; (that also was programmed to deallocate
x->buff itself, sou you have one less
memory leak to worry about)

Another is memory overhead: all the magic that OO languages
provide take space in memory and Linux kernel is used in embedded
systems with very tight memory requirements.
=20

An vtable in C++ takes exactly the same space that the function
table pointer present in every driver nowadays... and probably
the virtual method call that C++ does itself with

thing->do_something(with,this)

like
push thing
push with
push this
call THING_vtable+indexof(do_something) // constants at compile time

is much more efficient that what gcc can mangle to do with

thing->do_something(with,this,thing)

push with
push this
push thing
get thing+offsetof(do_something) // not constant at compile time
dereference it
call it

(that is, get a generic field on a structure and use it as jump address=
)

In short, the kernel is object oriented, implements OO programming by
hand, but the compiler lacks the knowledge that it is object oriented
programming so it could do some optimizations.

Lots of people will think of better reasons why ObjC is not used.=

=2E.

People usually complains about RTTI or exceptions, but benefits versus
memory space should be seriously considered (sure there is something
in current drivers to ask 'are you a SATA or an IDE disk?').

--
J.A. Magallon <jamagallon()ono!com> \ Software is lik=
e sex:
\ It's better when it'=
s free
Mandriva Linux release 2008.1 (Cooker) for i586
Linux 2.6.23-jam03 (gcc 4.2.2 (4.2.2-1mdv2008.1)) SMP Sat Nov
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0

Nicholas Miell

2007-11-30 23:53:38 UTC

Post by J.A. MagallÃ³n
An vtable in C++ takes exactly the same space that the function
table pointer present in every driver nowadays... and probably
the virtual method call that C++ does itself with
=20
thing->do_something(with,this)
=20
like
push thing
push with
push this
call THING_vtable+indexof(do_something) // constants at compile time
=20
is much more efficient that what gcc can mangle to do with
=20
thing->do_something(with,this,thing)
=20
push with
push this
push thing
get thing+offsetof(do_something) // not constant at compile time
dereference it
call it
=20
(that is, get a generic field on a structure and use it as jump addre=

ss)

Post by J.A. MagallÃ³n
=20
In short, the kernel is object oriented, implements OO programming by
hand, but the compiler lacks the knowledge that it is object oriented
programming so it could do some optimizations.

struct test;
struct testVtbl
{
int (*fn1)(struct test *t, int x, int y);
int (*fn2)(struct test *t, int x, int y);
};
struct test
{
struct testVtbl *vtbl;
int x, y;
};
void testCall(struct test *t, int x, int y)
{
t->vtbl->fn1(t, x, y);
t->vtbl->fn2(t, x, y);
}

and

struct test
{
virtual int fn1(int x, int y);
virtual int fn2(int x, int y);
=20
int x, y;
};
=20
void testCall(struct test *t, int x, int y)
{
t->fn1(x, y);
t->fn2(x, y);
}
=20
generate instruction-for-instruction identical code.

--=20
Nicholas Miell <***@comcast.net>

Al Viro

2007-12-01 00:31:19 UTC

This is not what vtables are. Think for a minute - all codepaths arriving
to that point in your code will pick the address to call from the same
location. Either the contents of that location is constant (in which case
you could bloody well call it directly in the first place) *or* it has to
somehow be reassigned back and forth, according to the value of this. The
former is dumb, the latter - outright insane.

The contents of vtables is constant. The whole point of that thing is
to deal with the situations where we _can't_ tell which derived class
this ->do_something() is from; if we could tell which vtable it is at
compile time, we wouldn't need to bother at all.

It's a tradeoff - we pay the extra memory access (fetch vtable pointer, then
fetch method from vtable) for not having to store a slew of method pointers
in each instance of base class. But the extra memory access is very much
there. It can be further optimized away if you have several method calls
for the same object next to each other (then vtable can be picked once),
but it's still done at runtime.

Al Viro

2007-12-01 00:34:46 UTC

Post by Al Viro
somehow be reassigned back and forth, according to the value of this. The

s/this/thing/, of course

J.A. Magallón

2007-12-01 01:09:34 UTC

Post by Al Viro

Yup, my mistake (that's why I said i will learn something). I was thinking
on non-virtual methods. For virtual ones you have to fetch the vtable
start address and index from it.

Post by Al Viro
It's a tradeoff - we pay the extra memory access (fetch vtable pointer, then
fetch method from vtable) for not having to store a slew of method pointers
in each instance of base class. But the extra memory access is very much
there. It can be further optimized away if you have several method calls
for the same object next to each other (then vtable can be picked once),
but it's still done at runtime.

Avi Kivity

2007-12-01 19:55:59 UTC

Post by Al Viro

This is not what vtables are. Think for a minute - all codepaths arriving
to that point in your code will pick the address to call from the same
location. Either the contents of that location is constant (in which case
you could bloody well call it directly in the first place) *or* it has to
somehow be reassigned back and forth, according to the value of this. The
former is dumb, the latter - outright insane.
The contents of vtables is constant. The whole point of that thing is
to deal with the situations where we _can't_ tell which derived class
this ->do_something() is from; if we could tell which vtable it is at
compile time, we wouldn't need to bother at all.
It's a tradeoff - we pay the extra memory access (fetch vtable pointer, then
fetch method from vtable) for not having to store a slew of method pointers
in each instance of base class. But the extra memory access is very much
there. It can be further optimized away if you have several method calls
for the same object next to each other (then vtable can be picked once),
but it's still done at runtime.

True. C++ vtables have no performance advantage over C ->ops->function()
calls. But they have no disadvantage either and they do offer many
syntactic advantages (such as automatically casting the object type to
the *correct* derived class.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Avi Kivity

2007-12-04 21:10:55 UTC

But kmalloc is implemented by the kernel. Who implements 'new'?

The kernel.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

J.A. Magallón

2007-12-04 21:24:38 UTC

Post by J.A. MagallÃ³n
I think BeOS was C++ and OSX is C+ObjectiveC (and runs on an iPhone).
Original MacOS (fron 6 to 9) was Pascal (and a mac SE was very near
to embedded hardware :) ).
I do not advocate to rewrite Linux in C++, but don't say a kernel written
in C++ can not be efficient.

Well I am pretty sure the micro kernel of OS X is in C, and certainly
the BSD layer is as well. So the only ObjC part would be the nextstep
framework and other parts of the Mac GUI and other Mac APIs they
provide, which all at some point probably end up calling down into the C
stuff below.

Yup, thanks.

Post by J.A. MagallÃ³n
C++ (and for what I read on other answer, nor ObjectiveC) has no garbage
collection. It does not anything you did not it to do. It just allows
you to change this
struct buffer *x;
x = kmalloc(...)
x->sz = 128
x->buff = kmalloc(...)
...
kfree(x->buff)
kfree(x)
to
struct buffer *x;
x = new buffer(128); (that does itself allocates x->buff,
because _you_ programmed it,
so you poor programmer don't forget)
...
delete x; (that also was programmed to deallocate
x->buff itself, sou you have one less
memory leak to worry about)

But kmalloc is implemented by the kernel. Who implements 'new'?

Help yourself... as kmalloc() is a replacement for userspace glibc's
malloc, you can write your replacements for functions/operators in
libstdc++ (operators are just cosmetic, as many other features in C++)
In fact, for someone who dared to write a kernel C++ framework, the
very first function he has to write could be something like:

void* operator new(size_t sz)
{
return kmalloc(sz,GPF_KERNEL);
}

And could write alternatives like

operator new(size_t sz,int flags) -> x = new(GPF_ATOMIC) X;

operator new(size_t sz,MemPool& pl) -> x = new(pool) X;

If you are curious, this page http://www.osdev.org/wiki/C_PlusPlus
has some clues about what should you implement to get rid of
libstdc++.

--
J.A. Magallon <jamagallon()ono!com> \ Software is like sex:
\ It's better when it's free
Mandriva Linux release 2008.1 (Cooker) for i586
Linux 2.6.23-jam03 (gcc 4.2.2 (4.2.2-1mdv2008.1)) SMP Sat Nov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Matti Aarnio

2007-11-30 11:37:50 UTC

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?
regards,
BPC

To my recall: Never.

Some limited subset of C++ was tried, but was soon abandoned.

Overall the kernel data structures are done in objectish-manner,
although there are no strong type mechanisms being used.

Could the kernel be written in a limited subset[*] of ObjC ? Very likely.
Would it be worth the job ? Radical decrease in number of available
programmers...

*) Subset as enforcing the rule of not even indirectly using dynamic
memory allocation, when operating in interrupt state.

/Matti Aarnio

Lennart Sorensen

2007-11-30 14:34:45 UTC

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?

Doesn't objective C essentially require a runtime to provide a lot of
the features of the language? If it does (as I suspect) then it is
totally unsiatable for kernel development.

That and object oriented languages in general are badly designed and a
bad idea. Having not used objective C I have no idea if it qualifies as
badly designed or not. Certainly C++ and java are both very badly
designed.

Besides the kernel does a wonderful job doing object oriented design
where apropriate using C without any of the stupidities added by the
common OO languages.

--
Len Sorensen

Kyle Moffett

2007-11-30 15:26:03 UTC

Post by Lennart Sorensen

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?

Objective-C is actually a pretty minimal wrapper around C; it was
originally implemented as a C preprocessor. It generally does not
have any kind of memory management, garbage collection, or anything
else (although typically a "runtime" will provide those features).
There are no first-class exceptions, so there would be nothing to
worry about there (the exceptions used in GUI programs are built
around the setjmp/longjmp primitives). Objective-C is also almost
completely backwards-compatible with C, much more so than C++ ever
was. As far as the runtime goes the kernel would be expected to
write its own, the same way that it implements "kmalloc()" as part of
a "C runtime". Since the runtime itself never does any implicit
memory allocation, I think it would conceivably even be relatively
safe for kernel usage.

With that said, there is a significant performance penalty as all
Objective-C method calls are looked up symbolically at runtime for
every single call. For GUI programs where large chunks of the code
are event-loops and not performance-sensitive that provides a huge
amount of extra flexibility. In the kernel though, there are many
codepaths where *every* *single* instruction counts; that could be a
serious performance hit.

Cheers,
Kyle Moffett

H. Peter Anvin

2007-11-30 18:40:07 UTC

Post by Kyle Moffett
With that said, there is a significant performance penalty as all
Objective-C method calls are looked up symbolically at runtime for every
single call.

GACK!

At least C++ has vtables.

-hpa

Kyle Moffett

2007-11-30 19:35:34 UTC

Post by H. Peter Anvin

Post by Kyle Moffett
With that said, there is a significant performance penalty as all
Objective-C method calls are looked up symbolically at runtime for
every single call.

GACK!
At least C++ has vtables.

In a tight loop there is a way to do a single symbolic lookup and
just call directly through a function pointer, but typically it isn't
necessary for GUI programs and the like. The flexibility of being
able to dynamically add new methods to an existing class (at least
for desktop user interfaces) significantly outweighs the performance
cost. Any performance-sensitive code is typically written in
straight C anyways.

Cheers,
Kyle Moffett

Avi Kivity

2007-12-01 20:03:25 UTC

In the kernel though, there are many codepaths where *every* *single*
instruction counts; that could be a serious performance hit.

Write *those* *codepaths* in *C* or *assembly*. But only after you
manage to measure a difference compared to the object-oriented systems
language.

[I really doubt there are that many of these; syscall
entry/dispatch/exit, interrupt dispatch, context switch, what else?]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andi Kleen

2007-12-02 19:01:12 UTC

Post by Avi Kivity
[I really doubt there are that many of these; syscall
entry/dispatch/exit, interrupt dispatch, context switch, what else?]

Networking, block IO, page fault, ... But only the fast paths in these
cases. A lot of the kernel is slow path code and could probably
be written even in an interpreted language without much trouble.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Avi Kivity

2007-12-03 05:12:16 UTC

Post by Avi Kivity
[I really doubt there are that many of these; syscall
entry/dispatch/exit, interrupt dispatch, context switch, what else?]

Networking, block IO, page fault, ... But only the fast paths in these
cases. A lot of the kernel is slow path code and could probably
be written even in an interpreted language without much trouble.

Even these (with the exception of the page fault path) are hardly "we
care about a single instruction" material suggested above. Even with a
million packets per second per core (does such a setup actually exist?)
You have a few thousand cycles per packet. For block you'd need around
5,000 disks per core to reach such rates.

The real benefits aren't in keeping close to the metal, but in high
level optimizations. Ironically, these are easier when the code is a
little more abstracted. You can add quite a lot of instructions if it
allows you not to do some of the I/O at all.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andi Kleen

2007-12-03 09:50:22 UTC

Post by Avi Kivity
Even these (with the exception of the page fault path) are hardly "we
care about a single instruction" material suggested above. Even with a

With 10Gbit/s ethernet working you start to care about every cycle.
Similar with highend routing or in some latency sensitive network
applications (e.g. in HPC). Another simple noticeable case is Unix
sockets and your X server communication.

And there are some special cases where block IO is also pretty critical.
A popular one is TPC-* benchmarking, but there are also others and it
looks likely in the future that this will become more critical
as block devices become faster (e.g. highend SSDs)

Post by Avi Kivity
The real benefits aren't in keeping close to the metal, but in high
level optimizations. Ironically, these are easier when the code is a
little more abstracted. You can add quite a lot of instructions if it
allows you not to do some of the I/O at all.

While that's partly true -- cache misses are good for a lot of cycles --
it is not the whole truth and at some point raw code efficiency matters
too.

For example there are some CPUs who are relatively slow at indirect
function calls and there are actually cases where this can be measured.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Avi Kivity

2007-12-03 11:46:45 UTC

Post by Avi Kivity
Even these (with the exception of the page fault path) are hardly "we
care about a single instruction" material suggested above. Even with a

With 10Gbit/s ethernet working you start to care about every cycle.

If you have 10M packets/sec no amount of cycle-saving will help you.
You need high level optimizations like TSO. I'm not saying we should
sacrifice cycles like there's no tomorrow, but the big wins are elsewhere.

Post by Andi Kleen
Similar with highend routing or in some latency sensitive network
applications (e.g. in HPC).

True. And here, the hardware can cut hundreds of cycles by avoiding the
kernel completely for the fast path.

Post by Andi Kleen
Another simple noticeable case is Unix
sockets and your X server communication.

Your reflexes are *much* better than mine if you can measure half a
nanosecond on X.

Here, it's scheduling that matters, avoiding large transfers, and
avoiding ping-pongs, not some cycles on the unix domain socket. You
already paid 150 cycles or so by issuing the syscall and thousands for
copying the data, 50 more won't be noticeable except in nanobenchmarks.

Post by Andi Kleen
And there are some special cases where block IO is also pretty critical.
A popular one is TPC-* benchmarking, but there are also others and it
looks likely in the future that this will become more critical
as block devices become faster (e.g. highend SSDs)

And again the key is batching, improving cpu affinity, and caching, not
looking for a faster instruction sequence.

That is true. But any self-respecting systems language will let you
choose between direct and indirect calls.

If adding an indirect call allows you to avoid even 1% of I/O, you save
much more than you lose, so again the high level optimizations win.

Nanooptimizations are fun (I do them myself, I admit) but that's not
where performance as measured by the end user lies.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andi Kleen

2007-12-03 11:50:10 UTC

Post by Avi Kivity
If you have 10M packets/sec no amount of cycle-saving will help you.
You need high level optimizations like TSO. I'm not saying we should
sacrifice cycles like there's no tomorrow, but the big wins are elsewhere.

Both high and low level optimizations are needed for good performance.

Post by Andi Kleen
Similar with highend routing or in some latency sensitive network
applications (e.g. in HPC).

True. And here, the hardware can cut hundreds of cycles by avoiding the
kernel completely for the fast path.

A lot of applications don't and the user space networking schemes
tend to have their own drawbacks anyways.

Post by Andi Kleen
Another simple noticeable case is Unix
sockets and your X server communication.

Your reflexes are *much* better than mine if you can measure half a
nanosecond on X.

That's not about mouse/keyboard input, but about all X protocol communication
between X clients and X server. The key is not large copies here
anyways (large data is put into shm) but latency.

Post by Avi Kivity
And again the key is batching, improving cpu affinity, and caching, not
looking for a faster instruction sequence.

That's not the whole story no. Batching etc are needed, but the
faster instruction sequences are needed too.

Post by Avi Kivity
Nanooptimizations are fun (I do them myself, I admit) but that's not
where performance as measured by the end user lies.

It depends. Often high level (and then caching) optimizations are better
bang for the buck, but completely disregarding the fast path work is a bad
thing too. As an example see Christoph's recent work on the slub fastpath
which makes a quite measurable difference on benchmarks.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Willy Tarreau

2007-12-03 21:13:53 UTC

Post by Avi Kivity
Even these (with the exception of the page fault path) are hardly "we
care about a single instruction" material suggested above. Even with a

With 10Gbit/s ethernet working you start to care about every cycle.

Huh? At 4 GHz, you have 400 cycles to process each packet. If you need to
route those packets, those cycles may just be what you need to lookup a
forwarding table and perform a few MMIO on an accelerated chip which will
take care of the transfer. But you need those cycles. If you start to waste
them 30 by 30, the performance can drop by a critical factor.

Post by Andi Kleen
Similar with highend routing or in some latency sensitive network
applications (e.g. in HPC).

True. And here, the hardware can cut hundreds of cycles by avoiding the
kernel completely for the fast path.

Post by Andi Kleen
Another simple noticeable case is Unix
sockets and your X server communication.

Your reflexes are *much* better than mine if you can measure half a
nanosecond on X.

It just depends how many times a second it happens. For instance, consider
this trivial loop (fct is a two-function array which just return 1 or 2) :

i = 0;
for (j = 0; j < (1 << 28); j++) {
k = (j >> 8) & 1;
i += fct[k]();
}

It takes 1.6 seconds to execute on my athlon-xp 1.5 GHz. If, instead of
changing the function once every 256 calls, you change it to every call :

i = 0;
for (j = 0; j < (1 << 28); j++) {
k = (j >> 0) & 1;
i += fct[k]();
}

Then it only takes 4.3 seconds, which is about 3 times slower. The number
of calls per function remains the same (128M calls each), it's just the
branch prediction which is wrong every time. The very few nanoseconds added
at each call are enough to slow down a program from 1.6 to 4.3 seconds while
it executes the exact same code (it may even save one shift). If you have
such stupid code, say, to compute the color or alpha of each pixel in an
image, you will certainly notice the difference.

And such poorly efficient code may happen very often when you blindly rely
on function pointers instead of explicit calls.

Post by Avi Kivity
Here, it's scheduling that matters, avoiding large transfers, and
avoiding ping-pongs, not some cycles on the unix domain socket. You
already paid 150 cycles or so by issuing the syscall and thousands for
copying the data, 50 more won't be noticeable except in nanobenchmarks.

You are forgetting something very important : once you start stacking
functions to perform the dirty work for you, you end up with so much
abstraction that even new stupid code cannot be written at all without
relying on them, and it's where the problem takes its roots, because
when you need to write a fast function and you notice that you cannot
touch a variable without passing through a slow pinhole, your fast
function will remain slow whatever you do, and the worst of all is that
you will think that it is normally fast and that it cannot be written
faster.

And again the key is batching, improving cpu affinity, and caching, not
looking for a faster instruction sequence.

Every cycle burned is definitely lost. The time cannot go backwards. So
for each cycle that you lose to laziness, you have to become more and more
clever to find out how to write an alternative. Lazy people simply put
caches everywhere and after that they find normal that "hello world" requires
2 Gigs of RAM to be displayed. The only true solution is to create better
algorithms, but you will find even less people capable of creating efficient
algorithms than you will find capable of coding correctly.

Post by Andi Kleen
For example there are some CPUs who are relatively slow at indirect
function calls and there are actually cases where this can be measured.

It depends which type of I/O. If the I/O is non-blocking, you end up doing
something else instead of actively burning cycles.

Post by Avi Kivity
Nanooptimizations are fun (I do them myself, I admit) but that's not
where performance as measured by the end user lies.

I do not agree. It's not uncommon to find 2- or 3-fold performance factors
between equivalent components when one is carefully optimized and the other
one is not. Granted it takes an awful lot of time doing all those nano-opts
at the beginning, but the more you learn about how the hardware reacts to
your code, the more efficiently you write future code, with the fewest bloat.
End users notice bloat a lot (especially when CPU and RAM are excessively
wasted).

Best regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

J.A. Magallón

2007-12-03 21:39:02 UTC

On Mon, 3 Dec 2007 22:13:53 +0100, Willy Tarreau <***@1wt.eu> wrote:

...

Post by Willy Tarreau
It just depends how many times a second it happens. For instance, consider
i = 0;
for (j = 0; j < (1 << 28); j++) {
k = (j >> 8) & 1;
i += fct[k]();
}
It takes 1.6 seconds to execute on my athlon-xp 1.5 GHz. If, instead of
i = 0;
for (j = 0; j < (1 << 28); j++) {
k = (j >> 0) & 1;
i += fct[k]();
}
Then it only takes 4.3 seconds, which is about 3 times slower. The number
of calls per function remains the same (128M calls each), it's just the
branch prediction which is wrong every time. The very few nanoseconds added
at each call are enough to slow down a program from 1.6 to 4.3 seconds while
it executes the exact same code (it may even save one shift). If you have
such stupid code, say, to compute the color or alpha of each pixel in an
image, you will certainly notice the difference.
And such poorly efficient code may happen very often when you blindly rely
on function pointers instead of explicit calls.

...

Post by Willy Tarreau
You are forgetting something very important : once you start stacking
functions to perform the dirty work for you, you end up with so much
abstraction that even new stupid code cannot be written at all without
relying on them, and it's where the problem takes its roots, because
when you need to write a fast function and you notice that you cannot
touch a variable without passing through a slow pinhole, your fast
function will remain slow whatever you do, and the worst of all is that
you will think that it is normally fast and that it cannot be written
faster.

But don't forget that OOP is just another way to organize your code,
and let the language/compiler do some things you shouldn't de doing,
like fill an vtable pointer, that are error prone.

And of course everything depends on what language you choose and how
you use it.
You could write an equally effcient kernel in languages like C++,
using C++ abstractions as a high level organization, where
the fast paths could be coded the right way; we are not talking about
C# or Java, where even a sum is a call to an overloaded method.
Its the difference between doing school-book push and pops to lists,
and suddenly inventing the splice operator...

--
J.A. Magallon <jamagallon()ono!com> \ Software is like sex:
\ It's better when it's free
Mandriva Linux release 2008.1 (Cooker) for i586
Linux 2.6.23-jam03 (gcc 4.2.2 (4.2.2-1mdv2008.1)) SMP Sat Nov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Alan Cox

2007-12-03 21:57:27 UTC

Post by J.A. MagallÃ³n
You could write an equally effcient kernel in languages like C++,
using C++ abstractions as a high level organization, where

It's very very hard to generate good C code because of the numerous ways
objects get temporarily created, and the week aliasing rules (as with C).

There are reasons that Fortran lives on (and no I'm not suggesting one
should rewrite the kernel in Fortran ;)) and the fact its not really got
pointer aliasing or "address of" operators and all the resulting
optimsation problems is one of the big ones.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

J.A. Magallón

2007-12-04 21:47:45 UTC

Post by J.A. MagallÃ³n
You could write an equally effcient kernel in languages like C++,
using C++ abstractions as a high level organization, where

It's very very hard to generate good C code because of the numerous ways
objects get temporarily created, and the week aliasing rules (as with C).

That is what I like of C++, with good placement of high level features
like const's and & (references) one can gain fine control over what
gets copied or not.
Try to write a Vector class that does ops with SSE without storing
temporals on the stack. Its a good example of how one can get low
level control, and gcc is pretty good simplifying things like u=v+2*w
and not putting anything on the stack, all in xmm registers.

The advantage is you onle has to be careful one time, when you write
the class.

Post by Alan Cox
There are reasons that Fortran lives on (and no I'm not suggesting one
should rewrite the kernel in Fortran ;)) and the fact its not really got
pointer aliasing or "address of" operators and all the resulting
optimsation problems is one of the big ones.

--
J.A. Magallon <jamagallon()ono!com> \ Software is like sex:
\ It's better when it's free
Mandriva Linux release 2008.1 (Cooker) for i586
Linux 2.6.23-jam03 (gcc 4.2.2 (4.2.2-1mdv2008.1)) SMP Sat Nov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Diego Calleja

2007-12-04 22:20:45 UTC

That is what I like of C++, with good placement of high level feature=

like const's and & (references) one can gain fine control over what
gets copied or not.

But...if there's some way Linux can get "language improvements", is wit=
h
new C standards/gccextensions/etc. It'd be nice if people tried to add
(useful) C extensions to gcc, instead of proposing some random language=
:)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel"=
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Giacomo A. Catenazzi

2007-12-05 10:59:13 UTC

=20

That is what I like of C++, with good placement of high level featur=

like const's and & (references) one can gain fine control over what
gets copied or not.

=20
But...if there's some way Linux can get "language improvements", is w=

ith

new C standards/gccextensions/etc. It'd be nice if people tried to ad=

(useful) C extensions to gcc, instead of proposing some random langua=

ge :)

But nobody know such extensions.
I think that the core kernel will remain in C, because
there are no problems and no improvement possible
(with other language)

But the drivers side has more problems. There is a lot
of copy-paste, quality is often not high, not all developers
know well linux kernel, and not well maintained with new
or better internal API. So if we found a good template
or a good language to help *some* drivers without
causing a lot of problem to the rest of community, it would
be nice.

I don't think that we have written in stone that kernel
drivers should be written only in C, but actually there is
no good alternative.

But I think it is a huge task to find a language, a
prototype of API and convert some testing drivers.
And there is no guarantee of good result.

ciao
cate
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel"=
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Avi Kivity

2007-12-04 21:07:05 UTC

Post by Andi Kleen
With 10Gbit/s ethernet working you start to care about every cycle.

I really doubt Linux spends 400 cycles routing a packet. Look what an
skbuff looks like.

A flood ping to localhost on a 2GHz system takes 8 microseconds, that's
16,000 cycles. Sure it involves userspace, but you're about two orders
of magnitude off. And the localhost interface is nicely cached in L1
without mmio at all, unlike real devices.

Post by Andi Kleen
Another simple noticeable case is Unix
sockets and your X server communication.

Your reflexes are *much* better than mine if you can measure half a
nanosecond on X.

It just depends how many times a second it happens. For instance, consider
i = 0;
for (j = 0; j < (1 << 28); j++) {
k = (j >> 8) & 1;
i += fct[k]();
}
It takes 1.6 seconds to execute on my athlon-xp 1.5 GHz. If, instead of
i = 0;
for (j = 0; j < (1 << 28); j++) {
k = (j >> 0) & 1;
i += fct[k]();
}
Then it only takes 4.3 seconds, which is about 3 times slower. The number
of calls per function remains the same (128M calls each), it's just the
branch prediction which is wrong every time. The very few nanoseconds added
at each call are enough to slow down a program from 1.6 to 4.3 seconds while
it executes the exact same code (it may even save one shift). If you have
such stupid code, say, to compute the color or alpha of each pixel in an
image, you will certainly notice the difference.

This happens very often in HPC, and when it does, it is often worthwhile
to invest in manual optimizations or even assembly coding.
Unfortunately it is very rare in the kernel (memcmp, raid xor, what
else?). Loops with high iteration counts are very rare, so any
attention you give to the loop body is not amortized over a large number
of executions.

Post by Willy Tarreau
And such poorly efficient code may happen very often when you blindly rely
on function pointers instead of explicit calls.

Using an indirect call where a direct call is sufficient will also
reduce the compiler's optimization opportunities. However, I don't see
anyone recommending it in the context of systems programming.

It is not true that the number of indirect calls necessarily increases
if you use a language other than C.

(Actually, with templates you can reduce the number of indirect calls)

I don't understand. Can you give an example?

There are two cases where abstraction hurts performance: the first is
where the mechanisms used to achieve the abstraction (functions instead
of direct access to variables, function pointers instead of duplicating
the caller) introduce performance overhead. I don't think C has any
advantage here -- actually a disadvantage as it lacks templates and is
forced to use function pointers for nontrivial cases. Usually the
abstraction penalty is nil with modern compilers.

The second case is where too much abstraction clouds the programmer's
mind. But this is independent of the programming language.

And again the key is batching, improving cpu affinity, and caching, not
looking for a faster instruction sequence.

A 100 byte program will print "hello world" on a UART and stop. A
modern program will load a vector description of a font, scale it to the
desired size, render it using anti aliasing and sub-pixel positioning,
lay it out according to the language rules of whereever you live, and
place it on a multi-megabyte frame buffer. Yes it needs hundreds of
megabytes and lots of nasty algorithms to do that.

Post by Willy Tarreau
The only true solution is to create better
algorithms, but you will find even less people capable of creating efficient
algorithms than you will find capable of coding correctly.

That is true, that is why we see a lot more microoptimizations than
algorithmic progress.

But if you want a fast streaming filesystem you choose XFS over ext3,
even though the latter is much smaller and easier to optimize. If you
write a network server you choose epoll() instead of trying to optimize
select() somehow. True algorithmic improvements are rare but they are
the ones that are actually measurable.

Post by Andi Kleen
For example there are some CPUs who are relatively slow at indirect
function calls and there are actually cases where this can be measured.

It depends which type of I/O. If the I/O is non-blocking, you end up doing
something else instead of actively burning cycles.

Unless you are I/O bound, which is usually the case when you have 2GHz
cpus driving 200Hz disks.

Post by Avi Kivity
Nanooptimizations are fun (I do them myself, I admit) but that's not
where performance as measured by the end user lies.

Can you give an example of a 2- or 3- fold factor on an end-user
workload achieved by microopts?

I agree about bloat.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Willy Tarreau

2007-12-04 22:43:07 UTC

Hi Avi,

Post by Andi Kleen
With 10Gbit/s ethernet working you start to care about every cycle.

I really doubt Linux spends 400 cycles routing a packet. Look what an
skbuff looks like.

That's not what I wrote. I just wrote about doing forwarding table lookup
and MMIO so that dedicated hardware NICs can process the recv/send to the
correct ends. If you just need to scan a list of DMAed packets, look at
their destination IP address, lookup that IP in a table to find the output
NIC and destination MAC address, link them into an output list and waking
the output NIC up, there's nothing which requires more than 400 cycles
here. I never said that it was a requirement to pass through the existing
network stack.

Post by Avi Kivity
A flood ping to localhost on a 2GHz system takes 8 microseconds, that's
16,000 cycles. Sure it involves userspace, but you're about two orders
of magnitude off.

I don't see where you see a userspace (or I don't understand your test).
On traffic generation I often do from user space, I can send 630 k raw
ethernet packets per second from userspace on a 1.8 GHz opteron and PCI-e
NICs. That's 2857 cycles per packet, including the (small amount of)
userspace work. That's quite cheap.

Post by Avi Kivity
And the localhost interface is nicely cached in L1 without mmio at all,
unlike real devices.

(...)

Post by Avi Kivity
This happens very often in HPC, and when it does, it is often worthwhile
to invest in manual optimizations or even assembly coding.
Unfortunately it is very rare in the kernel (memcmp, raid xor, what
else?). Loops with high iteration counts are very rare, so any
attention you give to the loop body is not amortized over a large number
of executions.

Well, in my example above, everythin in the path of the send() syscall down
to the bare metal NIC is under high pressure in a fast loop. 30 cycles
already represent 1% of the performance! In fact, to modulate speed, I
use a busy loop with a volatile int and small values.

Post by Willy Tarreau
And such poorly efficient code may happen very often when you blindly rely
on function pointers instead of explicit calls.

Using an indirect call where a direct call is sufficient will also
reduce the compiler's optimization opportunities.

That's true.

Post by Avi Kivity
However, I don't see
anyone recommending it in the context of systems programming.
It is not true that the number of indirect calls necessarily increases
if you use a language other than C.
(Actually, with templates you can reduce the number of indirect calls)

I don't understand. Can you give an example?

Yes, the most common examples found today involve applications reading
data from databases. For instance, let's say that one function in your
program must count the number of unique people with the name starting
with an "A". It is very common to see "low-level" primitives to abstract
the database for portability purposes. One of such primitives will
generally be consist in retrieving a list of people with their names,
age and sex in one well-formated 3-column array. Many lazy people will
not see any problem in calling this one from the function described
above. Basically, what they would do is :

count_people_with_name_starting_with_a()
-> array[name,age,sex] = get_list_of_people()
-> while read_one_people_entry() {
alloc(one_line_of_3_columns)
read then parse the 3 fields
format_them_appropriately
}
-> create a new array "name2" by duplicating the "name" column
-> name3 = sort_unique(name2)
-> name4 = name3.grep("^A")
-> return name4.count

Don't laugh, I've recently read such a horrible thing. It was done
that way just because it was easier. Without the abstraction layer,
the coder would have been forced to access the base anyway and would
have seen an added value into just counting from the inner while
loop, saving lots of copies, greps, sort, etc... :

count_people_with_name_starting_with_a() {
count = 0;
while read_one_people_entry() {
read the 3 fields into a statically-allocated buffer
if (name[0] == 'A') count++;
}
return count;
}

I'm not saying that the above was not possible, just that it's
1000% easier to do the former without even having to think that
the final code uses such horrible things. And yes, I can confirm
that when you see this, you want to shoot the author !

Post by Avi Kivity
There are two cases where abstraction hurts performance: the first is
where the mechanisms used to achieve the abstraction (functions instead
of direct access to variables, function pointers instead of duplicating
the caller) introduce performance overhead. I don't think C has any
advantage here -- actually a disadvantage as it lacks templates and is
forced to use function pointers for nontrivial cases. Usually the
abstraction penalty is nil with modern compilers.
The second case is where too much abstraction clouds the programmer's
mind. But this is independent of the programming language.

Agreed. But most often, the abstraction prevents the user from accessing
some information directly and that becomes nasty. I remember when I was
a teen, I wrote a program designed to inventory what you had in your PC,
and run a few performance tests. It ran in those semi-graphical DOS mode
where you use graphics characters to draw boxes. I initially wrote all
the windowing code myself and it ran perfectly. I once decided to rewrite
it using TurboVision, the windowing framework from Borland (it was written
in TurboPascal). I made intensive use of the equivalent of a putchar()
function to write text in a window. You cannot imagine my pain when I
ran it on my old 8088, it wrote at the speed of a 1200 bps terminal. I
then tried to find how to write faster, even by accessing the window
buffer directly. I couldn't. I had to reverse-engineer the internal
structures by debugging memory contents in order to find the pointers
to the window buffer to write to them directly. After this disastrous
experience with abstraction, I thought "never that crap again".

Post by Willy Tarreau
Every cycle burned is definitely lost. The time cannot go backwards. So
for each cycle that you lose to laziness, you have to become more and more
clever to find out how to write an alternative. Lazy people simply put
caches everywhere and after that they find normal that "hello world" requires
2 Gigs of RAM to be displayed.

What I'm complaining about is that when you don't want those fancy things,
you still have them to justify the hundreds of megs. And even if you manage
to print to stdout, you still have a huge runtime just in case you'd like
to use the fancy features.

That is true, that is why we see a lot more microoptimizations than
algorithmic progress.

Also, algorithmic research is very little rewarding. You can work for
months or years thinking you found the nice algo for the job, then
finally discover a limitation you did not expect and throw that amount
of work to the bin in a few minutes.

Post by Avi Kivity
But if you want a fast streaming filesystem you choose XFS over ext3,
even though the latter is much smaller and easier to optimize. If you
write a network server you choose epoll() instead of trying to optimize
select() somehow.

That's interesting that you cite epoll() vs select(). I measured the
break-even point around 1000 FDs. Below, select() is faster. Above,
epoll() is faster. On small number of entries (less than 100), a select
based proxy can be 20-30% faster than the same one running on epoll()
because select() while dumber is cheaper to set up.

Post by Avi Kivity
True algorithmic improvements are rare but they are the ones that are
actually measurable.

I generally agree with this.

Post by Andi Kleen
For example there are some CPUs who are relatively slow at indirect
function calls and there are actually cases where this can be measured.

It depends which type of I/O. If the I/O is non-blocking, you end up doing
something else instead of actively burning cycles.

Unless you are I/O bound, which is usually the case when you have 2GHz
cpus driving 200Hz disks.

That's true when you seek a lot. When you manage to mostly perform sequential
reads (such as what you do when processing large files such as logs), you can
easily achieve 80 MB/s, which is 20000 pages/s, or 100 times faster.

Post by Avi Kivity
Nanooptimizations are fun (I do them myself, I admit) but that's not
where performance as measured by the end user lies.

Can you give an example of a 2- or 3- fold factor on an end-user
workload achieved by microopts?

Oh there are many primitives which are generally optimized in assembly for
this reason. What randomly comes to my mind :
- graphics libraries. Saving 1 cycle per pixel in a rectangle drawing
primitive can have an important impact in animated graphics for
instance.

- video/audio and generally multimedia code. I remember a specially
written version of mpg123 about 10 years ago, which was optimized
for i486 and which was the only one able to run on a 486 without
skipping.

- crypto code. It's common to find CPU-specific DES or AES functions.
Take a look at John The Ripper. I don't know if it still exists,
but there was an Alpha-optimized DES function which was something
like 5 times faster than the generic C one. It changes a lot of
things when you have 1 day to check your users passwords.

I also wrote a netfilter log analyzer which parses 300000 lines per
second on my 1.7 GHz notebook. That's 5600 cycles to read a full
line, lookup the field names, extract the values, parse them (atoi,
aton) save them in a structure, apply a filter, insert the result
in a tree containing up to 12 millions of them, and dump a report
of the counts by any creteria. That saved me a lot of time working
on log analysis. But to achieve such a speed, I had to optimize at
every level, including rewriting a faster atoi() equivalent, a
faster aton() equivalent (with no multiplies), and playing with
likely/unlikely a lot. The code slowly improved from about 75k
lines/s to 300k lines/s with no algorithmic change. Just by the
way of careful code placement and ordering.

In fact, you could say that micro-optimizations are not important
if you are doing them in a crappy environment where the fast path
is already wasted by a big dirty function. But when you have the
ability to master all the environment, every single cycle counts
because there's almost no waste.

I find it essential not to be the first one bringing crap somewhere
and serving as an excuse for others not to care about their code.
If everyone cares, you can still produce very good software, and
that's what I care about.

Cheers,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Avi Kivity

2007-12-05 17:05:01 UTC

Post by Willy Tarreau
Hi Avi,

Post by Andi Kleen
With 10Gbit/s ethernet working you start to care about every cycle.

I really doubt Linux spends 400 cycles routing a packet. Look what an
skbuff looks like.

If you're writing a single-purpose program then there is justification
to micro-optimize it to the death. Write it in VHDL, even. But that
description doesn't fit the kernel.

Post by Avi Kivity
A flood ping to localhost on a 2GHz system takes 8 microseconds, that's
16,000 cycles. Sure it involves userspace, but you're about two orders
of magnitude off.

I don't see where you see a userspace (or I don't understand your test).

ping -f -q localhost; the ping client is in userspace.

Post by Willy Tarreau
On traffic generation I often do from user space, I can send 630 k raw
ethernet packets per second from userspace on a 1.8 GHz opteron and PCI-e
NICs. That's 2857 cycles per packet, including the (small amount of)
userspace work. That's quite cheap.

Yes, it is.

Having an interface to send multiple packets in one syscall would cut
way more than 30 cycles.

I don't understand. Can you give an example?

Yes, the most common examples found today involve applications reading
data from databases. For instance, let's say that one function in your
program must count the number of unique people with the name starting
with an "A". It is very common to see "low-level" primitives to abstract
the database for portability purposes. One of such primitives will
generally be consist in retrieving a list of people with their names,
age and sex in one well-formated 3-column array. Many lazy people will
not see any problem in calling this one from the function described
count_people_with_name_starting_with_a()
-> array[name,age,sex] = get_list_of_people()
-> while read_one_people_entry() {
alloc(one_line_of_3_columns)
read then parse the 3 fields
format_them_appropriately
}
-> create a new array "name2" by duplicating the "name" column
-> name3 = sort_unique(name2)
-> name4 = name3.grep("^A")
-> return name4.count
Don't laugh, I've recently read such a horrible thing. It was done
that way just because it was easier. Without the abstraction layer,
the coder would have been forced to access the base anyway and would
have seen an added value into just counting from the inner while
count_people_with_name_starting_with_a() {
count = 0;
while read_one_people_entry() {
read the 3 fields into a statically-allocated buffer
if (name[0] == 'A') count++;
}
return count;
}
I'm not saying that the above was not possible, just that it's
1000% easier to do the former without even having to think that
the final code uses such horrible things.

Your optimized version is wrong. It counts duplicated names, while you
stated you needed unique names. Otherwise the sort_unique step is
completely redundant.

Databases are good examples of where the abstraction helps. If you had
hundreds of millions of records in your example, you'd connect to a
database, present it with an ASCII string describing what you want, upon
which it would parse it, compile it into an internal language against
the schema, optimize that and then execute it. Despite all that
abstraction it would win against your example because it would implement
the inner loop as

open index (by name)
seek to 'A'
while (current starts with 'A')
++count (taking care of the uniqueness requirement if
needed)
close index

Thus it would never see people who's name begins with 'W'. If the
database had a materialized view feature, and this particular query was
deemed important enough, it would optimize it to

open materialized view
read count
close materialized view

The database does all this while allowing concurrent reads and writes
and keeping your data in case someone trips on the power cord. You
can't do that without a zillion layers of abstraction.

If the abstraction if badly written, and further you cannot change it,
then of course it hurts. But if the abstraction is well written, or if
it can be fixed, then all is well. The problem here is not that
abstractions exist, but that you persist in using a broken API instead
of fixing it.

That's life. The fact is that users demand features, and programmers
cater to them. If you can find a way to provide all those features
without the bloat, more power to you. The abstractions here are not the
cause of the bloat, they are the tool used to provide the features while
keeping a reasonable level of maintainability and reliability.

That is true, that is why we see a lot more microoptimizations than
algorithmic progress.

You don't need to prove that P == NP to improve things. Most
improvements are in adding new APIs and data structures to keep the
inner loops working on more data. And of course scalability work to
keep data local to a processing core.

It won't win you a Nobel prize, but you'll be able to measure a few
percent improvement on a real-life workload instead of 10 cycles on a
microbenchmark.

[IIRC epoll() setup is done outside the loop, just once]

The small proxy probably doesn't have a performance problem, while 10K
connection servers do.

Post by Avi Kivity
Unless you are I/O bound, which is usually the case when you have 2GHz
cpus driving 200Hz disks.

Right, and this was achieved by having very good batching in the bio layer.

Post by Avi Kivity
Can you give an example of a 2- or 3- fold factor on an end-user
workload achieved by microopts?

Oh there are many primitives which are generally optimized in assembly for
- graphics libraries. Saving 1 cycle per pixel in a rectangle drawing
primitive can have an important impact in animated graphics for
instance.
- video/audio and generally multimedia code. I remember a specially
written version of mpg123 about 10 years ago, which was optimized
for i486 and which was the only one able to run on a 486 without
skipping.
- crypto code. It's common to find CPU-specific DES or AES functions.
Take a look at John The Ripper. I don't know if it still exists,
but there was an Alpha-optimized DES function which was something
like 5 times faster than the generic C one. It changes a lot of
things when you have 1 day to check your users passwords.

These are indeed cases where the inner loop is executed millions times
per second. Of course it is perfectly reasonable to assembly code these.

I'm talking about regular C code. Most C code is decision taking and
pointer chasing, which is why traditional microopts don't help much.

Post by Willy Tarreau
I also wrote a netfilter log analyzer which parses 300000 lines per
second on my 1.7 GHz notebook. That's 5600 cycles to read a full
line, lookup the field names, extract the values, parse them (atoi,
aton) save them in a structure, apply a filter, insert the result
in a tree containing up to 12 millions of them, and dump a report
of the counts by any creteria. That saved me a lot of time working
on log analysis. But to achieve such a speed, I had to optimize at
every level, including rewriting a faster atoi() equivalent, a
faster aton() equivalent (with no multiplies), and playing with
likely/unlikely a lot. The code slowly improved from about 75k
lines/s to 300k lines/s with no algorithmic change. Just by the
way of careful code placement and ordering.

Curious: wasn't the time dominated by the tree code? 12M nodes is 24
levels, and probably unpredictable to the processor unless the data is
very regular.

Post by Willy Tarreau
In fact, you could say that micro-optimizations are not important
if you are doing them in a crappy environment where the fast path
is already wasted by a big dirty function. But when you have the
ability to master all the environment, every single cycle counts
because there's almost no waste.

That only works if the environment is very small. A large scale project
needs abstractions, otherwise you spend all your time re-learning all
the details.

Post by Willy Tarreau
I find it essential not to be the first one bringing crap somewhere
and serving as an excuse for others not to care about their code.
If everyone cares, you can still produce very good software, and
that's what I care about.

We just disagree about the methods.

Gilboa Davara

2007-12-03 12:35:31 UTC

Post by Avi Kivity
[I really doubt there are that many of these; syscall
entry/dispatch/exit, interrupt dispatch, context switch, what else?]

Networking, block IO, page fault, ... But only the fast paths in these
cases. A lot of the kernel is slow path code and could probably
be written even in an interpreted language without much trouble.

Intel's newest dual 10GbE NIC can easily (?) throw ~14M packets per
second. (theoretical peak at 1514bytes/frame)
Granted, installing such a device on a single CPU/single core machine is
absurd - but even on an 8 core machine (2 x Xeon 53xx/54xx / AMD
Barcelona) it can still generate ~1M packets/s per core.

Now assuming you're doing low-level (passive) filtering of some sort
(frame/packet routing, traffic interception and/or packet analysis)
using hardware assistance (TSO, complete TCP offloading, etc) is off the
table and each and every cycle within netif_receive_skb (and friends)
-counts-.

I don't suggest that the kernel should be (re)designed for such (niche)
applications but on other hand, if it works...

- Gilboa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Gilboa Davara

2007-12-03 12:44:06 UTC

Post by Gilboa Davara
Intel's newest dual 10GbE NIC can easily (?) throw ~14M packets per
second. (theoretical peak at 1514bytes/frame)
Granted, installing such a device on a single CPU/single core machine is
absurd - but even on an 8 core machine (2 x Xeon 53xx/54xx / AMD
Barcelona) it can still generate ~1M packets/s per core.

Sigh... Sorry. Please ignore the broken math on my part.
Make that 1.8M frames/second per card and ~100K packets/second per core.

- Gilboa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Casey Schaufler

2007-12-03 16:28:50 UTC

Post by Gilboa Davara

Post by Avi Kivity
[I really doubt there are that many of these; syscall
entry/dispatch/exit, interrupt dispatch, context switch, what else?]

Networking, block IO, page fault, ... But only the fast paths in these
cases. A lot of the kernel is slow path code and could probably
be written even in an interpreted language without much trouble.

Intel's newest dual 10GbE NIC can easily (?) throw ~14M packets per
second. (theoretical peak at 1514bytes/frame)
Granted, installing such a device on a single CPU/single core machine is
absurd - but even on an 8 core machine (2 x Xeon 53xx/54xx / AMD
Barcelona) it can still generate ~1M packets/s per core.
Now assuming you're doing low-level (passive) filtering of some sort
(frame/packet routing, traffic interception and/or packet analysis)
using hardware assistance (TSO, complete TCP offloading, etc) is off the
table and each and every cycle within netif_receive_skb (and friends)
-counts-.
I don't suggest that the kernel should be (re)designed for such (niche)
applications but on other hand, if it works...

I was involved in a 10GBe project like you're describing not too
long ago. Only the driver, and only a tight, lean, special purpose
driver at that, was able to deal with line rate volumes. This was
in a real appliance, where faster CPUs were not an option. In fact,
not hardware changes were possible due to the issues with squeezing
in the 10GBe nics. This project would have been impossible without
the speed and deterministic behavior of th ekernel C environment.

Casey Schaufler
***@schaufler-ca.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Gilboa Davara

2007-12-05 10:31:34 UTC

10GbE can't do 14M packets per second if the packets are 1514 bytes. At
10M packets per second you have less than 1000 bits per packet, which is
far from 1514bytes.
10Gbps gives you at most 1.25GBps, which at 1514 bytes per packet works
out to 825627 packets per second. You could reach ~14M packets per
second with only the smallest packet size, which is rather unusual for
high throughput traffic, since you waste almost all the bytes on
overhead in that case. But you do want to be able to handle at least a
million or two packets per second to do 10GbE.

... I corrected my math in the second email. [1]

Never the less, a VOIP network (E.g. G729 and friends) can generate the
maximum number of frames allowed on 10GbE Ethernet which is, AFAIR just
below 15M -per- port. (~29M on a dual port card)

While I doubt that any non-NPU based NIC can handle such a load, on
mixed networks we're already seeing well-above 1M frames per port.

- Gilboa
[1] http://lkml.org/lkml/2007/12/3/69

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Avi Kivity

2007-12-01 19:59:31 UTC

Post by Lennart Sorensen

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?

Doesn't objective C essentially require a runtime to provide a lot of
the features of the language? If it does (as I suspect) then it is
totally unsiatable for kernel development.

C also requires a (very minimal) runtime. And I don't see how having a
runtime disqualifies a language from being usable in a kernel; the
runtime is just one more library, either supplied by the compiler or by
the kernel.

Post by Lennart Sorensen
Besides the kernel does a wonderful job doing object oriented design
where apropriate using C without any of the stupidities added by the
common OO languages

Object orientation in C leaves much to be desired; see the huge number
of void pointers and container_of()s in the kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jörn Engel

2007-12-02 19:44:30 UTC

=20
Object orientation in C leaves much to be desired; see the huge numbe=

r=20

of void pointers and container_of()s in the kernel.

While true, this isn't such a bad problem. A language really sucks whe=
n
it tries to disallow something useful. Back in university I was forced
to write system software in pascal. Simple pointer arithmetic became a
5-line piece of code.

Imo the main advantage of C is simply that it doesn't get in the way.

J=C3=B6rn

--=20
But this is not to say that the main benefit of Linux and other GPL
software is lower-cost. Control is the main benefit--cost is secondary.
-- Bruce Perens
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel"=
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Lennart Sorensen

2007-12-03 16:53:47 UTC

Post by Avi Kivity
C also requires a (very minimal) runtime. And I don't see how having a
runtime disqualifies a language from being usable in a kernel; the
runtime is just one more library, either supplied by the compiler or by
the kernel.

Well the majority of C syntax requires no runtime library. There are
some system call like things that you often want that need a library
(like malloc and such), but those aren't really part of C itself. Of
course without malloc and printf and file i/o calls the program would
probably be a bit boring. I have written some small C programs without
a runtime, where the few things I needed where implemented in assembly
and poked the hardware directly and called from the C program.

Post by Avi Kivity
Object orientation in C leaves much to be desired; see the huge number
of void pointers and container_of()s in the kernel.

As a programming language, C leaves much to be desired.

--
Len Sorensen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Chris Snook

2007-11-30 15:00:37 UTC

Post by Ben Crowhurst
Has Objective-C ever been considered for kernel development?

No. Kernel programming requires what is essentially assembly language with a
lot of syntactic sugar, which C provides. Higher-level languages abstract away
too much detail to be suitable for the sort of bit-perfect control you need when
you're directly controlling bare metal. You can still use object-oriented
programming techniques in C, and we do this all the time in the kernel, but we
do so with more fine-grained explicit control than a language like Objective-C
would give us. More to the point, if we tried to use Objective-C, we'd find
ourselves needing to fall back to C-style explicitness so often that it wouldn't
be worth the trouble.

In other news, I hear Hurd boots again!

-- Chris

David Newall

2007-12-01 09:50:00 UTC