David Jeske
2009-09-22 17:41:57 UTC
Thanks for the replies. A couple themes came up across them.
1. *MMU/TLB vs software paging* - What ideas do you have for implementing
paging in software that is better than paging in hardware? If code or
data-pages are removed, previously compiled code would fail when it tried to
access invalid pointers without MMU faults. The process which has pages
swapped out would need to run in a special mode which verified each code or
data pointer (to see if that data was swapped out) before it was accessed.
This could be a VM interpreter, or it could be an entirely different set of
compiled code -- in either case it creates more memory pressure because of
different versions of code that need to exist. Is this more or less
expensive than an MMU/TLB? That's hard to say, I'd guess it to be more
expensive. If processes (or smaller granularity 'units') are simply frozen
and not allowed to run while swapped it would eliminate the need for
dual-codepaths, but it would also reduce the amount of reclamation swapping
could provide if units are still handling small frequent tasks.
Software paging does have the nice property that it smartly "makes the
common case fast". (i.e. a system which isn't swapping has no overhead. Only
when we incur the HUGE overhead of secondary-storage swapping would we incur
the penalty of virtual memory checks.
2. *Application API security.* My thoughts in this space have been about
eliminating most (or all) requests to the user to approve heightened
security. Users simply click "yes" on most of these dialogs, so I think a
security system will be more effective if they never or seldom happen. (the
opposite of Vista UAC) Further, I think the primary compromise to prevent is
malicious code uploading unauthorized user data to a third party server.
One idea I've had is to create a set of interlocking isolation rules. For
example: consider two assemblies: (A) can access the network, and only it's
own private local storage; (B) can access all local data but not the network
or other processes. Assemblies of these two types can be intermixed on the
system without fear that unauthorized user data can be uploaded to the
Internet. -- Adding a secure open/save panel (like silverlight or a browser)
to authorize an application to access a specific piece of data, "A" becomes
similar to website security, since different websites can't access
eachother's backend servers. A surprising number of applications fit into
these two categories.
Chad mentioned "see rings in our docs". I couldn't find this reference on
the cosmos website. What is this a refernece to?
3. *Potential performance improvements.* Ben mentioned "if a loaded
webserver spends 50% of time in kernel". This sounds to me like a static
webserver. It's no surprise that solution which removes OS abstration layer
cost can optimize a dumb "data in data out" workload like this. Many tiny
systems (including the previously mentioned MIT Exokernel) have done this.
We arn't using them in the industry because this optimization isn't
important. Dynamic applications represent a much larger percentage of system
development costs, and they spend most of their time in userland. These
applications are not dominated by syscall or context-switch performance. I
argue that human resource (programmer) costs are the dominating factor in
most software applications. Reducing these costs is more important than a
fractional improvement to performance. This is why the trend is towards
programming systems which are inefficient for the machines and efficient for
the programmer. (like Ruby, Python on the extreme, Java, or CLR with garbage
collection in the less-extreme)
Overall, I'm skeptical of claims that this new architecture will have
similar capabilities to old architectures but with better performance. There
are tradeoffs, and while we can point at the inefficiencies of existing
systems, we can't yet point out all the tradeoffs and inefficiencies of a
full real-world system built in a Singularity-like model. Pervasive garbage
collection is a significant one of these costs. Similar to the quoted
figures about how TLB misses are becoming more expensive as CPU speeds
increase, garbage collection is becoming more expensive as CPU speeds and
memory sizes increase. In other words, it's very possible Singularity would
have a similar overhead to produce the same functionality, simply in
different places.
Perhaps the winning argument for an architecture like Singularity/Cosmos is
similar to the Aspect-Oriented-Programming argument. Currently we're
spending programmer time optimizing these boundaries, wheras if we make the
boundaries very cheap we don't have to. This is will make our code smaller,
easier to write, easier to maintain, and as a nice side effect, faster in
the common cases. The code required to handle the uncommon cases might even
still exist, but if it can only exist in centralized cross-cutting places,
then the overall codesize can be simpler.
4. *Passing data between domains.* From the Singularity paper, it seems they
concieved a way to make a code-constraint that enforces a given pointer
handoff is the only way to access a data-subtree. This allows them to
safetly hand the pointer to another domain without worry that it can be
modified by the original process. They still have to track lifetime for this
subtree with some kind of foreign GC handle. I don't understand how they
would implement a zero-copy buffer cache using this system, as once they
handoff the pointer, they can't retain a reference to the buffer anymore.
I like Ben's suggestion to encourage 'immutable data'. One chalenge lies in
trying to Consider both heap-allocated buffers and structures which are
designed to be marked immutable (though start out mutable). The buffer would
allow writes (or DMA) until it was marked immutable. The structures, even
with mutable, would only allow pointers to immutable data. This constraint
assures the structure can be safetly made immutable at any time and the
entire subtree is immutable. This pattern is an "incremental construction of
subtrees of immutable data", which can be handed to a cross-domain call
accepting only an immutable pointer. A buffer manager could keep around
immutable blocks and hand them out to different callers as necessary.
Cross-GC references would manage lifetimes for the data. Structures could
cheaply aggregate the total memory size of the subtree for accounting the
size of a cross-heap reference.
It seems one challenge of this system is cheaply tracking the cross-GC
domain references. Perhaps all immutable pointers would need to be boxed
into cross-GC handles
5. *Real-time garbage collection.* Ben made reference to "real time or
counting collectors for device drivers". Counting collectors introduce the
possibility of memory leaks through cycles, unless they have a sweeping
check which eliminates their real-time nature. Real-time sweeping/copying
collectors are a highly unsolved problem. Typical "success stories" I've
seen involve hard bounds on heap-size to limit pause times. When heap sizes
are not bounded they are not real-time or hardware support is required. Do
you have a reference to something you feel is promising?
IMO, the most promising work in real-time collectors is the triad of MS
Research I mentioned in my earlier post (chicken, stopless, and clover).
These are soft real-time collectors, that produce acceptable delays in most
real-world situations. However, they either require the overhead of
introducing a write-barrier in all code, or having two versions of code, one
with a write barrier.
Again, this speaks to the performance overhead argument I made above. It may
be that we're happier spending ~35% of our CPU on write barriers and garbage
collection scans, rather than on MMU/TLB costs, because our code can be
simpler. However, it's hard to argue that a radical redesign like this will
have superior performance before the issues are worked out.
---------
Again, thanks for all the detailed responses. Cosmos is an exciting
project!
1. *MMU/TLB vs software paging* - What ideas do you have for implementing
paging in software that is better than paging in hardware? If code or
data-pages are removed, previously compiled code would fail when it tried to
access invalid pointers without MMU faults. The process which has pages
swapped out would need to run in a special mode which verified each code or
data pointer (to see if that data was swapped out) before it was accessed.
This could be a VM interpreter, or it could be an entirely different set of
compiled code -- in either case it creates more memory pressure because of
different versions of code that need to exist. Is this more or less
expensive than an MMU/TLB? That's hard to say, I'd guess it to be more
expensive. If processes (or smaller granularity 'units') are simply frozen
and not allowed to run while swapped it would eliminate the need for
dual-codepaths, but it would also reduce the amount of reclamation swapping
could provide if units are still handling small frequent tasks.
Software paging does have the nice property that it smartly "makes the
common case fast". (i.e. a system which isn't swapping has no overhead. Only
when we incur the HUGE overhead of secondary-storage swapping would we incur
the penalty of virtual memory checks.
2. *Application API security.* My thoughts in this space have been about
eliminating most (or all) requests to the user to approve heightened
security. Users simply click "yes" on most of these dialogs, so I think a
security system will be more effective if they never or seldom happen. (the
opposite of Vista UAC) Further, I think the primary compromise to prevent is
malicious code uploading unauthorized user data to a third party server.
One idea I've had is to create a set of interlocking isolation rules. For
example: consider two assemblies: (A) can access the network, and only it's
own private local storage; (B) can access all local data but not the network
or other processes. Assemblies of these two types can be intermixed on the
system without fear that unauthorized user data can be uploaded to the
Internet. -- Adding a secure open/save panel (like silverlight or a browser)
to authorize an application to access a specific piece of data, "A" becomes
similar to website security, since different websites can't access
eachother's backend servers. A surprising number of applications fit into
these two categories.
Chad mentioned "see rings in our docs". I couldn't find this reference on
the cosmos website. What is this a refernece to?
3. *Potential performance improvements.* Ben mentioned "if a loaded
webserver spends 50% of time in kernel". This sounds to me like a static
webserver. It's no surprise that solution which removes OS abstration layer
cost can optimize a dumb "data in data out" workload like this. Many tiny
systems (including the previously mentioned MIT Exokernel) have done this.
We arn't using them in the industry because this optimization isn't
important. Dynamic applications represent a much larger percentage of system
development costs, and they spend most of their time in userland. These
applications are not dominated by syscall or context-switch performance. I
argue that human resource (programmer) costs are the dominating factor in
most software applications. Reducing these costs is more important than a
fractional improvement to performance. This is why the trend is towards
programming systems which are inefficient for the machines and efficient for
the programmer. (like Ruby, Python on the extreme, Java, or CLR with garbage
collection in the less-extreme)
Overall, I'm skeptical of claims that this new architecture will have
similar capabilities to old architectures but with better performance. There
are tradeoffs, and while we can point at the inefficiencies of existing
systems, we can't yet point out all the tradeoffs and inefficiencies of a
full real-world system built in a Singularity-like model. Pervasive garbage
collection is a significant one of these costs. Similar to the quoted
figures about how TLB misses are becoming more expensive as CPU speeds
increase, garbage collection is becoming more expensive as CPU speeds and
memory sizes increase. In other words, it's very possible Singularity would
have a similar overhead to produce the same functionality, simply in
different places.
Perhaps the winning argument for an architecture like Singularity/Cosmos is
similar to the Aspect-Oriented-Programming argument. Currently we're
spending programmer time optimizing these boundaries, wheras if we make the
boundaries very cheap we don't have to. This is will make our code smaller,
easier to write, easier to maintain, and as a nice side effect, faster in
the common cases. The code required to handle the uncommon cases might even
still exist, but if it can only exist in centralized cross-cutting places,
then the overall codesize can be simpler.
4. *Passing data between domains.* From the Singularity paper, it seems they
concieved a way to make a code-constraint that enforces a given pointer
handoff is the only way to access a data-subtree. This allows them to
safetly hand the pointer to another domain without worry that it can be
modified by the original process. They still have to track lifetime for this
subtree with some kind of foreign GC handle. I don't understand how they
would implement a zero-copy buffer cache using this system, as once they
handoff the pointer, they can't retain a reference to the buffer anymore.
I like Ben's suggestion to encourage 'immutable data'. One chalenge lies in
trying to Consider both heap-allocated buffers and structures which are
designed to be marked immutable (though start out mutable). The buffer would
allow writes (or DMA) until it was marked immutable. The structures, even
with mutable, would only allow pointers to immutable data. This constraint
assures the structure can be safetly made immutable at any time and the
entire subtree is immutable. This pattern is an "incremental construction of
subtrees of immutable data", which can be handed to a cross-domain call
accepting only an immutable pointer. A buffer manager could keep around
immutable blocks and hand them out to different callers as necessary.
Cross-GC references would manage lifetimes for the data. Structures could
cheaply aggregate the total memory size of the subtree for accounting the
size of a cross-heap reference.
It seems one challenge of this system is cheaply tracking the cross-GC
domain references. Perhaps all immutable pointers would need to be boxed
into cross-GC handles
5. *Real-time garbage collection.* Ben made reference to "real time or
counting collectors for device drivers". Counting collectors introduce the
possibility of memory leaks through cycles, unless they have a sweeping
check which eliminates their real-time nature. Real-time sweeping/copying
collectors are a highly unsolved problem. Typical "success stories" I've
seen involve hard bounds on heap-size to limit pause times. When heap sizes
are not bounded they are not real-time or hardware support is required. Do
you have a reference to something you feel is promising?
IMO, the most promising work in real-time collectors is the triad of MS
Research I mentioned in my earlier post (chicken, stopless, and clover).
These are soft real-time collectors, that produce acceptable delays in most
real-world situations. However, they either require the overhead of
introducing a write-barrier in all code, or having two versions of code, one
with a write barrier.
Again, this speaks to the performance overhead argument I made above. It may
be that we're happier spending ~35% of our CPU on write barriers and garbage
collection scans, rather than on MMU/TLB costs, because our code can be
simpler. However, it's hard to argue that a radical redesign like this will
have superior performance before the issues are worked out.
---------
Again, thanks for all the detailed responses. Cosmos is an exciting
project!