Discussion:
various design topics (performance, GC, security)
David Jeske
2009-09-22 17:41:57 UTC
Permalink
Thanks for the replies. A couple themes came up across them.
1. *MMU/TLB vs software paging* - What ideas do you have for implementing
paging in software that is better than paging in hardware? If code or
data-pages are removed, previously compiled code would fail when it tried to
access invalid pointers without MMU faults. The process which has pages
swapped out would need to run in a special mode which verified each code or
data pointer (to see if that data was swapped out) before it was accessed.
This could be a VM interpreter, or it could be an entirely different set of
compiled code -- in either case it creates more memory pressure because of
different versions of code that need to exist. Is this more or less
expensive than an MMU/TLB? That's hard to say, I'd guess it to be more
expensive. If processes (or smaller granularity 'units') are simply frozen
and not allowed to run while swapped it would eliminate the need for
dual-codepaths, but it would also reduce the amount of reclamation swapping
could provide if units are still handling small frequent tasks.

Software paging does have the nice property that it smartly "makes the
common case fast". (i.e. a system which isn't swapping has no overhead. Only
when we incur the HUGE overhead of secondary-storage swapping would we incur
the penalty of virtual memory checks.

2. *Application API security.* My thoughts in this space have been about
eliminating most (or all) requests to the user to approve heightened
security. Users simply click "yes" on most of these dialogs, so I think a
security system will be more effective if they never or seldom happen. (the
opposite of Vista UAC) Further, I think the primary compromise to prevent is
malicious code uploading unauthorized user data to a third party server.

One idea I've had is to create a set of interlocking isolation rules. For
example: consider two assemblies: (A) can access the network, and only it's
own private local storage; (B) can access all local data but not the network
or other processes. Assemblies of these two types can be intermixed on the
system without fear that unauthorized user data can be uploaded to the
Internet. -- Adding a secure open/save panel (like silverlight or a browser)
to authorize an application to access a specific piece of data, "A" becomes
similar to website security, since different websites can't access
eachother's backend servers. A surprising number of applications fit into
these two categories.

Chad mentioned "see rings in our docs". I couldn't find this reference on
the cosmos website. What is this a refernece to?

3. *Potential performance improvements.* Ben mentioned "if a loaded
webserver spends 50% of time in kernel". This sounds to me like a static
webserver. It's no surprise that solution which removes OS abstration layer
cost can optimize a dumb "data in data out" workload like this. Many tiny
systems (including the previously mentioned MIT Exokernel) have done this.
We arn't using them in the industry because this optimization isn't
important. Dynamic applications represent a much larger percentage of system
development costs, and they spend most of their time in userland. These
applications are not dominated by syscall or context-switch performance. I
argue that human resource (programmer) costs are the dominating factor in
most software applications. Reducing these costs is more important than a
fractional improvement to performance. This is why the trend is towards
programming systems which are inefficient for the machines and efficient for
the programmer. (like Ruby, Python on the extreme, Java, or CLR with garbage
collection in the less-extreme)

Overall, I'm skeptical of claims that this new architecture will have
similar capabilities to old architectures but with better performance. There
are tradeoffs, and while we can point at the inefficiencies of existing
systems, we can't yet point out all the tradeoffs and inefficiencies of a
full real-world system built in a Singularity-like model. Pervasive garbage
collection is a significant one of these costs. Similar to the quoted
figures about how TLB misses are becoming more expensive as CPU speeds
increase, garbage collection is becoming more expensive as CPU speeds and
memory sizes increase. In other words, it's very possible Singularity would
have a similar overhead to produce the same functionality, simply in
different places.

Perhaps the winning argument for an architecture like Singularity/Cosmos is
similar to the Aspect-Oriented-Programming argument. Currently we're
spending programmer time optimizing these boundaries, wheras if we make the
boundaries very cheap we don't have to. This is will make our code smaller,
easier to write, easier to maintain, and as a nice side effect, faster in
the common cases. The code required to handle the uncommon cases might even
still exist, but if it can only exist in centralized cross-cutting places,
then the overall codesize can be simpler.

4. *Passing data between domains.* From the Singularity paper, it seems they
concieved a way to make a code-constraint that enforces a given pointer
handoff is the only way to access a data-subtree. This allows them to
safetly hand the pointer to another domain without worry that it can be
modified by the original process. They still have to track lifetime for this
subtree with some kind of foreign GC handle. I don't understand how they
would implement a zero-copy buffer cache using this system, as once they
handoff the pointer, they can't retain a reference to the buffer anymore.

I like Ben's suggestion to encourage 'immutable data'. One chalenge lies in
trying to Consider both heap-allocated buffers and structures which are
designed to be marked immutable (though start out mutable). The buffer would
allow writes (or DMA) until it was marked immutable. The structures, even
with mutable, would only allow pointers to immutable data. This constraint
assures the structure can be safetly made immutable at any time and the
entire subtree is immutable. This pattern is an "incremental construction of
subtrees of immutable data", which can be handed to a cross-domain call
accepting only an immutable pointer. A buffer manager could keep around
immutable blocks and hand them out to different callers as necessary.
Cross-GC references would manage lifetimes for the data. Structures could
cheaply aggregate the total memory size of the subtree for accounting the
size of a cross-heap reference.

It seems one challenge of this system is cheaply tracking the cross-GC
domain references. Perhaps all immutable pointers would need to be boxed
into cross-GC handles

5. *Real-time garbage collection.* Ben made reference to "real time or
counting collectors for device drivers". Counting collectors introduce the
possibility of memory leaks through cycles, unless they have a sweeping
check which eliminates their real-time nature. Real-time sweeping/copying
collectors are a highly unsolved problem. Typical "success stories" I've
seen involve hard bounds on heap-size to limit pause times. When heap sizes
are not bounded they are not real-time or hardware support is required. Do
you have a reference to something you feel is promising?

IMO, the most promising work in real-time collectors is the triad of MS
Research I mentioned in my earlier post (chicken, stopless, and clover).
These are soft real-time collectors, that produce acceptable delays in most
real-world situations. However, they either require the overhead of
introducing a write-barrier in all code, or having two versions of code, one
with a write barrier.

Again, this speaks to the performance overhead argument I made above. It may
be that we're happier spending ~35% of our CPU on write barriers and garbage
collection scans, rather than on MMU/TLB costs, because our code can be
simpler. However, it's hard to argue that a radical redesign like this will
have superior performance before the issues are worked out.

---------

Again, thanks for all the detailed responses. Cosmos is an exciting
project!
Chad Z. Hower
2009-09-22 23:19:05 UTC
Permalink
If code or data-pages are removed, previously compiled code would fail
No, because code will only have handles to memory, not pointers. So the
kernel can move the memory without issue. This can be done by pointer
indirection as well internally and not allowing user code to change it,
ie it still thinks its a pointer.
which has pages swapped out would need to run in a special mode which
verified each code or data pointer (to see if that data was swapped out)
Nope.. you are still thinking too much like native code compilation....
Processes can also be suspended and stacks scanned, or pointers tracked.
There are many many options.

We can also do "segmentation" and simply move pages, but that allows
more fragmentation.

I prefer handle approach and combination for different types of memory
allocations. But the "cook off" will tell.
exist. Is this more or less expensive than an MMU/TLB? That's hard to
By far.
2. *Application API security.* My thoughts in this space have been about
eliminating most (or all) requests to the user to approve heightened
security. Users simply click "yes" on most of these dialogs, so I think
This is way down the road, but I'd prefer a permission based system. An
app requests permissions it needs during installation, and the user can
approve the policy or not.

Also focus on isolated storage, and granular network access.
Chad mentioned "see rings in our docs". I couldn't find this reference
on the cosmos website. What is this a refernece to?
Look in the milestones.
3. *Potential performance improvements.* Ben mentioned "if a loaded
webserver spends 50% of time in kernel". This sounds to me like a static
webserver. It's no surprise that solution which removes OS abstration
Not at all. Having an extensive background in networking, many
webservers spend much of their time handling the network and streaming
the data, whatever source it be, static or dynamic. Also much of user
code ends up relying on kernel things too.
Overall, I'm skeptical of claims that this new architecture will have
similar capabilities to old architectures but with better performance.
There are tradeoffs, and while we can point at the inefficiencies of
We've done some tests, and much is already known about problems with
current OS. Heck, I did some basic tests on using a non standard calling
convention (which we can do because we dont have to interface with
native code libs) and we can already beat MS.NET and hard C++ code in
many cases in my POCs. And thats just calling conventions!
Similar to the quoted figures about how TLB misses are becoming more
expensive as CPU speeds increase, garbage collection is becoming more
expensive as CPU speeds and memory sizes increase. In other words, it's
very possible Singularity would have a similar overhead to produce the
same functionality, simply in different places.
A GC is a lot more flexible than a native memory manager, and allows us
to continually test, refine, and transparently replace its implementation.
4. *Passing data between domains.* From the Singularity paper, it seems
they concieved a way to make a code-constraint that enforces a given
pointer handoff is the only way to access a data-subtree. This allows
In Cosmos, from the code perspective there are no pointers. Only the GC
and compiler can ever mess with a pointer.
them to safetly hand the pointer to another domain without worry that it
There are no pointers.
lifetime for this subtree with some kind of foreign GC handle. I don't
understand how they would implement a zero-copy buffer cache using this
system, as once they handoff the pointer, they can't retain a reference
to the buffer anymore.
Even if there are mult GC threads, GC handles can be global or easily
transferred. Each domain could have handles that point at shared memory,
and there could even be local and shared handles. A GC allows sooooo
much flexibility.
5. *Real-time garbage collection.* Ben made reference to "real time or
counting collectors for device drivers". Counting collectors introduce
the possibility of memory leaks through cycles, unless they have a
We can easily implement multiple types of memory management, and drivers
can request or be given memory that is managed differently.
David Jeske
2009-09-23 01:14:22 UTC
Permalink
You are misunderstanding me. When I'm talking about pointers, I'm talking
about Cosmos generated instructions using pointers. These pointers very much
do exist.
Post by Chad Z. Hower
If code or data-pages are removed, previously compiled code would fail
No, because code will only have handles to memory, not pointers. So the
kernel can move the memory without issue. This can be done by pointer
indirection as well internally and not allowing user code to change it,
ie it still thinks its a pointer.
I'm talking about the actual instructions running on the hardware, not
source-code. Are you saying that all pointer access instructions are
generated with in-place handle-to-pointer checks? That would be a huge
amount of overhead.

If handle-to-pointer references are not being checked everytime (which I
assume they arn't), then it won't be safe to swap out a chunk of memory and
still let that generated instruction stream run. Some stand-in or
write-barrier code would need to be substituted. That could be done by
re-generating instructions, or by falling back to an interpreter, but
someone has to be checking those handles (and function handles!) to see if
they are swapped out. Managing this is not going to have zero overhead, and
I suspect it would have much more overhead than hardware virtual-memory.
Again, with the exception that as long as nothing is swapped out these
checks don't need to exist.

What kind of genereated checks and when would you use them to handle
swapping?
Chad Z. Hower
2009-09-23 01:22:42 UTC
Permalink
Post by David Jeske
You are misunderstanding me. When I'm talking about pointers, I'm
talking about Cosmos generated instructions using pointers. These
pointers very much do exist.
No, I don't misunderstand you. Cosmos doesn't need to generate pointers.
.NET now doesn't generate pointers into the code, and we don't need to
either. We can use handles, and even mix them. For example, we can use
indirect pointers and easily relocate, and we can allow pointers only
onto the stack, as the stack is not likely to relocate, and we can even
track the stack and relocate it, or index of something...

In short, the code could be completely handles or indirect pointers
completely solving the issue, but I favor a multi solution hybrid for
many reasons.

We are still a ways off from even building POCs, but the point is that
there are many many solutions, many of which are very feasible.
David Jeske
2009-09-23 01:32:23 UTC
Permalink
Post by Chad Z. Hower
No, I don't misunderstand you. Cosmos doesn't need to generate pointers.
.NET now doesn't generate pointers into the code, and we don't need to
either. We can use handles, and even mix them. For example, we can use
indirect pointers and easily relocate, and we can allow pointers only
onto the stack, as the stack is not likely to relocate, and we can even
track the stack and relocate it, or index of something...
If all object references are handles, doesn't that make all dereferences
effectively double-dereferences? Doesn't that make a virtual method call a
quadruple dereference? What's the overhead of that?
Chad Z. Hower
2009-09-23 02:04:17 UTC
Permalink
Post by David Jeske
If all object references are handles, doesn't that make all dereferences
effectively double-dereferences? Doesn't that make a virtual method call
Well kind of. First, in .NET now they are already at least double
references, and it runs ok. So we would be less....

But second, I suggest a hybrid model since we control "all". That is
references depend on what kind they are, stack, heap, etc...
Post by David Jeske
a quadruple dereference? What's the overhead of that?
Not necessarily. Code points are very different than data points. We've
been discussing data so far.

For code again there are several ways and I suggest a hybrid approach.
But for example, Motorola and other architectures supported paging but
didn't support virtual mem. Processes contain a map table, and upon
loading each process the OS remaps all references as part of the load
process. Windows does something similar with DLLs to load them into
virtual space. Thats one way....

Relative jumps are another way....

Again, there are many ways and I suggest a hybrid, as well a something
suited to the architecture target. For x86 likely mapping during loading
will prove best. Its very fast and only requires some extra data to be
stored as part of the executable on disk and later discarded after
loading. Relative jumps can also be used, especially for branching etc.
Id have to check again though, especially when used outside a VM but I
think on Intel relative branches have performance issues at least in
some situations.
David Jeske
2009-09-23 02:52:13 UTC
Permalink
I'm beginning to wonder if we're talking about the same thing. We're
discussing "swapping", that is, removing a chunk of a program's memory from
system ram and putting it on secondary storage... reloading it only when
necessary. I heard assertions about how much overhead the MMU/TLB machinery
is creating, so I wondered what this efficient path for detecting a
page-fault in software is....
Post by Chad Z. Hower
Well kind of. First, in .NET now they are already at least double
references, and it runs ok. So we would be less....
If we were paging in software, there would be more code than a dereference.
It would need to check the handle-dereference against some sentinel value to
determine whether the target was in RAM or not. If not, it would need to
jump into some OS machinery that could load the target off secondary storage
before returning and allowing the program to continue. This would occur at
every handle dereference if chunks can be selectively swapped out to
secondary storage.

This is going to be more instructions and memory accesses than code without
these checks, and more expensive than "current .NET code" which already does
the handle indirection. Until it's tested, it's debatable that this is going
to be obviously less overhead than the MMU/TLB -- with the admitted
advantage that a process will never have anything paged out doesn't need to
do these checks.

I can see how jumps might be less expensive than the above, as the target
could be replaced with the 'missing page fault handler'.
Chad Z. Hower
2009-09-23 03:06:27 UTC
Permalink
Post by David Jeske
I'm beginning to wonder if we're talking about the same thing. We're
discussing "swapping", that is, removing a chunk of a program's memory
from system ram and putting it on secondary storage... reloading it only
when necessary. I heard assertions about how much overhead the MMU/TLB
machinery is creating, so I wondered what this efficient path for
detecting a page-fault in software is....
We can control what we swap. For example, we might decide only to unload
or swap code segments, resources and heap parts. You can unload but a
"swap" would be faster. We could decide not to swap stacks, and since we
likely will use a hybrid model, maybe pointers are used on stacks, but
handles on other items. So we can easily intercept references to those
types of data to reload if necessary.

Its also possible we may use some cpu features, depending on how we can
mix match them, unfortunately on Intel its really quite a bad
architecture. Ring changing, memory protection, etc are quite heavy.
Also swapping in and out separate VM tables for each process is bad.
However Intel cant run without at least one set of tables, so in fact we
probably will run in one "global table" doing our own protections and
making the most gains, but still using some features such as
swap/hit/miss detection.

Its all subject to "cook off", remember its all swappable and
replaceable. Right now we really need to focus on finishing the compiler
redo that we are working on (very close to being done, few bugs left in
first phase), and then working on integrating the debugger.
Ben Kloosterman
2009-09-23 07:06:43 UTC
Permalink
However Intel cant run without at least one set of tables?



This isn't 100% correct remember the 32/64 bit real mode trick I linked.



With swapping your touching on my pet hate . :-P



Every windows system I use I turn of swapping and I have never had an issue.
I don't think page swapping is necessary in a modern OS nor do I think it's
an area that s worth a lot of time. Since voice activated systems with full
browsers can fit with an OS in 64Meg I think our current 2-4Gig desktop
systems are completely bloated. By the time Cosmos is released 4 Gig will
be standard on PCs and we aim to run Cosmos in 64 Meg devices . Looking at
my test machine Firefox is using 624 Meg of Real memory and 800 Meg
Committed . Why ? Out of all those windows I have used 2 in the last hour
with process swapping it will prob use 120 Meg nor would I be offended if it
gave a nice out of memory error when starting an app I would simply close
some windows . If it happens often I will spend $100 for some more memory at
the moment many people upgrade their machine because its too slow and very
often its paging heavily but because its behind the scene they don't know
what makes it slow. Servers are specked/monitored to have enough memory
anyway . I bet software developers will use MUCH less memory nor will it
take much code with some Process swap functions for idle windows this In
itself will make apps and the system run a lot faster . I have also had a
number of cases where apps have crashed in an allocation loop or an
application has required so much memory that the machine has hung with the
red light on , reset time , swapping reduces reliability IMHO which is why
few embedded systems touch it .



For the following reason I think a 2Gig Cosmos machine will be equal to a
4-5G + windows machine

- Large Lib framework that is are shared and global so apps are
smaller and don't need to reinvent the wheel

- One type of lib is supported ( You don't have 3 versions of C++
, Delphi , Java , Python etc etc )

- No swapping and a swap window API so developers will be more
careful.

- We have few apps many apps will need to be ported and hence won't
be as huge as mature products.

- No legacy support .



The exception is obviously web servers and Database servers but these are
normally tightly managed in terms of memory.





1)I think by default Cosmos will support no VM and only process swapping.
This will ensure tighter apps.

2) If you have an under specked machine and want to run a 4 Gig App, you
can activate a paging MM with swapping but it will cost you the 6% TLB/mem
cost + the actual swapping cost + the page miss cost. So you will prob get a
10-100% hit. Though if you complain to me I will say buy some memory its
cheap J





The only thing I will miss is memory mapped file IO but that's shouldn't be
an issue with decent shared memory.



Regards,



Ben

















From: Cosmos-Dev-***@public.gmane.org [mailto:Cosmos-Dev-***@public.gmane.org] On
Behalf Of Chad Z. Hower
Sent: Wednesday, September 23, 2009 11:06 AM
To: Cosmos-Dev-***@public.gmane.org
Subject: Re: [Cosmos-Dev] various design topics (performance, GC, security)
Post by David Jeske
I'm beginning to wonder if we're talking about the same thing. We're
discussing "swapping", that is, removing a chunk of a program's memory
from system ram and putting it on secondary storage... reloading it only
when necessary. I heard assertions about how much overhead the MMU/TLB
machinery is creating, so I wondered what this efficient path for
detecting a page-fault in software is....
We can control what we swap. For example, we might decide only to unload
or swap code segments, resources and heap parts. You can unload but a
"swap" would be faster. We could decide not to swap stacks, and since we
likely will use a hybrid model, maybe pointers are used on stacks, but
handles on other items. So we can easily intercept references to those
types of data to reload if necessary.

Its also possible we may use some cpu features, depending on how we can
mix match them, unfortunately on Intel its really quite a bad
architecture. Ring changing, memory protection, etc are quite heavy.
Also swapping in and out separate VM tables for each process is bad.
However Intel cant run without at least one set of tables, so in fact we
probably will run in one "global table" doing our own protections and
making the most gains, but still using some features such as
swap/hit/miss detection.

Its all subject to "cook off", remember its all swappable and
replaceable. Right now we really need to focus on finishing the compiler
redo that we are working on (very close to being done, few bugs left in
first phase), and then working on integrating the debugger.
Royce Mitchell III
2009-09-23 11:33:45 UTC
Permalink
I'm not sure if this is a solution to what you all are talking about
but with handles you could implement a copy-on-write for objects that
are converted to immutable. In other words if I pass an object to Ipc
it's flagged as shared and if I make a subsequent change to the object
then the runtime knows it needs to copy the structure. The coat for
this would be a single cheap branch in the code for th common case.
Once the object is sent to ipc it belongs to the destination process
with a new handle ( old handle is marked immutable ). Of course when
the gc frees the block it ahould be returned to it's originating gc to
prevent fragmentation. Shared memory is by definition mutable so
applications would have to implement their own protection ala locks etc.

Sent from my iPhone
Post by Ben Kloosterman
However Intel cant run without at least one set of tables?
This isnft 100% correct remember the 32/64 bit real mode trick I lin
ked.
With swapping your touching on my pet hate c :-P
Every windows system I use I turn of swapping and I have never had
an issue. I donft think page swapping is necessary in a modern OS
nor do I think itfs an area that s worth a lot of time. Since voice
activated systems with full browsers can fit with an OS in 64Meg I
think our current 2-4Gig desktop systems are completely bloated. By
the time Cosmos is released 4 Gig will be standard on PCs and we ai
m to run Cosmos in 64 Meg devices . Looking at my test machine Fire
fox is using 624 Meg of Real memory and 800 Meg Committed . Why ? O
ut of all those windows I have used 2 in the last hour with process
swapping it will prob use 120 Meg nor would I be offended if it gave
a nice out of memory error when starting an app I would simply clos
e some windows . If it happens often I will spend $100 for some more
memory at the moment many people upgrade their machine because its
too slow and very often its paging heavily but because its behind th
e scene they donft know what makes it slow. Servers are specked/mon
itored to have enough memory anyway . I bet software developers wil
l use MUCH less memory nor will it take much code with some Process
swap functions for idle windows this In itself will make apps and t
he system run a lot faster . I have also had a number of cases wher
e apps have crashed in an allocation loop or an application has r
equired so much memory that the machine has hung with the red light
on , reset time , swapping reduces reliability IMHO which is why few
embedded systems touch it .
For the following reason I think a 2Gig Cosmos machine will be equal
to a 4-5G + windows machine
- Large Lib framework that is are shared and global so apps
are smaller and donft need to reinvent the wheel
- One type of lib is supported ( You donft have 3 versions
of C++ , Delphi , Java , Python etc etc )
- No swapping and a swap window API so developers will be
more careful.
- We have few apps many apps will need to be ported and
hence wonft be as huge as mature products.
- No legacy support .
The exception is obviously web servers and Database servers but
these are normally tightly managed in terms of memory.
1)I think by default Cosmos will support no VM and only process
swapping. This will ensure tighter apps.
2) If you have an under specked machine and want to run a 4 Gig
App, you can activate a paging MM with swapping but it will cost you
the 6% TLB/mem cost + the actual swapping cost + the page miss cost.
So you will prob get a 10-100% hit. Though if you complain to me I
will say buy some memory its cheap J
The only thing I will miss is memory mapped file IO but thatfs shoul
dnft be an issue with decent shared memory.
Regards,
Ben
On Behalf Of Chad Z. Hower
Sent: Wednesday, September 23, 2009 11:06 AM
Subject: Re: [Cosmos-Dev] various design topics (performance, GC, security)
Post by David Jeske
I'm beginning to wonder if we're talking about the same thing. We're
discussing "swapping", that is, removing a chunk of a program's
memory
Post by David Jeske
from system ram and putting it on secondary storage... reloading
it only
Post by David Jeske
when necessary. I heard assertions about how much overhead the MMU/
TLB
Post by David Jeske
machinery is creating, so I wondered what this efficient path for
detecting a page-fault in software is....
We can control what we swap. For example, we might decide only to unload
or swap code segments, resources and heap parts. You can unload but a
"swap" would be faster. We could decide not to swap stacks, and since we
likely will use a hybrid model, maybe pointers are used on stacks, but
handles on other items. So we can easily intercept references to those
types of data to reload if necessary.
Its also possible we may use some cpu features, depending on how we can
mix match them, unfortunately on Intel its really quite a bad
architecture. Ring changing, memory protection, etc are quite heavy.
Also swapping in and out separate VM tables for each process is bad.
However Intel cant run without at least one set of tables, so in fact we
probably will run in one "global table" doing our own protections and
making the most gains, but still using some features such as
swap/hit/miss detection.
Its all subject to "cook off", remember its all swappable and
replaceable. Right now we really need to focus on finishing the compiler
redo that we are working on (very close to being done, few bugs left in
first phase), and then working on integrating the debugger.
Chad Z. Hower
2009-09-23 11:36:26 UTC
Permalink
Because we control memory and will essentially have one flat space - we
don't need to copy because all can use the same physical memory.

Even with current OS, Windows can and does for shared memory take a
single physical page and map it to mult virtual spaces so memory is only
physically used once.
I'm not sure if this is a solution to what you all are talking about but
with handles you could implement a copy-on-write for objects that are
converted to immutable. In other words if I pass an object to Ipc it's
flagged as shared and if I make a subsequent change to the object then
the runtime knows it needs to copy the structure. The coat for this
would be a single cheap branch in the code for th common case. Once the
object is sent to ipc it belongs to the destination process with a new
handle ( old handle is marked immutable ). Of course when the gc frees
the block it ahould be returned to it's originating gc to prevent
fragmentation. Shared memory is by definition mutable so applications
would have to implement their own protection ala locks etc.
Sent from my iPhone
Post by Ben Kloosterman
However Intel cant run without at least one set of tables?
This isn’t 100% correct remember the 32/64 bit real mode trick I linked.
With swapping your touching on my pet hate 
 :-P
Every windows system I use I turn of swapping and I have never had an
issue. I don’t think page swapping is necessary in a modern OS nor
do I think it’s an area that s worth a lot of time. Since voice
activated systems with full browsers can fit with an OS in 64Meg I
think our current 2-4Gig desktop systems are completely bloated. By
the time Cosmos is released 4 Gig will be standard on PCs and we aim
to run Cosmos in 64 Meg devices . Looking at my test machine Firefox
is using 624 Meg of Real memory and 800 Meg Committed . Why ? Out of
all those windows I have used 2 in the last hour with process swapping
it will prob use 120 Meg nor would I be offended if it gave a nice out
of memory error when starting an app I would simply close some windows
. If it happens often I will spend $100 for some more memory at the
moment many people upgrade their machine because its too slow and very
often its paging heavily but because its behind the scene they don’t
know what makes it slow. Servers are specked/monitored to have
enough memory anyway . I bet software developers will use MUCH less
memory nor will it take much code with some Process swap functions
for idle windows this In itself will make apps and the system run a
lot faster . I have also had a number of cases where apps have
crashed in an allocation loop or an application has required so much
memory that the machine has hung with the red light on , reset time ,
swapping reduces reliability IMHO which is why few embedded systems
touch it .
For the following reason I think a 2Gig Cosmos machine will be equal
to a 4-5G + windows machine
- Large Lib framework that is are shared and global so apps
are smaller and don’t need to reinvent the wheel
- One type of lib is supported ( You don’t have 3 versions of
C++ , Delphi , Java , Python etc etc )
- No swapping and a swap window API so developers will be
more careful.
- We have few apps many apps will need to be ported and hence
won’t be as huge as mature products.
- No legacy support .
The exception is obviously web servers and Database servers but these
are normally tightly managed in terms of memory.
1)I think by default Cosmos will support no VM and only process
swapping. This will ensure tighter apps.
2) If you have an under specked machine and want to run a 4 Gig App,
you can activate a paging MM with swapping but it will cost you the 6%
TLB/mem cost + the actual swapping cost + the page miss cost. So you
will prob get a 10-100% hit. Though if you complain to me I will say
buy some memory its cheap J
The only thing I will miss is memory mapped file IO but that’s
shouldn’t be an issue with decent shared memory.
Regards,
Ben
*Sent:* Wednesday, September 23, 2009 11:06 AM
*Subject:* Re: [Cosmos-Dev] various design topics (performance, GC,
security)
Post by David Jeske
I'm beginning to wonder if we're talking about the same thing. We're
discussing "swapping", that is, removing a chunk of a program's memory
from system ram and putting it on secondary storage... reloading it
only
Post by David Jeske
when necessary. I heard assertions about how much overhead the MMU/TLB
machinery is creating, so I wondered what this efficient path for
detecting a page-fault in software is....
We can control what we swap. For example, we might decide only to unload
or swap code segments, resources and heap parts. You can unload but a
"swap" would be faster. We could decide not to swap stacks, and since we
likely will use a hybrid model, maybe pointers are used on stacks, but
handles on other items. So we can easily intercept references to those
types of data to reload if necessary.
Its also possible we may use some cpu features, depending on how we can
mix match them, unfortunately on Intel its really quite a bad
architecture. Ring changing, memory protection, etc are quite heavy.
Also swapping in and out separate VM tables for each process is bad.
However Intel cant run without at least one set of tables, so in fact we
probably will run in one "global table" doing our own protections and
making the most gains, but still using some features such as
swap/hit/miss detection.
Its all subject to "cook off", remember its all swappable and
replaceable. Right now we really need to focus on finishing the compiler
redo that we are working on (very close to being done, few bugs left in
first phase), and then working on integrating the debugger.
David Jeske
2009-09-23 15:58:36 UTC
Permalink
Chad - It seems like you didn't understand Roy's suggestion. He was talking
about copy-on-write semantics, not copying between processes. If IPC data is
immutable, then a translation layer in the middle might need to go through
some hoops recreating structures in order to change any of the data.
However, if a low-level block handed back a subtree of data marked as
copy-on-write, it could be shared as if it was immutable, and anyone who
changed it would end up with their own private copy of the structure. This
mechanism is present in copy-on-write pages in a VM, and happens during a
UNIX fork for example.

That said, it might take some fancy mechanics for copy-on-write to work the
way you expected for subtrees. If you had a subtree (A -> B > C) handed to
you, and you modified "B", you would probably expect your subtree to be (A
-> B' -> C), but the naive case would leave (A -> B -> C) with your new B
still pointed to C, (B' -> C). I can't think of a way you would cheaply be
able to turn it into (A -> B' -> C) without VM/MMU support and without
having process local handles for the entire subtree.

Roy - IMO the primary challenge I was considering is making sure a
potentially large subtree is immutable without walking all the objects
inside it. The copy-on-write idea is interesting, but we're still left with
the challenge of assuring the entire subtree is set to copy-on-write when
it's handed to IPC. That was the purpose being my "immutable by induction"
idea, where the code that constructed the return result would have to follow
induction rules. In my proposal an IPC-capable aggregate datatype with
handles to other data could only accept pointers to data flagged as
immutable (or copy-on-write). This would "prove by induction" that when an
IPC endpoint received a 20MB subtree of structures, if the top was immutable
they would all be immutable without any scans.
Perhaps the scans would be cheap, and just scanning the subtree to set them
all immutable (or copy on write) would be enough.
Post by Chad Z. Hower
Because we control memory and will essentially have one flat space - we
don't need to copy because all can use the same physical memory.
Even with current OS, Windows can and does for shared memory take a
single physical page and map it to mult virtual spaces so memory is only
physically used once.
I'm not sure if this is a solution to what you all are talking about but
with handles you could implement a copy-on-write for objects that are
converted to immutable. In other words if I pass an object to Ipc it's
flagged as shared and if I make a subsequent change to the object then
the runtime knows it needs to copy the structure. The coat for this
would be a single cheap branch in the code for th common case. Once the
object is sent to ipc it belongs to the destination process with a new
handle ( old handle is marked immutable ). Of course when the gc frees
the block it ahould be returned to it's originating gc to prevent
fragmentation. Shared memory is by definition mutable so applications
would have to implement their own protection ala locks etc.
Chad Z. Hower
2009-09-23 16:21:44 UTC
Permalink
Post by David Jeske
Chad - It seems like you didn't understand Roy's suggestion. He was
talking about copy-on-write semantics, not copying between processes. If
Aah yes I missed that then. Regarding immutability and copy on write, it
generally works pretty well, but fortunately most objects in .NET are
immutable. Delphi uses copy on write for strings for example, and while
there are a few pitfalls generally it works pretty well.
Ben Kloosterman
2009-09-24 01:43:01 UTC
Permalink
Chad wrote :
fortunately most objects in .NET are
immutable.

??? You mean strings and value types or am I missing something ?

Regards,

Ben
Chad Z. Hower
2009-09-24 01:47:59 UTC
Permalink
Post by Chad Z. Hower
fortunately most objects in .NET are
immutable.
??? You mean strings and value types or am I missing something ?
Yes, strings etc.
Ben Kloosterman
2009-09-24 07:23:14 UTC
Permalink
This is worth noting . GC completely stuff up page swapping as they visit
every page..And it has a major impact on pause times.



Regards,



Ben
Chad Z. Hower
2009-09-24 11:43:04 UTC
Permalink
Again we could be smart by not collecting swapped pages unless we really
were idle, and some pages if completely "dead" could just be removed
from disk rather than reloaded, or even marked and modified on disk.
This is worth noting … GC completely stuff up page swapping as they
visit every page..And it has a major impact on pause times.
Regards,
Ben
------------------------------------

--------------------------------------------------
More things to join for Cosmos!

1) Cosmos chat room:
http://tinyurl.com/pc7bds

2) Please add yourself to the map:
http://tinyurl.com/qhttde

3) Help publicity and join our Facebook page:
http://tinyurl.com/plrloa

--------------------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/Cosmos-Dev/

<*> Your email settings:
Individual Email | Traditional

<*> To change settings online go to:
http://groups.yahoo.com/group/Cosmos-Dev/join
(Yahoo! ID required)

<*> To change settings via email:
mailto:Cosmos-Dev-digest-***@public.gmane.org
mailto:Cosmos-Dev-fullfeatured-***@public.gmane.org

<*> To unsubscribe from this group, send an email to:
Cosmos-Dev-unsubscribe-***@public.gmane.org

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
Ben Kloosterman
2009-09-24 13:41:23 UTC
Permalink
I mainly highlighted this because it affects our day to day systems..It also
means GC systems steal memory from non GC systems.

Note the main issue is VM based on paging works because the cost to measure
access is low however GCs destroy the calculation. And the problem here is
calculating the cost of whether a page is dead may be very significant. The
idea of user controlled swapping or Swapping idle threads seems to have some
merit but note non managed OS may be able to turn of page counting while
doing a GC mark but even still the Sweep and Compaction will play havock.
Add to this caches need to interact with page swap system else the LRU cache
entry which is the most likely one to be used (by replacement) gets swapped
to disk.

Regards,

Ben
-----Original Message-----
Behalf Of Chad Z. Hower
Sent: Thursday, September 24, 2009 7:43 PM
Subject: Re: [Cosmos-Dev] Swapping and GC
Again we could be smart by not collecting swapped pages unless we really
were idle, and some pages if completely "dead" could just be removed
from disk rather than reloaded, or even marked and modified on disk.
Post by Ben Kloosterman
This is worth noting . GC completely stuff up page swapping as they
visit every page..And it has a major impact on pause times.
Regards,
Ben
------------------------------------
--------------------------------------------------
More things to join for Cosmos!
http://tinyurl.com/pc7bds
http://tinyurl.com/qhttde
http://tinyurl.com/plrloa
--------------------------------------------------
Yahoo! Groups Links
David Jeske
2009-09-24 18:31:28 UTC
Permalink
Post by Ben Kloosterman
I mainly highlighted this because it affects our day to day systems..It also
means GC systems steal memory from non GC systems.
That "stealing" point is really interesting.

I wonder how the graph changes in page replacement schemes which are more
complicated than LRU. (LRU2 for example) The greater context of the paper
was a design for a GC system that interacted pleasantly with a paging
system. It's interesting information for a system design based on GC.

While some of the arguments against paging/swapping here are interesting, my
primary focus on desktops means I think partial swapping is critical. For
embedded and small systems swapping may not be that important. However, on
the desktop, the "DOS days" when loading a given dataset might be "not
possible" on your machine unless you added more memory is the past, that's
going backwards. Not allowing a program to run because it needs 1% more
memory, or requiring the programmer to manage the swapping himself are both
bad solutions.

I wonder how high the overhead is for simply accumulating the
page-replacement information in a typical MMU based Virtual memory.

Further, I wonder if a software virtual machine could do a better job
exploiting cache locality and avoiding redundant checks. For example, a
program could have two instruction streams, one "fast", one "slow". In the
"slow" instruction stream, all object access could be recorded to estimate
an LFU number for the objects, and object access to be double-checked to see
if the object had been paged out (and needs to be restored before the code
continues). In the "fast" instruction stream, there would be NO code
updating object usage information. A program would only run the "slow" code
when it falls into a situation where partial paging might be required for
that process. This would eliminate ALL of the replacement algorithm overhead
when it is not needed. When paging is needed, the slower speed of
code-execution is likely to be irrelevant compared to the pauses created by
paging in from secondary storage. Further, the ability to record enough
information to do a much more high quality LFU (instead of LRU) might
actually make the program run much faster overall than it would if it were
running in a typical LRU or CLOCK based MMU/VM system. This might be
integrated with a compacting/copying GC system to effectively have pools of
"not recently touched objects" which could be paged out in reasonable side
blocks (instead of tiny objects), and which could maintain enough
information for GC to complete without actually bringing those objects back
into memory (i.e. cards to track pointers from those objects back into more
recently used objects).

Did that make sense? I'll have to think on this some more.
Ben Kloosterman
2009-09-25 04:17:04 UTC
Permalink
On Thu, Sep 24, 2009 at 6:41 AM, Ben Kloosterman <bklooste-***@public.gmane.org> wrote:

I mainly highlighted this because it affects our day to day systems..It also
means GC systems steal memory from non GC systems.



That "stealing" point is really interesting.



I wonder how the graph changes in page replacement schemes which are more
complicated than LRU. (LRU2 for example) The greater context of the paper
was a design for a GC system that interacted pleasantly with a paging
system. It's interesting information for a system design based on GC.



Complex algorithms were considered a long time ago but were seen
counterproductive. The thing is every time you do a collect the page will
get walked and this is hard to distinguish from real use.





This is nice

"

Page replacement algorithms were a hot topic of research and debate in the
1960s and 1970s. That mostly ended with the development of sophisticated LRU
approximations and working set algorithms. Since then, some basic
assumptions made by the traditional page replacement algorithms were
invalidated, resulting in a revival of research. In particular, the
following trends in the behavior of underlying hardware and user-level
software has affected the performance of page replacement algorithms:

* Size of primary storage has increased by multiple orders of
magnitude. With several gigabytes of primary memory, algorithms that require
a periodic check of each and every memory frame are becoming less and less
practical.

* Memory hierarchies have grown taller. The cost of a CPU cache miss
is far more expensive. This exacerbates the previous problem.

* Locality <http://en.wikipedia.org/wiki/Memory_locality> of
reference of user software has weakened. This is mostly attributed to the
spread of object-oriented programming
<http://en.wikipedia.org/wiki/Object-oriented_programming> techniques that
favor large numbers of small functions, use of sophisticated data structures
like trees <http://en.wikipedia.org/wiki/Tree_data_structure> and hash
<http://en.wikipedia.org/wiki/Hash_table> tables that tend to result in
chaotic memory reference patterns, and the advent of garbage collection
<http://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29>
that drastically changed memory access behavior of applications.

Requirements for page replacement algorithms have changed due to differences
in operating system kernel
<http://en.wikipedia.org/wiki/Kernel_%28computer_science%29> architectures.
In particular, most modern OS kernels have unified virtual memory and file
system caches, requiring the page replacement algorithm to select a page
from among the pages of both user program virtual address spaces and cached
files. The latter pages have specific properties. For example, they can be
locked, or can have write ordering requirements imposed by journaling
<http://en.wikipedia.org/wiki/Journaling_file_system> . Moreover, as the
goal of page replacement is to minimize total time waiting for memory, it
has to take into account memory requirements imposed by other kernel
sub-systems that allocate memory. As a result, page replacement in modern
kernels (Linux <http://en.wikipedia.org/wiki/Linux> , FreeBSD
<http://en.wikipedia.org/wiki/FreeBSD> , and Solaris
<http://en.wikipedia.org/wiki/Solaris_%28operating_system%29> ) tends to
work at the level of a general purpose kernel memory allocator, rather than
at the higher level of a virtual memory subsystem.

"
From wikipedia,
Better algorithms than LRU have existed for a long time and Aging ones with
access counters offer much better performance but even LRU is seen as
"expensive".





While some of the arguments against paging/swapping here are interesting, my
primary focus on desktops means I think partial swapping is critical. For
embedded and small systems swapping may not be that important. However, on
the desktop, the "DOS days" when loading a given dataset might be "not
possible" on your machine unless you added more memory is the past, that's
going backwards. Not allowing a program to run because it needs 1% more
memory, or requiring the programmer to manage the swapping himself are both
bad solutions.



It can be perceived as that by some technical people but im not convinced .

- The fact on my 3G machine running about 2 Gig of applications
when I swap to an app I haven't used for 5 minutes takes longer than
loading it fresh to me is going backwards. Persisting the whole app and
doing a sequential read will get it back.

- Programmers create bloat , on 64Meg devices which I used a lot (
I used to develop for Windows Mobile ) I have rarely seen out of memory
errors except when I do something stupid.

- In Nearly all cases the machine should have enough memory to run
the working set if you close your other windows if not you may be "run" it
on a paging system but it will run so slow as to be unusable. Eg We had this
exact issue with our Compiler when it was hitting 2 Gig address space ( when
my machine was 2G) it ran so slow at to be unusable and in most cases I had
to reset the machine to get control back.

- Users don't mind a close a window message it teaches good habits.
. Note this is more of an issue with newer browsers having multiple address
spaces/ sandboxes.

- The message can be mitigated by swapping background /unused apps
freeing most of the address space. ( Note here users complain to developers
, developer swap background window = system less bloated)



Add this to the technical issues

- A Sequential read hibernate is about the same speed as bringing a
medium to large app active.

- The large amount of caches in typical systems fail as the LRU
is most likely paging defeating the whole purpose of the cache ( This goes
for many caches not just disk)

- Free memory is used to quickly start new apps cant be paged.

- The GC issue outlined.

- Trashing

- Memory allocation bug or malicious apps/script can kill system
unless you impose quotes which few OS use in practice. .



I wonder how high the overhead is for simply accumulating the
page-replacement information in a typical MMU based Virtual memory.



I haven't read on this recently but a LOT of research showed it only worked
well when kept simple , some early systems had more complicated software
based policies like counts rather than last access time. Any form of global
write barrier just kills the system. The same has also been tried in GCs it
would be very nice for GCs to have information on how often an object is
accessed and for how long as it could be promoted earlier and well used
objects could be compacted together providing major gains - just the cost
isn't worth it. The hardware (x86) protected bits are really designed for
how systems worked in the 80's this works quite well when the OS are a 70s
design but is beginning to leak. .



Further, I wonder if a software virtual machine could do a better job
exploiting cache locality and avoiding redundant checks. For example, a
program could have two instruction streams, one "fast", one "slow". In the
"slow" instruction stream, all object access could be recorded to estimate
an LFU number for the objects, and object access to be double-checked to see
if the object had been paged out (and needs to be restored before the code
continues). In the "fast" instruction stream, there would be NO code
updating object usage information.



Yes exactly . Note if a thread is marked "fast" a GC could also repack
these pieces close together (Maybe following a reference graph) improving
cache hit this can easily get you a 10-20% gain (For no cost if you know
these bits will be fast) .



A program would only run the "slow" code when it falls into a situation
where partial paging might be required for that process. This would
eliminate ALL of the replacement algorithm overhead when it is not needed.
When paging is needed, the slower speed of code-execution is likely to be
irrelevant compared to the pauses created by paging in from secondary
storage. Further, the ability to record enough information to do a much more
high quality LFU (instead of LRU) might actually make the program run much
faster overall than it would if it were running in a typical LRU or CLOCK
based MMU/VM system. This might be integrated with a compacting/copying GC
system to effectively have pools of "not recently touched objects" which
could be paged out in reasonable side blocks (instead of tiny objects), and
which could maintain enough information for GC to complete without actually
bringing those objects back into memory (i.e. cards to track pointers from
those objects back into more recently used objects).



Yes this sort of Software paging is certainly possible and has the advantage
of the GC working with it ie its smart. Note this is similar to a ref
counting GC also as you need to change the code stream to add reference
counts.

The MM could hand out Virtual slow pages with an Address > Real memory. to
the GC's. We could even use the HW to trap this..and you use a pool to swap
pages in and out.



Also consider this using a ref counting GC and frequent checks ( = Slow) has
half the memory use of a Generational Mark Sweep with infrequent checks and
you can also call back unallocated memory in GC's , reduces caches/
Nurseries etc. With these methods in place an OS could reduce the memory
for an app by half of what is optimal So your app that requires 2 Gig (
"recommended") normally will run in 1Gig.. Ok you may not be able to run a
1Gig working set in a 100 Meg machine but would you in a page swapping
system anyway ? . Consider all the Free memory in an OS these days Disk
caches ,Web caches , Network buffers , network caches , Extra memory
allocated to Processes /GC in case they need it, Free memory , GC Nursery
size etc etc bigger gains than paging can be made by optimizing memory
allocation and reclamation.. I like the comment on Wikipedia stated the
Linux paging algorithm acts more like a memory allocator.



Did that make sense? I'll have to think on this some more.



Yes , there are a LOT of options. Singularity was created to play with this
and the guys did the right thing releasing it for Research ( Singularity has
a paging and a flat MMU) , however it has the Microsoft Stink and a lot of
researchers are obsessed with Linux.



Considering modern HW will soon have 8GB , 30-100 HW threads (+GPU) and Mem
Access times of 100 Cycles with 3 levels of caching we have to rethink the
way OS works . Lack of real memory use ( ie locks) , parallelism and cache
efficiencies will be dominant for performance. Though I will repeat the
biggest gains are in reliability and security.



The problem is no recent research in different systems and research from 30
years ago which does cover this doesn't cover modern type safe software on
the above described systems ( except for Algol /Borrows etc) :-( At least
there IS a lot of research on GC's which is VERY relevant as they are a
software MMU. ( Note the write barrier in the CLR GC just marking a page as
used so it can scan it , its inefficient to mark a time )



Going bravely forth with little research is kind of scary and we don't have
the resources to try many things.



Regards,



Ben
Ben Kloosterman
2009-09-24 01:05:37 UTC
Permalink
Chad - It seems like you didn't understand Roy's suggestion. He was talking
about copy-on-write semantics, not copying between processes. If IPC data is
immutable, then a translation layer in the middle might need to go through
some hoops recreating structures in order to change any of the data.
However, if a low-level block handed back a subtree of data marked as
copy-on-write, it could be shared as if it was immutable, and anyone who
changed it would end up with their own private copy of the structure. This
mechanism is present in copy-on-write pages in a VM, and happens during a
UNIX fork for example.



That said, it might take some fancy mechanics for copy-on-write to work the
way you expected for subtrees. If you had a subtree (A -> B > C) handed to
you, and you modified "B", you would probably expect your subtree to be (A
-> B' -> C), but the naive case would leave (A -> B -> C) with your new B
still pointed to C, (B' -> C). I can't think of a way you would cheaply be
able to turn it into (A -> B' -> C) without VM/MMU support and without
having process local handles for the entire subtree.



Im moving away a little bit from copy on write for Shared Memory as it seems
to lead to complications.



Roy - IMO the primary challenge I was considering is making sure a
potentially large subtree is immutable without walking all the objects
inside it. The copy-on-write idea is interesting, but we're still left with
the challenge of assuring the entire subtree is set to copy-on-write when
it's handed to IPC. That was the purpose being my "immutable by induction"
idea, where the code that constructed the return result would have to follow
induction rules. In my proposal an IPC-capable aggregate datatype with
handles to other data could only accept pointers to data flagged as
immutable (or copy-on-write). This would "prove by induction" that when an
IPC endpoint received a 20MB subtree of structures, if the top was immutable
they would all be immutable without any scans.



Perhaps the scans would be cheap, and just scanning the subtree to set them
all immutable (or copy on write) would be enough.





In theory such scans could be made at compile time and hence the cost would
be immaterial. We cant do a reference scan anyway since a user can change a
null reference and need to do a type scan. In some ways the scan is easy
any field which is not a Value type or string is inherently mutable , so
only properties need to be checked if there are public set properties its
immutable , if there are public set properties its mutable if the set
properties are internal its mutable but only the creator can change it.
Note this check is a casual check and is only a help to a developer.



Actually I don't think we can make such a check as this is not in the IL .In
IL field = property and only one access modifier. ( Can someone confirm) .
This means that a reference sub object will always be mutable unless the
sub types are all value types or strings.



This is prob worth while , shared mem = structs + strings (I have thought
of this before but didn't pursue it as it needs more compiler changes) ,
we need someway to make the structs marked with an attribute to not be boxed
and created on the shared heap instead of the stack. Is this possible ?



However to pass a buffer all the way without trough without copying ,we
could do all buffers or a single buffer ( via the entry reference) to do a
grouping of buffers we need some mechanism to get a pointer to the locations
of the various buffers. This should be possible with some unsafe buffer
manager code ( ie part of a trusted lib) but we need to ensure that these
buffers do not get flushed /reused until the receiver is finished with it.
The buffer manager can also assist with reading the buffer. This would work
great for large buffers.



IS this better than having a global lock via a single indirect ref on the
shared memory region like Sing ? It will be more programmer friendly ( and
not allow object spaces) it will probably also have better concurrency.
Does sound kind of elegant only question is can we create the structs in
Shared Mem rather than the stack ?



What do we gain compared to the Sing system or my access block with lock ?




Regards,



Ben
davidj
2009-09-24 15:39:59 UTC
Permalink
Post by David Jeske
Perhaps the scans would be cheap, and just scanning the subtree to set them
all immutable (or copy on write) would be enough.
In theory such scans could be made at compile time and hence the cost would
be immaterial.
I don't understand this suggestion, what do you mean? Scans through
dynamically allocated datastructures can't be done at compile time.

If program fragments are restricted to pure-functional (no side effects, no
access to non-local variables), then it could be proven at compile time that
nobody holds other references to the datastructures.
Post by David Jeske
We cant do a reference scan anyway since a user can change a null reference
and need to do a type scan. In some ways the scan is easy any field which
is not a Value type or string is inherently mutable , so only properties
need to be checked if there are public set properties its immutable , if
there are public set properties its mutable if the set properties are
internal its mutable but only the creator can change it. Note this check
is a casual check and is only a help to a developer.
Can you better explain what you are talking about here?

Let me put it in the context of an example. If a filesystem driver is
building a return result for a directory scan with stat-type information, it
might build a return result like LinkedList<FileStatInfo>, where
FileStatInfo has members like { int create_t, mod_t; string fname }.

Now, we can't make the FileStatInfo struct or the LinkedList immutable at
compile time, because the driver needs to build the return result first.
However, at some point the return result will be built, and we'd like to
"freeze" the tree of objects to make it immutable. I imagine the process
something like:

LinkedList<FileStatInfo> result = new LinkedList<FileStatInfo>();foreach (
inf : filesystem.walkthrough_structures() ) {
result.add(new FileStatInfo(inf.a,inf.b,inf.c,inf.d));
}

FREEZE_TO_IMMUTABLE ( result ); // this has to scan the list
return result;


The question is, how is the result frozen to immutable so that it can no
longer be changed? (A) One method would be to walk the object graph at
runtime and mark all the structures as immutable. (B) Another method would
be to build some custom support for compiler enforced immutable arrays and
aggregate datatypes. In the example above, we would build into a mutable
list and then "freeze" it into an immutable array by letting some underlying
trusted code copy the values. (think of this like StringBuilder -> String
conversion). (C) We would have datastructures designed to have a modeswitch
from mutable to immutable. At runtime we would scan them to "flip the
switch" they would become immutable. This could be coded directly in C#, but
the compiler would enforce the ability to turn immutable (i.e. no public
fields, all access through generated properties which check for mutability
first). (D) We could do something like C, but instead of scanning a subtree
to flip the switch, they could enforce immutability by induction during
construction. That is, one of these FreezableStructs would only allow
property sets to values which are already set to immutable. Likewise with
lists, which would only allow elements which are set immutable. The
psudocode for D would be:

LinkedList<FileStatInfo> result = new LinkedList<FileStatInfo>();foreach (
inf : filesystem.walkthrough_structures() ) {
FileStatInfo fsi = new FileStatInfo()
fsi.a = inf.a; fsi.b = inf.b; fsi.c = inf.c; // rvalues required to be
immutable
fsi.freeze();
result.add(fsi); // this is only allowed because fsi is frozen
}
result.freeze(); // this is O(1) because we know we only reference frozen
objects
return result;

This example above is what I was referring to several emails ago in my
"immutable proof by induction" explanation. The same process could extend to
large nested structures of objects, where each "freeze()" operation is O(1)
because we require the rvalue (RHS) to be immutable at every assignment.
Ben Kloosterman
2009-09-24 23:51:51 UTC
Permalink
On Wed, Sep 23, 2009 at 6:05 PM, Ben Kloosterman <bklooste-***@public.gmane.org> wrote:

Perhaps the scans would be cheap, and just scanning the subtree to set them
all immutable (or copy on write) would be enough.

In theory such scans could be made at compile time and hence the cost would
be immaterial.

I don't understand this suggestion, what do you mean? Scans through
dynamically allocated datastructures can't be done at compile time.



For Strongly typed systems they can.. Note we are only checking whether the
classes are all immutable or potentially immutable . ( Note the later
comments on this not being possible for reference types ( Except strings)
where depth is > 1)



If program fragments are restricted to pure-functional (no side effects, no
access to non-local variables), then it could be proven at compile time that
nobody holds other references to the datastructures.



Local variables are on the local stack pointing to the stack or GC these are
always bad.





More later



Regards,



Ben
David Jeske
2009-09-25 00:52:25 UTC
Permalink
Post by David Jeske
Perhaps the scans would be cheap, and just scanning the subtree to set them
all immutable (or copy on write) would be enough.
In theory such scans could be made at compile time and hence the cost would
be immaterial.
I don't understand this suggestion, what do you mean? Scans through
dynamically allocated datastructures can't be done at compile time.
For Strongly typed systems they can.. Note we are only checking whether the
classes are all immutable or potentially immutable . ( Note the later
comments on this not being possible for reference types ( Except strings)
where depth is > 1)
This isn't a "scan at compile time", but yes, this is exactly the technique
I explained. The compiler and type rules can verify that the code follows a
proof-by-induction immutability.
Ben Kloosterman
2009-09-25 05:15:24 UTC
Permalink
On Thu, Sep 24, 2009 at 4:51 PM, Ben Kloosterman <bklooste-***@public.gmane.org> wrote:

Perhaps the scans would be cheap, and just scanning the subtree to set them
all immutable (or copy on write) would be enough.

In theory such scans could be made at compile time and hence the cost would
be immaterial.

I don't understand this suggestion, what do you mean? Scans through
dynamically allocated datastructures can't be done at compile time.

For Strongly typed systems they can.. Note we are only checking whether the
classes are all immutable or potentially immutable . ( Note the later
comments on this not being possible for reference types ( Except strings)
where depth is > 1)

This isn't a "scan at compile time", but yes, this is exactly the technique
I explained. The compiler and type rules can verify that the code follows a
proof-by-induction immutability.



I should clarify it can be checked by the install time IL ->x86 compile as
the type information should give you what you need. The check is actually
much easier basically any value types and strings are safe any reference is
not. ( since you can set it to null) . Kind of a bummer Spec# has non
nullable references exactly for this reason ( and this alone makes it very
attractive to me) . Haven't thought of a way around this yet. (except for
the all struct structures and Buffer Manager( for the bits that need to be
fast) ) .



We could maybe use an ObjectHolder type class which the compiler knows
about ( which cant be null) , or even better ReadOnlyObjectHolder where the
constructor is the only way to set it . It wont be pretty but will work I
think. The code itself can check if (value == null) throw new NullRef not
supported Exception but this is hard to prove without generating the code
and running it. Again the check is trivial .. Interesting needs more
thought.



Regards,



Ben
Matthijs ter Woord
2009-09-25 07:38:49 UTC
Permalink
This is one of the two things where CCIAST can help us out i think: code
modifications (managed memory stuff) and code analysis (proofing stuff)
Post by David Jeske
Perhaps the scans would be cheap, and just scanning the subtree to set them
all immutable (or copy on write) would be enough.
In theory such scans could be made at compile time and hence the cost would
be immaterial.
I don't understand this suggestion, what do you mean? Scans through
dynamically allocated datastructures can't be done at compile time.
For Strongly typed systems they can.. Note we are only checking whether the
classes are all immutable or potentially immutable . ( Note the later
comments on this not being possible for reference types ( Except strings)
where depth is > 1)
This isn't a "scan at compile time", but yes, this is exactly the technique
I explained. The compiler and type rules can verify that the code follows a
proof-by-induction immutability.
I should clarify it can be checked by the install time IL ->x86 compile as
the type information should give you what you need. The check is actually
much easier basically any value types and strings are safe any reference is
not. ( since you can set it to null) . Kind of a bummer Spec# has non
nullable references exactly for this reason ( and this alone makes it very
attractive to me) . Haven’t thought of a way around this yet. (except for
the all struct structures and Buffer Manager( for the bits that need to be
fast) ) .
We could maybe use an ObjectHolder type class which the compiler knows
about ( which cant be null) , or even better ReadOnlyObjectHolder where the
constructor is the only way to set it . It wont be pretty but will work I
think. The code itself can check if (value == null) throw new NullRef not
supported Exception but this is hard to prove without generating the code
and running it. Again the check is trivial .. Interesting needs more
thought.
Regards,
Ben
Ben Kloosterman
2009-09-25 11:08:18 UTC
Permalink
Can you better explain what you are talking about here?



Let me put it in the context of an example. If a filesystem driver is
building a return result for a directory scan with stat-type information, it
might build a return result like LinkedList<FileStatInfo>, where
FileStatInfo has members like { int create_t, mod_t; string fname }.



Now, we can't make the FileStatInfo struct or the LinkedList immutable at
compile time, because the driver needs to build the return result first.



However, at some point the return result will be built, and we'd like to
"freeze" the tree of objects to make it immutable. I imagine the process
something like:



LinkedList<FileStatInfo> result = new LinkedList<FileStatInfo>();

foreach ( inf : filesystem.walkthrough_structures() ) {

result.add(new FileStatInfo(inf.a,inf.b,inf.c,inf.d));

}



FREEZE_TO_IMMUTABLE ( result ); // this has to scan the list

return result;



The question is, how is the result frozen to immutable so that it can no
longer be changed? (A) One method would be to walk the object graph at
runtime and mark all the structures as immutable. (B) Another method would
be to build some custom support for compiler enforced immutable arrays and
aggregate datatypes. In the example above, we would build into a mutable
list and then "freeze" it into an immutable array by letting some underlying
trusted code copy the values. (think of this like StringBuilder -> String
conversion). (C) We would have datastructures designed to have a modeswitch
from mutable to immutable. At runtime we would scan them to "flip the
switch" they would become immutable. This could be coded directly in C#, but
the compiler would enforce the ability to turn immutable (i.e. no public
fields, all access through generated properties which check for mutability
first). (D) We could do something like C, but instead of scanning a subtree
to flip the switch, they could enforce immutability by induction during
construction. That is, one of these FreezableStructs would only allow
property sets to values which are already set to immutable. Likewise with
lists, which would only allow elements which are set immutable. The
psudocode for D would be:



LinkedList<FileStatInfo> result = new LinkedList<FileStatInfo>();

foreach ( inf : filesystem.walkthrough_structures() ) {

FileStatInfo fsi = new FileStatInfo()

fsi.a = inf.a; fsi.b = inf.b; fsi.c = inf.c; // rvalues required to be
immutable

fsi.freeze();

result.add(fsi); // this is only allowed because fsi is frozen

}

result.freeze(); // this is O(1) because we know we only reference frozen
objects

return result;



This example above is what I was referring to several emails ago in my
"immutable proof by induction" explanation. The same process could extend to
large nested structures of objects, where each "freeze()" operation is O(1)
because we require the rvalue (RHS) to be immutable at every assignment.





Yes the Collections are the big issue , Arrays are immutable but the
elements can always be set to null . Grr no non null type lick Spec#.



For Struct it easy



Point p1;

Send(p1) ;; // This always immutable since structs are copied on send



Send( ref p1); // this is mutable









I the vain of option B.

What about the following , objects have a few bits in memory to do things
like a Synch lock we can put such a bit at the high end of the lock field to
say this object ( not type! ) is Immutable. What happens then is this when
a write occurs the GC ( which normally marks the page as used for scanning
purposes can fault if the immutable bit is set . Note this does not have to
be all objects just objects Marked with the Immutable Attribute or better
yet interface.



We then need to allow all forms of pass by copy but how do we handle stirngs
prob the most useful type.



However aren't we still relying on the dev to call freeze ? Isn't it the
same for him to write immutable classed ?

This looks strange /sub optimal for immutable classes .



Public class TestImmutable :IImmutable

{



Private string data;



Public TestImmutable(string data) { This.data = data;}



Public string Data {Get {Return data;}}





Public void Freeze() // DO nothing ?? {}

}





This immutable will be pretty impressive if it works out especially for
concurrency . You don't need shared memory for immutable objects. You still
need to create traditional shared memory for communicating back and forward
however .. Since the receiver cant alter the data.





In the vain of D why cant we scan types when we compile and any type with
no public fields and no set properties is always immutable .( Internal sets
properties can be used for only the creator )



Eg





Public class TestImmutable :IImmutable , IDataProvider

{



Private string data;



Internal TestImmutable(string data) {This.data = data;}



Public string Data {Get{Return data;}}



Internal string Data {set {Data = value;}



}

}



The above can only be altered by the creator so an interface can be used
for the receiver ..



interface IDataProvider

{

string Data { get; }

}





Regards,



Ben
Ben Kloosterman
2009-09-26 08:15:10 UTC
Permalink
Ok I want to get some detail on the shared memory discussions



Firstly the main issues for all concerned

· We can pass any reference but is it valid for a reference to be in the destination. ( Possible Runtime checks)

· How do ensure receivers have the minimum visibility needed , we for a Web page coming through the stack the application should not see the tcp header or IP header. The same goes with disk buffers if they were not meant for an app they should not be visible.

· How do we manage the memory of the shared memory region ?

· How do we handle or prevent references from shared memory to a processes memory and visa versa.

· How do we ensure multi threaded data integrity for data in the shared memory region.

· How difficult will it be for a programmer to use ?

· If the 2 processes trust each other eg 2 Kernel processes do we treat them the same as if there was no trust between them.

· How do we manage types in shared memory ?

· Can we manage access to a shared region eg read only ?

· GC collect times if there are too many dependencies /locks.



For purposes of this exercise please assume IPC uses a different mechanism.



Process structure a Cosmos B process contains a single thread and sub Object space of the application. Main() is the root process and other sub process are created to form a process tree. The processes in the process tree normally share an allocatorPool but they could have their own allocator. A traditional large Object space app is possible by having multiple Processes all share the same large object space and share the same Allocator Pool. Static data is valid only in the Allocator pool ( ie not system wide) . An appdomain is all processed that share an allocator pool ( it maybe a sub application or multiple applications)



Global Issues: All receiver must know the type hence Shared objects must reside in a shared lib.



I expect to support all of these except maybe 4. 2 Can be used to create a very effective zero copy no locking message based IPC in the Minix style. 2 Is simple ( prob low bugs) and efficient but may not be great for rapidly changing data.. 3 Is inefficient but can handle a large amount of mutable data.



Direct send pointer schemes





1.Inter AppDomain pointer passing

A process via RPC passes a pointer direct to another process residing in the same AppDomain ( or Process Tree) .

Only works within an App Domain ( ie all processes sharing the same collector) . In practices its the same as passing a pointer to a different class on a different thread in a normal app.



2.Intra AppDomain pointer passing

A process via IPC passes a pointer direct to another process. Objects are checked at compile time for immutability. The sender retains a dummy reference for each reference he sends which he will remove when the receiver states it has finished or if receiver is terminated. Objects must be fixed.

Option: Use a shared heap with ref counting to store these objects. As it relieves the senders GC, I don’t think this is needed because large amounts of immutable data is not really common ( by large I mean object count not bytes). 3 Can handle this scenario easier.

Option: The sender must allow the reference to be used by the receiver. This way we can ensure no non privileged process has a reference outside of its domain unless specifically allowed. ( very useful in debug builds)



Shared Mem Schemes

3. A shared memory region is created by a process . Shared memory types need not be immutable. The user need to create these and all child objects on the Shared mem region , objects maybe cloned. The Shared heap region is prevented from having any pointers to and from any SIP ( as any ref objects must be ObjectHolder) . A special kernel indirect object is used to obtain a reference to shared memory this indirect object switches on a global lock on the shared memory region root object and when the user is not the owner the pointer is set to null. ( Note the system still maintains the reference for the ref counting GC , if the caller is called so is the system object and the reference will be removed. ) , when the owner ( ie someone who owns the lock) is killed the region is killed and all the non owners Control blocks internal pointer will be permanently set to null since they cant get the lock. Note the sending GC does not need to wait for the lock not does the shared Memory GC. Note multiple control blocks are allowed in a process.

A special using block is used





Process.CreateSharedRegion<string>(“Region10” ); // sets root to string .



Using ( var shared = Process.SharedMemLocks[“Region10”].GetLock () ) //access to shared here.

{

Var data = Shared.CreateObject(typeof(string) ,”New string”); // is there a more elegant way to create shared objects and still have interactions with the local GC working?

shared.root = data;

}



//shared memory is not accessible



Issue1 : The creating objects method is very cumbersome . We could define shared types with an interfaces but how do we handle strings ? We would need a SharedString and create our own SharedCollections . The good thing is the creation could check that all references are ObjectHolder<youurType> .

Issue 2: Rather than use ObjectHolders can the IL to x86 Compiler handle this ie if we will assign the reference to shared Memory it must be a shared memory object.







4. As 3 but used for trusted partners , Its open there is no checking for references there is a lock but its use is not enforced ( as the references are not checked) . It is up to the developer to manage this and his own dummy references to ensure things don’t get collected.



Regards,



Ben
Royce Mitchell III
2009-09-26 13:32:28 UTC
Permalink
Post by Ben Kloosterman
· How do ensure receivers have the minimum visibility needed , we
for a Web page coming through the stack the application should not see the
tcp header or IP header. The same goes with disk buffers if they were not
meant for an app they should not be visible.
Create a special API drivers can use to create a weak reference to only the
sub-portion of data you want to share. I assume the JIT is going to enforce
boundaries across the board.
Post by Ben Kloosterman
· If the 2 processes trust each other eg 2 Kernel processes do
we treat them the same as if there was no trust between them.
I think trust is a really bad idea. I can't imagine there being a
significant performance gain, but there will be a significant loss in
isolation ( if one crashes, how do you know which is responsible )
Ben Kloosterman
2009-09-27 06:35:34 UTC
Permalink
. How do ensure receivers have the minimum visibility needed , we
for a Web page coming through the stack the application should not see the
tcp header or IP header. The same goes with disk buffers if they were not
meant for an app they should not be visible.

Create a special API drivers can use to create a weak reference to only the
sub-portion of data you want to share. I assume the JIT is going to enforce
boundaries across the board.



Yes the JIT /Dynamic loader wil enforce boundaries but largely they are
there as part of the language.



Eg

Any type marked private in CIL is not visible to the outside world , nor can
it be created from outside the assembly. Basically apps are object domains
from a root , unless you have access to a reference you cant see any part of
the app. Eg if you have 2 objects in an Appdomain they have no access to
each other unless something passed a reference this is the core of the
isolation.



In terms of a CIL hacks I think they will be picked up by the AOT compiler
or the Dynamic loader.



Call SystemFunc1 ; This is the IL and if the user has no visibility the
Dynamic Loader wont patch the address.

Jmp 1020102; This will be either illegal or need to be checked

jmp myLabel; ; this is legal

JNE/Branch -10000 ; We will need to ensure all
branches stay within the method. Im not sure if this is even legal in CIL.

. If the 2 processes trust each other eg 2 Kernel processes do we
treat them the same as if there was no trust between them.

I think trust is a really bad idea. I can't imagine there being a
significant performance gain, but there will be a significant loss in
isolation ( if one crashes, how do you know which is responsible )



Yes im leaning for this as well ( and the process are designed already with
no pre assumed trust) also with our fast immutable send the reference ,
performance should not be an issue.



Regards,



Ben
Chad Z. Hower
2009-09-23 12:02:32 UTC
Permalink
Post by Ben Kloosterman
However Intel cant run without at least one set of tables?
To run in 32/64 you need to set up at least one set of segments for
CS/DS/SS. After that it becomes "invisible" but you have to do it.
Post by Ben Kloosterman
This isn’t 100% correct remember the 32/64 bit real mode trick I linked.
Real mode has other issues though. Can you repost the link though? :)
Post by Ben Kloosterman
Every windows system I use I turn of swapping and I have never had an
issue. I don’t think page swapping is necessary in a modern OS nor do
I think it’s an area that s worth a lot of time. Since voice activated
I think we are mostly in agreement Ben, when I say swapping I'm using
the term very loosely.
Post by Ben Kloosterman
The only thing I will miss is memory mapped file IO but that’s shouldn’t
be an issue with decent shared memory.
Exactly. We can even "swap" at the object level... ie more like HSM than
swapping....



------------------------------------

--------------------------------------------------
More things to join for Cosmos!

1) Cosmos chat room:
http://tinyurl.com/pc7bds

2) Please add yourself to the map:
http://tinyurl.com/qhttde

3) Help publicity and join our Facebook page:
http://tinyurl.com/plrloa

--------------------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/Cosmos-Dev/

<*> Your email settings:
Individual Email | Traditional

<*> To change settings online go to:
http://groups.yahoo.com/group/Cosmos-Dev/join
(Yahoo! ID required)

<*> To change settings via email:
mailto:Cosmos-Dev-digest-***@public.gmane.org
mailto:Cosmos-Dev-fullfeatured-***@public.gmane.org

<*> To unsubscribe from this group, send an email to:
Cosmos-Dev-unsubscribe-***@public.gmane.org

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
Ben Kloosterman
2009-09-23 12:56:04 UTC
Permalink
Post by Chad Z. Hower
Post by Ben Kloosterman
However Intel cant run without at least one set of tables?
To run in 32/64 you need to set up at least one set of segments for
CS/DS/SS. After that it becomes "invisible" but you have to do it.
Post by Ben Kloosterman
This isn't 100% correct remember the 32/64 bit real mode trick I
linked.
Real mode has other issues though. Can you repost the link though? :)
It was in the Cosmos-Dev , Lost it ( but may be in my notes somewhere) . I
think you set up 32 bit ( or 64) protected mode a GDT etc you then drop back
to real mode without setting up segments and it still has the Flat 32 ( or
64 ) bit segment in real mode.
Post by Chad Z. Hower
Post by Ben Kloosterman
Every windows system I use I turn of swapping and I have never had an
issue. I don't think page swapping is necessary in a modern OS nor do
I think it's an area that s worth a lot of time. Since voice activated
I think we are mostly in agreement Ben, when I say swapping I'm using
the term very loosely.
Post by Ben Kloosterman
The only thing I will miss is memory mapped file IO but that's
shouldn't
Post by Ben Kloosterman
be an issue with decent shared memory.
Exactly. We can even "swap" at the object level... ie more like HSM than
swapping....
Interesting idea.. You could implement this with the GC as a 4th Generation
, when you promote from the 3rd to the 4th you add a write barrier that
checks frequency of use. If frequent you remove the write barrier. The GC
can then swap out the objects in batches. Still the Random access seek
times will prb be worse than page swapping. .

Regards,

Ben
Chad Z. Hower
2009-09-23 15:33:49 UTC
Permalink
It was in the Cosmos-Dev , Lost it ( but may be in my notes somewhere) . I
If you can find it I'd love to read it.
think you set up 32 bit ( or 64) protected mode a GDT etc you then drop back
to real mode without setting up segments and it still has the Flat 32 ( or
64 ) bit segment in real mode.
Which is what we want. IIRC though the GDT requires one descriptor for
each, no? IIRC thats the segments I was talking about. That is, you
might only have one, but you still have to use it and it still involves
the extra hardware. Though there are no misses. :)
Interesting idea.. You could implement this with the GC as a 4th Generation
, when you promote from the 3rd to the 4th you add a write barrier that
checks frequency of use. If frequent you remove the write barrier. The GC
We can also add attributes to objects and let user code specify that the
object will be a long life object, short life, large size, etc... all
these hints can help the MM a lot.
can then swap out the objects in batches. Still the Random access seek
times will prb be worse than page swapping. .
The net effect I believe will be in our favor because of our gains
elsewhere, and stack access internally can be direct.
Ben Kloosterman
2009-09-24 00:03:02 UTC
Permalink
It was in the Cosmos-Dev , Lost it ( but may be in my notes somewhere) . I
If you can find it I'd love to read it.

Ill post it if we find it , maybe we need to use some sort of perm thign for
Cosmos-Dev a lot of stuff posted their would be nice to be searchable
think you set up 32 bit ( or 64) protected mode a GDT etc you then drop back
to real mode without setting up segments and it still has the Flat 32 ( or
64 ) bit segment in real mode.
Which is what we want. IIRC though the GDT requires one descriptor for
each, no? IIRC thats the segments I was talking about. That is, you
might only have one, but you still have to use it and it still involves
the extra hardware. Though there are no misses. :)



The thing is the GDT and LDT are no longer used you can reuse the memory ,
you can do this in flat protected. Still I don't think we should go down
this way too many bugs in the CPU but it would be an interesting research
topic down the track ( and we can change VESA graphic modes)



Ben
Ben Kloosterman
2009-09-23 01:51:32 UTC
Permalink
He may just be talking about an x86 indirect memory reference. Which is a
pointer.



Regards,



Ben



From: Cosmos-Dev-***@public.gmane.org [mailto:Cosmos-Dev-***@public.gmane.org] On
Behalf Of Chad Z. Hower
Sent: Wednesday, September 23, 2009 9:23 AM
To: Cosmos-Dev-***@public.gmane.org
Subject: Re: [Cosmos-Dev] various design topics (performance, GC, security)
Post by David Jeske
You are misunderstanding me. When I'm talking about pointers, I'm
talking about Cosmos generated instructions using pointers. These
pointers very much do exist.
No, I don't misunderstand you. Cosmos doesn't need to generate pointers.
.NET now doesn't generate pointers into the code, and we don't need to
either. We can use handles, and even mix them. For example, we can use
indirect pointers and easily relocate, and we can allow pointers only
onto the stack, as the stack is not likely to relocate, and we can even
track the stack and relocate it, or index of something...

In short, the code could be completely handles or indirect pointers
completely solving the issue, but I favor a multi solution hybrid for
many reasons.

We are still a ways off from even building POCs, but the point is that
there are many many solutions, many of which are very feasible.
Chad Z. Hower
2009-09-23 02:05:58 UTC
Permalink
Its a pointer to a pointer... I expect we will use these in many places,
but not exclusively.
Post by Ben Kloosterman
He may just be talking about an x86 indirect memory reference. Which is
a pointer.
Regards,
Ben
*On Behalf Of *Chad Z. Hower
*Sent:* Wednesday, September 23, 2009 9:23 AM
*Subject:* Re: [Cosmos-Dev] various design topics (performance, GC,
security)
Post by David Jeske
You are misunderstanding me. When I'm talking about pointers, I'm
talking about Cosmos generated instructions using pointers. These
pointers very much do exist.
No, I don't misunderstand you. Cosmos doesn't need to generate pointers.
.NET now doesn't generate pointers into the code, and we don't need to
either. We can use handles, and even mix them. For example, we can use
indirect pointers and easily relocate, and we can allow pointers only
onto the stack, as the stack is not likely to relocate, and we can even
track the stack and relocate it, or index of something...
In short, the code could be completely handles or indirect pointers
completely solving the issue, but I favor a multi solution hybrid for
many reasons.
We are still a ways off from even building POCs, but the point is that
there are many many solutions, many of which are very feasible.
Ben Kloosterman
2009-09-23 04:31:59 UTC
Permalink
I will respond also as my ideas are a bit different in terms of
implementation to Chad .



Thanks for the conversation by the way always good to have peer reviews.



Please note.



1) Better performance is NOT The main goal , the real goal is to
improve reliability and security without reducing performance. That said I
believe we will achieve better performance ( esp through better parallelism
/ fast IPC ) it will be great for marketing that said it doesn't matter if
we don't as long as we are close. Note Singularity has the same goals I
place performance as more important than they do for example they showed a
10% improvement with IPC by services calling the device drivers without
going to IPC ( eg a Direct Call) , they wore this cost as acceptable I
probably will not.



2) Remember when OS went from Real mode to protected ( eg DOS to
windows X/NT) and how much performance we lost ? All the games and some
Cad systems all stayed with real mode for performance. Why did we break
real mode 1. Not enough memory (needed by Gui apps) 2 Reliability for
Multiple process support . Through a Type save managed applications
can provide the reliability and the memory issue is also no big issue as it
was mainly a DOS /16 bit real mode issue. Note you can even go 64 bit real
mode.



3) Cosmos is a kit we will be able to chop and change major parts of
the OS easily and by anyone , for example the whole MM I implemented so far
is changeable , could be restarted and can even sit on a different machine (
though that's not a good idea in case of network problems). With all the
kernel code being OO and C# someone can just replace my software MM with a
hardware VMM. This is only just possible with Linux and is a major effort
as its mature ( bloated) and lacks clearly defined interfaces.



MMU/TLB vs software paging - What ideas do you have for implementing paging
in software that is better than paging in hardware? If code or data-pages
are removed, previously compiled code would fail when it tried to access
invalid pointers without MMU faults.



Memory from the software paging system is only allocated in large blocks to
trusted components so far the only thing that requests memory are Type Code
library loader , Allocators( GC) , Shared Memory Allocator and DMA. It is
not possible to have an invalid pointer



For data programs can't create or modify pointers (accept for setting it to
null) . Only the GC can create a pointer . The only form of corruption
possible is a user could set the pointer to another object he manages of the
same type however he only has access to object created by that process .



For code C function pointers are not supported . The IL Library has the
method names it wants to call and the OS (Dynamic Loader) replaces it with
the appropriate address if it is valid when it loads. Self modifying code
and reflection are not supported.



The process which has pages swapped out would need to run in a special mode
which verified each code or data pointer (to see if that data was swapped
out) before it was accessed. This could be a VM interpreter, or it could be
an entirely different set of compiled code -- in either case it creates more
memory pressure because of different versions of code that need to exist. Is
this more or less expensive than an MMU/TLB? That's hard to say, I'd guess
it to be more expensive. If processes (or smaller granularity 'units') are
simply frozen and not allowed to run while swapped it would eliminate the
need for dual-codepaths, but it would also reduce the amount of reclamation
swapping could provide if units are still handling small frequent tasks.



No for a software manager it will work this way , the
scheduler or UI service sees all threads of a application module are idle
and if the machine is under memory pressure swap the process to disk. When
a thread needs to wake up it will resume the process. Please note these
process are not the large UNIX /Windows processes but small single threaded
object spaces that use IPC to talk to other processes. Like an OO service.
Because the IPC cost is almost equivalent to an internal call you can
modularize apps and increase reuse. Ie breaking an app into multiple
domains/processes is cheap and caries no penalty. I also intend to provide
these swap routines to the application developer as he knows the inactive
windows and can swap them ( Browsers anyone) .. Personally I think swapping
is just used because of bad programming not a single small device supports
it and smart phones and PDA s run happily with 64 Meg. Supporting these
devices in the short term is probably more important than PCs.



Software paging does have the nice property that it smartly "makes the
common case fast". (i.e. a system which isn't swapping has no overhead. Only
when we incur the HUGE overhead of secondary-storage swapping would we incur
the penalty of virtual memory checks.



Yes exactly also note there is an additional technique here when there is no
memory pressure you can run a Garbadge Collector that is fast and spends
little time compacting and runs infrequently if memory pressure changes you
can swap GC and the new GC collects more frequently and can use expensive
best fit methods.



2. Application API security. My thoughts in this space have been about
eliminating most (or all) requests to the user to approve heightened
security. Users simply click "yes" on most of these dialogs, so I think a
security system will be more effective if they never or seldom happen. (the
opposite of Vista UAC) Further, I think the primary compromise to prevent is
malicious code uploading unauthorized user data to a third party server.



One idea I've had is to create a set of interlocking isolation rules. For
example: consider two assemblies: (A) can access the network, and only it's
own private local storage; (B) can access all local data but not the network
or other processes. Assemblies of these two types can be intermixed on the
system without fear that unauthorized user data can be uploaded to the
Internet. -- Adding a secure open/save panel (like silverlight or a browser)
to authorize an application to access a specific piece of data, "A" becomes
similar to website security, since different websites can't access
eachother's backend servers. A surprising number of applications fit into
these two categories.



Chad mentioned "see rings in our docs". I couldn't find this reference on
the cosmos website. What is this a refernece to?



The rings is the default Cosmos OS , I don't think this will be good enough
. Cosmos B im proposing will be a full POLA/POLP Capability Object system
ie there is no ACL and limited background authority. References to objects
( with no public members) will serve as keys these keys will be persisted
and can be past to another process the process trusts. Without the key you
can create or forge the object , keys can be cross machine.



3. Potential performance improvements. Ben mentioned "if a loaded webserver
spends 50% of time in kernel". This sounds to me like a static webserver.
It's no surprise that solution which removes OS abstration layer cost can
optimize a dumb "data in data out" workload like this. Many tiny systems
(including the previously mentioned MIT Exokernel) have done this.



Managed OS have a LOT in common with ExoKernels especially CosmosB. However
they remove the main weakness of ExoKernels . Most web servers spend a lot
of time in the OS cache , NIC , Network stack , File system or Disk driver .




We arn't using them in the industry because this optimization isn't
important. Dynamic applications represent a much larger percentage of system
development costs, and they spend most of their time in userland. These
applications are not dominated by syscall or context-switch performance. I
argue that human resource (programmer) costs are the dominating factor in
most software applications. Reducing these costs is more important than a
fractional improvement to performance. This is why the trend is towards
programming systems which are inefficient for the machines and efficient for
the programmer. (like Ruby, Python on the extreme, Java, or CLR with garbage
collection in the less-extreme)



No argument on human resources however Cosmos support s .NET so the
development costs will be equivalent ( removing the ExoKernels weakness) ,
obviously we have an issue in admin costs for a server but the reliability
will be the major argument no one is happy when servers have issues. Also
remember the security , Military and hand device markets .

Provided an OS does parallelism well (which Windows and Linux don't but
Sing and Cosmos can - 32+ hardware threads which we will soon see in dual
servers will be a major issue) I agree with you, however note your earlier
comments on spending time optimizing ( batching) kernel calls , prob user
level caching etc . Why did you do this ? If a server like IIS is 30% more
efficient your prob looking at a hardware cost saved in the billions , sure
this doesn't matter on an individual site but it does matter for 3rd party
software such as SQL servers , web servers etc. An OS is a once of
development investment if a .NET Cloud runs an OS that is 10% faster and
runs all .NET apps you have a customer pretty quick as it's a big hardware
cost for some.



Overall, I'm skeptical of claims that this new architecture will have
similar capabilities to old architectures but with better performance. There
are tradeoffs, and while we can point at the inefficiencies of existing
systems, we can't yet point out all the tradeoffs and inefficiencies of a
full real-world system built in a Singularity-like model.



Note our major trade off Languages must be safe ( no pointers) and strongly
typed and applications must be installed before they run. What you are doing
is pushing the issues earlier in the development cycle we do a lot of the
hard lifting at compile time and impose more restriction on the developer no
pointers and strongly typed.

There are rumours some of MSDN runs on singularity . And MS is building
Midori hiring some major players like Shap. I don't buy the trade-off
argument , sure there are trade offs but sometimes you pay a small price
for a big gain eg caching provides a big perf gain but costs a small
amount of policy.





Pervasive garbage collection is a significant one of these costs.



The GC costs will be about the same as a.NET app running on Windows .



Similar to the quoted figures about how TLB misses are becoming more
expensive as CPU speeds increase, garbage collection is becoming more
expensive as CPU speeds and memory sizes increase. In other words, it's very
possible Singularity would have a similar overhead to produce the same
functionality, simply in different places.



I don't think this holds true. The alternative to a GC is manual memory
management with malloc and free which gets into the human resource costs
stray pointers etc . If the situation gets too bad we can do ref counting
GC's but at the moment the CPU cost of doing this is not justified ( accept
for Real time code) . Lastly the best way to speed things up is use less
memory and increase L2 cache use , the old 70s designs are having trouble
with parallelism and the fact memory access on these systems is hitting a
100 cycles. Cosmos /Singularity apps will be smaller and have less code (
since all devs will be forced to use the .NET libs) and we can optomize
things to suit the changes in HW architecture ( eg Asynch is prob better
than Synch now for IPC , Minix was years ahead of its time but the Ring3/0
model is its limitation) .



Perhaps the winning argument for an architecture like Singularity/Cosmos is
similar to the Aspect-Oriented-Programming argument. Currently we're
spending programmer time optimizing these boundaries, wheras if we make the
boundaries very cheap we don't have to. This is will make our code smaller,
easier to write, easier to maintain, and as a nice side effect, faster in
the common cases. The code required to handle the uncommon cases might even
still exist, but if it can only exist in centralized cross-cutting places,
then the overall codesize can be simpler.



Yes exactly. Please note we have just touched on performance and haven't
discussed many of the topics, you can see a modern OS based on this has
much promise with the biggest gains coming in parallelism , reliability ,
security and maintainability ( of the OS) . Consider a TDD kernel also.



4. Passing data between domains. From the Singularity paper, it seems they
concieved a way to make a code-constraint that enforces a given pointer
handoff is the only way to access a data-subtree.



This is the nature of Strongly Typed systems (which IL is) if you don't
have a reference you cant create or forge the pointer. This is well
discussed in Capability systems.



This allows them to safetly hand the pointer to another domain without
worry that it can be modified by the original process. They still have to
track lifetime for this subtree with some kind of foreign GC handle. I don't
understand how they would implement a zero-copy buffer cache using this
system, as once they handoff the pointer, they can't retain a reference to
the buffer anymore.



Once they set it to null its gone , but the receiver can send the reference
back. hence the reference is a key. The GC is the main issue.



I like Ben's suggestion to encourage 'immutable data'. One chalenge lies in
trying to Consider both heap-allocated buffers and structures which are
designed to be marked immutable (though start out mutable). The buffer would
allow writes (or DMA) until it was marked immutable.



Yep but your thinking old style J . Use the OO type safe language .. The
creator can be the only one with access (and the HW you cant stop DMA note
IO can only be instantiate via trusted code IL has no in/out instructions)
by using internal/private methods for changes when finished just tell the
creator or make the Finish method public.



The structures, even with mutable, would only allow pointers to immutable
data. This constraint assures the structure can be safetly made immutable at
any time and the entire subtree is immutable. This pattern is an
"incremental construction of subtrees of immutable data", which can be
handed to a cross-domain call accepting only an immutable pointer. A buffer
manager could keep around immutable blocks and hand them out to different
callers as necessary. Cross-GC references would manage lifetimes for the
data. Structures could cheaply aggregate the total memory size of the
subtree for accounting the size of a cross-heap reference.



Note .NET really helps here as all the value types + string are immutable.



It seems one challenge of this system is cheaply tracking the cross-GC
domain references. Perhaps all immutable pointers would need to be boxed
into cross-GC handles



Note however you get the Compiler to do the work for you by marking Shared
Memory objects with an attribute at compile time when a local or stack
reference is assigned to a shared structure you can be informed of this. You
can just use a separate GC for the Shared memory which walks the stack for
both processes.



There are many solutions here. This is what I have now



Shared memory

There are 3 levels of shared memory.

Level 1

Process can directly access sub processes memory (and could pass a pointer
to itself) this is completely controlled by the process. Note it is only
possible within process trees else requires very careful management of the
GC.

Level 2

Shared memory managed through the IPC Shared Memory System. All such classes
must be marked with SharedMemory attributes. This forces the compiler to use
the shared memory allocator. To use this shared memory a pointer is simply
passed to a process. Locks are placed and all access should be done through
locks. If a process terminates the state of the memory should be fine if it
did not hold a lock. This method is used by the IPC system ( eg the messages
are placed in a level 2 shared memory space and the lock is normally held by
the sender ) . It relies on both parties doing the right thing with the
lock. It is fast and should only be used when sender and receiver are
controlled ( eg with IPC it controls the sender and receiver). A control
block exposes an EventWaitHandle.

Level 3

As level 2 but more carefully managed , the pointer are indirect and will
deny access if the user is not the current owner ie there is only 1 user who
has visibility of the shared memory at one time. The pointer is contain a
reference to a control block which exposes an EventWaitHandle. This is the
recommended method it is secure and reliable.



5. Real-time garbage collection. Ben made reference to "real time or
counting collectors for device drivers". Counting collectors introduce the
possibility of memory leaks through cycles, unless they have a sweeping
check which eliminates their real-time nature. Real-time sweeping/copying
collectors are a highly unsolved problem. Typical "success stories" I've
seen involve hard bounds on heap-size to limit pause times. When heap sizes
are not bounded they are not real-time or hardware support is required. Do
you have a reference to something you feel is promising?



No issue with bounding heap size. In fact all apps will have a MetaData
memory and cpu time limit which may or may not be used. Note each driver has
its own heap.



IMO, the most promising work in real-time collectors is the triad of MS
Research I mentioned in my earlier post (chicken, stopless, and clover).
These are soft real-time collectors, that produce acceptable delays in most
real-world situations. However, they either require the overhead of
introducing a write-barrier in all code, or having two versions of code, one
with a write barrier.



Again, this speaks to the performance overhead argument I made above. It may
be that we're happier spending ~35% of our CPU on write barriers and garbage
collection scans, rather than on MMU/TLB costs, because our code can be
simpler. However, it's hard to argue that a radical redesign like this will
have superior performance before the issues are worked out.



I will avoid write barriers like the plague or anything else interfering
with execution stream. Device drivers will use little memory ( most will be
shared memory even) if worst case they stop for 1ms it just means it's not a
good real time driver ( as long as the real time scheduler is aware of it
then it can still meet its commitments) most real time systems are dog slow
they just work within the time frame advertised ( hence why if you limit the
heap size you know the time) . I probably misused the real time term as
well I'm more interested in short interruptions for device drivers rather
than doing things in deterministic times though a real time Cosmos is
intriguing . Also note device drivers do no IO they communicate with the
System specific HAL which does it all.



As the GC marks can be done on idle threads ( which is the most time
consuming) and don't need to stop the world I don't think it will be a big
deal , the actual sweep does need a stop the world im not smart enough to
solve that problem . I suspect the actual cost of GC to the process running
will be about 1% ( though it will prob use about 4% resources on idle
threads) however I think the Nursery heap allocation being 3 x86
instructions will be much faster than traditional C heap or .NET Nursery
allocations.



I'm sort of toying with the idea of having a system wide shared Nursery for
allocations as a large Nursery size can give you a big benefit when a
collection is needed you go down the most active process list calling their
collectors to update their heaps if not enough is cleaned you go to the next
one ( the collectors knows the reference belongs to it if it's in its
allocated memory ranges) etc Needs a lot more thought before being
practical.



Regards,



Ben
David Jeske
2009-09-23 06:59:21 UTC
Permalink
Post by Ben Kloosterman
No for a software manager it will work this way , the
scheduler or UI service sees all threads of a application module are idle
and if the machine is under memory pressure swap the process to disk.
This is very different than what a VM MMU/TLB system does. Swapping an
entire process has some pretty limiting degenerate cases. For example,
applications with very large datasets are likely to have them in a single
large heap, and require paging subsections if another application needs
memory.

Further, one of the benefits of paging is that currently unused portions of
code or data can be removed from memory without actually stopping the
process. This isn't provided by the idea above. While it might be sufficient
for some situations, it seems unfair to consider this similar functionality
to a traditional MMU/TLB based VM setup.

However, this explanation of yours has helped me understand where the
miscommunication was. I was asking probing about how a software-based
solution might produce VM-paging functionality, but you're discussing a
system with more limited capabilities than typical VM-paging. I understand
now.
Post by Ben Kloosterman
When a thread needs to wake up it will resume the process. Please note
these process are not the large UNIX /Windows processes but small single
threaded object spaces that use IPC to talk to other processes.
While I can see there will be many small components, the larger heaps in a
typical system (in my experience) tend to be dominated by a smaller number
of codeblocks managing large amounts of data.
Post by Ben Kloosterman
Lastly the best way to speed things up is use less memory and increase
L2 cache use
This point I fully agree with. Memory speed has been dramatically lagging
CPU speed, so anything we can do to better utilize cache and reduce real
hits to memory will help.
Post by Ben Kloosterman
4. *Passing data between domains.* From the Singularity paper, it seems
they concieved a way to make a code-constraint that enforces a given pointer
handoff is the only way to access a data-subtree.
This is the nature of Strongly Typed systems (which IL is) if you don’t
have a reference you cant create or forge the pointer. This is well
discussed in Capability systems.
I was making a stronger point. If I understand what I read about Singularity
correctly, they created a code-constraint such that when you are handed a
particular type of pointer for IPC, you are guaranteed that no other
reference to this pointer exists within the SIP. This seems to be the basis
of their "proof" that the datastructures behind the pointer can't be changed
by the SIP that provided it.
Post by Ben Kloosterman
I like Ben's suggestion to encourage 'immutable data'. One chalenge lies in
trying to Consider both heap-allocated buffers and structures which are
designed to be marked immutable (though start out mutable). The buffer would
allow writes (or DMA) until it was marked immutable.
Yep but your thinking old style J . Use the OO type safe language .. The
creator can be the only one with access (and the HW you cant stop DMA note
IO can only be instantiate via trusted code IL has no in/out instructions)
by using internal/private methods for changes when finished just tell the
creator or make the Finish method public.
I'm not thinking old style. I'm considering how one would verify that a
potentially large subtree of data is *provably* immutable. Asserting that
only methods in a class can touch the memory is not a proof unless it is
proved that nobody else has the ability to call the methods of that class
which can change the data.
Post by Ben Kloosterman
Note .NET really helps here as all the value types + string are immutable.
This is helpful, but most interesting IPC will include many aggregate
datatypes. Returning a directory-listing, as a simple example which has come
up recently. Using a proof-by-induction system to prove that a subtree of
data is immutable would allow it to be safetly handed over the IPC channel
without fear of thread synchronization issues.
Post by Ben Kloosterman
No issue with bounding heap size. In fact all apps will have a MetaData
memory and cpu time limit which may or may not be used. Note each driver has
its own heap.
If the driver is interacting with shared heaps (which I think is what you
are implying), then we need to worry about GC pauses for all these heaps, as
any code which can touch a shared heap would need to be stopped during it's
sweep/compact phase in a stop-the-world collector.

Regardless of exactly which layer of abstraction is impacted, presumably
there will be desire for a relatively large buffer cache with some type of
LRU replacement scheme. The path to retrieve data from disk will interact
with this buffer cache, so doing a full-GC on the buffer cache heap will
involve stopping any processes with zero-copy access to the buffer cache --
which could be quite a lot of processes.

We're probably getting a little too deep in the details for this discussion.
My major point is that GC is not just a panacea to all problems, it also has
many challenges and problems of its own.

I will avoid write barriers like the plague or anything else interfering
Post by Ben Kloosterman
with execution stream.
You may reconsider when stop-the-world collections on buffer caches are
stopping too many processes. :) Letting those processes continue at slower
speeds with write barriers may be preferable to stopping them.
Post by Ben Kloosterman
Device drivers will use little memory ( most will be shared memory even)
The argument that the driver-itself has little memory but access a shared
heap feels like it's jus tshifting the problem. The driver would need to be
stopped when the shared-heap was GCed. If there is truly zero-copy from the
hardware HAL to user-land, then all these folks are going to be touching
some shared-buffer-cache-heap, and all of them would need to be stopped for
a stop-the-world collection of the shared-buffer-cache-heap. That heap is
likely to be quite large.
Post by Ben Kloosterman
the actual sweep does need a stop the world im not smart enough to solve
that problem .
The stop-the-world sweep/compact pause time is a huge problem in real
application level GCed systems today. It's one of the reasons games and
other soft-real-time applications have trouble with GC. Even on
high-performance webserving pause times become unacceptable (for me) on
heaps over 256MB.

If you havn't already, I recommend you read the MS Research paper I
referenced. Their collectors are really interesting and they allow code to
run during compaction (though they need a write barrier). You can find the
paper here:

http://portal.acm.org/citation.cfm?id=1375587

I’m sort of toying with the idea of having a system wide shared Nursery for
Post by Ben Kloosterman
allocations as a large Nursery size can give you a big benefit when a
collection is needed you go down the most active process list calling their
collectors to update their heaps if not enough is cleaned you go to the next
one ( the collectors knows the reference belongs to it if it’s in its
allocated memory ranges) etc Needs a lot more thought before being
practical.
I can see the draw of a system wide Nursery. This would reduce memory lost
to allocated but unused Nursery space across many processes/units. However,
the Nursery would have to have a very aggressive non-blocking parallel
allocator, otherwise it would become a serious performance bottleneck for
threaded systems.

Would those objects graduate from the Nusery back to an assembly/process
specific heap?
Ben Kloosterman
2009-09-23 11:37:28 UTC
Permalink
On Tue, Sep 22, 2009 at 9:31 PM, Ben Kloosterman <bklooste-***@public.gmane.org> wrote:

No for a software manager it will work this way , the
scheduler or UI service sees all threads of a application module are idle
and if the machine is under memory pressure swap the process to disk.

This is very different than what a VM MMU/TLB system does. Swapping an
entire process has some pretty limiting degenerate cases. For example,
applications with very large datasets are likely to have them in a single
large heap, and require paging subsections if another application needs
memory.



Further, one of the benefits of paging is that currently unused portions of
code or data can be removed from memory without actually stopping the
process. This isn't provided by the idea above. While it might be sufficient
for some situations, it seems unfair to consider this similar functionality
to a traditional MMU/TLB based VM setup.



I agree swapping whole processes is not as ideal as swapping pages but it
only applies to the problem of running more programs than you need if Phys
mem is > than mem required there is no issue . For the problem of mem
required > Physical memory there are many solution.



15 -20 years ago I was a paging advocate but I don't think its relevant
anymore. Swapping whole processes has some efficiencies for smaller
processes with sequential read / writes..

Consider this why is it quicker in many cases to reload an OS from Hibernate
than it is to activate an application that has been paged. Why when I swap
to an applications I haven't used for 5 minutes does it take longer to
activate than to start a new copy ? To me this defeats the while purpose
..and it even happens on systems that are not under memory pressure with
disk cache pages and apps pages having the same priority.



However, this explanation of yours has helped me understand where the
miscommunication was. I was asking probing about how a software-based
solution might produce VM-paging functionality, but you're discussing a
system with more limited capabilities than typical VM-paging. I understand
now.



I think Chad has some plans to support paging with a reference pointer
indirection but that's not where im coming from.



When a thread needs to wake up it will resume the process. Please note
these process are not the large UNIX /Windows processes but small single
threaded object spaces that use IPC to talk to other processes.

While I can see there will be many small components, the larger heaps in a
typical system (in my experience) tend to be dominated by a smaller number
of codeblocks managing large amounts of data.



It doesn't matter if it has a lot of data as long as the code hasn't ran for
a long time. The largest heaps such as web and databases server can and do
have limits to their heap , this memory is VERY volatile and gets many hits
( with regard to page swapping it doesn't really make much sense to swap a
DB or Web data page that is used , as a cache as the Least used will be
thrown out by the DB/Web and replaced now if the OS has swapped it out you
have a major performance issue) . The cache code is often the small code
using large data. As can be seen for caches and buffers page swapping will
reduce performance .



On most desktop systems the browsers are now the biggest hogs..but these
can be swapped on a per tab basis. I think most traditional apps can be
broken down into 10 or more single threaded process and hence swap more
efficiently when these components are waiting for work.




Lastly the best way to speed things up is use less memory and increase L2
cache use

This point I fully agree with. Memory speed has been dramatically lagging
CPU speed, so anything we can do to better utilize cache and reduce real
hits to memory will help.

4. Passing data between domains. From the Singularity paper, it seems they
concieved a way to make a code-constraint that enforces a given pointer
handoff is the only way to access a data-subtree.

This is the nature of Strongly Typed systems (which IL is) if you don't
have a reference you cant create or forge the pointer. This is well
discussed in Capability systems.



I was making a stronger point. If I understand what I read about Singularity
correctly, they created a code-constraint such that when you are handed a
particular type of pointer for IPC, you are guaranteed that no other
reference to this pointer exists within the SIP. This seems to be the basis
of their "proof" that the datastructures behind the pointer can't be changed
by the SIP that provided it.



If there is no reference that's 100% correct . Note you can use an reference
to a control block which then references the IPC that way you can have many
references in you code accessing thye shared emmory but if you don't have
the "lock" the references are null ( or your threads block)

I like Ben's suggestion to encourage 'immutable data'. One chalenge lies in
trying to Consider both heap-allocated buffers and structures which are
designed to be marked immutable (though start out mutable). The buffer would
allow writes (or DMA) until it was marked immutable.

Yep but your thinking old style J . Use the OO type safe language .. The
creator can be the only one with access (and the HW you cant stop DMA note
IO can only be instantiate via trusted code IL has no in/out instructions)
by using internal/private methods for changes when finished just tell the
creator or make the Finish method public.

I'm not thinking old style. I'm considering how one would verify that a
potentially large subtree of data is *provably* immutable. Asserting that
only methods in a class can touch the memory is not a proof unless it is
proved that nobody else has the ability to call the methods of that class
which can change the data.



The compiler can do things like that ..Put an attribute on it and check the
code at compile time. There are a few others. The control block just
mentioned is proof. I don't think we need to get into full kernel proofs but
something that's simple will be enough. One way I do this is by having an
application divided into many small process each with 1 thread , with less
code its easier for the developer to check.



Why do we need to prove this ? If the coder writes some code badly it will
appear just like a Threading/Synch bug. If the shared memory has an
interface and calling process creates a piece of code with all properties
all we has to do is make the constructor and set properties internal and no
one else can change it. This is not immutable as the caller can change it
when he wants but no one else can it is enforce through the language. The
only thing you have to do is manage destruction of the object for which
there are a number of ways.

Note .NET really helps here as all the value types + string are immutable.

This is helpful, but most interesting IPC will include many aggregate
datatypes. Returning a directory-listing, as a simple example which has come
up recently. Using a proof-by-induction system to prove that a subtree of
data is immutable would allow it to be safetly handed over the IPC channel
without fear of thread synchronization issues.



Agree 100% but even a fast byte[] or string is very useful as it can pass
from device driver to service to user app. This is why I think we will have
2 types of shared memory fast/Immutable/Simple ( eg could be used for a
Mutex) and a more carefully managed one with defined Objects.



I don't think Shared Memory should ever contain functioning Objects it
should be data. It may be wise to enforce no methods on classes with a
Shared Memory Attribute.



Extending the immutable idea has some merit I was going to just do it via
lightweight lock ( which can be on the root memory object of the tree) ,
avoiding locking would be nice.



No issue with bounding heap size. In fact all apps will have a MetaData
memory and cpu time limit which may or may not be used. Note each driver has
its own heap.

If the driver is interacting with shared heaps (which I think is what you
are implying), then we need to worry about GC pauses for all these heaps, as
any code which can touch a shared heap would need to be stopped during it's
sweep/compact phase in a stop-the-world collector.



I was viewing shared memory spaces as small (each with its
own GC) between a few processes. I don't like the idea of a single huge
Shared Heap which Singularity uses exactly for the reasons you mentioned.
Though it may be ok to have a single large one and an option for a shared
memory region to have its own you are still open to DOS type attack through
a bug.





Regardless of exactly which layer of abstraction is impacted, presumably
there will be desire for a relatively large buffer cache with some type of
LRU replacement scheme. The path to retrieve data from disk will interact
with this buffer cache, so doing a full-GC on the buffer cache heap will
involve stopping any processes with zero-copy access to the buffer cache --
which could be quite a lot of processes.



You could have a large buffer cache in the driver or service and just expose
one or a few buffer ( s) to other processes via shared memory But not the
pool itself. eg pass the Buffer reference into the shared memory. Or you
can put the whole buffer in shared memory note such buffers may simply be
byte[Block_SIZE]NumBuffers] This is static memory and will never be
collected rather than replacing a block you can just overwrite the existing
blocks. This is typically how such things work they don't require a lot of
GC interaction. There is though the practical consideration how to expose
some buffers and not all .



We're probably getting a little too deep in the details for this discussion.
My major point is that GC is not just a panacea to all problems, it also has
many challenges and problems of its own.



GC are well understood and we don't see it as a solution to any of these
problems. Singularity proved managed device drivers do work under sustained
loads ( though I don't like their single kernel GC and large shared Heap)
you can always go back to a self managed heap ( like the Sing Shared heap)
which have their own issues ( eg slow allocation times ). Not sure whether
your saying we think GCs allow managed OS to work this is far from the case.
What allows them to work is a type safe pointer free language.

I will avoid write barriers like the plague or anything else interfering
with execution stream.

You may reconsider when stop-the-world collections on buffer caches are
stopping too many processes. :) Letting those processes continue at slower
speeds with write barriers may be preferable to stopping them.

Device drivers will use little memory ( most will be shared memory even)

The argument that the driver-itself has little memory but access a shared
heap feels like it's jus tshifting the problem.



Partially true but Shared Memory can have a memory solution that is more
appropriate to the need. Device drivers prob should be just code.



The driver would need to be stopped when the shared-heap was GCed.



Only the thread that talked to the Shared memory but yes. Note 1 driver may
have 50 Shared Memory regions talking to different services each with its
own thread.



If there is truly zero-copy from the hardware HAL to user-land, then all
these folks are going to be touching some shared-buffer-cache-heap, and all
of them would need to be stopped for a stop-the-world collection of the
shared-buffer-cache-heap. That heap is likely to be quite large.



Why ? If we are talking buffers we are talking immutable byte[] I don't see
the GC being involved at all so why involve it . When you are finished with
a buffer tell the device driver.



the actual sweep does need a stop the world im not smart enough to solve
that problem .

The stop-the-world sweep/compact pause time is a huge problem in real
application level GCed systems today. It's one of the reasons games and
other soft-real-time applications have trouble with GC. Even on
high-performance webserving pause times become unacceptable (for me) on
heaps over 256MB.



You sure this isn't the Mark time ? The majority of the time is spent in
the mark and in most GCs (.NET and Java) this is single threaded stop the
world. The problem with games and such is Windows Processes are large and
IPC is expensive for example if you broke the code up into smaller services
( remember the IPC time) you can use a different GC for each part with far
less work it is also good from a coding point of view. For critical
processes such as the GUI you can use a ref counting GC or even manually
call free. Oh I forgot to mention that you can manually free code by
calling CG.Free( ref) so in theory you could turn of collections and manage
it yourself just like traditional system but I haven't given this much
thought. The free is basically there since we have an Allocator but no
Collector.



You using,MS CLR ? I have had high performance systems up to 1 Gig Heaps
and the worst case I have seen maybe a 2ms tops. For Nursery to Generation1
there is no need to stop the world ( but we do because of the Mark and the
single Nursery ) this covers most collection work , From Gen1 to Gen2
there is normally a mark and Sweep . Only from Gen2 to 3 do you normally
see a compaction.





One major issue with a ref counting GC is we need to recompile the app to
change form a mark and sweep to ref counting and visa versa as we don't want
expensive ref counting checks when not needed.



If you havn't already, I recommend you read the MS Research paper I
referenced. Their collectors are really interesting and they allow code to
run during compaction (though they need a write barrier). You can find the
paper here:



http://portal.acm.org/citation.cfm?id=1375587



For shared memory I suppose a write barrier is not such a big deal as the
code is created then read. I don't see many writes going on for receivers.
The shared memory Control block mentioned above can maybe serve as the write
barrier.

I'm sort of toying with the idea of having a system wide shared Nursery for
allocations as a large Nursery size can give you a big benefit when a
collection is needed you go down the most active process list calling their
collectors to update their heaps if not enough is cleaned you go to the next
one ( the collectors knows the reference belongs to it if it's in its
allocated memory ranges) etc Needs a lot more thought before being
practical.

I can see the draw of a system wide Nursery. This would reduce memory lost
to allocated but unused Nursery space across many processes/units. However,
the Nursery would have to have a very aggressive non-blocking parallel
allocator, otherwise it would become a serious performance bottleneck for
threaded systems.



Would those objects graduate from the Nusery back to an assembly/process
specific heap?





Im thinking something like this put 10% of System Memory into 2 Nurseries so
on a 4 Gig systems that's 2 * 200 Meg.



[Inline]

Allocator ( note this is all it does and is the same as the current planned
alloc it will prob be 3 hand code x86 instructions )

Interlocked Compare alloc request to size to
mem left , if empty call Nursery empty I

Interlocked Add size of object to Nursery
pointer

return nursery pointer



NursuryEmpty()

Critical Sectiion

{

If ( all process have not finished with backup Nursury)

Force remaining processes to update with a
stop the world.

Set backup Nursury pointer to start. ( this is the Sweep) .

Swap Nursury 1 and 2

Call background update Nursury

Return

}





BackgroundUpdateNursury()

Call the collector for each process active since last
collection pref when they are idle.

Reset pointer on Nursery when all collects done..



ProcessCollectGenration1

//Normal tri colour Mark and Sweep

Find all references in stack and Heap that point to the
Nursery. Note the size is at the reference location.

Move these items to the Heap

Signal complete to Background Update Nursery.



The whole idea is to exploit parralism more and reduce collects , and give
objects more time to expire and hence not incur a collect cost.. ( eg avoid
the situation where under memory stress you use more collects and hence make
the problem worse)

Should make new about the same speed as the stack this will be very nice for
strings J You can allocate and use 400 Meg of strings fast!



Anyway this optomization is a LONG way off and may be counterproductive..



On the other hand I will post something more on Shared Memory some options
os we can disucss the Pros and Cons.
David Jeske
2009-09-23 18:51:01 UTC
Permalink
On Wed, Sep 23, 2009 at 4:37 AM, Ben Kloosterman <bklooste-***@public.gmane.org> wrote:


Why do we need to prove this [that data passed through IPC won't change
later] ? If the coder writes some code badly it will appear just like a
Threading/Synch bug.


First off, Singularity seems to be using this proof as part of how they
assure SIPs can be independently GCed
[PDF<http://research.microsoft.com/pubs/69431/osr2007_rethinkingsoftwarestack.pdf>
].

Second, the bugs that would occur would be extremely hard to track down. I
might be handed a structure from IPC. I might store that structure. If
someone else can change the structure later, then my application would
strangely break at some weird time in the future because of the structure
changing. This seems really bad. I think it will be very important to assure
that data passing across IPC boundaries does not change on the other side
(unless that is specifically desired, as in shared-memory regions).
Post by Ben Kloosterman
Extending the immutable idea has some merit I was going to just do it via
lightweight lock ( which can be on the root memory object of the tree) ,
avoiding locking would be nice.
How would a lock on the root of the object tree work? If I had a handle to a
random place in the subtree, how would it inexpensively enforce the lock?
Post by Ben Kloosterman
I was viewing shared memory spaces as small (each with
its own GC) between a few processes. I don’t like the idea of a single huge
Shared Heap which Singularity uses exactly for the reasons you mentioned.
Though it may be ok to have a single large one and an option for a shared
memory region to have its own you are still open to DOS type attack through
a bug.
If you don't like the idea of a large shared heap, then how would you
implement a large buffer cache?
Post by Ben Kloosterman
You could have a large buffer cache in the driver or service and just
expose one or a few buffer ( s) to other processes via shared memory But not
the pool itself.
... of course clients only have references to the few objects (blocks)
they've been given as return result. Nobody can touch an object in the
buffer unless they've been given a handle. However, the pool is still
large.

I suspect a scheme exploiting the fact that data is immutable and
introducing a temporary write barrier of some kind on handles into the
shared memory would be able to avoid the need to stop-the-world.

It may also be practical to subdivide the buffer cache into several shared
pools based on a hash of a cache key. By using a dynamic hash, the
individual pools could be repartitioned when they reached a certain maximum
size. This somehow feels cludgy, but perhaps it makes sense. This would
allow the buffer-cache to be collected in smaller pieces.
Post by Ben Kloosterman
Only the thread that talked to the Shared memory but yes. Note 1 driver may
have 50 Shared Memory regions talking to different services each with its
own thread.
I don't see how this fixes the problem. How are the shared memory regions
GCed? If the driver needs to be stopped when any of the 50 are stopped, then
we have a pause problem. I'm beginning to respect the Singularity design
more and more.

If there is truly zero-copy from the hardware HAL to user-land, then all
Post by Ben Kloosterman
these folks are going to be touching some shared-buffer-cache-heap, and all
of them would need to be stopped for a stop-the-world collection of the
shared-buffer-cache-heap. That heap is likely to be quite large.
Why ? If we are talking buffers we are talking immutable byte[] I don’t
see the GC being involved at all so why involve it . When you are finished
with a buffer tell the device driver.
In a typesafe system like this, memory can only be freed when the GC can
prove it's not referenced. That can only happen as part of a trusted GC
process. People "saying they are finished" is not enough. The GC needs to
prove they have no more handles.

Perhaps one solution is to support reference-counting on immutable
leaf-objects in a shared heap. (leaf-object ==> object with no handles to
other objects) Since they are leaf-objects there can be no cycles and
freeing the object should be constant time. Since they are immutable
everyone can share them with no danger.
Post by Ben Kloosterman
the actual sweep does need a stop the world im not smart enough to
solve that problem .
The stop-the-world sweep/compact pause time is a huge problem in real
application level GCed systems today. It's one of the reasons games and
other soft-real-time applications have trouble with GC. Even on
high-performance webserving pause times become unacceptable (for me) on
heaps over 256MB.
You sure this isn’t the Mark time ? The majority of the time is spent in
the mark and in most GCs (.NET and Java) this is single threaded stop the
world.
You are right, mark (and remark) time. I also didn't realize java's
concurrent collector<http://chaoticjava.com/posts/parallel-and-concurrent-garbage-collectors/>is
non-compacting. It seems to achieve concurrency also only with a
write-barrier. [PDF<http://research.sun.com/techrep/2000/smli_tr-2000-88.pdf>]
My point still stands, large heaps create large GC pause times.
Post by Ben Kloosterman
The problem with games and such is Windows Processes are large and IPC is
expensive for example if you broke the code up into smaller services (
remember the IPC time) you can use a different GC for each part with far
less work it is also good from a coding point of view.
Smaller services do not solve this problem for many applications. Consider a
3d modeling package or computer game with a very large scene. The data is a
very large number of verticies, and the graph that connects them into edges
and faces. There is no "module" boundary to create. This is a graph
partitioning problem and it's NP-complete.
Post by Ben Kloosterman
For critical processes such as the GUI you can use a ref counting GC or
even manually call free. Oh I forgot to mention that you can manually free
code by calling CG.Free( ref) so in theory you could turn of collections
and manage it yourself just like traditional system but I haven’t given
this much thought. The free is basically there since we have an Allocator
but no Collector.
I don't understand this suggestion. It is not safe to release memory until
the GC has proven no other references exist. GC.Free(ref) in a safe GC
system can only null out the handle. It can't actually release memory. If it
did, some other handle to that same memory would become invalid.
Post by Ben Kloosterman
You using,MS CLR ? I have had high performance systems up to 1 Gig Heaps
and the worst case I have seen maybe a 2ms tops.
I have not tested the MS CLR, but I don't believe 2ms is the worst case
pause time for a typical 1GB heap. I could imagine a 1GB heap that collects
quickly because it has very large data-buffers. However, if there is even
512MB of pointers, I expect the worst case pause would be much longer.

Here is a quote from the person writing the new CLR 4.0 background
GC<http://blogs.msdn.com/maoni/archive/2008/11/19/so-what-s-new-in-the-clr-4-0-gc.aspx>,
explaining why they are still working on this problem: "*as people start
writing larger applications with larger heaps that handle more stressful
situations, the latency can be unacceptable.*" Also, on the servers they
have a notification API because they recognize this problem. From the same
article above... "*Basically you register to get notified when a full GC is
approaching and when it’s finished. This allows you to do software load
balancing between different server instances – when a full GC is about to
happen in one of the server instances, you can redirect new requests to
other instances.*"

GC pausing is a real problem. It's not solved.

Here are some other interesting references:
PPT-gc-challenges-overview<http://www.symbolic-computation.org/science/images/7/73/Jones.ppt>
; CLR garbage collection<http://vineetgupta.spaces.live.com/blog/cns!8DE4BDC896BEE1AD!1104.entry>
;
PDF-gc-overhead <http://www.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf> ;
PDF-coordinating-gc-and-paging<http://www.cs.umass.edu/~emery/pubs/f034-hertz.pdf>
Post by Ben Kloosterman
One major issue with a ref counting GC is we need to recompile the app to
change form a mark and sweep to ref counting and visa versa as we don’t want
expensive ref counting checks when not needed.
Ref-counting will leak memory when cycles occur, so in the end there would
need to be some kind of sweeping GC even in a refcounted pool (like Python,
refcounting + GC to find cycles). As I mentioned above, refcounting might be
a good solution for shared immutable leaf-objects, as they can't create
cycles.

For shared memory I suppose a write barrier is not such a big deal as the
Post by Ben Kloosterman
code is created then read. I don’t see many writes going on for receivers.
The shared memory Control block mentioned above can maybe serve as the write
barrier.
Agreed. AFAIK, the CLR concurrent collector already has a similar write
barrier anyhow.
Post by Ben Kloosterman
BackgroundUpdateNursury()
Call the collector for each process active since last
collection pref when they are idle.
Reset pointer on Nursery when all collects done..
If I understand correctly, you are expecting this "collection of all
processes with active data in the nursery" to require them to either free
the memory or copy it out of the nursery into their own heap. It seems this
would require a write-barrier also, as you would want to cheaply keep track
of all reference into nursery objects other generations. (otherwise how
would you cheaply scavenge the nursery without also sweeping those
generations?)

I expect this would also have subdivision of the Nursery for each
hardware-thread, to avoid locking and concurrency problems fighting over a
single top-of-heap-pointer. Overall the single Nursery looks to have some
interesting advantages. It has the same disadvantages of
generational/concurrent schemes, namely write barrier, and compating
schemes, namely copy overhead. However, these probably already exist to
support those other features so they might be available 'for free'.

Regardles of the details, there are many interesting and unsolved challenges
in GC for a system like this. I've been particularly interest in worst-case
pause-reduction because it's been such a problem in the server-side and
desktop applications I've written and managed. IMO, unacceptable and
unpredictable worst-case pause-times are one of the few remaining reasons
that some applications are still written with manual memory allocation.
Ben Kloosterman
2009-09-24 05:27:56 UTC
Permalink
Ø Shared memory is a bit hairy I will start a separate thread to break
things down its hard on the brain. I think we need a separate thread for
GC.






Why do we need to prove this [that data passed through IPC won't change
later] ? If the coder writes some code badly it will appear just like a
Threading/Synch bug.



First off, Singularity seems to be using this proof as part of how they
assure SIPs can be independently GCed [PDF
<http://research.microsoft.com/pubs/69431/osr2007_rethinkingsoftwarestack.pd
f> ].



Ø They are trying to avoid the issue of a programmer assigning a reference
to a object in the SIPs object space into Shared Memory. This is bad
programming and could result in the object being collected despite the
Shared Memory reference. A reference into Shared Memory can be easily
handled eg call a different Alloctor at compile time for object trees
marked shared. In their architecture a SIP has only one GC ,in one of my
suggestions in a Thread there is a different allocator used for Shared
Memory objects only one of which is active at the time. If a thread wants to
talk to multiple shared regions it will need to change regions via a using
block..

Ø Im thinking of allowing references into shared memory since most shared
memory areas should be small.

Ø Maybe by using a factory to create Shared Memory objects and replacing
all references with indirect. Except for buffers and IO which will be
single level ( see other post with Buffer Manager) , an additional
indirection on shared memory references is perfectly valid. This opens up a
lot of options. This indirect poiutner can be set to null if the user does
not have the global lock and you don’t incur this cost for structs and flat
objects..





Second, the bugs that would occur would be extremely hard to track down. I
might be handed a structure from IPC. I might store that structure. If
someone else can change the structure later, then my application would
strangely break at some weird time in the future because of the structure
changing. This seems really bad. I think it will be very important to assure
that data passing across IPC boundaries does not change on the other side
(unless that is specifically desired, as in shared-memory regions).





Ø But this is just standard Multi threaded code . Locking on the object or
Shared Memory Region would prevent this. I agree there is a valid argument
in terms of untrusted IPC vs trusted IPC and IMHO they should have
different mechanisms. It is entirely appropriate to say for untrusted
code it must be immutable ( which may be achieved by a single complex
object not a tree) and for trusted partners to lock on trees or even
objects within the tree.



Extending the immutable idea has some merit I was going to just do it via
lightweight lock ( which can be on the root memory object of the tree) ,
avoiding locking would be nice.

How would a lock on the root of the object tree work? If I had a handle to a
random place in the subtree, how would it inexpensively enforce the lock?





Ø If you want your code to work you obtain the lock as with any multi
threaded code within your process. Even traditional shared memory if its
active between 2 processes they need some way of locking though they can use
a Mutex.



I was viewing shared memory spaces as small (each with its
own GC) between a few processes. I don’t like the idea of a single huge
Shared Heap which Singularity uses exactly for the reasons you mentioned.
Though it may be ok to have a single large one and an option for a shared
memory region to have its own you are still open to DOS type attack through
a bug.

If you don't like the idea of a large shared heap, then how would you
implement a large buffer cache?



Ø Large in terms of big object spaces/GC collection not data .. A Buffer
cache is only 1 level deep byte[BufferSIze][Num_Bufers] so can be managed
with a global lock. Note the Buffer is a singleGC object ( See other post
for more)

You could have a large buffer cache in the driver or service and just expose
one or a few buffer ( s) to other processes via shared memory But not the
pool itself.

... of course clients only have references to the few objects (blocks)
they've been given as return result. Nobody can touch an object in the
buffer unless they've been given a handle. However, the pool is still large.




I suspect a scheme exploiting the fact that data is immutable and
introducing a temporary write barrier of some kind on handles into the
shared memory would be able to avoid the need to stop-the-world.



Ø Yes this will work for buffers since they are immutable and there are
no writes except for creation. I think Buffers need to be handled
differently from deeper N>1 object spaces. Though since they are not read a
write barrier incurs no cost and could apply to all shared memory spaces.



It may also be practical to subdivide the buffer cache into several shared
pools based on a hash of a cache key. By using a dynamic hash, the
individual pools could be repartitioned when they reached a certain maximum
size. This somehow feels cludgy, but perhaps it makes sense. This would
allow the buffer-cache to be collected in smaller pieces.



Ø Yes you are right here we need to do something like that not because of
a big pool but exposing all buffers to “user land” is a security issue.

Only the thread that talked to the Shared memory but yes. Note 1 driver may
have 50 Shared Memory regions talking to different services each with its
own thread.

I don't see how this fixes the problem. How are the shared memory regions
GCed? If the driver needs to be stopped when any of the 50 are stopped, then
we have a pause problem. I'm beginning to respect the Singularity design
more and more.



Ø Each shared memory region has its own GC so why do you need to stop the
device driver ? There are many options here flick the shared lock before
the collect With a multi threaded Mark you know all the objects before you
do a Sweep . I cant see the driver or the GC sweep being a big deal . Ref
counting is also viable for Shared Memory ( and what was thinking), to do
a collect you get the GC to acquire the lock on the region. This may block
a single thread on a driver but is no big deal for the whole driver.

Ø Don’t discount the Sing design they had lots of Really smart people and
worked on it for 5-6 years , the main thing Sing suffers from is lack of
people to make real device drivers /services , a poor security model (
which is why they hired Shap) , poor Visual Studio integration , and a bit
of conservatism in the OS design ( Which they must if they want to replace
Windows) .



If there is truly zero-copy from the hardware HAL to user-land, then all
these folks are going to be touching some shared-buffer-cache-heap, and all
of them would need to be stopped for a stop-the-world collection of the
shared-buffer-cache-heap. That heap is likely to be quite large.



Why ? If we are talking buffers we are talking immutable byte[] I don’t see
the GC being involved at all so why involve it . When you are finished with
a buffer tell the device driver.

In a typesafe system like this, memory can only be freed when the GC can
prove it's not referenced. That can only happen as part of a trusted GC
process. People "saying they are finished" is not enough. The GC needs to
prove they have no more handles.

Ø That’s up to the original creator , the creator may have a dummy handle
like Buffer outstanding which gets cleaned up when the receiver says I’m
finished. If there are other references you don’t need to collect. I think
we should avoid buffers when talking about GCs as they are not really
collected at any time and normally live for the duration of the
process/Shared Memory and just reused..



Perhaps one solution is to support reference-counting on immutable
leaf-objects in a shared heap. (leaf-object ==> object with no handles to
other objects) Since they are leaf-objects there can be no cycles and
freeing the object should be constant time. Since they are immutable
everyone can share them with no danger.



Ø The .NET Mark and Sweep already treats Leaf objects differently (eg
all strings and boxed structs) it doesn’t scan them in the Mark etc . Why
not use ref counting on all Shared objects . It doesn’t solve the Shared
memory pointer issue but solves the GC dead time issue. .

the actual sweep does need a stop the world im not smart enough to solve
that problem .

The stop-the-world sweep/compact pause time is a huge problem in real
application level GCed systems today. It's one of the reasons games and
other soft-real-time applications have trouble with GC. Even on
high-performance webserving pause times become unacceptable (for me) on
heaps over 256MB.

You sure this isn’t the Mark time ? The majority of the time is spent in
the mark and in most GCs (.NET and Java) this is single threaded stop the
world.

You are right, mark (and remark) time. I also didn't realize java's
<http://chaoticjava.com/posts/parallel-and-concurrent-garbage-collectors/>
concurrent collector is non-compacting. It seems to achieve concurrency also
only with a write-barrier. [PDF
<http://research.sun.com/techrep/2000/smli_tr-2000-88.pdf> ] My point still
stands, large heaps create large GC pause times.



Ø If the pause time is an issue switch the app to to a ref counting one.
For 95% of apps a concurrent Mark and stop the work sweep and compact will
be sufficient . Also SOA program structures for most apps will avoid
massive heaps in a single process. I certainly acknowledge 3D Video and
some heavy calc apps are not in that category

The problem with games and such is Windows Processes are large and IPC is
expensive for example if you broke the code up into smaller services (
remember the IPC time) you can use a different GC for each part with far
less work it is also good from a coding point of view.

Smaller services do not solve this problem for many applications. Consider a
3d modeling package or computer game with a very large scene. The data is a
very large number of verticies, and the graph that connects them into edges
and faces. There is no "module" boundary to create. This is a graph
partitioning problem and it's NP-complete.





Ø Fair point but really only affects the 3D driver which aggregates it all
, threads can add background meshes , foreground etc. I would imagine such
a driver would use ref counting or even a manual collector. A write barrier
here is not bad either since the meshes are immutable , 99% of the time the
data is created , read MANY times sounds like a good use of the MS GC you
mentioned.

For critical processes such as the GUI you can use a ref counting GC or
even manually call free. Oh I forgot to mention that you can manually free
code by calling CG.Free( ref) so in theory you could turn of collections
and manage it yourself just like traditional system but I haven’t given
this much thought. The free is basically there since we have an Allocator
but no Collector.

I don't understand this suggestion. It is not safe to release memory until
the GC has proven no other references exist. GC.Free(ref) in a safe GC
system can only null out the handle. It can't actually release memory. If it
did, some other handle to that same memory would become invalid.





Ø Its not safe but nothing stops a force free and just like C++ if the
developer leaves a dangling pointer the process may crash. Its an option of
last resort though I do use it at the moment since the Collector is not
working yet.

You using,MS CLR ? I have had high performance systems up to 1 Gig Heaps
and the worst case I have seen maybe a 2ms tops.

I have not tested the MS CLR, but I don't believe 2ms is the worst case
pause time for a typical 1GB heap. I could imagine a 1GB heap that collects
quickly because it has very large data-buffers. However, if there is even
512MB of pointers, I expect the worst case pause would be much longer.



Here is a quote from the person writing the new CLR
<http://blogs.msdn.com/maoni/archive/2008/11/19/so-what-s-new-in-the-clr-4-0
-gc.aspx> 4.0 background GC, explaining why they are still working on this
problem: "as people start writing larger applications with larger heaps that
handle more stressful situations, the latency can be unacceptable." Also,
on the servers they have a notification API because they recognize this
problem. From the same article above... "Basically you register to get
notified when a full GC is approaching and when it’s finished. This allows
you to do software load balancing between different server instances – when
a full GC is about to happen in one of the server instances, you can
redirect new requests to other instances."





Ø Even 5-10 ms is an issue for some roles ( eg Graphics , ping /perf
checks ) . I note they use a double nursery as well ( but obv not global
one) . Interesting this concurrent GC is only available as the workstation
GC not server.



GC pausing is a real problem. It's not solved.





Ø Memory pauses is a problem that exists in C as well , Calls to malloc can
take a long time on a large heap and if the pause is there waiting on mm
then other threads will pause as well (unless they don’t allocate) . The
same techniques that work their also work with us eg permanent self managed
buffers , reallocate data from a factory , use local stack more , user
workers that don’t allocate memory ( and in our case don’t need a GC etc ).
The real problem is if the memory gets overloaded ( eg a very large object
space for a single GC) then the problem gets worse and worse. There will
always be tiny pauses but I cant see how say a ref counting GC that
frequently releases will have issues its behaviour would be similar to
native C app.



Anyway if you exclude buffers I think most device drivers will have tiny
heaps. Large 3D apps is an issue and would prob need a write barrier GC if
it canr make do with a ref counting one.



Here are some other interesting references: PPT-gc-challenges-overview
<http://www.symbolic-computation.org/science/images/7/73/Jones.ppt> ; CLR
<http://vineetgupta.spaces.live.com/blog/cns!8DE4BDC896BEE1AD!1104.entry>
garbage collection ; PDF-gc-overhead
<http://www.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf> ;
PDF-coordinating-gc-and-paging
<http://www.cs.umass.edu/~emery/pubs/f034-hertz.pdf>



Ø I like removing paging ,it is an issue as your pause gets hit with disk
accesses since a GC scans infrequent memory , that wont be hard to implement
;-P , The idea of using lots of threads at high priority to do the Sweep
/Compact also has merit as this is the stop the world phase. Not much info
on ref counting except interest is returning . Will be interesting to
compare the cost of the MS research one with a ref counting one ( yes the
ref counting will have small stops) .

One major issue with a ref counting GC is we need to recompile the app to
change form a mark and sweep to ref counting and visa versa as we don’t want
expensive ref counting checks when not needed.

Ref-counting will leak memory when cycles occur, so in the end there would
need to be some kind of sweeping GC even in a refcounted pool (like Python,
refcounting + GC to find cycles). As I mentioned above, refcounting might be
a good solution for shared immutable leaf-objects, as they can't create
cycles.



Ø Not sure if this is a big issue just means you waste more memory ,
you can also schedule compactions when most threads are idle. You can also
run the background Mark to determine how bad the cycle problem is and act
accordingly ie if less than 20% ignore if threads are idle clean up now etc.
.



For shared memory I suppose a write barrier is not such a big deal as the
code is created then read. I don’t see many writes going on for receivers.
The shared memory Control block mentioned above can maybe serve as the write
barrier.

Agreed. AFAIK, the CLR concurrent collector already has a similar write
barrier anyhow.





Ø Yes a very lightweight one ( even in the existing Collector

BackgroundUpdateNursury()

Call the collector for each process active since last
collection pref when they are idle.

Reset pointer on Nursery when all collects done..



If I understand correctly, you are expecting this "collection of all
processes with active data in the nursery" to require them to either free
the memory or copy it out of the nursery into their own heap. It seems this
would require a write-barrier also, as you would want to cheaply keep track
of all reference into nursery objects other generations. (otherwise how
would you cheaply scavenge the nursery without also sweeping those
generations?)





Ø Yes sort of the current CLR Nusery works as I outlined , it does use a
sort of write barrier ( when an object is modified the JIT emits the code
to update a bit field to state the object has a change) but only for objects
that exist in Generation 1 and 2. Just these items are checked for the ref
into the Nursury.



The different processes may have different Collectors also which will sweep
or compact in different ways.



2 Issues

- Ref counting GCs cant use the Nursery

- Cache hit will drop a bit. ( I don’t think it will be such a big
deal as it only affects Gen 0 and most of these will be recently accessed
anyway)



I expect this would also have subdivision of the Nursery for each
hardware-thread, to avoid locking and concurrency problems fighting over a
single top-of-heap-pointer.



Ø Nope not needed. The top of the heap pointer is done via an Interlock so
a single CPU instruction so it wont really hold other threads up. ( Note I
haven’t solved the issue of storing this in a register for multiple CPU ,
for single CPU you could be more efficient , what I outlined will work for
Multi Core systems) . You do incur a memory access cost but its pretty good
compared to malloc or even the CLR.



Overall the single Nursery looks to have some interesting advantages. It has
the same disadvantages of generational/concurrent schemes, namely write
barrier, and compating schemes, namely copy overhead. However, these
probably already exist to support those other features so they might be
available 'for free'.





Ø 85% of allocations are gone before 1 Meg is allocated. With a large
double Nursery this may mean 95-98%% ( my guess) this means the actual
collection Sweep will work on 1/8-1/3 of the number of objects , note their
will be a smaller improvement in the Gen1-2 Collections as most of these are
long lived ( but note less collections means less Gen 2 objects) .. The
biggest gains IMHO will be web servers and XML with their huge collections
of short term strings.



Consider a video game if The 2 Nurseries are large enough to hold all
temporary scene data say a minute then you incur no collection cost (
since the Mark is fixed) for 95%-98% of this 400 Meg. That’s a huge gain
and will really help GC stoppages. ( Games also have the advantage of being
the only heavily active proc when running)





Regardles of the details, there are many interesting and unsolved challenges
in GC for a system like this. I've been particularly interest in worst-case
pause-reduction because it's been such a problem in the server-side and
desktop applications I've written and managed. IMO, unacceptable and
unpredictable worst-case pause-times are one of the few remaining reasons
that some applications are still written with manual memory allocation.



Your welcome to write our GC’s … J Im focussing on Scheduler
/Process/Shared Memory/Security and the MM which is already far too much.



One more thing I don’t think a strongly typed system needs to zero memory ?
Any disagreements ?



Regards,



Ben
David Jeske
2009-09-24 17:46:11 UTC
Permalink
Post by David Jeske
First off, Singularity seems to be using this proof as part of how they
assure SIPs can be independently GCed [PDF<http://research.microsoft.com/pubs/69431/osr2007_rethinkingsoftwarestack.pdf>
].
Ø They are trying to avoid the issue of a programmer assigning a
reference to a object in the SIPs object space into Shared Memory.This is
bad programming and could result in the object being collected despite the
Shared Memory reference.
I don't understand this comment above. If I understand
correctly....Singularity does not GC the shared memory at all but relies on
the invariant that only one SIP can point to an object in shared memory.
This design means that when they sweep an individual SIP, they also sweep
the objects that SIP owns in shared memory. Therefore, insuring only a
single SIP references a shared object is a fundamental part of their GC.
Without this property, the only way they could know that all references to a
shared memory object were removed would be to do a full sweep across all
SIPs.

In other words, this isn't simply a matter of 'avoiding bad programming'.
Their entire GC concept depends on this design element. Without it, they
wouldn't know when to collect the shared memory objects either.
Post by David Jeske
A reference into Shared Memory can be easily handled eg call a different
Alloctor at compile time for object trees marked shared.
Multiple allocators for multiple reasons is only solving the trivial problem
of where the object is placed. The hard work is inexpensively determining
reachability. That's what GC is all about. Singularity's design has a crafty
elegance to it. They effectively amortorize the cost of computing
reachability for IPC objects across all SIPs by assuring that only one SIP
can see an IPC object at once.
Post by David Jeske
Ø Im thinking of allowing references into shared memory since most shared
memory areas should be small.
I don't see how the size of a shared memory region is relevant. If sweeping
GC is used, and mutiple SIPs can reference the same object, then a sweep
would need to stop-and-sweep ALL SIPs connected to that shared memory to
determine reachability for an object.
Post by David Jeske
How would a lock on the root of the object tree work? If I had a handle
to a random place in the subtree, how would it inexpensively enforce the
lock?
Ø If you want your code to work you obtain the lock as with any multi
threaded code within your process. Even traditional shared memory if its
active between 2 processes they need some way of locking though they can use
a Mutex.
You are proposing a concept of IPC data sharing which only works if
programmers "do the right thing" all the time. That philosophy sound
incompatible with the idea of reliying on a provably correct VM system for
isolation. Getting multi-threaded locks correct is difficult for most
programmers. If system stability is reliant on all programers, then the
system will not be stable.
Post by David Jeske
Ø Each shared memory region has its own GC so why do you need to stop
the device driver ?
Even in concurrent GC it's necessary to stop all mutators for a period of
time. Then all mutators need to be scanned. If a driver may have a pointer
into shared memory, then it is a mutator and would need to be stopped during
any non-concurrent phases.
Post by David Jeske
Ref counting is also viable for Shared Memory ( and what was thinking),
to do a collect you get the GC to acquire the lock on the region. This may
block a single thread on a driver but is no big deal for the whole driver.
Ref-counting shared memory does sound like a possibly good idea IF the
objects allowed in shared memory also disallow cycles. For example, the
proof-by-induction immutability concept I explained earlier would disallow
cycles, since is provably impossible to end up with two immutable objects
that point to eachother (assuming proper thread locks).
Post by David Jeske
Ø If the pause time is an issue switch the app to to a ref counting one.
This is non-viable. Apps with very large datastructures which produce GC
pause problems have MANY cyclic references. Sticking with the 3d example,
there are many reasons for cyclic references between verticies and objects
(polygon reduction, to name one). A refcounting GC would still have to
periodically sweep, recreating the pauses.

If you want a more common application, consider MS Excel or MS Word. Both
have graphs with many cyclic references which become very large for large
documents. Large documents are very common, especially in Excel.
Post by David Jeske
Ø Fair point but really only affects the 3D driver which aggregates it
all , threads can add background meshes , foreground etc.
?? This does not affect the 3d driver at all. The driver only sees the 3d
information for brief moments during the rendering process. This affects the
end-application which has a giant tree of information in a huge large soup.
Your comment to "use threads for background meshes, etc" doesn't make sense
to me. Even individual models can be large enough to cause GC pauses.
Likewise for the Excel document example. Again, this is graph partitioning,
it's NP-complete. There is no "easy" way to partition pieces into different
places.
Post by David Jeske
I don't understand this suggestion. It is not safe to release memory
until the GC has proven no other references exist. GC.Free(ref) in a safe GC
system can only null out the handle. It can't actually release memory. If it
did, some other handle to that same memory would become invalid.
Ø Its not safe but nothing stops a force free and just like C++ if the
developer leaves a dangling pointer the process may crash. Its an option of
last resort though I do use it at the moment since the Collector is not
working yet.
This explanation of yours contradicts everything I've heard Cosmos is about.
The entire premise of a system which uses a "typesafe software VM for
safety" is that the software VM is provably correct. There is one huge
memory space with no hardware protection. You can't allow a force free! The
worst case is NOT a crash. The worst case is that I exploit the force-free,
get a different object or byte-buffer allocated into the same place, and use
a leftover pointer to violate every security and protection on the system.

You can stick a force-free concept into Cosmos today because nobody has
implemented a GC, but Cosmos isn't going to be able to be using software for
reliability and security until evey force-free is removed.
Post by David Jeske
Ø Memory pauses is a problem that exists in C as well ,
The problem with GC is not pausing, which I agree exists in any allocator.
The problem is _unpredictable_ and _uncontrollable_ pauses. A C program can
write it's own allocator, and can choose when to allocate or free objects to
avoid long pauses. A program with a GC no longer has control over this
element of the program, so it becomes unpredictable and uncontrollable.
Post by David Jeske
There will always be tiny pauses but I cant see how say a ref counting GC
that frequently releases will have issues its behaviour would be similar to
native C app.
ref-counting GCs are only usable if there are no cycles. Many many
applications have cyclic data-structures that are not usable in a
refcounting system.
Post by David Jeske
I expect this would also have subdivision of the Nursery for each
hardware-thread, to avoid locking and concurrency problems fighting over a
single top-of-heap-pointer.
Ø Nope not needed. The top of the heap pointer is done via an Interlock
so a single CPU instruction so it wont really hold other threads up.
An interlock holds up other hardware threads! It may be a single
instruction, but that single instruction causes a coherency protocol to
occur on the processor bus. If several hardware threads are frequently doing
the interlock on the same value (becuase they are frequently allocating),
this can be a bottleneck.
Post by David Jeske
Consider a video game if The 2 Nurseries are large enough to hold all
temporary scene data say a minute then you incur no collection cost
This doesn't make sense to me.

Scene data is rendered in a streaming fashion. The persistent data for the
entire scene is always in memory. The temporary data is created in a
pipeline to the driver, and the pieces only need to live for a short time. A
10MB nursery is probably large enough to avoid stalling the driver pipeline.
These temporary allocations and the Nursery have nothing to do with the GC
pause.

The GC pause occurs because eventually enough longer lived garbage is
generated. There are many reasons for this, one reason would be streaming
terrain, where the terrain loaded is dependent on where the user's charater
is in the world. The reasons are not important, but eventually there will be
a full GC of the oldest generation. This contains the entire scene, and the
GC pause that occurs is too long for an interactive application.

The "C" version of this program would be manually freeing these longer lived
pieces of garbage along the way, because it would have manually coded rules
about when to free things. (The C code effectively is using induction proofs
of reachability). It would never need to sweep the entire scene to determine
reachability, so it would never incur this pause.
Post by David Jeske
Your welcome to write our GC’s … J Im focussing on Scheduler
/Process/Shared Memory/Security and the MM which is already far too much.
I'm very interested in GC. I've never written a collector, but I know quite
a bit about them. I may take a shot at it. Keep in mind that there are many
interactions between Shared Memory/Security, overall design, and the GC.
Post by David Jeske
One more thing I don’t think a strongly typed system needs to zero memory ?
Any disagreements ?
Memory will need to be either zeroed or assigned to a known valid value
before a constructor returns.

A strongly typed system that controls all code could do this entirely during
object construction instead of before/during allocation. This would prevent
redundant writes for values the construction is going to fill-in
anyhow. However, Zeroing a large block in mass is much faster than writing
individual values, so it may be less overhead simply to make sure the
allocator returns zeroed memory. It's certainly simpler.
Ben Kloosterman
2009-09-25 07:06:37 UTC
Permalink
On Wed, Sep 23, 2009 at 10:27 PM, Ben Kloosterman <bklooste-***@public.gmane.org>
wrote:

First off, Singularity seems to be using this proof as part of how they
assure SIPs can be independently GCed [PDF
<http://research.microsoft.com/pubs/69431/osr2007_rethinkingsoftwarestack.pd
f> ].

Ø They are trying to avoid the issue of a programmer assigning a reference
to a object in the SIPs object space into Shared Memory.This is bad
programming and could result in the object being collected despite the
Shared Memory reference.



I don't understand this comment above. If I understand
correctly....Singularity does not GC the shared memory at all but relies on
the invariant that only one SIP can point to an object in shared memory.
This design means that when they sweep an individual SIP, they also sweep
the objects that SIP owns in shared memory. Therefore, insuring only a
single SIP references a shared object is a fundamental part of their GC.
Without this property, the only way they could know that all references to a
shared memory object were removed would be to do a full sweep across all
SIPs.



Yes they are trying to avoid the dangling pointer issue but also the issue (
since Shared Memory has no GC of a ref coming from a SIP to the Shared mem
what does the GC do with it or even worse Shared Memory to SIP in which cace
the SIPS GC will remove it ). And the issue only exists when shared memory
exists between multiple SIPS . I must say I was seeing this as more 1:1
relationships in which case it’s easy to set all such references to null.



That raises the questions should shared memory allow multiple processes.



Also singularity passes they Shared Memory around from SIP to SIP with the
enclosed data rather than a poor permanent IPC type structure. Considering
Bufferes can be passed around via a buffer manager and is normally 1:1 (
Device – Cache – FileSystem – User) . However Object Caches are more 1:N
and the data is certainly not immutable





In other words, this isn't simply a matter of 'avoiding bad programming'.
Their entire GC concept depends on this design element. Without it, they
wouldn't know when to collect the shared memory objects either.



Isnt their shared memory objects are only collected when the Shared Memory
is finished ? What I was saying is you can leave this to the programmer
which is rather awkward.



A reference into Shared Memory can be easily handled eg call a different
Alloctor at compile time for object trees marked shared.

Multiple allocators for multiple reasons is only solving the trivial problem
of where the object is placed.

Its trivial but important as it will get collected if in the wrong place.



The hard work is inexpensively determining reachability. That's what GC is
all about. Singularity's design has a crafty elegance to it. They
effectively amortorize the cost of computing reachability for IPC objects
across all SIPs by assuring that only one SIP can see an IPC object at once.




But this can be done via a indirect pointer and a lock eg set the pointer to
null when you don’t have the lock and even better this indirect pointer can
be a control block with threading events.



Ø Im thinking of allowing references into shared memory since most shared
memory areas should be small.

I don't see how the size of a shared memory region is relevant. If sweeping
GC is used, and mutiple SIPs can reference the same object, then a sweep
would need to stop-and-sweep ALL SIPs connected to that shared memory to
determine reachability for an object.



Yes I got it backwards the size is irrelevant , but if you allow references
from Shared into the immutable objects of SIPS it is relevant , I also was
going to have an owner lock /Indirect pointer as above and for this case (
Shared into SIPS) you could block the GC until the lock was there ( ie it
had sole access to the Shared Mem) and then do a Mark . ( Its probably
better handled with a dummy objects to prevent it getting collected)

How would a lock on the root of the object tree work? If I had a handle to a
random place in the subtree, how would it inexpensively enforce the lock?

Ø If you want your code to work you obtain the lock as with any multi
threaded code within your process. Even traditional shared memory if its
active between 2 processes they need some way of locking though they can use
a Mutex.



You are proposing a concept of IPC data sharing which only works if
programmers "do the right thing" all the time. That philosophy sound
incompatible with the idea of reliying on a provably correct VM system for
isolation. Getting multi-threaded locks correct is difficult for most
programmers. If system stability is reliant on all programers, then the
system will not be stable.



Sort of .. Im proposing different levels of Shared Memory eg between
trusted partners and a different one for untrusted. It will not make the
whole system unstable only the SIPS who share that shared memory piece. So
between Device Driver and File System you could rely on things working
correctly , between File System and the User you need a safer method. I
originally was thinking a forced locking system to be enforced only for the
Untrusted but am now thinking the manual lock is not worth it. ( As There is
a different method without shared memory we can use for immutable objects eg
just pass the reference and create a dummy until the receiver is finished)



Ø Each shared memory region has its own GC so why do you need to stop the
device driver ?



Even in concurrent GC it's necessary to stop all mutators for a period of
time. Then all mutators need to be scanned. If a driver may have a pointer
into shared memory, then it is a mutator and would need to be stopped during
any non-concurrent phases.



Again if it doesn’t hold the lock there is no need as there is no mutator.



Ref counting is also viable for Shared Memory ( and what was thinking), to
do a collect you get the GC to acquire the lock on the region. This may
block a single thread on a driver but is no big deal for the whole driver.

Ref-counting shared memory does sound like a possibly good idea IF the
objects allowed in shared memory also disallow cycles. For example, the
proof-by-induction immutability concept I explained earlier would disallow
cycles, since is provably impossible to end up with two immutable objects
that point to eachother (assuming proper thread locks).



Yes good point , a cycle can be checked at the same time.. .



Ø If the pause time is an issue switch the app to to a ref counting one.

This is non-viable. Apps with very large datastructures which produce GC
pause problems have MANY cyclic references. Sticking with the 3d example,
there are many reasons for cyclic references between verticies and objects
(polygon reduction, to name one). A refcounting GC would still have to
periodically sweep, recreating the pauses.



If you want a more common application, consider MS Excel or MS Word. Both
have graphs with many cyclic references which become very large for large
documents. Large documents are very common, especially in Excel.



Not saying to eliminate the pauses just making them small enough not to
matter. Also you can detect cyclic references concurrently and free the
memory except for fractional thread stack scans. The main issue here is
memory fragmentation for which either you need 2* the memory or you need
compaction which will create a bigger pause.

Ø Fair point but really only affects the 3D driver which aggregates it all
, threads can add background meshes , foreground etc.

?? This does not affect the 3d driver at all. The driver only sees the 3d
information for brief moments during the rendering process. This affects the
end-application which has a giant tree of information in a huge large soup.
Your comment to "use threads for background meshes, etc" doesn't make sense
to me. Even individual models can be large enough to cause GC pauses.
Likewise for the Excel document example. Again, this is graph partitioning,
it's NP-complete. There is no "easy" way to partition pieces into different
places.



Maybe im thinking of DirectX where the DirectX lib does all that once you
send it the data you forget about it in the user app but the key graph is
only a scene . Anyway I don’t think it wil be that different from the
malloc/free heap pauses with a ref counting GC. Lastly you can probably use
static memory for this and manage it yourself Vertex buffers and most 3D
things are done this way.



I don't understand this suggestion. It is not safe to release memory until
the GC has proven no other references exist. GC.Free(ref) in a safe GC
system can only null out the handle. It can't actually release memory. If it
did, some other handle to that same memory would become invalid.

Ø Its not safe but nothing stops a force free and just like C++ if the
developer leaves a dangling pointer the process may crash. Its an option of
last resort though I do use it at the moment since the Collector is not
working yet.

This explanation of yours contradicts everything I've heard Cosmos is about.
The entire premise of a system which uses a "typesafe software VM for
safety" is that the software VM is provably correct. There is one huge
memory space with no hardware protection. You can't allow a force free! The
worst case is NOT a crash. The worst case is that I exploit the force-free,
get a different object or byte-buffer allocated into the same place, and use
a leftover pointer to violate every security and protection on the system.



Except you can only free your own apps pointers (GC will check mem range )
and hence you only see bits of your own app ( which granted may have
interesting information in it) … Anyway its an option of last resort ,
every OS in history has had its compromises.. Managed OS have unmanaged
pointer based Memory Managers and GC’s if a 3D lib does the same so be it
though I don’t think its needed.





You can stick a force-free concept into Cosmos today because nobody has
implemented a GC, but Cosmos isn't going to be able to be using software for
reliability and security until evey force-free is removed.

Ø Memory pauses is a problem that exists in C as well ,

The problem with GC is not pausing, which I agree exists in any allocator.
The problem is _unpredictable_ and _uncontrollable_ pauses. A C program can
write it's own allocator, and can choose when to allocate or free objects to
avoid long pauses. A program with a GC no longer has control over this
element of the program, so it becomes unpredictable and uncontrollable.



In the past C was like this but now it’s hard to know when a libs malloc
will make a kernel allocation call. To handle this you manage memory
yourself. This is still sort of possible in C# you can create static buffers
and objects which you manage yourself ( you can recycle unused objects etc)
. It is worthwhile noting this is exactly what is used with most 3D
systems.



There will always be tiny pauses but I cant see how say a ref counting GC
that frequently releases will have issues its behaviour would be similar to
native C app.



ref-counting GCs are only usable if there are no cycles. Many many
applications have cyclic data-structures that are not usable in a
refcounting system.



I expect this would also have subdivision of the Nursery for each
hardware-thread, to avoid locking and concurrency problems fighting over a
single top-of-heap-pointer.

Ø Nope not needed. The top of the heap pointer is done via an Interlock so
a single CPU instruction so it wont really hold other threads up.

An interlock holds up other hardware threads! It may be a single
instruction, but that single instruction causes a coherency protocol to
occur on the processor bus. If several hardware threads are frequently doing
the interlock on the same value (becuase they are frequently allocating),
this can be a bottleneck.



It can be but is not much worse than existing systems eg CLR has a multi
threaded allocator ( Except in this case more threads are in contentions) .
However I don’t think many threads will be allocating at the same time in
fact when an app is really active all the threads of one app are active and
hence it’s not much different. Be nice to do some research to test the
impact .. eg ( Use 4 Nurseries and load balance the thread on 2 ) however
gut feel is the lightweight method will be close to best.



Consider a video game if The 2 Nurseries are large enough to hold all
temporary scene data say a minute then you incur no collection cost

This doesn't make sense to me.



Scene data is rendered in a streaming fashion. The persistent data for the
entire scene is always in memory. The temporary data is created in a
pipeline to the driver, and the pieces only need to live for a short time. A
10MB nursery is probably large enough to avoid stalling the driver pipeline.
These temporary allocations and the Nursery have nothing to do with the GC
pause.



The GC pause occurs because eventually enough longer lived garbage is
generated. There are many reasons for this, one reason would be streaming
terrain, where the terrain loaded is dependent on where the user's charater
is in the world. The reasons are not important, but eventually there will be
a full GC of the oldest generation. This contains the entire scene, and the
GC pause that occurs is too long for an interactive application.



What if there is a completely new scene ( eg Mobs , effects , particles)
before the Nursery is collected or say 70% of the data ( textures , meshes
etc) is changed ? . A much smaller piece will then end up in the Gen2
collection ... Why would all the textures and meshes which would appear in a
10Mb Nursery be in the Gen2 collection , if its no longer used or
referenced by the time a 400 Meg Nursery correct ? Though obviously a 3D
app is a special case and may just cache it all and keep permanent
references ).



Note I don’t think 3D is a good use of Mark and Sweep for the pauses but it
does show that with a Large Nursery the pauses will be smaller.



The "C" version of this program would be manually freeing these longer lived
pieces of garbage along the way, because it would have manually coded rules
about when to free things. (The C code effectively is using induction proofs
of reachability). It would never need to sweep the entire scene to determine
reachability, so it would never incur this pause.



Like I said there are non GC techniques available to Managed OS also like
self managed objects .Though the catch is the objects have to be of the same
type /size but Vertex Buffers and Materials fit nicely in here.

Your welcome to write our GC’s … J Im focussing on Scheduler
/Process/Shared Memory/Security and the MM which is already far too much.

I'm very interested in GC. I've never written a collector, but I know quite
a bit about them. I may take a shot at it. Keep in mind that there are many
interactions between Shared Memory/Security, overall design, and the GC.



We’re just starting… I will send my design docs over for comment in a few
weeks you may want to start on the GC design. The really nice bit about
Cosmos is the ease of use it will be a fantastic teaching tool hit Debug
from Visual Studio and you are debugging the OS that is great .



One more thing I don’t think a strongly typed system needs to zero memory ?
Any disagreements ?

Memory will need to be either zeroed or assigned to a known valid value
before a constructor returns.



A strongly typed system that controls all code could do this entirely during
object construction instead of before/during allocation. This would prevent
redundant writes for values the construction is going to fill-in anyhow.
However, Zeroing a large block in mass is much faster than writing
individual values, so it may be less overhead simply to make sure the
allocator returns zeroed memory. It's certainly simpler.



Since all allocations are from the Nusery Zeroing is not a big deal and can
be done in the background one 200 Meg…. Zero coming up.



Regards,



Ben
Royce Mitchell III
2009-09-25 07:57:46 UTC
Permalink
A couple thoughts.

If you only allow references to shared memory to be stored in TLS then
you only have to stop a thread and not a whole process. Maybe this is
a solution to shared memory.

Also, perhaps it would be a good idea to implement a ref-counted gc
until the pausing problem is solved. This way you can make forward
progress. Did I understand correctly that ref counting negates all our
performance gains, but if it's at least not slower than a monolithic
kernel we've gained the security, stability and hardware portability
of a managed system without a performance loss. Honestly I don't care
if it's no faster as long as we gain the rest of those benefits.
Post by David Jeske
First off, Singularity seems to be using this proof as part of how they
assure SIPs can be independently GCed [PDF <http://research.microsoft.com/pubs/69431/osr2007_rethinkingsoftwarestack.pdf>].
Ø
They are trying to avoid
the issue of a programmer assigning a reference to a object in the SIPs object
space into Shared Memory.This is bad programming and could result in the object
being collected despite the Shared Memory reference.
I don't understand this comment above. If I understand
correctly....Singularity does not GC the shared memory at all but relies on the
invariant that only one SIP can point to an object in shared memory. This
design means that when they sweep an individual SIP, they also sweep the
objects that SIP owns in shared memory. Therefore, insuring only a single SIP
references a shared object is a fundamental part of their GC. Without this
property, the only way they could know that all references to a shared memory
object were removed would be to do a full sweep across all SIPs.
Yes they are trying to avoid the dangling pointer issue but also
the issue ( since Shared Memory has no GC of  a ref coming from a SIP to the
Shared mem what does the GC do with it or even worse Shared Memory to SIP in which
cace the SIPS GC will remove it  ). And the issue only exists when shared
memory exists between multiple SIPS . I must say I was seeing this as more 1:1
relationships in which case it’s  easy to set all such references to
null.
That  raises the questions should shared memory allow multiple
processes.
Also singularity passes they Shared Memory around from SIP to
SIP with the enclosed data rather than a poor permanent IPC type structure.  Considering
Bufferes can be passed around via a buffer manager and is normally 1:1 ( Device
– Cache – FileSystem – User)     . However Object Caches are
more 1:N  and the data is certainly not immutable
In other words, this isn't simply a matter of 'avoiding bad
programming'. Their entire GC concept depends on this design element. Without
it, they wouldn't know when to collect the shared memory objects either.
Isnt their shared memory objects are only collected when the
Shared Memory is finished  ?  What I was saying is you can leave this to the
programmer which is rather awkward.
 A reference into Shared
Memory can be easily handled eg  call a different Alloctor at compile time
for object trees marked shared.
Multiple allocators for multiple reasons is only solving the
trivial problem of where the object is placed.
Its trivial but important as it will get collected if in the wrong place.
The hard work is inexpensively determining reachability.
That's what GC is all about. Singularity's design has a crafty elegance to it.
They effectively amortorize the cost of computing reachability for IPC objects
across all SIPs by assuring that only one SIP can see an IPC object at once.
But this can be done via a indirect pointer and a lock eg set
the pointer to null when you don’t have the lock and even better this
indirect pointer can be a control block with threading events.
Ø
Im thinking of allowing
references into shared memory since most shared memory areas should be small.
I don't see how the size of a shared memory region is
relevant. If sweeping GC is used, and mutiple SIPs can reference the same
object, then a sweep would need to stop-and-sweep ALL SIPs connected to that
shared memory to determine reachability for an object.
Yes I got it backwards the size is irrelevant  , but if you
allow references from Shared into the immutable objects of SIPS it is relevant ,
I also was going to have an owner lock /Indirect pointer as above and for this
case ( Shared into SIPS) you could  block the GC until the lock was there ( ie
it had sole access to the Shared Mem) and     then do a Mark .  ( Its probably
better handled with a dummy objects to prevent it getting collected)
How would a lock on the root of the object tree work? If I had a handle to a
random place in the subtree, how would it inexpensively enforce the lock?
Ø  If you want your code to work you obtain the lock as with
any multi threaded code within your process. Even traditional shared memory if
its active between 2 processes they need some way of locking though they can
use a Mutex.
You are proposing a concept of IPC data sharing which only
works if programmers "do the right thing" all the time. That
philosophy sound incompatible with the idea of reliying on a provably correct
VM system for isolation. Getting multi-threaded locks correct is difficult for
most programmers. If system stability is reliant on all programers, then the
system will not be stable.
Sort of .. Im proposing different levels of Shared Memory  eg
between trusted partners and a different one for untrusted.  It will not make the
whole system unstable only the SIPS who share that shared memory piece.  So
between Device Driver and File System you could rely on things working
correctly , between File System and the User  you need a safer method.   I
originally was thinking a forced locking system to be enforced only for the
Untrusted but am now thinking the manual lock is not worth it. ( As There is a different
method without shared memory we can use for immutable objects eg just pass the
reference and create a dummy until the receiver is finished)
Ø    Each shared memory region has its own GC
so why do you need to stop the device driver ?
Even in concurrent GC it's necessary to stop all mutators
for a period of time. Then all mutators need to be scanned. If a driver may
have a pointer into shared memory, then it is a mutator and would need to be
stopped during any non-concurrent phases.
Again if it doesn’t hold the lock there is no need as
there is no mutator.
Ref counting is also viable for
Shared Memory  ( and what  was thinking), to do a collect  you
get the GC to acquire the lock on the region.  This may block a single
thread on a driver but is no big deal for the whole driver.
Ref-counting shared memory does sound like a possibly good
idea IF the objects allowed in shared memory also disallow cycles. For example,
the proof-by-induction immutability concept I explained earlier would disallow
cycles, since is provably impossible to end up with two immutable objects that
point to eachother (assuming proper thread locks).
Yes good point , a cycle can be checked at the same time.. .
Ø    If the pause time is an issue switch the
app to to a ref counting one.
This is non-viable. Apps with very large datastructures
which produce GC pause problems have MANY cyclic references. Sticking with the
3d example, there are many reasons for cyclic references between verticies and
objects (polygon reduction, to name one). A refcounting GC would still
have to periodically sweep, recreating the pauses.
If you want a more common application, consider MS Excel or
MS Word. Both have graphs with many cyclic references which become very large
for large documents. Large documents are very common, especially in Excel.
Not saying to eliminate the pauses just making them small enough
not to matter.  Also you can detect cyclic references concurrently  and free
the memory except for fractional thread stack scans. The main issue here is
memory fragmentation for which either you need 2* the memory or  you need
compaction which will create a bigger pause.
Ø  Fair point but really only affects the 3D driver
 which aggregates it all , threads can add background meshes , foreground
etc.
?? This does not affect the 3d driver at all. The driver
only sees the 3d information for brief moments during the rendering process.
This affects the end-application which has a giant tree of information in a
huge large soup. Your comment to "use threads for background meshes,
etc" doesn't make sense to me. Even individual models can be large enough
to cause GC pauses. Likewise for the Excel document example. Again, this is
graph partitioning, it's NP-complete. There is no "easy" way to partition
pieces into different places.
Maybe im thinking of DirectX  where the DirectX lib does all
that once you send it the data you forget about it in the user app but the key
graph is only a scene . Anyway I don’t think it wil be that different
from the malloc/free heap pauses with a ref counting GC.  Lastly you can
probably use static memory for this and manage it yourself Vertex buffers and
most 3D things are done this way.
I don't understand this suggestion. It is not safe to release memory until
the GC has proven no other references exist. GC.Free(ref) in a safe GC system
can only null out the handle. It can't actually release memory. If it did, some
other handle to that same memory would become invalid.
Ø  Its not safe but nothing stops a force free and just like
C++ if the developer leaves a dangling pointer the process  may crash. Its
an option of last resort though I do use it at the moment since the Collector
is not working yet.
This explanation of yours contradicts everything I've heard
Cosmos is about. The entire premise of a system which uses a "typesafe
software VM for safety" is that the software VM is provably correct. There
is one huge memory space with no hardware protection. You can't allow a force
free! The worst case is NOT a crash. The worst case is that I exploit the
force-free, get a different object or byte-buffer allocated into the same place,
and use a leftover pointer to violate every security and protection on the
system.
Except you can only free your own apps pointers  (GC  will check
mem range ) and hence you only see bits of your own app ( which granted may
have interesting information in it) … Anyway its an option of last
resort  , every OS in history has had its compromises..   Managed OS have
unmanaged pointer based Memory Managers and GC’s  if a 3D lib does the
same so be it though I  don’t think its needed.
You can stick a force-free concept into Cosmos today because
nobody has implemented a GC, but Cosmos isn't going to be able to be using
software for reliability and security until evey force-free is removed.
Ø  Memory pauses is a problem that exists in C as well ,
The problem with GC is not pausing, which I agree exists in
any allocator. The problem is _unpredictable_ and _uncontrollable_ pauses. A C
program can write it's own allocator, and can choose when to allocate or free
objects to avoid long pauses. A program with a GC no longer has control over
this element of the program, so it becomes unpredictable and
uncontrollable.
In the past C was like this but now it’s hard to know when
a libs malloc will make a kernel allocation call. To handle this you manage
memory yourself. This is still sort of possible in C# you can create static buffers
and objects which you manage yourself ( you can recycle  unused objects etc) .  It
is worthwhile noting this is exactly what is used with most 3D systems.
 There
will always be tiny pauses but I cant see how say a ref counting GC that
frequently releases will have issues its behaviour would be similar to native C
app.
ref-counting GCs are only usable if there are no cycles.
Many many applications have cyclic data-structures that are not usable in a
refcounting system.
--
--
Royce Mitchell III
I.T. Manager
Westpark Communications, Inc.
royce3-+2L6ibP9hNPjdaj/***@public.gmane.org
713-785-3238 ofc
713-977-5944 fax

Confidentiality Notice: The information contained in or attached to this
message may be privileged and confidential and protected from disclosure.
If the reader of this message is not the intended recipient, or an employee
or agent responsible for delivering this message to the intended recipient,
you are hereby notified that any dissemination, distribution or copying of
this communication is strictly prohibited. If you have received this
communication in error, please notify us immediately and delete the email as
received.


------------------------------------

--------------------------------------------------
More things to join for Cosmos!

1) Cosmos chat room:
http://tinyurl.com/pc7bds

2) Please add yourself to the map:
http://tinyurl.com/qhttde

3) Help publicity and join our Facebook page:
http://tinyurl.com/plrloa

--------------------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/Cosmos-Dev/

<*> Your email settings:
Individual Email | Traditional

<*> To change settings online go to:
http://groups.yahoo.com/group/Cosmos-Dev/join
(Yahoo! ID required)

<*> To change settings via email:
mailto:Cosmos-Dev-digest-***@public.gmane.org
mailto:Cosmos-Dev-fullfeatured-***@public.gmane.org

<*> To unsubscribe from this group, send an email to:
Cosmos-Dev-unsubscribe-***@public.gmane.org

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
Ben Kloosterman
2009-09-25 09:33:03 UTC
Permalink
Post by Royce Mitchell III
A couple thoughts.
If you only allow references to shared memory to be stored in TLS then
you only have to stop a thread and not a whole process. Maybe this is
a solution to shared memory.
Nice idea .. Though in my case a process is a single thread similar to
EROS/Coyotos ( though it can shares an allocator with other processes) ,
hence the part communicating with shared memory is normally a single thread
( unless multiple threads are needed for performance). Im trying to
encourage devs to break their apps into pieces ( not so much for the
reasons just discussed but more for concurrency) though you can still make
a traditional app. Any reference created will be TLS so I don't even need
TLS .
Post by Royce Mitchell III
Also, perhaps it would be a good idea to implement a ref-counted gc
until the pausing problem is solved. This way you can make forward
progress. Did I understand correctly that ref counting negates all our
performance gains, but if it's at least not slower than a monolithic
kernel we've gained the security, stability and hardware portability
of a managed system without a performance loss. Honestly I don't care
if it's no faster as long as we gain the rest of those benefits.
Yes the gains are wiped . But you only have to use the ref counting gc where
pausing causes an issue other apps should use a concurrent generational mark
and sweep. Hence it won't have a major effect. Instead of ref counting you
can also self manage your objects.

Regards,

Ben
Royce Mitchell III
2009-09-25 14:55:16 UTC
Permalink
Post by Ben Kloosterman
Post by Royce Mitchell III
A couple thoughts.
If you only allow references to shared memory to be stored in TLS then
you only have to stop a thread and not a whole process. Maybe this is
a solution to shared memory.
Nice idea .. Though in my case a process is a single thread similar to
EROS/Coyotos ( though it can shares an allocator with other processes) ,
hence the part communicating with shared memory is normally a single thread
( unless multiple threads are needed for performance). Im trying to
encourage devs to break their apps into pieces ( not so much for the
reasons just discussed but more for concurrency) though you can still make
a traditional app. Any reference created will be TLS so I don't even need
TLS .
I think maybe you didn't understand what I was suggesting. If a driver
creates a thread for each process that communicates with it, then the whole
driver doesn't have to be stopped to run GC on behalf of a specific process,
you only have to stop the thread. The API could facilitate this by
registering a special entrypoint that gets called any time a new process
starts communicating with it. Since the thread should pretty much stay in a
blocked state, there shouldn't be any performance hit, though perhaps a
small hit on memory usage for thread state.
Post by Ben Kloosterman
Post by Royce Mitchell III
Also, perhaps it would be a good idea to implement a ref-counted gc
until the pausing problem is solved. This way you can make forward
progress. Did I understand correctly that ref counting negates all our
performance gains, but if it's at least not slower than a monolithic
kernel we've gained the security, stability and hardware portability
of a managed system without a performance loss. Honestly I don't care
if it's no faster as long as we gain the rest of those benefits.
Yes the gains are wiped . But you only have to use the ref counting gc where
pausing causes an issue other apps should use a concurrent generational mark
and sweep. Hence it won't have a major effect. Instead of ref counting you
can also self manage your objects.
self-managed objects would involve effort from the application programmers,
a definite step backwards. I'm just saying that ref-counting is
well-understood, and would be a good choice to use by default until the GC
problem is understood/solved. Personally I don't mind an app/OS that's
perhaps a bit slower than it potentially can be, but one that experiences
1/4-second or longer stutters would drive me away after about the third one.
Post by Ben Kloosterman
Regards,
Ben
Ben Kloosterman
2009-09-25 22:46:51 UTC
Permalink
Post by Royce Mitchell III
A couple thoughts.
If you only allow references to shared memory to be stored in TLS then
you only have to stop a thread and not a whole process. Maybe this is
a solution to shared memory.
Nice idea .. Though in my case a process is a single thread similar to
EROS/Coyotos ( though it can shares an allocator with other processes) ,
hence the part communicating with shared memory is normally a single thread
( unless multiple threads are needed for performance). Im trying to
encourage devs to break their apps into pieces ( not so much for the
reasons just discussed but more for concurrency) though you can still make
a traditional app. Any reference created will be TLS so I don't even need
TLS .



I think maybe you didn't understand what I was suggesting. If a driver
creates a thread for each process that communicates with it, then the whole
driver doesn't have to be stopped to run GC on behalf of a specific process,
you only have to stop the thread. The API could facilitate this by
registering a special entrypoint that gets called any time a new process
starts communicating with it. Since the thread should pretty much stay in a
blocked state, there shouldn't be any performance hit, though perhaps a
small hit on memory usage for thread state.





I understood alright ,its just since process are always a single thread
that's already covered. Each driver (app) may have a number of processes
with different rolls some of which will be IPC its up to the developer
whether he wants 1 thread doing all IPC to different parts , 1 thread per
IPC partner or even multiple . Doing 1 thread per process you communicate
to is the most trivial. Only issue is we may need some sort fo expandable
stack so threads don't burn through too much memory.
Post by Royce Mitchell III
Also, perhaps it would be a good idea to implement a ref-counted gc
until the pausing problem is solved. This way you can make forward
progress. Did I understand correctly that ref counting negates all our
performance gains, but if it's at least not slower than a monolithic
kernel we've gained the security, stability and hardware portability
of a managed system without a performance loss. Honestly I don't care
if it's no faster as long as we gain the rest of those benefits.
Yes the gains are wiped . But you only have to use the ref counting gc where
pausing causes an issue other apps should use a concurrent generational mark
and sweep. Hence it won't have a major effect. Instead of ref counting you
can also self manage your objects.



self-managed objects would involve effort from the application programmers,
a definite step backwards.



This is only for areas where pauses are critical and is no different from
what you have to do in C as malloc can also cause issues.



I'm just saying that ref-counting is well-understood, and would be a good
choice to use by default until the GC problem is understood/solved.
Personally I don't mind an app/OS that's perhaps a bit slower than it
potentially can be, but one that experiences 1/4-second or longer stutters
would drive me away after about the third one.



Absolutely you would use ref counting first ( which will do for most
drivers ) , but if you were writing a 3D lib you would be pretty mad not to
manage the vertex buffers yourself and avoid the pauses even 10ms is bad for
3D.



Regards,



Ben
David Jeske
2009-09-25 18:22:13 UTC
Permalink
From this discussion .. I like the idea of a large shared memory pool like
singularity, but using ref-counted immutable structures (which can only
point to other refcounted immutable structures).
Isnt their shared memory objects are only collected when the Shared Memory
is finished ? What I was saying is you can leave this to the programmer
which is rather awkward.
As far as I can tell from reading their papers, no. Their 'shared memory' is
never finished, it's a big shared pool that lives forever. A
shared-memory-object is only referenced by one SIP at a time. Even though it
lives in the shared-memory area, it is part of the GC process of the SIP
that owns it. In order to allow another SIP to see a shared-memory-object,
you must lose access to it. The shared-memory-object is freed when the SIP
GC says it's no longer reachable. When you 'destroy' a SIP, you need to both
free it's memory block and all shared memory objects it owns.
A reference into Shared Memory can be easily handled eg call a
different Alloctor at compile time for object trees marked shared.
Multiple allocators for multiple reasons is only solving the trivial
problem of where the object is placed.
Its trivial but important as it will get collected if in the wrong place.
I don't understand what collection model you are thinking of when you make
this statement. Regardless of where you put it, GC can only collect things
which become non-reachable. Depending on where objects are, the GC may use
different techniques to determine reachability.
Sort of .. Im proposing different levels of Shared Memory eg between
trusted partners and a different one for untrusted. It will not make the
whole system unstable only the SIPS who share that shared memory piece.
I don't think I've seen enough detail to understand what model you are
proposing. Doesn't this mean you can't release the shared memory back to the
operating system? (because SIPs might be holding pointers to it?)
Even in concurrent GC it's necessary to stop all mutators for a period of
time. Then all mutators need to be scanned. If a driver may have a pointer
into shared memory, then it is a mutator and would need to be stopped during
any non-concurrent phases.
Again if it doesn’t hold the lock there is no need as there is no mutator.
Earlier you said these were voluntarily locks. I understand how a
non-voluntary (i.e. compiler generated) lock could work here. It seems like
this path might lead to deadlock problems. (both GC and mutator are
acquiring locks, and get in a state where neither can continue)
Maybe im thinking of DirectX where the DirectX lib does all that once you
send it the data you forget about it in the user app but the key graph is
only a scene .
What you are describing is called 'Direct3d Retained Mode'. It was hardly
ever used and was dropped after DirectX v3.0.
Lastly you can probably use static memory for this and manage it yourself
Vertex buffers and most 3D things are done this way.
3d modelers use dynamically allocated structures. Old game engines did tend
to use statically allocated structures, but those days are gone. Octrees,
streaming terrain, and multiplayer networking structures are just a few of
the many techniques which require lots of dynamic allocation in modern 3d.

Also, this isn't limited to 3d, that was just one example. Lots of apps are
still written in non-garbage collected systems. It's not because people are
lazy, or bored, or don't care about this detail. One reason is that garbage
collectors create pause problems. We're getting much closer to solving these
problems. My point is that it's very important to be mindful of the issue
and acknowledge it.
Except you can only free your own apps pointers (GC will check mem range )
Earlier you were talking about explicit free of shared-memory-objects, which
I hope you now see is unsafe in the design space we're talking about. If you
allow apps to free their own pointers then you have to be careful about not
releasing any memory back to the OS while the app might have a pointer to
it. I think pervasive GC is a better route.
What if there is a completely new scene ( eg Mobs , effects , particles)
before the Nursery is collected or say 70% of the data ( textures , meshes
etc) is changed ?
3d apps are dominated by geometry and texture data. Consider an app with
500MB of 'fairly stable' data. The data is fed to the 3d card every frame,
which generates Nursery allocations only for minor things. The
vertex-buffers, for example, are already in the 'fairly stable data', they
are just fed to the card with some new paramaters each frame. Then, after
the user has stood in one spot for 2 minutes (long enough for all the
'fairly stable data' to get into the oldest-generation), then the user moves
somewhere else in the world. As they move, the system is streaming terrain
information in from disk. This causes 10MB of churn every minute for 20
minutes. Now we have 200MB of dead data in the oldest-generation. It needs
to be collected, and the only way to do it is a full-GC. This is where
low-pause GC is really important, and to-date commercial systems have
failed. Today's apps solve this by using manual allocators.

Don't make the mistake of thinking this is a 'special case in 3d apps'.
Applications which allocate large amounts of memory (in cache or otherwise)
tend to want to do something interesting with that data. If they have
interactivity requirements, then we must bound worst-case pause times, and
GC systems are not very good at this. It's a real interesting problem.

I like Singularity's design for Shared Memory and GC. It makes the SIPs
worst-case pause time related to the amount of memory it references. This
should allow low-level pieces such as drivers, filesystem, and networking to
have low worst-case pauses because they reference small amounts of memory
(shared or not).

A similar goal should be achievable by using a large shared-memory pool of
immutable refcounted subtrees. This would allow GC for a SIP to be
proportional to it's non-shared memory. It would also allow large immutable
buffer caches to be seen by multiple SIPs. There will be a cost to the
refcounting, but for large subtrees this should be less than the cost of
copying. It might be practical to allow smaller data to be copied to avoid
the refcounting overhead.
Ben Kloosterman
2009-09-26 01:03:15 UTC
Permalink
From this discussion .. I like the idea of a large shared memory pool like
singularity, but using ref-counted immutable structures (which can only
point to other refcounted immutable structures).





Are the immutable structures really shared memory ( Should we call it that
, it's a funny concept here with Singularity type systems its trivial ?) ,
they are VERY useful but allow only data to be sent so great for zero copy
IPC. I see no reason why they cant be created and managed by the senders
GC , all the sender needs is a ref collection and when the receiver ( or
receivers) are finished he removes the reference ( you can count or you can
store multiple references 1 against each receiver).



I suspect we will still need some form of mutable structures for back and
forward comms ? Comments ? How do we handle editing or writing files
efficiently ?



I must admit everything inter appdomain being immutable is attractive but is
it practical ? I think in most cases the immutable objects will just be
wrapper objects for internal data.



Regards,



Ben
Ben Kloosterman
2009-09-26 06:05:59 UTC
Permalink
Isnt their shared memory objects are only collected when the Shared Memory
is finished ? What I was saying is you can leave this to the programmer
which is rather awkward.

As far as I can tell from reading their papers, no. Their 'shared memory' is
never finished, it's a big shared pool that lives forever. A
shared-memory-object is only referenced by one SIP at a time. Even though it
lives in the shared-memory area, it is part of the GC process of the SIP
that owns it. In order to allow another SIP to see a shared-memory-object,
you must lose access to it. The shared-memory-object is freed when the SIP
GC says it's no longer reachable.



No this isn't entirely correct , They break the shared memory into regions
with a single owner this region is attached to different SIPS as it goes
along ( 1 at a time) each SIP can alter data in the region but once the
last SIP relinquishes ownership without transferring it is finished..



When you 'destroy' a SIP, you need to both free it's memory block and all
shared memory objects it owns.



Correct , However if the Owner SIP is destroyed it can't transfer ownership
and you simply destroy the region.





A reference into Shared Memory can be easily handled eg call a different
Alloctor at compile time for object trees marked shared.

Multiple allocators for multiple reasons is only solving the trivial problem
of where the object is placed.

Its trivial but important as it will get collected if in the wrong place.

I don't understand what collection model you are thinking of when you make
this statement. Regardless of where you put it, GC can only collect things
which become non-reachable. Depending on where objects are, the GC may use
different techniques to determine reachability.



I was stating exactly that. If you have a reference null or unavailable (
let say a single ref in shared memory to an object managed by the Processes
GC the GC will collect it ( The reverse is also possible if the Shared mem
has its own GC). So just stating its important things end up in the right
place.



Sort of .. Im proposing different levels of Shared Memory eg between
trusted partners and a different one for untrusted. It will not make the
whole system unstable only the SIPS who share that shared memory piece.

I don't think I've seen enough detail to understand what model you are
proposing. Doesn't this mean you can't release the shared memory back to the
operating system? (because SIPs might be holding pointers to it?)



I will detail it but yes through indirect pointers that go to null when they
don't have a lock. Hence it can safely be released. Note this is in
addition to our proposed Immutable solution,



Even in concurrent GC it's necessary to stop all mutators for a period of
time. Then all mutators need to be scanned. If a driver may have a pointer
into shared memory, then it is a mutator and would need to be stopped during
any non-concurrent phases.

Again if it doesn't hold the lock there is no need as there is no mutator.

Earlier you said these were voluntarily locks. I understand how a
non-voluntary (i.e. compiler generated) lock could work here. It seems like
this path might lead to deadlock problems. (both GC and mutator are
acquiring locks, and get in a state where neither can continue)



I was thinking voluntary between trusted and involuntary between non
trusted. We need a far more detailed design before considering the
deadlocks.



Lastly you can probably use static memory for this and manage it yourself
Vertex buffers and most 3D things are done this way.

3d modelers use dynamically allocated structures. Old game engines did tend
to use statically allocated structures, but those days are gone. Octrees,
streaming terrain, and multiplayer networking structures are just a few of
the many techniques which require lots of dynamic allocation in modern 3d.



True at the higher levels but many things are still static eg vertex
buffers. Note you can combine self managed static structures such as buffers
and objects into a dynamic structure.



Earlier you were talking about explicit free of shared-memory-objects, which
I hope you now see is unsafe in the design space we're talking about. If you
allow apps to free their own pointers then you have to be careful about not
releasing any memory back to the OS while the app might have a pointer to
it. I think pervasive GC is a better route.



IIt was always a last resort something I don't think we will be doing.

What if there is a completely new scene ( eg Mobs , effects , particles)
before the Nursery is collected or say 70% of the data ( textures , meshes
etc) is changed ?

3d apps are dominated by geometry and texture data. Consider an app with
500MB of 'fairly stable' data. The data is fed to the 3d card every frame,
which generates Nursery allocations only for minor things. The
vertex-buffers, for example, are already in the 'fairly stable data', they
are just fed to the card with some new paramaters each frame. Then, after
the user has stood in one spot for 2 minutes (long enough for all the
'fairly stable data' to get into the oldest-generation), then the user moves
somewhere else in the world. As they move, the system is streaming terrain
information in from disk. This causes 10MB of churn every minute for 20
minutes. Now we have 200MB of dead data in the oldest-generation. It needs
to be collected, and the only way to do it is a full-GC. This is where
low-pause GC is really important, and to-date commercial systems have
failed. Today's apps solve this by using manual allocators.



Don't make the mistake of thinking this is a 'special case in 3d apps'.
Applications which allocate large amounts of memory (in cache or otherwise)
tend to want to do something interesting with that data. If they have
interactivity requirements, then we must bound worst-case pause times, and
GC systems are not very good at this. It's a real interesting problem.



Agree with this , though I do think 3 D is a bad example. Many things will
benefit from a large nursery based in Research stating 85% of objects are
gone by the time you allocate 1 Meg .



I was thinking for things like caches. These are often self managed static
data even in the unmanaged world , let's say buffers . Buffers are no issue
as the GC does not manage them .



Obviously object caches are different , since GC find managing long term
objects trivial most of the optimizations like generations /Nursery etc are
for short term objects. So I was thinking for long term objects why don't
we mark them with an attribute as such , that way a GC will allocate them
immediately in Gen 2 and pack them. ( You could even get very smart about
the repacking) at least you save the gen0 to gen1 to gen2 sweep costs. This
way since Marks are concurrent you avoid most of the expensive sweeps and
pauses ? Developers don't have to use it but it helps .





I like Singularity's design for Shared Memory and GC. It makes the SIPs
worst-case pause time related to the amount of memory it references. This
should allow low-level pieces such as drivers, filesystem, and networking to
have low worst-case pauses because they reference small amounts of memory
(shared or not).



A similar goal should be achievable by using a large shared-memory pool of
immutable refcounted subtrees. This would allow GC for a SIP to be
proportional to it's non-shared memory. It would also allow large immutable
buffer caches to be seen by multiple SIPs. There will be a cost to the
refcounting, but for large subtrees this should be less than the cost of
copying. It might be practical to allow smaller data to be copied to avoid
the refcounting overhead.



Why not just deliver the reference ( to the immutable object) to other SIPS
? Why place it in a different pool ? It would be far simpler and elegant to
manage it from the senders GC. You can remove the dummy reference when the
receiver is dead ( IPC channel closes) . Is the concern that these
structures will be that big ? I see the issue that the reference must be
pinned and cant be moved by the GC ( Maybe mark as long lived see above or
treat them like you do large objects which are always pinned) . From a
programming perspective these immutable objects can be just normal objects
and only when you need to do you send the reference.



The Large Disk Cache Buffer is covered as the entire buffer array sits in a
SIP and it just hands out pointers to its internal data.



Thanks for the comments btw these are difficult issues and critical to the
way the system will work.



Regards,



Ben
David Jeske
2009-09-26 07:27:46 UTC
Permalink
Post by Ben Kloosterman
Why not just deliver the reference ( to the immutable object) to other
SIPS ? Why place it in a different pool ? It would be far simpler and
elegant to manage it from the senders GC.
Even if it is in the sender's pool, it won't be managed by the senders's GC
as soon as someone else referneces it.

I'm assuming the SIPs are going to have a compacting GC. If these immutable
structures are refcounted and referenced by other SIPs, it may be
impractical to move/compact them so they would be in the way causing
problems. Further, their liveliness would needs to be unrelated to the SIPs
memory. If the SIP is destroyed, these immutable structures still need to
live as long as anyone has referneces to them. Because of these reasons, I
don't see why these immutable objects would ever be created in a SIP's GC
pool. They would only ever be allocated and managed in the shared memory
region.
Post by Ben Kloosterman
You can remove the dummy reference when the receiver is dead ( IPC channel
closes) .
Not if the other SIP is using a pointer to the memory you can't. Before
shared objects can be reclaimed we have to prove nobody has their pointers
anymore. This is the primary issue. We can't allow memory which
instructions might be touching to be reused.
Post by Ben Kloosterman
Is the concern that these structures will be that big ?
This is not the primary concern, though I do expect they may often be big
(2-100MB). If they are always small, then there is no reason for
shared-anything as they could just be copied faster than all of this sharing
machinery.
Ben Kloosterman
2009-09-26 08:56:28 UTC
Permalink
Why not just deliver the reference ( to the immutable object) to other SIPS
? Why place it in a different pool ? It would be far simpler and elegant to
manage it from the senders GC.

Even if it is in the sender's pool, it won't be managed by the senders's GC
as soon as someone else referneces it.



Dummy reference which can be used for the ref count.



I'm assuming the SIPs are going to have a compacting GC. If these immutable
structures are refcounted and referenced by other SIPs, it may be
impractical to move/compact them so they would be in the way causing
problems.



I think I said that somewhere they need t be fixed most GC's have fixed
objects eg large objects. So that's not a major issue.



Further, their liveliness would needs to be unrelated to the SIPs memory. If
the SIP is destroyed, these immutable structures still need to live as long
as anyone has referneces to them.

There is always the option of terminating all receivers that's what pretty
much has to happen with shared mem at the moment as the sender may have
corrupted it.. Or even better freeze the receivers scan them for any
reference in the bounds of the crashed procs GC and replace the ref with a
ref to null. I don't see an issue if any receiver is terminated (except
the ref count needs to be updated) only if the sender. The fact the data
is immutable would mean it is normally used only for short periods anyway.



Because of these reasons, I don't see why these immutable objects would ever
be created in a SIP's GC pool. They would only ever be allocated and managed
in the shared memory region.

While theoretically there is no issue there are a number of practical issues
how to do this. While complete independence is attractive im not sure its
worth it ( Increased complexity , more complex API) .Maybe there is a
case for both.



You can remove the dummy reference when the receiver is dead ( IPC channel
closes) .

Not if the other SIP is using a pointer to the memory you can't.

How can he be using a pointer if he is killed ?



Before shared objects can be reclaimed we have to prove nobody has their
pointers anymore. This is the primary issue. We can't allow memory which
instructions might be touching to be reused.

Is the concern that these structures will be that big ?

This is not the primary concern, though I do expect they may often be big
(2-100MB). If they are always small, then there is no reason for
shared-anything as they could just be copied faster than all of this sharing
machinery.



Sending references directly will be much faster than copying especially for
things like a few buffers. I just cant see the value of large blobs of
immutable data but this maybe a gamechanging thing , do any systems out
there have large amounts of immutable data ? ( Note the alternative method
to handle this)



Regards,



Ben
Ben Kloosterman
2009-09-25 14:08:59 UTC
Permalink
OK thinking this through further



Immutable objects would be great for IPC and parallelism.



Ok I don't see how any class with public methods or any delegates fields
can ever be tested for being immutable ( even though it might be) .



That said we can create a Iimmutable interface and check types at IL->86
compile time If a type has

No public methods

No delegates in fields .

No Public set properties and

No public fields



If so its immutable . If it has the interface and the above is true abort .
This is quite useful as it re-enforces the idea of sending Data not
references between Applications.



We can also hack string to support IImmutable and all value types (via copy)
.



So void SendReference(Iimmutable root);



Note we probably still needs some form of real share memory but the above
will do for most uses and is very high performance and reliable.



Collections would be IImmutable [].



So a buffer cache would be a Process not shared memory. The Device driver
gets the HAL to write it via DMA directly into the buffers. When the file
systems needs it the Buffer cache just constructs some Immutable wrappers
and sends it to the File System which in turns creates an immutable wrapper
over the top for someone to read it . To Write it is done in reverse. ( It
works great for reads not sure whether this will work well for writes).



Regards,



Ben
Royce Mitchell III
2009-09-25 15:00:52 UTC
Permalink
One possible solution for immutable objects would be to have a ULONG_PTR in
each object that contains either:
0: root object, mutable
1: root object, immutable
(any other value): non-root object, pointer to root object's mutable state.

In other words, for any root object it has a 0 or 1 indicating whether it's
mutable. For any sub-reference, the value is a pointer to the root's
mutability flag. To check an object's mutability, would be something like
this pseudo-code:

bool is_immutable ( Object* o )
{
ULONG_PTR immutable = o->immutable;
if ( immutable > 1 )
immutable = *(ULONG_PTR*)immutable;
return (bool)immutable;
}
Matthijs ter Woord
2009-09-27 15:07:34 UTC
Permalink
Ben, forget about strings, they are always immutable in .net
Post by Ben Kloosterman
OK thinking this through further
Immutable objects would be great for IPC and parallelism.
Ok I don’t see how any class with public methods or any delegates fields
can ever be tested for being immutable ( even though it might be) .
That said we can create a Iimmutable interface and check types at IL->86
compile time If a type has
No public methods
No delegates in fields .
No Public set properties and
No public fields
If so its immutable . If it has the interface and the above is true abort
. This is quite useful as it re-enforces the idea of sending Data not
references between Applications.
We can also hack string to support IImmutable and all value types (via
copy) .
So void SendReference(Iimmutable root);
Note we probably still needs some form of real share memory but the above
will do for most uses and is very high performance and reliable.
Collections would be IImmutable [].
So a buffer cache would be a Process not shared memory. The Device driver
gets the HAL to write it via DMA directly into the buffers. When the file
systems needs it the Buffer cache just constructs some Immutable wrappers
and sends it to the File System which in turns creates an immutable wrapper
over the top for someone to read it . To Write it is done in reverse. ( It
works great for reads not sure whether this will work well for writes).
Regards,
Ben
Ben Kloosterman
2009-09-28 00:40:50 UTC
Permalink
Ben, forget about strings, they are always immutable in .net



With strings ( and basic value types) we have to get it working with the
following API

void SendReference(Iimmutable root);

. We don't want to go

is IImmutable || is string all the time

Should be possible though We could add IsImutable to the Object base class.


Regards,

Ben
Matthijs ter Woord
2009-09-28 09:57:49 UTC
Permalink
wont be possible, we use the c# compiler to do c# -> msil, we would need to
compile against a changed mscorlib to get this working, which imo is not
wanted..
Post by Matthijs ter Woord
Ben, forget about strings, they are always immutable in .net
With strings ( and basic value types) we have to get it working with the
following API
void SendReference(Iimmutable root);
. We don’t want to go
is IImmutable || is string all the time
Should be possible though We could add IsImutable to the Object base class.
Regards,
Ben
Ben Kloosterman
2009-09-28 10:52:14 UTC
Permalink
Why cant we go if type = string add xyz interface ?



wont be possible, we use the c# compiler to do c# -> msil, we would need to
compile against a changed mscorlib to get this working, which imo is not
wanted..

On Mon, Sep 28, 2009 at 2:40 AM, Ben Kloosterman <bklooste-***@public.gmane.org> wrote:



Ben, forget about strings, they are always immutable in .net



With strings ( and basic value types) we have to get it working with the
following API

void SendReference(Iimmutable root);

. We don't want to go

is IImmutable || is string all the time

Should be possible though We could add IsImutable to the Object base class.


Regards,

Ben
Matthijs ter Woord
2009-09-28 11:28:03 UTC
Permalink
Compilation of your c# project is as follows:

1: Compile your c# file(s) using microsoft's compiler, which uses
System.String as string type, which is not modified by us
2: compile the .dll/.exe (and related files) to binary:
3: make iso/pxe/etc to boot

step 1 is NOT modifiable, unless we want to make a custom build of mono's
mscorlib, but this will give other (HUGE) implications..
Post by Ben Kloosterman
Why cant we go if type = string add xyz interface ?
wont be possible, we use the c# compiler to do c# -> msil, we would need to
compile against a changed mscorlib to get this working, which imo is not
wanted..
Ben, forget about strings, they are always immutable in .net
With strings ( and basic value types) we have to get it working with the
following API
void SendReference(Iimmutable root);
. We don’t want to go
is IImmutable || is string all the time
Should be possible though We could add IsImutable to the Object base class.
Regards,
Ben
David Jeske
2009-09-28 16:25:39 UTC
Permalink
Passing small arguments like individual strings can probably be handled just
by copying. The challenge is larger trees of data, such as a large directory
listing, or large blocks of data, such as disk or network packets.
During this immutable discussion, I had envisioned a specific assembly of
'freezable types' that were created specifically for this purpose. For
example, a "FreezableByteArray" that would be used by a low-level dma
buffer, and then handed off. Something like:

fba = FreezableByteArray.allocFromSharedMemory(SIZE)
{ byte *fba_ptr = fba.pinned_pointer_for_dma();
driver_dma_into(fba_ptr);
}
fba.freeze(); // makes it immutable
return fba; // returning to another SIP
Ben Kloosterman
2009-09-29 05:47:02 UTC
Permalink
Passing small arguments like individual strings can probably be handled just
by copying. The challenge is larger trees of data, such as a large directory
listing, or large blocks of data, such as disk or network packets.

Strings can be pretty big like a web page.. We really should use UTF 8
strings not Unicode though.



During this immutable discussion, I had envisioned a specific assembly of
'freezable types' that were created specifically for this purpose. For
example, a "FreezableByteArray" that would be used by a low-level dma
buffer, and then handed off. Something like:



fba = FreezableByteArray.allocFromSharedMemory(SIZE)

{ byte *fba_ptr = fba.pinned_pointer_for_dma();

driver_dma_into(fba_ptr);

}

fba.freeze(); // makes it immutable

return fba; // returning to another SIP



We will prob create some of these but we can still support user created
types. Is there any reasons for the fba.freze ? like string If all
properties are public get and there are no fields exposed mutation is
impossible.



Regards,



Ben
David Jeske
2009-09-29 17:15:49 UTC
Permalink
Post by Ben Kloosterman
We will prob create some of these but we can still support user created
types. Is there any reasons for the fba.freze ? like string If all
properties are public get and there are no fields exposed mutation is
impossible.
Consider how such an object is created. For example, a large 10MB string of
a webpage does not "materialize out of thin air". The bytes get there by
being copied from a mutable byte[]. If this is the model for everything
(immutable on initialization), then initializaing all these immutable types
will require unnecessary data copies. Consider two alternatives for
initializing a data-buffer:

option A) immutable on allocaiton:

byte[SIZE] rawdata;
fetch_data(rawdata)
ima = ImmutableArray(rawdata); // this has to copy the entire data into the
immutable array
return ima;

option B) freezing model

ima = ImmutableArray(SIZE);
fetch_data(ima.underlyingbytes); // data is put directly into the array,
no copy necessary
ima.freeze();
return ima;
Ben Kloosterman
2009-09-30 13:27:06 UTC
Permalink
On Mon, Sep 28, 2009 at 10:47 PM, Ben Kloosterman <bklooste-***@public.gmane.org>
wrote:

We will prob create some of these but we can still support user created
types. Is there any reasons for the fba.freze ? like string If all
properties are public get and there are no fields exposed mutation is
impossible.

Consider how such an object is created. For example, a large 10MB string of
a webpage does not "materialize out of thin air". The bytes get there by
being copied from a mutable byte[].



Yep the byte[] will need to be encoded.



If this is the model for everything (immutable on initialization), then
initializaing all these immutable types will require unnecessary data
copies. Consider two alternatives for initializing a data-buffer:



option A) immutable on allocaiton:



byte[SIZE] rawdata;

fetch_data(rawdata)

ima = ImmutableArray(rawdata); // this has to copy the entire data into the
immutable array

return ima;



Why not just wrap the data with a readonly wrapper and send that ?



Eg



Byte[BUFFER_SIZE][NUMBUFFERS] buffers;



ByteReader GetBufferPointer ( uint bufNumber)

{

Byte[BUFFER_SIZE] buff = buffers[bufNumber];

Return new ReadOnlyByteReader(buff);

}





Ahh I see you are theoretically correct because the BufferManager needs to
mark the buffer in use. So logically if you mark the buffer in use
beforehand its equivalent.



Regards,



Ben
David Jeske
2009-09-30 15:12:00 UTC
Permalink
Post by David Jeske
byte[SIZE] rawdata;
fetch_data(rawdata)
ima = ImmutableArray(rawdata); // this has to copy the entire data into the
immutable array
return ima;
Why not just wrap the data with a readonly wrapper and send that ?
How do you reasonably prove that the underlying data will not change? If you
wrap the underlying byte[], then code can still change values in the byte[]
after the data is sent.

Ahh I see you are theoretically correct because the BufferManager needs to
Post by David Jeske
mark the buffer in use. So logically if you mark the buffer in use
beforehand its equivalent.
You seem to trust the programmer to make sure he doesn't modify something
after it's sent. I'm not so trusting.
Matthijs ter Woord
2009-09-30 16:24:32 UTC
Permalink
Post by Ben Kloosterman
Ahh I see you are theoretically correct because the BufferManager needs to
Post by Ben Kloosterman
mark the buffer in use. So logically if you mark the buffer in use
beforehand its equivalent.
You seem to trust the programmer to make sure he doesn't modify something
after it's sent. I'm not so trusting.
Is this a form of self knowledge? :-) (no offense)
David Jeske
2009-09-30 22:32:30 UTC
Permalink
On Wed, Sep 30, 2009 at 9:24 AM, Matthijs ter Woord <
Post by Ben Kloosterman
Ahh I see you are theoretically correct because the BufferManager needs to
Post by David Jeske
Post by Ben Kloosterman
mark the buffer in use. So logically if you mark the buffer in use
beforehand its equivalent.
You seem to trust the programmer to make sure he doesn't modify something
after it's sent. I'm not so trusting.
Is this a form of self knowledge? :-) (no offense)
It's realism from years of software engineering management. You can
discipline yourself to do the 'right thing'. You might be able to disipline
a small team. It's very hard to disipline a large team. An external
community of programmers will do anything they can. I have a rule... If you
give programmers two ways to do something one right and one wrong, they will
invent a third wrong way. Therefore, if something needs to be done a certain
way, enforce it. It's similar the the XP test-first philosophy that if
something isn't tested it doesn't work.
Ben Kloosterman
2009-10-01 02:55:20 UTC
Permalink
However your trusting the sender to set it as Immutable though the receiver
can check .



Note an Array will never pass the Immutable test so my example was wrong. (
A public set indexer will fail the test)



ima = ImmutableArray(SIZE);

fetch_data(ima.underlyingbytes); // data is put directly into the array,
no copy necessary

ima.freeze();

return ima;



Excluding DMA how do we actually retain the flexability of passing a
reference into the constructor to create a ReadOnly object and allow the
original to be altered. I think this is what you are suggesting with the
freeze.







Eg





public class ImmutableArray<T> :IImmutable

{

// Declare an array to store the data elements.

private T[] arr = null;

bool isLocked;



public ImmutableArray(T[] data)

arr = data;

}



public T this[int i]

{

get

{

return arr[i];

}

// can use set if we support unlock. But this behaviour could not be
verified and would need it to be a system assembly.





public bool Freeze()

{

// not needed

}



Public bool IsFrozen()

{

Return true;

}





}





While the above class is Immutable the data is not ( unless its read in Via
DMA) , excluding the DMA case the original can always be altered because
there may be more references to the data.



Unless you

A) Guarantee the source is IImmutable

B) Copy the Data

C) Read it via DMA

D) Guarantee there is no pointer in the heap to the original ie only 1
pointer to the data in the System ( this gets back to the Singularity idea)
.

eg



var ints = new int[10];

set ints data



var immutable ints = new ImmutableArray<int>(data);



// this is save provided you can prove the original data is safe.







public ImmutableArray(T[] data)

{

if ( data is IImmutableArray || there are no heap references to data)

arr = data;

else

arr = data.Clone();

}





I don't think it's possible proving cost effectively there are no heap
references for non ref counting GCs? You also need to set the passed in
pointer to null after loading the data ( this is possible with some system
assemblies) .



The above will work but would limit the value of an existing runtime. IN
itself this is not too bad as Capabilities will require significant changes
so you can use a newer faster more secure lib or use the old .NET lib and
run less secure /slower.



Once the data is loaded you can pass it throughout with no issues



You could have more flexible system types that implement IImmutable and
bypass the no public set method /fields check. These will be trusted.







Regards,



Ben











On Wed, Sep 30, 2009 at 6:27 AM, Ben Kloosterman <bklooste-***@public.gmane.org> wrote:

option A) immutable on allocaiton:



byte[SIZE] rawdata;

fetch_data(rawdata)

ima = ImmutableArray(rawdata); // this has to copy the entire data into the
immutable array

return ima;



Why not just wrap the data with a readonly wrapper and send that ?

How do you reasonably prove that the underlying data will not change? If you
wrap the underlying byte[], then code can still change values in the byte[]
after the data is sent.



Ahh I see you are theoretically correct because the BufferManager needs to
mark the buffer in use. So logically if you mark the buffer in use
beforehand its equivalent.

You seem to trust the programmer to make sure he doesn't modify something
after it's sent. I'm not so trusting.
David Jeske
2009-10-01 08:25:41 UTC
Permalink
Post by Ben Kloosterman
Excluding DMA how do we actually retain the flexability of passing a
reference into the constructor to create a ReadOnly object and allow the
original to be altered. I think this is what you are suggesting with the
freeze.
My example did not include the ability to pass in a reference to the
underlying data. I was anticipating the actual array would be hidden inside
the ImmutableArray and classes outside would not get a reference to it. My
hope was for DMA to have special support to interact with the underlying
data in an ImmutableArray without leaking the internal handle out. At some
level this would be necessary anyhow given that DMA would need a pinned
pointer at a valid PCI address target. My hope is that the DMA details could
be contained in the Immutable classes. More like:

class ImmutableArray<T> {
private T[] data;
private bool isFrozen = false;
public ImmutableArray(int size) { data = new T[size]; }
public T this[int i] {
get { return data[i];}
set { if (isFrozen) throw new ImmutableException(); else data[i] =
value; }
}
public setupDma(DmaHandler hndl) {..} // only accepts special internal
class 'DmaHandler'
}

I have no idea how Singularity implements their 'single pointer to data'
invariant, but from what I read it sounds different than this. I interpreted
Singuarlity to be trying to assure only a single SIP has a handle to a block
of 'shared' memory, handing that pointer to another SIP means losing the
pointer yourself. The scheme above is intended to allow multiple SIPs to
hold onto the same datablock, since the data underneath is immutable.

One motivation for the above is to support zero-copy data transfer from DMA
to cache to application code as this is a bottleneck for high-throughput
systems. I recognize this has tradeoffs. Hiding the real bytes inside the
class means non-DMA transfers incur a penalty, either the overhead of the
setter, or the overhead of copying out of another T[]. This cost could be
avoided by allowing someone outside to see the raw T[] and expecting them to
"be careful" to use it properly.

For structs, the readonly keyword would be used to insure immutability after
construction, and some IL verification would assure all ImmutableStruct
subclasses had only readonly fields.

... I understand this is an abstract design discussion. There may be other
constraints that actually steer the design when dealing with the details. I
hope it's more helpful than distracting.

I'd like to add that since thinking about some of the ideas from this list,
I see the lure of possible performance advantages of a software-isolation
system. I don't think it's obvious what all the tradeoffs might be,
potential GC challenges or otherwise, but I see the lure.
Matthijs ter Woord
2009-10-01 08:45:11 UTC
Permalink
I think in this discussion we're missing the possibility of doing compiler
magic: i think we're assuming here that using a public setter "public T
this[int index]{set{if (!isFrozen){Exception}data[index]=value}}" that it
incurs a method call. it does currently, but we could (and likely will) do
inlining and other sorts of stuff..
Post by David Jeske
Post by Ben Kloosterman
Excluding DMA how do we actually retain the flexability of passing a
reference into the constructor to create a ReadOnly object and allow the
original to be altered. I think this is what you are suggesting with the
freeze.
My example did not include the ability to pass in a reference to the
underlying data. I was anticipating the actual array would be hidden inside
the ImmutableArray and classes outside would not get a reference to it. My
hope was for DMA to have special support to interact with the underlying
data in an ImmutableArray without leaking the internal handle out. At some
level this would be necessary anyhow given that DMA would need a pinned
pointer at a valid PCI address target. My hope is that the DMA details could
class ImmutableArray<T> {
private T[] data;
private bool isFrozen = false;
public ImmutableArray(int size) { data = new T[size]; }
public T this[int i] {
get { return data[i];}
set { if (isFrozen) throw new ImmutableException(); else data[i] =
value; }
}
public setupDma(DmaHandler hndl) {..} // only accepts special internal
class 'DmaHandler'
}
I have no idea how Singularity implements their 'single pointer to data'
invariant, but from what I read it sounds different than this. I interpreted
Singuarlity to be trying to assure only a single SIP has a handle to a block
of 'shared' memory, handing that pointer to another SIP means losing the
pointer yourself. The scheme above is intended to allow multiple SIPs to
hold onto the same datablock, since the data underneath is immutable.
One motivation for the above is to support zero-copy data transfer from DMA
to cache to application code as this is a bottleneck for high-throughput
systems. I recognize this has tradeoffs. Hiding the real bytes inside the
class means non-DMA transfers incur a penalty, either the overhead of the
setter, or the overhead of copying out of another T[]. This cost could be
avoided by allowing someone outside to see the raw T[] and expecting them to
"be careful" to use it properly.
For structs, the readonly keyword would be used to insure immutability
after construction, and some IL verification would assure all
ImmutableStruct subclasses had only readonly fields.
... I understand this is an abstract design discussion. There may be other
constraints that actually steer the design when dealing with the details. I
hope it's more helpful than distracting.
I'd like to add that since thinking about some of the ideas from this list,
I see the lure of possible performance advantages of a software-isolation
system. I don't think it's obvious what all the tradeoffs might be,
potential GC challenges or otherwise, but I see the lure.
Ben Kloosterman
2009-10-01 11:43:27 UTC
Permalink
My example did not include the ability to pass in a reference to the
underlying data. I was anticipating the actual array would be hidden inside
the ImmutableArray and classes outside would not get a reference to it. My
hope was for DMA to have special support to interact with the underlying
data in an ImmutableArray without leaking the internal handle out. At some
level this would be necessary anyhow given that DMA would need a pinned
pointer at a valid PCI address target. My hope is that the DMA details could
be contained in the Immutable classes. More like:



DMA will work and that does help high performance systems Im just concerned
how well the rest will work. Having Immutable IPC is very potent as we also
remove locking , obviously as core count goes up there is less degradation
with less locks.



class ImmutableArray<T> {

private T[] data;

private bool isFrozen = false;

public ImmutableArray(int size) { data = new T[size]; }

public T this[int i] {

get { return data[i];}

set { if (isFrozen) throw new ImmutableException(); else data[i] =
value; }

}

public setupDma(DmaHandler hndl) {..} // only accepts special internal
class 'DmaHandler'

}





Why even have the set values ?



If it can change back to mutable , you are trusting the source. If not the
constructor is fine . We still want an interface ..





I have no idea how Singularity implements their 'single pointer to data'
invariant, but from what I read it sounds different than this. I interpreted
Singuarlity to be trying to assure only a single SIP has a handle to a block
of 'shared' memory, handing that pointer to another SIP means losing the
pointer yourself. The scheme above is intended to allow multiple SIPs to
hold onto the same datablock, since the data underneath is immutable.



Its different but I find it interesting the same issues and solutions keep
popping up in different ways.



One motivation for the above is to support zero-copy data transfer from DMA
to cache to application code as this is a bottleneck for high-throughput
systems. I recognize this has tradeoffs. Hiding the real bytes inside the
class means non-DMA transfers incur a penalty, either the overhead of the
setter, or the overhead of copying out of another T[]. This cost could be
avoided by allowing someone outside to see the raw T[] and expecting them to
"be careful" to use it properly.



For structs, the readonly keyword would be used to insure immutability after
construction, and some IL verification would assure all ImmutableStruct
subclasses had only readonly fields.



Obviously the system types will work (and don't need to be verified)

Yes we can verify user and structs objects ( Only public readonly fields ,
no public set properties , no methods )



... I understand this is an abstract design discussion. There may be other
constraints that actually steer the design when dealing with the details. I
hope it's more helpful than distracting.



Its very helpful . I think I will give with 3 levels of IPC. ( One of them
is only inter app though) This one looks the best candidate for IPC.



I'd like to add that since thinking about some of the ideas from this list,
I see the lure of possible performance advantages of a software-isolation
system. I don't think it's obvious what all the tradeoffs might be,
potential GC challenges or otherwise, but I see the lure.



This is the exact trade off we want which Minix and the other MicroKernels
could not make , we have the opportunity of greater performance and we can
trade this for reliability and security gains.

No one knows all the trade off except M ( and even they would only have a
basic idea) , again the goal is to get Linux like performance ( not better )
but a real nice TDD C# kernel that's easy to maintain and runs applications
secure and reliable.



Regards,



Ben
Ben Kloosterman
2009-10-02 13:59:04 UTC
Permalink
One more thing what about an Immutable object like ForwardStreamReader ? The
object is immutable but the Method read can act on a buffer which belongs to
the sender . Since no data is exposed and the creator of the object trusts
the object to be safe ( maybe via a lock) I see no reason why not to allow
them ?



Comments ?



Also



I'd like to add that since thinking about some of the ideas from this list,
I see the lure of possible performance advantages of a software-isolation
system. I don't think it's obvious what all the tradeoffs might be,
potential GC challenges or otherwise, but I see the lure.



The lack of research is despairing there are probably 20 papers a year on
improving linux locking or LRU algorithms.. If we build it we should
publish a few papers for good or bad.



Regards,



Ben





.

Chad Z. Hower
2009-09-23 12:07:21 UTC
Permalink
Post by David Jeske
This is very different than what a VM MMU/TLB system does. Swapping an
entire process has some pretty limiting degenerate cases. For example,
applications with very large datasets are likely to have them in a
single large heap, and require paging subsections if another application
needs memory.
We have the option to "swap" per object.
Post by David Jeske
Further, one of the benefits of paging is that currently unused portions
of code or data can be removed from memory without actually stopping the
We can do this without using hardware paging. We can decide to just
store assemblies, or even discard and reload thus saving the time to
save it out. Same for read only data.
Post by David Jeske
solution might produce VM-paging functionality, but you're discussing a
system with more limited capabilities than typical VM-paging. I
understand now.
Actually I think our proposals are less limited, not more limited.... We
could even turn an in memory data block into disk, and just redirect IO
to disk without the caller ever knowing, and if it gets active, reload
it back to RAM. HSM for memory. VM cant do that.
Post by David Jeske
zero-copy from the hardware HAL to user-land, then all these folks are
going to be touching some shared-buffer-cache-heap, and all of them
would need to be stopped for a stop-the-world collection of the
shared-buffer-cache-heap. That heap is likely to be quite large.
Not necessarily, collection can be done incrementally.





------------------------------------

--------------------------------------------------
More things to join for Cosmos!

1) Cosmos chat room:
http://tinyurl.com/pc7bds

2) Please add yourself to the map:
http://tinyurl.com/qhttde

3) Help publicity and join our Facebook page:
http://tinyurl.com/plrloa

--------------------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/Cosmos-Dev/

<*> Your email settings:
Individual Email | Traditional

<*> To change settings online go to:
http://groups.yahoo.com/group/Cosmos-Dev/join
(Yahoo! ID required)

<*> To change settings via email:
mailto:Cosmos-Dev-digest-***@public.gmane.org
mailto:Cosmos-Dev-fullfeatured-***@public.gmane.org

<*> To unsubscribe from this group, send an email to:
Cosmos-Dev-unsubscribe-***@public.gmane.org

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
Chad Z. Hower
2009-09-23 11:57:06 UTC
Permalink
Post by Ben Kloosterman
1) Better performance is NOT The main goal , the real goal is to
improve reliability and security without reducing performance. That
said I believe we will achieve better performance ( esp through better
Ben I fully agree here. Speed is not our main goal, however I too
believe we can do quite well here.
Post by Ben Kloosterman
some Cad systems all stayed with real mode for performance. Why did we
break real mode 1. Not enough memory (needed by Gui apps) 2
Reliability for Multiple process support . Through a Type save
Much of this was because of a very crappy implementation by Intel and
their need to maintain backwards compat. Intel never learns, and now we
have layers of this garbage. Motorola and others had much better
architectures, but just like VHS, Intel won because of the available
software.
Post by Ben Kloosterman
Software paging does have the nice property that it smartly "makes the
common case fast". (i.e. a system which isn't swapping has no overhead.
We can also do "smart swapping". Deciding and organizing data better.
Processes can even mark data as how important it is. Currently WIndows
supports only "swap or no swap" and most programs dont use it. We can
even group memory allocation types and move them around to swap together
which Windows and others cant.
Post by Ben Kloosterman
The GC costs will be about the same as a.NET app running on Windows .
I think less. I've done some investigation here, and because it has to
sit on top of the Windows mem alloc etc... I think we can do better.
Post by Ben Kloosterman
reliability , security and maintainability ( of the OS) . Consider a TDD
kernel also.
I love TDD... and it will be a huge part of Cosmos.





------------------------------------

--------------------------------------------------
More things to join for Cosmos!

1) Cosmos chat room:
http://tinyurl.com/pc7bds

2) Please add yourself to the map:
http://tinyurl.com/qhttde

3) Help publicity and join our Facebook page:
http://tinyurl.com/plrloa

--------------------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/Cosmos-Dev/

<*> Your email settings:
Individual Email | Traditional

<*> To change settings online go to:
http://groups.yahoo.com/group/Cosmos-Dev/join
(Yahoo! ID required)

<*> To change settings via email:
mailto:Cosmos-Dev-digest-***@public.gmane.org
mailto:Cosmos-Dev-fullfeatured-***@public.gmane.org

<*> To unsubscribe from this group, send an email to:
Cosmos-Dev-unsubscribe-***@public.gmane.org

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
Ben Kloosterman
2009-09-23 12:40:55 UTC
Permalink
Post by Chad Z. Hower
Much of this was because of a very crappy implementation by Intel and
their need to maintain backwards compat. Intel never learns, and now we
have layers of this garbage. Motorola and others had much better
architectures, but just like VHS, Intel won because of the available
software.
Don't forget the 35B they plug into a plant to have the most transistors and
hence the fastest CPU . I don't think its entirely Intels fault they tried a
new approach with i860 and later HP with IA-64 Itanium it just hard to move
the market. Intel wanted everyone to go Itanium for 64 bit but AMD
prevented that with the x64 which Intel licensed to prevent loss of market
share at least we got extra 15 GP registers in 64 bit mode. I still like
the rep instruction :-)
Post by Chad Z. Hower
Post by David Jeske
Software paging does have the nice property that it smartly "makes the
common case fast". (i.e. a system which isn't swapping has no overhead.
We can also do "smart swapping". Deciding and organizing data better.
Processes can even mark data as how important it is. Currently WIndows
supports only "swap or no swap" and most programs dont use it. We can
even group memory allocation types and move them around to swap together
which Windows and others cant.
I was thinking this exactly but even put it in the API eg when a browser tab
goes in the background it should be a strong candidate for swap when not
active for more than 1 Minute the GUI cant really do this . Honestly I
don't think swapping is a high priority but it's a counter to the VMM
swapping. VMM with swapping also has a lot of bugs and makes the OS a dogs
breakfast and has numerous issues eg it gets its hooks into the dispatcher ,
file system , disk driver etc ( paging a page in any of these things would
dead lock the system). The loss in mainability IMHO is not worth it.
Post by Chad Z. Hower
Post by David Jeske
The GC costs will be about the same as a.NET app running on Windows .
I think less. I've done some investigation here, and because it has to
sit on top of the Windows mem alloc etc... I think we can do better.
Agree , note the shared Nursery I proposed it would be very fast ( new
string() would be the same as int i ) but is not viable with Windows.
Another optimization I use is mem pages are 1 Meg , (there is a small page
mem service which manages large pages and breaks it down but you incur a CPU
overhead for your app using these) . By using these large but equally sized
blocks we will have less problem with memory fragmention ( I use a best fit
algorithm) and the GCs will not call the Memory Manager frequently. This is
also not really possible without all memory being subdivided by GC's.
Post by Chad Z. Hower
Post by David Jeske
reliability , security and maintainability ( of the OS) . Consider a
TDD
Post by David Jeske
kernel also.
I love TDD... and it will be a huge part of Cosmos.
I think it should be mandated for the kernel , device drivers and services.
A lot of academic interest is in proving the micro kernel I always say id
rather have TDD on 100K of code including the Kernel and key services than a
difficult to change and expensive mathematical proof on a 10K micro kernel.

Regards,

Ben
Chad Z. Hower
2009-09-23 15:43:39 UTC
Permalink
Post by Ben Kloosterman
Don't forget the 35B they plug into a plant to have the most transistors and
hence the fastest CPU . I don't think its entirely Intels fault they tried a
Yes, but I disagree that the strategy of more transistor is the best.
Thats like dumping bigger engines in cars while ignoring efficiency etc.
Intel CPU's are horribly inefficient and they have to compensate with
too much silicon.

Cosmos however will be CPU independent... so maybe we can spur on
another CPU to be popular. :)
Post by Ben Kloosterman
new approach with i860 and later HP with IA-64 Itanium it just hard to move
the market. Intel wanted everyone to go Itanium for 64 bit but AMD
prevented that with the x64 which Intel licensed to prevent loss of market
share at least we got extra 15 GP registers in 64 bit mode. I still like
the rep instruction :-)
Didn't Intel also lock up the IA-64 etc so as to try an exclude AMD?
Given that AMD had no choice, and at least they innovate on the X86
platform, in fact led the X64 one.
Post by Ben Kloosterman
Post by Chad Z. Hower
I love TDD... and it will be a huge part of Cosmos.
I think it should be mandated for the kernel , device drivers and services.
A lot of academic interest is in proving the micro kernel I always say id
rather have TDD on 100K of code including the Kernel and key services than a
difficult to change and expensive mathematical proof on a 10K micro kernel.
Yes, TDD will be. We need to get past this next phase, then build a
decent testing framework and TDD will be a mandate.
Loading...