Discussion:
GC refactor ahead
Allison Randal
2009-11-24 22:06:16 UTC
Permalink
It's beginning to look like the next big Parrot task (along the lines of
the calling conventions refactor) will be the GC, as an important
performance enhancement for the compiler tools. See:

http://irclog.perlgeek.de/parrot/2009-11-20#i_1751189

Likely after 2.0. This is just a general message to get people starting
to think about it.

Allison
Donald Hunter
2009-11-25 21:19:50 UTC
Permalink
Allison,

I hacked out arena based GC and hacked in Boehm GC several months back. IIRC
for an execution run of compiled Rakudo code the performance difference was
minimal. Reading the IRClog it chimes with what I remember - it would appear
to be due to the fact that so many short-lived objects are created, any GC
is going to be under huge pressure.

It will take me a while to resurrect my Boehm hack since I am in the middle
of a Windows 7 upgrade but I can gather some specific figures if it will
help.

Cheers,
Donald.
Post by Allison Randal
It's beginning to look like the next big Parrot task (along the lines of
the calling conventions refactor) will be the GC, as an important
http://irclog.perlgeek.de/parrot/2009-11-20#i_1751189
Likely after 2.0. This is just a general message to get people starting to
think about it.
Allison
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
Kevin Tew
2009-11-25 21:54:28 UTC
Permalink
Generational collection with mmap protection barriers is probably a good
strategy at this point.

Kevin
Post by Donald Hunter
Allison,
I hacked out arena based GC and hacked in Boehm GC several months
back. IIRC for an execution run of compiled Rakudo code the
performance difference was minimal. Reading the IRClog it chimes with
what I remember - it would appear to be due to the fact that so many
short-lived objects are created, any GC is going to be under huge
pressure.
It will take me a while to resurrect my Boehm hack since I am in the
middle of a Windows 7 upgrade but I can gather some specific figures
if it will help.
Cheers,
Donald.
It's beginning to look like the next big Parrot task (along the
lines of the calling conventions refactor) will be the GC, as an
http://irclog.perlgeek.de/parrot/2009-11-20#i_1751189
Likely after 2.0. This is just a general message to get people
starting to think about it.
Allison
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
------------------------------------------------------------------------
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
Allison Randal
2009-11-25 23:23:13 UTC
Permalink
Aye. In this case it's the long-lived objects that are killing us,
getting them off into older generations would be a big win.

Allison
Post by Kevin Tew
Generational collection with mmap protection barriers is probably a good
strategy at this point.
Kevin
Post by Donald Hunter
Allison,
I hacked out arena based GC and hacked in Boehm GC several months
back. IIRC for an execution run of compiled Rakudo code the
performance difference was minimal. Reading the IRClog it chimes with
what I remember - it would appear to be due to the fact that so many
short-lived objects are created, any GC is going to be under huge
pressure.
It will take me a while to resurrect my Boehm hack since I am in the
middle of a Windows 7 upgrade but I can gather some specific figures
if it will help.
Cheers,
Donald.
It's beginning to look like the next big Parrot task (along the
lines of the calling conventions refactor) will be the GC, as an
http://irclog.perlgeek.de/parrot/2009-11-20#i_1751189
Likely after 2.0. This is just a general message to get people
starting to think about it.
Allison
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
------------------------------------------------------------------------
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
Patrick R. Michaud
2009-11-26 14:51:00 UTC
Permalink
Post by Donald Hunter
Allison,
I hacked out arena based GC and hacked in Boehm GC several months back. IIRC
for an execution run of compiled Rakudo code the performance difference was
minimal. Reading the IRClog it chimes with what I remember - it would
appear to be due to the fact that so many short-lived objects are
created, any GC is going to be under huge pressure.
I don't quite understand this last comment. Do short-lived objects
have a big impact on GC if they are created and then die before a GC
run is triggered?

Pm
chromatic
2009-11-26 18:01:03 UTC
Permalink
Post by Patrick R. Michaud
I don't quite understand this last comment. Do short-lived objects
have a big impact on GC if they are created and then die before a GC
run is triggered?
They do, because we have to trace the entire live set on every GC run, even if
only 10% of the GCables created are live and 90% are short-lived.

-- c
Geoffrey Broadwell
2009-11-26 18:29:00 UTC
Permalink
Post by chromatic
Post by Patrick R. Michaud
I don't quite understand this last comment. Do short-lived objects
have a big impact on GC if they are created and then die before a GC
run is triggered?
They do, because we have to trace the entire live set on every GC run, even if
only 10% of the GCables created are live and 90% are short-lived.
I am somewhat suprised by this. My expectation (and it sounds like
Patrick's as well) was that GC froth (items both created and destroyed
between GC runs) would be removed from the "live set" before the GC run
sees them -- and thus would not affect the execution time of the GC run.

It sounds like instead you're saying that creating an item adds it to
the live set, but dying does *not* remove it from the live set -- and
that only a full GC run can do this removal.

That sounds less than awesome, performance wise.

If the problem is that we don't know if the GCable has references to or
from it, and need to do a full GC run to resolve this, then it would be
nice if we could address this either through hinting or algorithmically.

If that's not the problem, then what is?


-'f
chromatic
2009-11-26 19:40:26 UTC
Permalink
Post by Geoffrey Broadwell
... we have to trace the entire live set on every GC run,
even if only 10% of the GCables created are live and 90% are short-lived.
I am somewhat suprised by this. My expectation (and it sounds like
Patrick's as well) was that GC froth (items both created and destroyed
between GC runs) would be removed from the "live set" before the GC run
sees them -- and thus would not affect the execution time of the GC run.
How do you identify items that were once but now aren't live?

There are a couple of options.

1) refcounting
2) create all new GCables from young generation pools and ...
2a) use pervasive write barriers to mark the pool as containing GCables
pointed to from an older generation, so you know they need marking
2b) use pervasive write barriers to promote GCables pointed to from an older
generation to a generation not immediately collectable
3) trace the whole live set from the roots to all reachable objects

#1 is out.
We do #3.

We should move to a #2 system.

(I've left out copying and compacting because that's an implementation detail
for reclaiming, not marking live/dead.)
Post by Geoffrey Broadwell
It sounds like instead you're saying that creating an item adds it to
the live set, but dying does *not* remove it from the live set -- and
that only a full GC run can do this removal.
Yes; that's how non-refcounting GC works. You can always identify what's
live. You can identify what's dead by a union operation: everything you
haven't marked as live is dead.
Post by Geoffrey Broadwell
That sounds less than awesome, performance wise.
If the problem is that we don't know if the GCable has references to or
from it, and need to do a full GC run to resolve this, then it would be
nice if we could address this either through hinting or algorithmically.
We do need write barriers, but it's going to be a pain to add them to the
entire system. *Every place* that stores a STRING or a PMC has to use the
appropriate macro or function call.

That includes all dynops, extension code, and custom PMCs.

-- c
Andrew Whitworth
2009-11-26 21:01:48 UTC
Permalink
Post by chromatic
How do you identify items that were once but now aren't live?
There are a couple of options.
1) refcounting
2) create all new GCables from young generation pools and ...
2a) use pervasive write barriers to mark the pool as containing GCables
pointed to from an older generation, so you know they need marking
2b) use pervasive write barriers to promote GCables pointed to from an older
generation to a generation not immediately collectable
3) trace the whole live set from the roots to all reachable objects
#1 is out.
We do #3.
We should move to a #2 system.
Another option, similar in idea but different in implementation would
be to use escape analysis to allocate certain groups of short-lived
PMCs on the stack instead of the heap. This would exclude those items
from GC entirely, likely reducing a lot of pressure from certain
internal-only operations.

Of the options chromatic mentioned, I think something along the idea
of 2b is best. PMCs which are attached to other PMCs in a
semi-permanent way could all be treated together by the GC as a single
"super-object". This is similar to how we currently handle PMCs and
their associated attributes structures. In this case, the write
barriers don't need to be pervasive throughout the system, they would
only need to be implemented inside aggregate PMC types (any PMC which
contains a reference to another PMC).

Once we switch to Lorito for most things, I think we can avoid messy
operations like stack-walking and move to a more precise (as opposed
to a conservative) approach.

Here are some of the high-priority things that we need to do if we
want to improve GC performance:

1) Decrease the total number of PMCs created in the flow of normal
operations (various optimizations, precise GC instead of conservative
GC, etc)
2) Decrease the number of times a long-lived PMC needs to be marked
(Generational GC and write barriers)
3) Decrease the number of items we have to examine during a sweep

Once we have these improvements in place, we can talk about improving
the efficiency of each operation by tweaking the implementation
details (compacting GC, etc).
Post by chromatic
We do need write barriers, but it's going to be a pain to add them to the
entire system.  *Every place* that stores a STRING or a PMC has to use the
appropriate macro or function call.
That includes all dynops, extension code, and custom PMCs.
I think we can get away with a more limited approach by only demanding
write barriers inside aggregate PMCs.

--Andrew Whitworth
Kevin Tew
2009-11-28 05:57:53 UTC
Permalink
Many generational GCs allocate from mmaped "pages" of say 4k-16k in
size. mprotect and handling SIGSEGV can then be used as a write barrier
without having to modify the VM code, the write barriers are of course
per page.

Kevin
Post by Andrew Whitworth
Post by chromatic
How do you identify items that were once but now aren't live?
There are a couple of options.
1) refcounting
2) create all new GCables from young generation pools and ...
2a) use pervasive write barriers to mark the pool as containing GCables
pointed to from an older generation, so you know they need marking
2b) use pervasive write barriers to promote GCables pointed to from an older
generation to a generation not immediately collectable
3) trace the whole live set from the roots to all reachable objects
#1 is out.
We do #3.
We should move to a #2 system.
Another option, similar in idea but different in implementation would
be to use escape analysis to allocate certain groups of short-lived
PMCs on the stack instead of the heap. This would exclude those items
from GC entirely, likely reducing a lot of pressure from certain
internal-only operations.
Of the options chromatic mentioned, I think something along the idea
of 2b is best. PMCs which are attached to other PMCs in a
semi-permanent way could all be treated together by the GC as a single
"super-object". This is similar to how we currently handle PMCs and
their associated attributes structures. In this case, the write
barriers don't need to be pervasive throughout the system, they would
only need to be implemented inside aggregate PMC types (any PMC which
contains a reference to another PMC).
Once we switch to Lorito for most things, I think we can avoid messy
operations like stack-walking and move to a more precise (as opposed
to a conservative) approach.
Here are some of the high-priority things that we need to do if we
1) Decrease the total number of PMCs created in the flow of normal
operations (various optimizations, precise GC instead of conservative
GC, etc)
2) Decrease the number of times a long-lived PMC needs to be marked
(Generational GC and write barriers)
3) Decrease the number of items we have to examine during a sweep
Once we have these improvements in place, we can talk about improving
the efficiency of each operation by tweaking the implementation
details (compacting GC, etc).
Post by chromatic
We do need write barriers, but it's going to be a pain to add them to the
entire system. *Every place* that stores a STRING or a PMC has to use the
appropriate macro or function call.
That includes all dynops, extension code, and custom PMCs.
I think we can get away with a more limited approach by only demanding
write barriers inside aggregate PMCs.
--Andrew Whitworth
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
Andrew Whitworth
2009-11-28 13:31:43 UTC
Permalink
Yes, it is a good strategy too, assuming all our targetted platforms
properly support the mmap options we need. I can't think of any off
the top of my head that don't.

I've seen some very interesting papers recently, especially links from
the "unladen swallow" project page where good compacting GCs were
written to take advantage of a similar mechanism (mmap and robust sig
handling) to handle moves of data objects. So, if we have a good
system in place to handle these issues, we could reuse it to implement
features like fast compacting.

It also gives us the ability to combine semi-space-like GC algorithms
with iterative behavior, Very attractive indeed.

--Andrew Whitworth
Many generational GCs allocate from mmaped "pages" of say 4k-16k in size.
 mprotect and handling SIGSEGV can then be used as a write barrier without
having to modify the VM code,   the write barriers are of course per page.
Kevin
Post by Andrew Whitworth
Post by chromatic
How do you identify items that were once but now aren't live?
There are a couple of options.
1) refcounting
2) create all new GCables from young generation pools and ...
2a) use pervasive write barriers to mark the pool as containing GCables
pointed to from an older generation, so you know they need marking
2b) use pervasive write barriers to promote GCables pointed to from an older
generation to a generation not immediately collectable
3) trace the whole live set from the roots to all reachable objects
#1 is out.
We do #3.
We should move to a #2 system.
Another option, similar in idea but different in implementation would
be to use escape analysis to allocate certain groups of short-lived
PMCs on the stack instead of the heap. This would exclude those items
from GC entirely, likely reducing a lot of pressure from certain
internal-only operations.
Of the options chromatic mentioned, I think something along the idea
of 2b is best. PMCs which are attached to other PMCs in a
semi-permanent way could all be treated together by the GC as a single
"super-object". This is similar to how we currently handle PMCs and
their associated attributes structures. In this case, the write
barriers don't need to be pervasive throughout the system, they would
only need to be implemented inside aggregate PMC types (any PMC which
contains a reference to another PMC).
Once we switch to Lorito for most things, I think we can avoid messy
operations like stack-walking and move to a more precise (as opposed
to a conservative) approach.
Here are some of the high-priority things that we need to do if we
1) Decrease the total number of PMCs created in the flow of normal
operations (various optimizations, precise GC instead of conservative
GC, etc)
2) Decrease the number of times a long-lived PMC needs to be marked
(Generational GC and write barriers)
3) Decrease the number of items we have to examine during a sweep
Once we have these improvements in place, we can talk about improving
the efficiency of each operation by tweaking the implementation
details (compacting GC, etc).
Post by chromatic
We do need write barriers, but it's going to be a pain to add them to the
entire system.  *Every place* that stores a STRING or a PMC has to use the
appropriate macro or function call.
That includes all dynops, extension code, and custom PMCs.
I think we can get away with a more limited approach by only demanding
write barriers inside aggregate PMCs.
--Andrew Whitworth
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
Patrick R. Michaud
2009-11-27 00:09:07 UTC
Permalink
Post by chromatic
Post by Patrick R. Michaud
I don't quite understand this last comment. Do short-lived objects
have a big impact on GC if they are created and then die before a GC
run is triggered?
They do, because we have to trace the entire live set on every
GC run, even if only 10% of the GCables created are live and
90% are short-lived.
Thanks, this helps explain it a bit.

Pm
Patrick R. Michaud
2009-12-01 18:06:24 UTC
Permalink
This morning while attempting to answer a different question related
to Parrot performance I ran into a very disturbing (to me) result
about Parrot GC, which I share below.

Here's a NQP version of the fibonacci benchmarks that appear in
examples/benchmarks/ .

$ cat fib.nqp

sub fib($n) {
($n < 2) ?? $n !! fib($n-1) + fib($n-2);
}

my $N := 28;

pir::say("fib($N) = " ~ fib($N));


Running the above program with NQP takes about 20.5 seconds:

$ time ./parrot-nqp fib.nqp
fib(28) = 317811
real 0m20.598s
user 0m20.500s
sys 0m0.090s

Okay, so let's figure out where the time is being used. Let's see
how long it takes to compile the above into PIR:

$ time ./parrot-nqp --target=pir fib.nqp >fib-nqp.pir
real 0m0.258s
user 0m0.220s
sys 0m0.030s

Hmm, compilation only requires 0.25 seconds -- not too bad. So the
remaining time (20.25 seconds) must be spent executing the program,
right? Let's try running the PIR file we just generated:

$ time ./parrot fib-nqp.pir
fib(28) = 317811
real 0m9.879s
user 0m9.870s
sys 0m0.010s

Huh?! Only 9.8 seconds? So, if fib.nqp takes 0.25 seconds to compile
and 9.8 seconds to run, where are the other 10 seconds coming from in
the original execution? Let's try running the original program again,
but this time disabling GC at the beginning:

$ cat fib-2.nqp

Q:PIR { sweepoff };
Q:PIR { collectoff };

sub fib($n) {
($n < 2) ?? $n !! fib($n-1) + fib($n-2);
}

my $N := 28;

pir::say("fib($N) = " ~ fib($N));

$ time ./parrot-nqp fib-2.nqp
fib(28) = 317811
real 0m16.715s
user 0m10.320s
sys 0m2.600s

Okay, it appears that simply turning off the GC gives us a 10 second
improvement in execution speed -- or, put another way, over half of our
execution time for this simple benchmark is spent dealing with GC overhead
for things not directly related to the code being run.

Presumably the extra GC overhead is coming from the parse/past/post
tree structures produced during compilation, as well as the nqp-rx/regex/
pct/past/etc. libraries themselves. If this is true, then what
is especially worrying is that the data structures produced and retained
from compiling the fib.nqp program are almost trivially small, yet
their existence (but not their creation) is resulting in a relatively
huge degradation in runtime performance -- in this case, effectively
doubling the overall runtime cost.

Some may suggest that we can improve the situation by reduce the
number of objects created or retained as a result of compilation.
While I agree it will be useful to make the compilation process
more efficient, my underlying point is that with the current GC
*any* modest-sized data structure that exists at runtime appears
to incur an unacceptable runtime cost (and this would include
data structures created in applications as well).

Given what I've just found above, I'm really hoping we can get
some significant effort placed on GC *prior* to 2.0. It's okay
with me if nothing _lands_ before 2.0, but Allison's message that
started this thread sounded much more to me like "we'll start
working on it in earnest shortly after 2.0" as opposed to
"it will land shortly after 2.0". I fear "start working on it
after 2.0" will ultimately mean that we're seriously hampered by
GC performance through at least the first half of 2010.

Pm
chromatic
2009-12-01 18:22:43 UTC
Permalink
Post by Patrick R. Michaud
While I agree it will be useful to make the compilation process
more efficient, my underlying point is that with the current GC
any modest-sized data structure that exists at runtime appears
to incur an unacceptable runtime cost (and this would include
data structures created in applications as well).
I don't believe it's the cost of those data structures created for NQP. I
believe instead it's the cost of the data structures created for PCC.

Those are short-lived, mostly-garbage headers that we currently can't reclaim
without a full stop the world mark and sweep run.

-- c
chromatic
2009-12-01 20:33:22 UTC
Permalink
Post by chromatic
I don't believe it's the cost of those data structures created for NQP. I
believe instead it's the cost of the data structures created for PCC.
Is there any way we can verify this presumption?
Profile NQP-rx on Actions.pm with Callgrind and look at the callees of pmc_new.
Of 4,780,813 PMCs created, the top 10 are:

* 916,675 Context
* 890,230 CallSignature
* 879,236 RetContinuation
* 851,059 CallSignatureReturns
313,394 String
249,716 ResizablePMCArray
* 208,173 FixedIntegerArray
187,444 Integer
109,101 Hash

Half of the ten are directly part of PCC. They combine to represent 3,745,373
PMCs -- 78.34% of all PMCs created. (I suspect that a lot of those String
PMCs autobox S registers for slurpy parameters -- which go into the RPAs as
well).

If ~80% of all PMCs created in the system are only there to account for
internal data structures for calling functions (and they are), and if there's
no complex control flow such as continuations and coroutines which necessitate
keeping sub-graphs of the call graph alive (and in this benchmark, there
aren't), then I'm very confident that PCC is expensive here.
Instead of increased
GC pressure caused by PCC, I am starting to suspect that our mark
implementation is more inefficient then I previously had thought. I'm
going to take a look around to see what I can find.
That's probably also true.

-- c
Vasily Chekalkin
2009-12-01 21:01:05 UTC
Permalink
Post by chromatic
Is there any way we can verify this presumption?
Profile NQP-rx on Actions.pm with Callgrind and look at the callees of pmc_new.
* 890,230 CallSignature
* 851,059 CallSignatureReturns
This 2 are easy to merge. Just move attributes of CallSignatureReturns
into CallSignature, VTABLE functions into src/call/args.c.
--
Bacek
Vasily Chekalkin
2009-12-03 13:45:44 UTC
Permalink
Post by Vasily Chekalkin
Post by chromatic
Is there any way we can verify this presumption?
Profile NQP-rx on Actions.pm with Callgrind and look at the callees of
* 890,230 CallSignature
* 851,059 CallSignatureReturns
This 2 are easy to merge. Just move attributes of CallSignatureReturns
into CallSignature, VTABLE functions into src/call/args.c.
I've created cs_csr_merge branch with merged CallSignature and
CallSignatureReturns into single PMC. Branch is ready to merge back into
trunk. Unfortunately it doesn't help much. 31593623631 vs 29891188893
instructions according to callgrind.

Sigh... We need proper Generational GC.
--
Bacek
Jonathan Leto
2009-12-03 19:07:42 UTC
Permalink
Howdy,
Post by Vasily Chekalkin
I've created cs_csr_merge branch with merged CallSignature and
CallSignatureReturns into single PMC. Branch is ready to merge back into
trunk. Unfortunately it doesn't help much. 31593623631 vs 29891188893
instructions according to callgrind.
Sigh... We need proper Generational GC.
Should we merge it in? Does it at least give us a small performance
improvement?

Duke
--
Jonathan "Duke" Leto
***@leto.net
http://leto.net
chromatic
2009-12-03 19:51:48 UTC
Permalink
Post by Jonathan Leto
Should we merge it in? Does it at least give us a small performance
improvement?
It gives 5.389% on the branch. Combine that with the GC tunings of the other
day and it'll be a lot more impressive. That should finally make the code
faster than before the PCC merge.

-- c
chromatic
2009-12-01 21:07:10 UTC
Permalink
Post by chromatic
If ~80% of all PMCs created in the system are only there to account for
internal data structures for calling functions (and they are), and
if there's no complex control flow such as continuations and coroutines which
necessitate keeping sub-graphs of the call graph alive (and in this
benchmark, there aren't), then I'm very confident that PCC is expensive here.
For further review, I annotated the GC with printf() calls to track PMC
destruction during GC. The attached log shows how many PMCs of which type get
recycled during every GC run, as well as the percentage of PMCs (Context,
CallSig, RetCallSig, FIA) I believe to be PCC-related.

At least two out of three PMCs recycled in every run are PCC-related.

-- c
Andrew Whitworth
2009-12-01 22:33:46 UTC
Permalink
Post by chromatic
Post by chromatic
If ~80% of all PMCs created in the system are only there to account for
internal data structures for calling functions (and they are), and
if there's no complex control flow such as continuations and coroutines which
necessitate keeping sub-graphs of the call graph alive (and in this
benchmark, there aren't), then I'm very confident that PCC is expensive here.
For further review, I annotated the GC with printf() calls to track PMC
destruction during GC.  The attached log shows how many PMCs of which type get
recycled during every GC run, as well as the percentage of PMCs (Context,
CallSig, RetCallSig, FIA) I believe to be PCC-related.
At least two out of three PMCs recycled in every run are PCC-related.
Thanks for this, c. It's very informative. So I won't doubt anymore
that PCC is expensive! I make a few observations from this:

1) We go through 15 GC runs before we ever find any trash. Suggests to
me that we are allocating a lot of things at startup. When we try to
allocate a new PMC at startup and there are none in the free list, we
first do a GC mark/sweep and then we allocate a new arena. This
suggests to me that we need to significantly increase the initialize
size of our pools to prevent this. Or else, we need to use heuristics
to prevent GC runs from occurring so early in the execution cycle.

2) High number of String PMCs is a little troublesome, and you're
probably right that most of those are autoboxed S registered for PCC
calls. I think CallSignature needs to be made much more intelligent
about this and avoid unnecessary autoboxing. STRINGs do have a flag
set in the PObj flags field, so we should be able to lazily box these
only when an S is passed and a P is needed. But, the numbers for this
are comparatively small.

3) Seems like there are huge gains to be had if we can merge
CallSignature/Context/CallSignatureReturns together. I think the
VTABLE interface of the resulting PMC will be sufficiently clumsy,
though maybe we can reuse the same object and just replace the VTABLE*
for it when it's being used in call or return mode.

4) It seems like we are making ~4000-8000 CallSignatures and Contexts
between most GC runs, which implies that we're making that many
function calls between GC runs. So, I don't think GC is running too
often, but when it is running it is very expensive. Of course, there
are a comparatively small number of RetContinuation PMCs collected, so
it seems the vast majority of expenses come from calling NCI
functions. This seems like a case that we should be able to optimize
further.

I also seem to remember you saying a while back that Capture.mark()
was eating up a huge amount of execution time, is this a true
recollection?

--Andrew Whitworth
Patrick R. Michaud
2009-12-01 21:31:40 UTC
Permalink
Post by chromatic
Post by Patrick R. Michaud
While I agree it will be useful to make the compilation process
more efficient, my underlying point is that with the current GC
any modest-sized data structure that exists at runtime appears
to incur an unacceptable runtime cost (and this would include
data structures created in applications as well).
I don't believe it's the cost of those data structures created for NQP. I
believe instead it's the cost of the data structures created for PCC.
Those are short-lived, mostly-garbage headers that we currently can't reclaim
without a full stop the world mark and sweep run.
My testing -- which may be based on a flawed understanding on my part --
doesn't seem to support this conclusion. For this analysis I tend
to group the gc-ables into two groups: those created during compilation
of fib.nqp, and those created during the runtime of fib.nqp .

The first GC run that occurs during the runtime of fib.nqp ought
to catch all of the short-lived PCC structures that are in the first
group... leaving only long-lived PMCs. Any PCC structures that get
created after runtime begins should be the same for both the .nqp
and generated-.pir versions of the program, thus any observed
differences in execution time would seem to have to be due to
long-lived gc-ables remaining from the compilation process.

I wrote a short test to check this hypothesis -- at the beginning
of fib.nqp execution I force a GC run (to clear up any "mostly-garbage
headers" left over from compilation), and then time the resulting
execution:

$ cat fib-4.nqp

pir::say(pir::time__N);
Q:PIR {
sweep 1
$P0 = getinterp
$P0.'run_gc'()
};
pir::say(pir::time__N);

sub fib($n) {
($n < 2) ?? $n !! fib($n-1) + fib($n-2);
}

my $N := 28;

pir::say("fib($N) = " ~ fib($N));
pir::say(pir::time__N);

$ ./parrot-nqp fib-4.nqp
1259694448.26969
1259694448.28342
fib(28) = 317811
1259694469.60673

As you can see, execution time after a forced GC run at the beginning
is still about 21 seconds (69.60673 - 48.28342). Running the .pir
version of the above takes about 9.6 seconds. This tells me that
it must be the long-lived gc-ables that are making the difference
in execution time, and not the short-lived structures resulting
from PCC.

Pm

P.S.: Further analysis reveals that many of our "short-lived"
PCC structures may in fact end up being long-lived. But the original
point remains that it appears to be GC that is significantly slowing
things down during runtime.
chromatic
2009-12-01 23:05:00 UTC
Permalink
Post by Patrick R. Michaud
The first GC run that occurs during the runtime of fib.nqp ought
to catch all of the short-lived PCC structures that are in the first
group... leaving only long-lived PMCs. Any PCC structures that get
created after runtime begins should be the same for both the .nqp
and generated-.pir versions of the program, thus any observed
differences in execution time would seem to have to be due to
long-lived gc-ables remaining from the compilation process.
I wrote a short test to check this hypothesis -- at the beginning
of fib.nqp execution I force a GC run (to clear up any "mostly-garbage
headers" left over from compilation), and then time the resulting
$ cat fib-4.nqp
pir::say(pir::time__N);
Q:PIR {
sweep 1
$P0 = getinterp
$P0.'run_gc'()
};
pir::say(pir::time__N);
sub fib($n) {
($n < 2) ?? $n !! fib($n-1) + fib($n-2);
}
my $N := 28;
pir::say("fib($N) = " ~ fib($N));
pir::say(pir::time__N);
$ ./parrot-nqp fib-4.nqp
1259694448.26969
1259694448.28342
fib(28) = 317811
1259694469.60673
As you can see, execution time after a forced GC run at the beginning
is still about 21 seconds (69.60673 - 48.28342). Running the .pir
version of the above takes about 9.6 seconds. This tells me that
it must be the long-lived gc-ables that are making the difference
in execution time, and not the short-lived structures resulting
from PCC.
I can't reproduce this; the results are opposite for me.

$ time ./parrot ext/nqp-rx/nqp-rx.pbc fib4.nqp
1259707187.22421
fib(26) = 121393
1259707190.75702

real 0m3.693s
user 0m3.612s
sys 0m0.076s

$ time ./parrot fib4.pir
1259707220.46122
fib(26) = 121393
1259707223.66419

real 0m3.214s
user 0m3.184s
sys 0m0.012s

Yet when I run Callgrind, I get a cost of 7,703,189,522 instructions for the
NQP version and 8,134,562,432 for the PIR version.

One significant difference is that the GC runs 588 times for the NQP version and
6801 times for the PIR version.

-- c
Allison Randal
2009-12-04 10:52:19 UTC
Permalink
Post by chromatic
Those are short-lived, mostly-garbage headers that we currently can't reclaim
without a full stop the world mark and sweep run.
I've wondered for a long time if we're using an incredibly stupid
tree-traversal algorithm during the mark phase. But, to look at the
problem more abstractly, any tree-traversal algorithm is going to have a
cost relative to n, where n is the number of elements in the tree. No
matter what traversal algorithm we use, the biggest gain will be
reducing the number of elements we traverse in a mark. Which means:

A) Create a short-lived object pool that's marked/swept frequently, and
only use it for call signatures, contexts, etc. We can do this right now
without implementing full generational GC using the same strategy we use
for constant PMCs: call a separate function pmc_new_adonis that
allocates from the "fast dying pool". We can also use a function
pmc_new_methuselah for objects we know are created at interpreter
startup that will live until interpreter death (there's no reason to run
GC on those during operation at all). Like constants, we require that
all children of adonis or methuselah PMCs must be allocated from the
same pool(s). Splitting off adonis and methuselah PMC pools will mean
far fewer mark runs on the ordinary GC pools. (Names chosen for
illustration purposes.)

B) Create fewer PMCs: Merge CallSignature/Context/RetContinuation/etc.
Avoid boxing until absolutely necessary (chromatic's changes to make
CallSignature a typed array should have taken care of that for
positional arguments, but we still need typed storage for named arguments).

Allison
Andrew Whitworth
2009-12-04 13:31:08 UTC
Permalink
Post by Allison Randal
I've wondered for a long time if we're using an incredibly stupid
tree-traversal algorithm during the mark phase. But, to look at the problem
more abstractly, any tree-traversal algorithm is going to have a cost
relative to n, where n is the number of elements in the tree. No matter what
traversal algorithm we use, the biggest gain will be reducing the number of
A) Create a short-lived object pool that's marked/swept frequently, and only
use it for call signatures, contexts, etc. We can do this right now without
implementing full generational GC using the same strategy we use for
constant PMCs: call a separate function pmc_new_adonis that allocates from
the "fast dying pool". We can also use a function pmc_new_methuselah for
objects we know are created at interpreter startup that will live until
interpreter death (there's no reason to run GC on those during operation at
all). Like constants, we require that all children of adonis or methuselah
PMCs must be allocated from the same pool(s). Splitting off adonis and
methuselah PMC pools will mean far fewer mark runs on the ordinary GC pools.
(Names chosen for illustration purposes.)
So you're thinking like a generational scheme but with the ability to
explicitly declare which generation to use when we allocate an object?
I think we're going to want at least three explicit groups: Items that
are rapidly recycled, items that are stable but mortal, and items
which are absolutely immortal. On top of that, we're going to want
just a normal allocation where we don't know the lifespan ahead of
time and all the system to migrate the items depending on how many
mark phases they survive.

I'm also becoming very enamored with the idea of allowing stack-based
allocations for items that have very finite lifespans (think about
where we use constant_pmc_new now). Some macros to allocate simple,
short-lived PMCs and their attributes on the stack would take off a
lot of GC pressure in these cases. We wouldn't even need to call
VTABLE_destroy in most cases. I suspect that a relatively large number
of PMC allocations could be transformed in this way for a major
performance savings.
Post by Allison Randal
B) Create fewer PMCs: Merge CallSignature/Context/RetContinuation/etc. Avoid
boxing until absolutely necessary (chromatic's changes to make CallSignature
a typed array should have taken care of that for positional arguments, but
we still need typed storage for named arguments).
Explicitly merging PMCs is a good idea, but I also am liking the idea
of implicitly merging them into "super objects". If we can take a PMC
and lump it together into a memory block with the complete tree of all
it's delegate PMCs and STRINGs, we could get away with only marking
the block as a whole (not all the individual parts). Barring the
ability to store the complete tree in one place, we could store groups
of related PMCs together along with an array of escape pointers. The
mark process would be simplified to marking the block as a whole and
the array of escape pointers only. This idea is similar to generations
and inter-generational pointer lists, but uses locality to group items
together that are used together, instead of grouping by the number of
mark phases survived (which will tend to have the same result for
stable groups, but with worse performance).

--Andrew Whitworth
Patrick R. Michaud
2009-12-04 14:36:14 UTC
Permalink
[...] We can
also use a function pmc_new_methuselah for objects we know are
created at interpreter startup that will live until interpreter
death (there's no reason to run GC on those during operation at
all). Like constants, we require that all children of adonis or
methuselah PMCs must be allocated from the same pool(s).
...when we're allocating a new PMC, how can we know in advance
that it will end up being a child of a methuselah PMC? Or do we
not need to know this (if we don't need to know it, how does
the requirement get met)?

Pm
Kevin Tew
2009-12-04 16:00:33 UTC
Permalink
Pmichaud is hinting at my primary concern.

Intergenerational pointer invariants should not be enforced manually.

Constant PMCS are a headache. We have had lots of "GC bugs" because
developers hung non-constant objects off of a constant pmc.

This is even worse if a specific PMC type is used in both constant and
non-constant settings.

1. You can't separate objects into separate generations unless you can
track intergenerational pointers.

2. Tracking (or eliminating) intergenerational pointers based on
"developer coding policy invariants" (ie Constant PMCs can ONLY contain
or point to constant PMCS) is essentially reverting memory management
responsibilities back to the developer.

Coding policy based intergenerational management will be violated
(creating GC bugs) by adding new features and refactoring old code.

Unless I'm missing something obvious this seems likely to multiply the
types of bugs that constant PMCS introduce.

Kevin
Post by Patrick R. Michaud
[...] We can
also use a function pmc_new_methuselah for objects we know are
created at interpreter startup that will live until interpreter
death (there's no reason to run GC on those during operation at
all). Like constants, we require that all children of adonis or
methuselah PMCs must be allocated from the same pool(s).
...when we're allocating a new PMC, how can we know in advance
that it will end up being a child of a methuselah PMC? Or do we
not need to know this (if we don't need to know it, how does
the requirement get met)?
Pm
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev
Andrew Whitworth
2009-12-04 16:26:12 UTC
Permalink
Coding policy based intergenerational management will be violated (creating
GC bugs) by adding new features and refactoring old code.
Unless I'm missing something obvious this seems likely to multiply the types
of bugs that constant PMCS introduce.
Let me try and explain better what I am thinking:

We'll have at least two functions to allocate a new PMC header:
Parrot_new_normal_pmc and Parrot_new_immortal_pmc. On any field access
we'll have a write barrier setup that can identify the generation of
the child PMC, and if the generation of the child is the same as the
generation of the parent, we don't need to update the
intergenerational list. We only need to update the intergenerational
pointer list when a pointer is "intergenerational".

So if we call Parrot_new_immortal_pmc() to create the parent, and we
call it again to create the child, we've manually shown the
relationship that the two are in the same generation and we don't need
integenerational pointers. However, if the parent is immortal and the
child is normal, the write barrier will detect that and add the
intergenerational pointer record.

We could detect the mortal/immortal status of a PMC by checking a flag.

So there would be no manual modifications to the intergenerational
pointer list, and there would be fewer bugs because the differences
would be detected automatically (currently, differences aren't
detected or accounted for at all, hence all the bugs). We can all keep
in mind that this is an optimization, something that we should be able
to apply after we've created the generational GC and after we've
inserted write barriers in all the proper places. There are lots of
similar optimizations we could try once we have a good base system in
place.

--Andrew Whitworth
Kevin Tew
2009-12-04 16:39:35 UTC
Permalink
Alison was talking about solutions we could implement before writing a
generational collector.

Write barriers are expensive, in PLT scheme we limit write barriers us
to the memory page granularity.

Kevin
Post by Andrew Whitworth
Coding policy based intergenerational management will be violated (creating
GC bugs) by adding new features and refactoring old code.
Unless I'm missing something obvious this seems likely to multiply the types
of bugs that constant PMCS introduce.
Parrot_new_normal_pmc and Parrot_new_immortal_pmc. On any field access
we'll have a write barrier setup that can identify the generation of
the child PMC, and if the generation of the child is the same as the
generation of the parent, we don't need to update the
intergenerational list. We only need to update the intergenerational
pointer list when a pointer is "intergenerational".
So if we call Parrot_new_immortal_pmc() to create the parent, and we
call it again to create the child, we've manually shown the
relationship that the two are in the same generation and we don't need
integenerational pointers. However, if the parent is immortal and the
child is normal, the write barrier will detect that and add the
intergenerational pointer record.
We could detect the mortal/immortal status of a PMC by checking a flag.
So there would be no manual modifications to the intergenerational
pointer list, and there would be fewer bugs because the differences
would be detected automatically (currently, differences aren't
detected or accounted for at all, hence all the bugs). We can all keep
in mind that this is an optimization, something that we should be able
to apply after we've created the generational GC and after we've
inserted write barriers in all the proper places. There are lots of
similar optimizations we could try once we have a good base system in
place.
--Andrew Whitworth
Patrick R. Michaud
2009-12-04 19:52:09 UTC
Permalink
Post by Andrew Whitworth
[...]
So if we call Parrot_new_immortal_pmc() to create the parent, and we
call it again to create the child, we've manually shown the
relationship that the two are in the same generation and we don't need
integenerational pointers. However, if the parent is immortal and the
child is normal, the write barrier will detect that and add the
intergenerational pointer record.
Speaking from the perspective of an HLL developer and user, almost
nothing in the above paragraph makes any sense to me. What would I do
in a language like Perl 5, Python, Perl 6, or even PIR to say "hey,
this object I'm creating is going to be part of an immortal data
structure, use the 'immortal PMC' creator instead of the 'mortal'
one"? Or, if my compiler is supposed to be smart enough to figure it
out, then how will it know? (This sounds like a variation of the
halting problem to me.)

I also think the term "immortal PMC" is highly misleading and
open for misinterpretation -- long-lived (but still potentially
mortal) PMCs appear to be the real issue here. I'd prefer we stick
with the "long-lived" phrase instead of "immortal".

And to err on the side of being extra explicit, I should be
clear that *all* of the long-lived PMCs that cause the slowdown
in the fib.nqp benchmark are created from PIR, at :load :init time.
There's no special-purpose PMCs involved, nor any calls to NCI
subroutines, nor anything that would yet be able to somehow
make use of a "Parrot_new_immortal_pmc()" interface.

Pm
Andrew Whitworth
2009-12-04 20:15:40 UTC
Permalink
Post by Patrick R. Michaud
Post by Andrew Whitworth
[...]
So if we call Parrot_new_immortal_pmc() to create the parent, and we
call it again to create the child, we've manually shown the
relationship that the two are in the same generation and we don't need
integenerational pointers. However, if the parent is immortal and the
child is normal, the write barrier will detect that and add the
intergenerational pointer record.
Speaking from the perspective of an HLL developer and user, almost
nothing in the above paragraph makes any sense to me.  What would I do
in a language like Perl 5, Python, Perl 6, or even PIR to say "hey,
this object I'm creating is going to be part of an immortal data
structure, use the 'immortal PMC' creator instead of the 'mortal'
one"?  Or, if my compiler is supposed to be smart enough to figure it
out, then how will it know?  (This sounds like a variation of the
halting problem to me.)
Ah, good question. We're really talking about two different things
here and I'm glazing over the differences. There are really two
environments where PMCs and STRINGs are used: Internal to Parrot, and
in "PIR-land".

In the general sense of a generational GC, we create all objects into
a "nursery" where we presume all objects are short-lived. We mark and
sweep the nursery regularly. However, if items in the nursery survive
mark and sweep (or survive it X times), we bump it up to the next
generation. Each successively older generation is marked and swept
less often. The heuristic being that the longer an item has survived,
the longer it is likely to survive. We gain huge savings because we
can make presumptions about the longevity of entire generations at
once and waste less effort on objects that statistically aren't going
anywhere.

So in a generational GC under normal operation, objects are promoted
to higher generations automatically. You as the HLL developer just
allocate like normal and GC takes care of the rest.

What I was talking about above are details of Parrot internals where,
as an optimization, we can skip the nursery and allocate directly from
older generations. This saves us the effort for the "trial by fire"
where GC determines the item is long-lived, because we know in some
cases ahead of time that certain PMCs are long-lived and some are
short-lived.
Post by Patrick R. Michaud
I also think the term "immortal PMC" is highly misleading and
open for misinterpretation -- long-lived (but still potentially
mortal) PMCs appear to be the real issue here.  I'd prefer we stick
with the "long-lived" phrase instead of "immortal".
Noted.

--Andrew Whitworth
Patrick R. Michaud
2009-12-04 17:45:30 UTC
Permalink
Post by Kevin Tew
Pmichaud is hinting at my primary concern.
Intergenerational pointer invariants should not be enforced manually.
Constant PMCS are a headache. We have had lots of "GC bugs" because
developers hung non-constant objects off of a constant pmc.
This is even worse if a specific PMC type is used in both constant
and non-constant settings.
[...]
In my previous message I failed to mention that "long lived PMC"
is not always the same as "constant PMC". I'd like for us to
make that clear distinction now, though, and not fall into the
trap of thinking that we're only talking about constants or PMCs
that are invariant throughout their lifetime. Certainly someone's
HLL program that parses a large file and builds a very large data
structure would end up with long-lived PMCs that aren't constant.
Indeed, this is exactly what happens with compilers such as NQP
at the moment.

The scary bit is the amount of performance hit we appear to be
taking even though the structures are not all that large.

Pm
Allison Randal
2009-12-05 11:34:22 UTC
Permalink
Post by Patrick R. Michaud
..when we're allocating a new PMC, how can we know in advance
that it will end up being a child of a methuselah PMC? Or do we
not need to know this (if we don't need to know it, how does
the requirement get met)?
They would have to be purely internal and only used when we can
guarantee parents and children are from the same pool.

It's a compromise solution, and like most compromises involves...
compromises. But, it's worth exploring some partial solutions if they
might help you before 2.0.
Post by Patrick R. Michaud
Pmichaud is hinting at my primary concern.
Intergenerational pointer invariants should not be enforced manually.
Constant PMCS are a headache. We have had lots of "GC bugs" because
developers hung non-constant objects off of a constant pmc.
Agreed, and would eventually like to get rid of them.
Post by Patrick R. Michaud
1. You can't separate objects into separate generations unless you can
track intergenerational pointers.
2. Tracking (or eliminating) intergenerational pointers based on
"developer coding policy invariants" (ie Constant PMCs can ONLY contain
or point to constant PMCS) is essentially reverting memory management
responsibilities back to the developer.
Coding policy based intergenerational management will be violated
(creating GC bugs) by adding new features and refactoring old code.
Unless I'm missing something obvious this seems likely to multiply the
types of bugs that constant PMCS introduce.
The key question in weighing the compromise is whether the majority of
the short-lived or long-lived PMCs are leaves or containers. Leaves work
well in separate pools, containers fall afoul of intergenerational
management, which could cost more than the gain of separate pools. From
chromatic's list we've got

Leaves:

* 208,173 FixedIntegerArray (contains non-GC elements)
187,444 Integer
313,394 String

Containers:

* 916,675 Context
* 890,230 CallSignature
* 879,236 RetContinuation
* 851,059 CallSignatureReturns
249,716 ResizablePMCArray
109,101 Hash

Of those, Context and RetContinuation could likely have all their
elements allocated from a short-lived pool. There's no way of knowing
from this list what ResizablePMCArray and Hash are used for.
CallSignature and CallSignatureReturns are explicitly short-lived
containers for user-space variables (in this model, user-space variables
are always the "general" pool), so heavily intergenerational. And if we
merge Context, CallSignature, CallSignatureReturns, and RetContinuation,
that leaves one heavily intergenerational container PMC at the top of
the chart.
Post by Patrick R. Michaud
I'm also becoming very enamored with the idea of allowing stack-based
allocations for items that have very finite lifespans (think about
where we use constant_pmc_new now). Some macros to allocate simple,
short-lived PMCs and their attributes on the stack would take off a
lot of GC pressure in these cases. We wouldn't even need to call
VTABLE_destroy in most cases. I suspect that a relatively large number
of PMC allocations could be transformed in this way for a major
performance savings.
Again, this works okay for leaves, but not so well for containers (which
may have GC-able children). I don't immediately see advantages in purely
stack-based allocations over a young generation. Did you have more
thoughts on that?
Post by Patrick R. Michaud
Explicitly merging PMCs is a good idea, but I also am liking the idea
of implicitly merging them into "super objects". If we can take a PMC
and lump it together into a memory block with the complete tree of all
it's delegate PMCs and STRINGs, we could get away with only marking
the block as a whole (not all the individual parts). Barring the
ability to store the complete tree in one place, we could store groups
of related PMCs together along with an array of escape pointers. The
mark process would be simplified to marking the block as a whole and
the array of escape pointers only. This idea is similar to generations
and inter-generational pointer lists, but uses locality to group items
together that are used together, instead of grouping by the number of
mark phases survived (which will tend to have the same result for
stable groups, but with worse performance).
It has potential... It would require careful analysis of how often we'll
see a gain from the grouping. We wouldn't see a benefit in the
CallSignature case, because it's specifically a container for non-local
children (from user-space). There's a certain amount of complexity
involved in deciding which PMCs can be grouped.
Context/CallSignature/RetContinuation could be manually grouped, but
then, would we get the same gains (and ultimately a cleaner
implementation of the behavior of those PMCs) by merging them into a
single PMC anyway?


Do people have other thoughts on part-way solutions we might implement
before full generational GC? We're just in the initial green-lighting
phase here, doesn't have to be a fully-formed idea.

Allison
Vasily Chekalkin
2010-02-11 11:44:31 UTC
Permalink
Patrick R. Michaud wrote:
...
Post by Patrick R. Michaud
$ cat fib.nqp
Apparently we need better test now:

***@icering:~/src/parrot$ time ./parrot-nqp fib.nqp
fib(28) = 317811

real 0m13.586s
user 0m13.413s
sys 0m0.060s
Post by Patrick R. Michaud
fib-nqp.pir
real 0m0.273s
user 0m0.256s
sys 0m0.020s
***@icering:~/src/parrot$ time ./parrot fib-nqp.pir
fib(28) = 317811

real 0m12.501s
user 0m12.033s
sys 0m0.048s

Less than a second difference after latest PCC improvements. Can anyone
provide test which will favour Generational GC? I'm going to bring Boehm
GC again into Parrot.
--
Bacek
Loading...