Unusually high polymorphic dispatch costs?

Discussion:

Unusually high polymorphic dispatch costs?

Charles Oliver Nutter

2011-04-28 19:58:58 UTC

I'm trying to figure out why polymorphic dispatch is incredibly slow
in JRuby + indy. Take this benchmark, for example:

class A; def foo; end; end
class B; def foo; end; end

a = A.new
b = B.new

5.times { puts Benchmark.measure { 1000000.times { a, b = b, a; a.foo;
b.foo } } }

a.foo and b.foo are bimorphic here. Under stock JRuby, using
CachingCallSite, this benchmark runs in about 0.13s per iteration.
Using invokedynamic, it takes 9s!!!

This is after a patch I just committed that caches the target method
handle for direct paths. I believe the only thing created when GWT
fails now is a new GWT.

Is it expected that rebinding a call site or constructing a GWT would
be very expensive? If yes...I will have to look into having a hard
failover to inline caching or a PIC-like handle chain for polymorphic
cases. That's not necessarily difficult. If no...I'm happy to update
my build and play with patches to see what's happening here.

A sampled profile produced the following output:

Stub + native Method
57.6% 0 + 5214 java.lang.invoke.MethodHandleNatives.init
30.9% 0 + 2798 java.lang.invoke.MethodHandleNatives.init
2.1% 0 + 189 java.lang.invoke.MethodHandleNatives.getTarget
0.1% 0 + 7 java.lang.Object.getClass
0.0% 0 + 3 java.lang.Class.isPrimitive
0.0% 0 + 3 java.lang.System.arraycopy
90.7% 0 + 8214 Total stub

Of course we all know how accurate sampled profiles are, but this is
pretty a pretty dismal result.

I suspect that this polymorphic cost is a *major* factor in slowing
down some benchmarks under invokedynamic. FWIW, the above benchmark
without the a,b swap runs in 0.06s, better than 2x faster than stock
JRuby (yay!).

- Charlie

Christian Thalinger

2011-04-29 09:23:13 UTC

Post by Charles Oliver Nutter
I'm trying to figure out why polymorphic dispatch is incredibly slow
class A; def foo; end; end
class B; def foo; end; end
a = A.new
b = B.new
5.times { puts Benchmark.measure { 1000000.times { a, b = b, a; a.foo;
b.foo } } }
a.foo and b.foo are bimorphic here. Under stock JRuby, using
CachingCallSite, this benchmark runs in about 0.13s per iteration.
Using invokedynamic, it takes 9s!!!
This is after a patch I just committed that caches the target method
handle for direct paths. I believe the only thing created when GWT
fails now is a new GWT.
Is it expected that rebinding a call site or constructing a GWT would
be very expensive?

Looking at the compiled methods, it seems so. There is a lot going on when creating a new GWT.

Post by Charles Oliver Nutter
If yes...I will have to look into having a hard
failover to inline caching or a PIC-like handle chain for polymorphic
cases. That's not necessarily difficult. If no...I'm happy to update
my build and play with patches to see what's happening here.
Stub + native Method
57.6% 0 + 5214 java.lang.invoke.MethodHandleNatives.init
30.9% 0 + 2798 java.lang.invoke.MethodHandleNatives.init
2.1% 0 + 189 java.lang.invoke.MethodHandleNatives.getTarget
0.1% 0 + 7 java.lang.Object.getClass
0.0% 0 + 3 java.lang.Class.isPrimitive
0.0% 0 + 3 java.lang.System.arraycopy
90.7% 0 + 8214 Total stub
Of course we all know how accurate sampled profiles are, but this is
pretty a pretty dismal result.

But that seems to be correct. java.lang.invoke.MethodHandleImpl$GuardWithTest::<init> gets compiled and the inline tree is:

8892 135 java.lang.invoke.MethodHandleImpl$GuardWithTest::<init> (22 bytes)
@ 2 java.lang.invoke.BoundMethodHandle::<init> (37 bytes) inline (hot)
@ 2 java.lang.invoke.MethodHandle::type (5 bytes) inline (hot)
@ 7 java.lang.invoke.MethodType::dropParameterTypes (162 bytes) already compiled into a big method
@ 10 java.lang.invoke.MethodHandle::<init> (15 bytes) inline (hot)
@ 1 java.lang.Object::<init> (1 bytes) inline (hot)
@ 5 java.lang.Object::getClass (0 bytes) (intrinsic)
@ 20 java.lang.invoke.MethodHandle::type (5 bytes) inline (hot)
@ 24 java.lang.invoke.MethodType::parameterSlotDepth (30 bytes) inline (hot)
@ 26 java.lang.invoke.MethodTypeForm::parameterToArgSlot (9 bytes) inline (hot)
@ 33 java.lang.invoke.BoundMethodHandle::initTarget (7 bytes) inline (hot)
@ 3 java.lang.invoke.MethodHandleNatives::init (0 bytes) native method

Obviously that is VERY expensive.

-- Christian

Post by Charles Oliver Nutter
I suspect that this polymorphic cost is a *major* factor in slowing
down some benchmarks under invokedynamic. FWIW, the above benchmark
without the a,b swap runs in 0.06s, better than 2x faster than stock
JRuby (yay!).
- Charlie
_______________________________________________
mlvm-dev mailing list
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

Rémi Forax

2011-04-29 13:59:47 UTC

Post by Charles Oliver Nutter
I'm trying to figure out why polymorphic dispatch is incredibly slow
class A; def foo; end; end
class B; def foo; end; end
a = A.new
b = B.new
5.times { puts Benchmark.measure { 1000000.times { a, b = b, a; a.foo;
b.foo } } }
a.foo and b.foo are bimorphic here. Under stock JRuby, using
CachingCallSite, this benchmark runs in about 0.13s per iteration.
Using invokedynamic, it takes 9s!!!
This is after a patch I just committed that caches the target method
handle for direct paths. I believe the only thing created when GWT
fails now is a new GWT.

If you want to emulate a bimorphic cache, you should have two GWTs.
So no construction of new GWT after discovering all possible targets
for the two callsites.

Relying on a mutable MethodHandle, a method handle that change
for every call will not work well because the JIT will not be able to
inline through this mutable method handle.

Post by Charles Oliver Nutter
Is it expected that rebinding a call site or constructing a GWT would
be very expensive? If yes...I will have to look into having a hard
failover to inline caching or a PIC-like handle chain for polymorphic
cases. That's not necessarily difficult. If no...I'm happy to update
my build and play with patches to see what's happening here.

Yes, it's expensive.
The target of a CallSite should be stable.
So yes it's expensible and yes it's intended.

Post by Charles Oliver Nutter
Stub + native Method
57.6% 0 + 5214 java.lang.invoke.MethodHandleNatives.init
30.9% 0 + 2798 java.lang.invoke.MethodHandleNatives.init
2.1% 0 + 189 java.lang.invoke.MethodHandleNatives.getTarget
0.1% 0 + 7 java.lang.Object.getClass
0.0% 0 + 3 java.lang.Class.isPrimitive
0.0% 0 + 3 java.lang.System.arraycopy
90.7% 0 + 8214 Total stub
Of course we all know how accurate sampled profiles are, but this is
pretty a pretty dismal result.
I suspect that this polymorphic cost is a *major* factor in slowing
down some benchmarks under invokedynamic. FWIW, the above benchmark
without the a,b swap runs in 0.06s, better than 2x faster than stock
JRuby (yay!).
- Charlie

Rémi

Ola Bini

2011-04-29 19:59:28 UTC

Hi,

Given that creating GWTs are expensive, is it a really bad idea to
create them and bind them on a cache miss then? My current logic for
call sites look something like this:

invoke call site
if fallback, check if current morphism is < 10.
If so, create a new GWT with the currently found method and
appropriate test.

How would you recommend doing this without creating GWTs at runtime?
Having ten slots in the call site and precreate the GWTs that use them?

Cheers

Post by RÃ©mi Forax

Post by Charles Oliver Nutter
I'm trying to figure out why polymorphic dispatch is incredibly slow
class A; def foo; end; end
class B; def foo; end; end
a = A.new
b = B.new
5.times { puts Benchmark.measure { 1000000.times { a, b = b, a; a.foo;
b.foo } } }
a.foo and b.foo are bimorphic here. Under stock JRuby, using
CachingCallSite, this benchmark runs in about 0.13s per iteration.
Using invokedynamic, it takes 9s!!!
This is after a patch I just committed that caches the target method
handle for direct paths. I believe the only thing created when GWT
fails now is a new GWT.

If you want to emulate a bimorphic cache, you should have two GWTs.
So no construction of new GWT after discovering all possible targets
for the two callsites.
Relying on a mutable MethodHandle, a method handle that change
for every call will not work well because the JIT will not be able to
inline through this mutable method handle.

Post by Charles Oliver Nutter
Is it expected that rebinding a call site or constructing a GWT would
be very expensive? If yes...I will have to look into having a hard
failover to inline caching or a PIC-like handle chain for polymorphic
cases. That's not necessarily difficult. If no...I'm happy to update
my build and play with patches to see what's happening here.

Yes, it's expensive.
The target of a CallSite should be stable.
So yes it's expensible and yes it's intended.

Post by Charles Oliver Nutter
Stub + native Method
57.6% 0 + 5214 java.lang.invoke.MethodHandleNatives.init
30.9% 0 + 2798 java.lang.invoke.MethodHandleNatives.init
2.1% 0 + 189 java.lang.invoke.MethodHandleNatives.getTarget
0.1% 0 + 7 java.lang.Object.getClass
0.0% 0 + 3 java.lang.Class.isPrimitive
0.0% 0 + 3 java.lang.System.arraycopy
90.7% 0 + 8214 Total stub
Of course we all know how accurate sampled profiles are, but this is
pretty a pretty dismal result.
I suspect that this polymorphic cost is a *major* factor in slowing
down some benchmarks under invokedynamic. FWIW, the above benchmark
without the a,b swap runs in 0.06s, better than 2x faster than stock
JRuby (yay!).
- Charlie

Rémi
_______________________________________________
mlvm-dev mailing list
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

--
Ola Bini (http://olabini.com)
Ioke - JRuby - ThoughtWorks

"Yields falsehood when quined" yields falsehood when quined.

Charles Oliver Nutter

2011-04-29 20:12:03 UTC

I think my strategy is going to be something like this:

* Bind first method encountered straight through with a new GWT
* Upon encountering nth new method, rebuild GWT chain to check first,
second, nth PIC-style
* For some N encountered methods, failover to a simple IC

Some variation of these (profiling GWT misses versus hits, direct call
+ IC chain, etc) should work around the GWT creation overhead.

I suppose you could pre-create GWTs, but that seems more cumbersome
since in many/most cases the test will be time-sensitive (e.g. in
JRuby where it's based on the current "generation" of a class, which
will change over time).

For monomorphic, GWT + direct handle will certainly be faster than
CachingCallSite. For some N-morphic, a GWT "PIC chain" will still be
faster than CCS. And then the failover should be no slower than CCS.
Not a bad trade-off.

- Charlie

Post by Ola Bini
Hi,
Given that creating GWTs are expensive, is it a really bad idea to
create them and bind them on a cache miss then? My current logic for
invoke call site
if fallback, check if current morphism is < 10.
If so, create a new GWT with the currently found method and
appropriate test.
How would you recommend doing this without creating GWTs at runtime?
Having ten slots in the call site and precreate the GWTs that use them?
Cheers

Post by RÃ©mi Forax

Post by Charles Oliver Nutter
I'm trying to figure out why polymorphic dispatch is incredibly slow
class A; def foo; end; end
class B; def foo; end; end
a = A.new
b = B.new
5.times { puts Benchmark.measure { 1000000.times { a, b = b, a; a.foo;
b.foo } } }
a.foo and b.foo are bimorphic here. Under stock JRuby, using
CachingCallSite, this benchmark runs in about 0.13s per iteration.
Using invokedynamic, it takes 9s!!!
This is after a patch I just committed that caches the target method
handle for direct paths. I believe the only thing created when GWT
fails now is a new GWT.

If you want to emulate a bimorphic cache, you should have two GWTs.
So no construction of new GWT after discovering all possible targets
for the two callsites.
Relying on a mutable MethodHandle, a method handle that change
for every call will not work well because the JIT will not be able to
inline through this mutable method handle.

Post by Charles Oliver Nutter
Is it expected that rebinding a call site or constructing a GWT would
be very expensive? If yes...I will have to look into having a hard
failover to inline caching or a PIC-like handle chain for polymorphic
cases. That's not necessarily difficult. If no...I'm happy to update
my build and play with patches to see what's happening here.

Yes, it's expensive.
The target of a CallSite should be stable.
So yes it's expensible and yes it's intended.

Post by Charles Oliver Nutter
Stub + native Method
57.6% 0 + 5214 java.lang.invoke.MethodHandleNatives.init
30.9% 0 + 2798 java.lang.invoke.MethodHandleNatives.init
2.1% 0 + 189 java.lang.invoke.MethodHandleNatives.getTarget
0.1% 0 + 7 java.lang.Object.getClass
0.0% 0 + 3 java.lang.Class.isPrimitive
0.0% 0 + 3 java.lang.System.arraycopy
90.7% 0 + 8214 Total stub
Of course we all know how accurate sampled profiles are, but this is
pretty a pretty dismal result.
I suspect that this polymorphic cost is a *major* factor in slowing
down some benchmarks under invokedynamic. FWIW, the above benchmark
without the a,b swap runs in 0.06s, better than 2x faster than stock
JRuby (yay!).
- Charlie

Rémi
_______________________________________________
mlvm-dev mailing list
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

--
Ola Bini (http://olabini.com)
Ioke - JRuby - ThoughtWorks
"Yields falsehood when quined" yields falsehood when quined.
_______________________________________________
mlvm-dev mailing list
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

Charles Oliver Nutter

2011-04-30 07:27:14 UTC

I have pushed a change to JRuby master that fails over to a simple
inline cache after one failed GWT. This is not how I would want to do
it long-term, but it does bring some benchmarks back in line with
non-indy JRuby. Specifically, bench_richards 1000000 now runs in a
normal amount of time...but it's not faster than non-indy just yet (it
was *much* slower before). The simple bimorphic benchmark I posted
earlier also performs well...and runs slightly faster than non-indy.

In case it wasn't obvious: yay!

I have not yet implemented a bimorphic (or greater) PIC chain yet, but
that will come next.

Regarding why this is not sufficient: Since it's a hard failure, any
class modification that cascades into the cached method's class will
kill it permanently. I need some wiggle room in place to avoid the
chaotic early boot of an application from nuking a bunch of call
sites. For example...if someone monkey-patches String#gsub, any call
sites that have cached the non-patched gsub will permanently fail to
IC. Ideally, they'd only fail permanently if a call site remained
forever polymorphic...and even better would be to go back to GWT if a
call site appears to be polymorphic for a while but is actually
monomorphic in steady state (I don't think Hotspot does this...deopt
is only triggered when guards fail, and there's no speculative
re-profiling after reaching steady state).

I'd be interested in discussing fail strategies with others on this
list. Specifically, I'm debating a number of aspects of JRuby's
invokedynamic use:

* Number of GWT in a "PIC chain"
* Number of failed GWT to give up on the PIC chain
* Combination of PIC chain and IC
* MH usage in the IC (currently the IC "fail" logic in JRuby's indy
stuff is the same in general as CachingCallSite logic)
* Strategies for preventing the chaotic boot time from damaging call
sites unnecessarily

Basically the challenge is balancing inline caches, GWT-based indy
dispatch, GWT-based PIC, and class structure volatility to get the
best possible performance once a stable state is reached (and not
having late-run class structure changes from doing permanent damage).

I'd really love to hear from John or anyone else familiar with JVM
strategies for deopt. I know Hotspot will inline monomorphic and
bimorphic sites, and anything higher ends up turning into a CALL
(vtable? IC?)

This is fun :)

- Charlie

On Fri, Apr 29, 2011 at 3:12 PM, Charles Oliver Nutter

Post by Charles Oliver Nutter
* Bind first method encountered straight through with a new GWT
* Upon encountering nth new method, rebuild GWT chain to check first,
second, nth PIC-style
* For some N encountered methods, failover to a simple IC
Some variation of these (profiling GWT misses versus hits, direct call
+ IC chain, etc) should work around the GWT creation overhead.
I suppose you could pre-create GWTs, but that seems more cumbersome
since in many/most cases the test will be time-sensitive (e.g. in
JRuby where it's based on the current "generation" of a class, which
will change over time).
For monomorphic, GWT + direct handle will certainly be faster than
CachingCallSite. For some N-morphic, a GWT "PIC chain" will still be
faster than CCS. And then the failover should be no slower than CCS.
Not a bad trade-off.
- Charlie

Post by Ola Bini
Hi,
Given that creating GWTs are expensive, is it a really bad idea to
create them and bind them on a cache miss then? My current logic for
invoke call site
if fallback, check if current morphism is < 10.
If so, create a new GWT with the currently found method and
appropriate test.
How would you recommend doing this without creating GWTs at runtime?
Having ten slots in the call site and precreate the GWTs that use them?
Cheers

Post by RÃ©mi Forax

Post by Charles Oliver Nutter
I'm trying to figure out why polymorphic dispatch is incredibly slow
class A; def foo; end; end
class B; def foo; end; end
a = A.new
b = B.new
5.times { puts Benchmark.measure { 1000000.times { a, b = b, a; a.foo;
b.foo } } }
a.foo and b.foo are bimorphic here. Under stock JRuby, using
CachingCallSite, this benchmark runs in about 0.13s per iteration.
Using invokedynamic, it takes 9s!!!
This is after a patch I just committed that caches the target method
handle for direct paths. I believe the only thing created when GWT
fails now is a new GWT.

If you want to emulate a bimorphic cache, you should have two GWTs.
So no construction of new GWT after discovering all possible targets
for the two callsites.
Relying on a mutable MethodHandle, a method handle that change
for every call will not work well because the JIT will not be able to
inline through this mutable method handle.

Post by Charles Oliver Nutter
Is it expected that rebinding a call site or constructing a GWT would
be very expensive? If yes...I will have to look into having a hard
failover to inline caching or a PIC-like handle chain for polymorphic
cases. That's not necessarily difficult. If no...I'm happy to update
my build and play with patches to see what's happening here.

Yes, it's expensive.
The target of a CallSite should be stable.
So yes it's expensible and yes it's intended.

Post by Charles Oliver Nutter
Stub + native Method
57.6% 0 + 5214 java.lang.invoke.MethodHandleNatives.init
30.9% 0 + 2798 java.lang.invoke.MethodHandleNatives.init
2.1% 0 + 189 java.lang.invoke.MethodHandleNatives.getTarget
0.1% 0 + 7 java.lang.Object.getClass
0.0% 0 + 3 java.lang.Class.isPrimitive
0.0% 0 + 3 java.lang.System.arraycopy
90.7% 0 + 8214 Total stub
Of course we all know how accurate sampled profiles are, but this is
pretty a pretty dismal result.
I suspect that this polymorphic cost is a *major* factor in slowing
down some benchmarks under invokedynamic. FWIW, the above benchmark
without the a,b swap runs in 0.06s, better than 2x faster than stock
JRuby (yay!).
- Charlie

Rémi
_______________________________________________
mlvm-dev mailing list
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

--
Ola Bini (http://olabini.com)
Ioke - JRuby - ThoughtWorks
"Yields falsehood when quined" yields falsehood when quined.
_______________________________________________
mlvm-dev mailing list
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

Charles Oliver Nutter

2011-04-30 07:34:45 UTC

On Sat, Apr 30, 2011 at 2:27 AM, Charles Oliver Nutter

Post by Charles Oliver Nutter
I have pushed a change to JRuby master that fails over to a simple
inline cache after one failed GWT. This is not how I would want to do
it long-term, but it does bring some benchmarks back in line with
non-indy JRuby. Specifically, bench_richards 1000000 now runs in a
normal amount of time...but it's not faster than non-indy just yet (it
was *much* slower before). The simple bimorphic benchmark I posted
earlier also performs well...and runs slightly faster than non-indy.

Just to clarify...I meant that indy WITH my fix is slightly slower
than normal JRuby, but indy WITHOUT my fix was *really* slow.

- Charlie

Charles Oliver Nutter

2011-04-30 21:13:37 UTC

Well it seems like the GWT-based PIC could be a *great* success, at
least compared to falling back on a single-entry IC. I will shortly
commit another patch to JRuby master that provides a configurable GWT
chain size before giving up and going to a simple IC, and the results
are spectacular.

First, recall that my little bimorphic bench took 9s to run when
constantly rebinding. GWT construction is expensive, CallSite
rebinding is expensive...we've clearly established that.

JRuby with an immediate failover to a simple IC performs roughly like
JRuby, with each million-cycle iteration taking about 0.13s, a massive
improvement over rebinding. No real surprise there, since it's
essentially just giving up and doing the same thing as JRuby's
CachingCallSite.

Now for the surprise. With a 2-deep GWT chain simulating a PIC, the
bimorphic benchmark performs almost *identically* to one that's
monomorphic. The iterations take roughly 0.054s when monomorphic and
0.056s when bimorphic. If it goes trimorphic, the hard failure kicks
in and it drops back to IC performance again.

Naturally I'm quite pleased by this result.

For those of you following along at home, the commit is 9890000 and
the property to configure is jruby.invokedynamic.maxfail, which should
equal the number of GWT it will attempt to chain before failing over
to simple IC.

Awesome stuff.

- Charlie

On Sat, Apr 30, 2011 at 2:27 AM, Charles Oliver Nutter

Post by Charles Oliver Nutter
I have pushed a change to JRuby master that fails over to a simple
inline cache after one failed GWT. This is not how I would want to do
it long-term, but it does bring some benchmarks back in line with
non-indy JRuby. Specifically, bench_richards 1000000 now runs in a
normal amount of time...but it's not faster than non-indy just yet (it
was *much* slower before). The simple bimorphic benchmark I posted
earlier also performs well...and runs slightly faster than non-indy.
In case it wasn't obvious: yay!
I have not yet implemented a bimorphic (or greater) PIC chain yet, but
that will come next.
Regarding why this is not sufficient: Since it's a hard failure, any
class modification that cascades into the cached method's class will
kill it permanently. I need some wiggle room in place to avoid the
chaotic early boot of an application from nuking a bunch of call
sites. For example...if someone monkey-patches String#gsub, any call
sites that have cached the non-patched gsub will permanently fail to
IC. Ideally, they'd only fail permanently if a call site remained
forever polymorphic...and even better would be to go back to GWT if a
call site appears to be polymorphic for a while but is actually
monomorphic in steady state (I don't think Hotspot does this...deopt
is only triggered when guards fail, and there's no speculative
re-profiling after reaching steady state).
I'd be interested in discussing fail strategies with others on this
list. Specifically, I'm debating a number of aspects of JRuby's
* Number of GWT in a "PIC chain"
* Number of failed GWT to give up on the PIC chain
* Combination of PIC chain and IC
* MH usage in the IC (currently the IC "fail" logic in JRuby's indy
stuff is the same in general as CachingCallSite logic)
* Strategies for preventing the chaotic boot time from damaging call
sites unnecessarily
Basically the challenge is balancing inline caches, GWT-based indy
dispatch, GWT-based PIC, and class structure volatility to get the
best possible performance once a stable state is reached (and not
having late-run class structure changes from doing permanent damage).
I'd really love to hear from John or anyone else familiar with JVM
strategies for deopt. I know Hotspot will inline monomorphic and
bimorphic sites, and anything higher ends up turning into a CALL
(vtable? IC?)
This is fun :)
- Charlie
On Fri, Apr 29, 2011 at 3:12 PM, Charles Oliver Nutter

Post by Charles Oliver Nutter
* Bind first method encountered straight through with a new GWT
* Upon encountering nth new method, rebuild GWT chain to check first,
second, nth PIC-style
* For some N encountered methods, failover to a simple IC
Some variation of these (profiling GWT misses versus hits, direct call
+ IC chain, etc) should work around the GWT creation overhead.
I suppose you could pre-create GWTs, but that seems more cumbersome
since in many/most cases the test will be time-sensitive (e.g. in
JRuby where it's based on the current "generation" of a class, which
will change over time).
For monomorphic, GWT + direct handle will certainly be faster than
CachingCallSite. For some N-morphic, a GWT "PIC chain" will still be
faster than CCS. And then the failover should be no slower than CCS.
Not a bad trade-off.
- Charlie

Post by Ola Bini
Hi,
Given that creating GWTs are expensive, is it a really bad idea to
create them and bind them on a cache miss then? My current logic for
invoke call site
if fallback, check if current morphism is < 10.
If so, create a new GWT with the currently found method and
appropriate test.
How would you recommend doing this without creating GWTs at runtime?
Having ten slots in the call site and precreate the GWTs that use them?
Cheers

Post by RÃ©mi Forax

Post by Charles Oliver Nutter
I'm trying to figure out why polymorphic dispatch is incredibly slow
class A; def foo; end; end
class B; def foo; end; end
a = A.new
b = B.new
5.times { puts Benchmark.measure { 1000000.times { a, b = b, a; a.foo;
b.foo } } }
a.foo and b.foo are bimorphic here. Under stock JRuby, using
CachingCallSite, this benchmark runs in about 0.13s per iteration.
Using invokedynamic, it takes 9s!!!
This is after a patch I just committed that caches the target method
handle for direct paths. I believe the only thing created when GWT
fails now is a new GWT.

If you want to emulate a bimorphic cache, you should have two GWTs.
So no construction of new GWT after discovering all possible targets
for the two callsites.
Relying on a mutable MethodHandle, a method handle that change
for every call will not work well because the JIT will not be able to
inline through this mutable method handle.

Post by Charles Oliver Nutter
Is it expected that rebinding a call site or constructing a GWT would
be very expensive? If yes...I will have to look into having a hard
failover to inline caching or a PIC-like handle chain for polymorphic
cases. That's not necessarily difficult. If no...I'm happy to update
my build and play with patches to see what's happening here.

Yes, it's expensive.
The target of a CallSite should be stable.
So yes it's expensible and yes it's intended.

Post by Charles Oliver Nutter
Stub + native Method
57.6% 0 + 5214 java.lang.invoke.MethodHandleNatives.init
30.9% 0 + 2798 java.lang.invoke.MethodHandleNatives.init
2.1% 0 + 189 java.lang.invoke.MethodHandleNatives.getTarget
0.1% 0 + 7 java.lang.Object.getClass
0.0% 0 + 3 java.lang.Class.isPrimitive
0.0% 0 + 3 java.lang.System.arraycopy
90.7% 0 + 8214 Total stub
Of course we all know how accurate sampled profiles are, but this is
pretty a pretty dismal result.
I suspect that this polymorphic cost is a *major* factor in slowing
down some benchmarks under invokedynamic. FWIW, the above benchmark
without the a,b swap runs in 0.06s, better than 2x faster than stock
JRuby (yay!).
- Charlie

Rémi
_______________________________________________
mlvm-dev mailing list
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

--
Ola Bini (http://olabini.com)
Ioke - JRuby - ThoughtWorks
"Yields falsehood when quined" yields falsehood when quined.
_______________________________________________
mlvm-dev mailing list
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

Rémi Forax

2011-05-01 15:33:34 UTC

Post by Charles Oliver Nutter
Well it seems like the GWT-based PIC could be a *great* success, at
least compared to falling back on a single-entry IC. I will shortly
commit another patch to JRuby master that provides a configurable GWT
chain size before giving up and going to a simple IC, and the results
are spectacular.
First, recall that my little bimorphic bench took 9s to run when
constantly rebinding. GWT construction is expensive, CallSite
rebinding is expensive...we've clearly established that.
JRuby with an immediate failover to a simple IC performs roughly like
JRuby, with each million-cycle iteration taking about 0.13s, a massive
improvement over rebinding. No real surprise there, since it's
essentially just giving up and doing the same thing as JRuby's
CachingCallSite.
Now for the surprise. With a 2-deep GWT chain simulating a PIC, the
bimorphic benchmark performs almost *identically* to one that's
monomorphic. The iterations take roughly 0.054s when monomorphic and
0.056s when bimorphic.

I remember seeing a patch from John to group
a cascade of typechecks on the same object into one check.

Post by Charles Oliver Nutter
If it goes trimorphic, the hard failure kicks
in and it drops back to IC performance again.
Naturally I'm quite pleased by this result.
For those of you following along at home, the commit is 9890000 and
the property to configure is jruby.invokedynamic.maxfail, which should
equal the number of GWT it will attempt to chain before failing over
to simple IC.
Awesome stuff.

Also, instead of using IC has a fallback, you can use a hashmap
to handle megamorphic calls that are not mega-megamorphic.

Post by Charles Oliver Nutter
- Charlie

cheers,
Rémi

Mark Roos

2011-05-01 18:01:07 UTC

With respect to chaining gwt handles.

Any thoughts on at what depth the chain should be converted to some other
form of dispatch?
I am currently just letting them build

In smalltalk it seems that around 10-20 inline tests are about the same as
a hash lookup but I
would think it could be different after the JIT gets to it.

Thanks
mark

Charles Oliver Nutter

2011-05-01 19:58:36 UTC

Post by Mark Roos
With respect to chaining gwt handles.
Any thoughts on at what depth the chain should be converted to some other
form of dispatch?
I am currently just letting them build
In smalltalk it seems that around 10-20 inline tests are about the same as a
hash lookup but I
would think it could be different after the JIT gets to it.

I think it's going to vary widely by how much code is in the target
methods, how expensive the test is, how "balanced" the chain is (e.g.
most frequently-hit at the front) and so on. Obviously if you have a
megamorphic site that's seeing hundreds of methods you wouldn't want
to chain GWT to that length.

I'm going to write a benchmark with variable-morphology(?) to see
where the line is in JRuby.

The other question is whether a mostly mono or bimorphic call site
should keep those at the head of the chain, speeding up those calls,
and just have the IC at the end of the chain as a catch-all. I'm also
trying to think of low-cost ways of tracking which branches are
getting followed in the chain, so they could potentially be rebalanced
periodically.

- Charlie

Mark Roos

2011-05-01 21:01:40 UTC

In the Smalltalk I am porting the solution they use is to just drop the
entire chain and let it
reform. The assumption would be that for any short time period in the
life of a program no sites
are megamorphic. the metric they use is the total amount of active code.
Since they actually
drop all compiled code when it hits 1 meg they have an easy way of
determining the active
code amount.

What I was thinking of was just dropping the gwt chain when it reached
some depth. The
question is what depth. I think I'll add a counter to the call site just
to see what happens.

It would be nice to have a way to walk the chain but I don't see an method
to get the original
mh from a mh adapter.

regards
mark

Rémi Forax

2011-05-02 14:43:25 UTC

Post by Mark Roos
In the Smalltalk I am porting the solution they use is to just drop
the entire chain and let it
reform. The assumption would be that for any short time period in the
life of a program no sites
are megamorphic. the metric they use is the total amount of active
code. Since they actually
drop all compiled code when it hits 1 meg they have an easy way of
determining the active
code amount.

There are some well known API in which this assumption is not true.
GUI API like Swing (think setVisible) or parser AST visitor (think accept).

I don't know if it's still the case but V8 (google javascript engine)
was using
a similar idea, it flush all callsites when a full GC occurs.

Post by Mark Roos
What I was thinking of was just dropping the gwt chain when it reached
some depth. The
question is what depth. I think I'll add a counter to the call site
just to see what happens.
It would be nice to have a way to walk the chain but I don't see an
method to get the original
mh from a mh adapter.

MethodHandle reflection is one item of the issue list for JSR 292 the
return.

Post by Mark Roos
regards
mark

cheers,
Rémi

Mark Roos

2011-05-02 17:07:17 UTC

The system I am porting redoes its compilation based on code size
generated not
objects generated. So if the object code in active execution is less than
1 meg no flushes/redux
occur. So the examples you gave would not cause a code cache flush. I
could see where a
system based on objects created or total object size would have issues.

The closest metric I will have in the jvm version will be active call
sites. As I need to
keep a collection of them in order to perform an invalidate on code
replacement my thought was to
dump them all at some point. Perhaps when it reaches the 2-5K range.

By the way a code cache flush is pretty fast today, adds about 25 ms to
rebuild 500 call sites. Once I
add the invalidate I'll see how it fares in java

regards
mark

From:
Rémi Forax <***@univ-mlv.fr>
To:
mlvm-***@openjdk.java.net
Date:
05/02/2011 08:59 AM
Subject:
Re: Unusually high polymorphic dispatch costs?
Sent by:
mlvm-dev-***@openjdk.java.net

On 05/01/2011 11:01 PM, Mark Roos wrote:
In the Smalltalk I am porting the solution they use is to just drop the
entire chain and let it
reform. The assumption would be that for any short time period in the
life of a program no sites
are megamorphic. the metric they use is the total amount of active code.
Since they actually
drop all compiled code when it hits 1 meg they have an easy way of
determining the active
code amount.

There are some well known API in which this assumption is not true.
GUI API like Swing (think setVisible) or parser AST visitor (think
accept).

I don't know if it's still the case but V8 (google javascript engine) was
using
a similar idea, it flush all callsites when a full GC occurs.

What I was thinking of was just dropping the gwt chain when it reached
some depth. The
question is what depth. I think I'll add a counter to the call site just
to see what happens.

It would be nice to have a way to walk the chain but I don't see an method
to get the original
mh from a mh adapter.

MethodHandle reflection is one item of the issue list for JSR 292 the
return.

regards
mark

cheers,
Rémi

Rémi Forax

2011-05-02 14:48:27 UTC

Post by Charles Oliver Nutter

Post by Mark Roos
With respect to chaining gwt handles.
Any thoughts on at what depth the chain should be converted to some other
form of dispatch?
I am currently just letting them build
In smalltalk it seems that around 10-20 inline tests are about the same as a
hash lookup but I
would think it could be different after the JIT gets to it.

I think it's going to vary widely by how much code is in the target
methods, how expensive the test is, how "balanced" the chain is (e.g.
most frequently-hit at the front) and so on. Obviously if you have a
megamorphic site that's seeing hundreds of methods you wouldn't want
to chain GWT to that length.
I'm going to write a benchmark with variable-morphology(?) to see
where the line is in JRuby.
The other question is whether a mostly mono or bimorphic call site
should keep those at the head of the chain, speeding up those calls,
and just have the IC at the end of the chain as a catch-all. I'm also
trying to think of low-cost ways of tracking which branches are
getting followed in the chain, so they could potentially be rebalanced
periodically.

Could you also include a test with -XX:+TieredCompilation ?
I wonder how profile information are propagated between tiers.

Post by Charles Oliver Nutter
- Charlie

cheers,
Rémi

Christian Thalinger

2011-05-05 11:14:15 UTC

Post by Charles Oliver Nutter
I have pushed a change to JRuby master that fails over to a simple
inline cache after one failed GWT. This is not how I would want to do
it long-term, but it does bring some benchmarks back in line with
non-indy JRuby. Specifically, bench_richards 1000000 now runs in a
normal amount of time...but it's not faster than non-indy just yet (it
was *much* slower before).

richards looks much better now. Today's indy version is even faster on SPARC than non-indy:

$ bin/jruby.sh --server bench/bench_richards.rb 10 1000000
10.777000 0.000000 10.777000 ( 10.716000)
9.040000 0.000000 9.040000 ( 9.040000)
8.422000 0.000000 8.422000 ( 8.422000)
7.948000 0.000000 7.948000 ( 7.949000)
7.965000 0.000000 7.965000 ( 7.965000)
7.974000 0.000000 7.974000 ( 7.974000)
8.007000 0.000000 8.007000 ( 8.007000)
8.009000 0.000000 8.009000 ( 8.009000)
7.984000 0.000000 7.984000 ( 7.984000)
8.000000 0.000000 8.000000 ( 7.999000)

$ bin/jruby.sh --server -Xcompile.invokedynamic=true bench/bench_richards.rb 10 1000000
10.935000 0.000000 10.935000 ( 10.874000)
8.682000 0.000000 8.682000 ( 8.682000)
7.672000 0.000000 7.672000 ( 7.672000)
7.573000 0.000000 7.573000 ( 7.573000)
7.543000 0.000000 7.543000 ( 7.543000)
7.571000 0.000000 7.571000 ( 7.572000)
7.583000 0.000000 7.583000 ( 7.583000)
7.583000 0.000000 7.583000 ( 7.583000)
7.590000 0.000000 7.590000 ( 7.590000)
7.584000 0.000000 7.584000 ( 7.584000)

Exciting!

-- Christian

Rémi Forax

2011-05-01 15:06:06 UTC

Post by Ola Bini
Hi,
Given that creating GWTs are expensive, is it a really bad idea to
create them and bind them on a cache miss then? My current logic for
invoke call site
if fallback, check if current morphism is< 10.
If so, create a new GWT with the currently found method and
appropriate test.
How would you recommend doing this without creating GWTs at runtime?
Having ten slots in the call site and precreate the GWTs that use them?
Cheers

Creating GWT at runtime is fine unless you create one for each call.
So your logic is fine.

Rémi

Post by Ola Bini

Post by RÃ©mi Forax

Post by Charles Oliver Nutter
I'm trying to figure out why polymorphic dispatch is incredibly slow
class A; def foo; end; end
class B; def foo; end; end
a = A.new
b = B.new
5.times { puts Benchmark.measure { 1000000.times { a, b = b, a; a.foo;
b.foo } } }
a.foo and b.foo are bimorphic here. Under stock JRuby, using
CachingCallSite, this benchmark runs in about 0.13s per iteration.
Using invokedynamic, it takes 9s!!!
This is after a patch I just committed that caches the target method
handle for direct paths. I believe the only thing created when GWT
fails now is a new GWT.

If you want to emulate a bimorphic cache, you should have two GWTs.
So no construction of new GWT after discovering all possible targets
for the two callsites.
Relying on a mutable MethodHandle, a method handle that change
for every call will not work well because the JIT will not be able to
inline through this mutable method handle.

Post by Charles Oliver Nutter
Is it expected that rebinding a call site or constructing a GWT would
be very expensive? If yes...I will have to look into having a hard
failover to inline caching or a PIC-like handle chain for polymorphic
cases. That's not necessarily difficult. If no...I'm happy to update
my build and play with patches to see what's happening here.

Yes, it's expensive.
The target of a CallSite should be stable.
So yes it's expensible and yes it's intended.

Post by Charles Oliver Nutter
Stub + native Method
57.6% 0 + 5214 java.lang.invoke.MethodHandleNatives.init
30.9% 0 + 2798 java.lang.invoke.MethodHandleNatives.init
2.1% 0 + 189 java.lang.invoke.MethodHandleNatives.getTarget
0.1% 0 + 7 java.lang.Object.getClass
0.0% 0 + 3 java.lang.Class.isPrimitive
0.0% 0 + 3 java.lang.System.arraycopy
90.7% 0 + 8214 Total stub
Of course we all know how accurate sampled profiles are, but this is
pretty a pretty dismal result.
I suspect that this polymorphic cost is a *major* factor in slowing
down some benchmarks under invokedynamic. FWIW, the above benchmark
without the a,b swap runs in 0.06s, better than 2x faster than stock
JRuby (yay!).
- Charlie

Rémi
_______________________________________________
mlvm-dev mailing list
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

16 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Charles Oliver Nutter 2011-04-28 19:58:58 UTC

Christian Thalinger 2011-04-29 09:23:13 UTC

Rémi Forax 2011-04-29 13:59:47 UTC

Ola Bini 2011-04-29 19:59:28 UTC

Charles Oliver Nutter 2011-04-29 20:12:03 UTC

Charles Oliver Nutter 2011-04-30 07:27:14 UTC

Charles Oliver Nutter 2011-04-30 07:34:45 UTC

Charles Oliver Nutter 2011-04-30 21:13:37 UTC

Rémi Forax 2011-05-01 15:33:34 UTC

Mark Roos 2011-05-01 18:01:07 UTC

Charles Oliver Nutter 2011-05-01 19:58:36 UTC

Mark Roos 2011-05-01 21:01:40 UTC

Rémi Forax 2011-05-02 14:43:25 UTC

Mark Roos 2011-05-02 17:07:17 UTC

Rémi Forax 2011-05-02 14:48:27 UTC

Christian Thalinger 2011-05-05 11:14:15 UTC

Rémi Forax 2011-05-01 15:06:06 UTC

about - legalese

Loading...