Discussion:
RFC: LRA for x86/x86-64 [0/9]
Vladimir Makarov
2012-09-27 22:56:45 UTC
Permalink
Originally I was to submit LRA at the very beginning of stage1 for
gcc4.9 as it was discussed on this summer GNU Tools Cauldron. After
some thinking, I've decided to submit LRA now but only switched on for
*x86/x86-64* target. The reasons for that are
o I am already pretty confident in LRA for this target with the
point of reliability, performance, code size, and compiler speed.
o I am confident that I can fix LRA bugs and pitfalls which might be
recognized and reported during stage2 and 3 of gcc4.8.
o Wider LRA testing for x86/x86-64 will make smoother a hard
transition of
other targets to LRA during gcc4.9 development.

During development of gcc4.9, I'd like to switch major targets to
LRA as it was planned before. I hope that all targets will be
switched for the next release after gcc4.9 (although it will be
dependent mostly on the target maintainers). When/if it is done,
reload and reload oriented machine-dependent code can be removed.

LRA project was reported on 2012 GNU Tools Cauldron
(http://gcc.gnu.org/wiki/cauldron2012). The presentation contains a
high-level description of LRA and the project status.

The following patches makes LRA working for x86/x86-64. Separately
patches mostly do nothing until the last patch switches on LRA for
x86/x86-64. Although compiler is bootstrapped after applying each
patch in given order, the division is only for review convenience.

Any comments and proposals are appreciated. Even if GCC community
decides that it is too late to submit it to gcc4.8, the earlier reviews
are always useful.

The patches were successfully bootstrapped and tested for x86/x86-64.
Richard Guenther
2012-09-28 08:20:49 UTC
Permalink
Post by Vladimir Makarov
Originally I was to submit LRA at the very beginning of stage1 for
gcc4.9 as it was discussed on this summer GNU Tools Cauldron. After
some thinking, I've decided to submit LRA now but only switched on for
*x86/x86-64* target. The reasons for that are
o I am already pretty confident in LRA for this target with the
point of reliability, performance, code size, and compiler speed.
o I am confident that I can fix LRA bugs and pitfalls which might be
recognized and reported during stage2 and 3 of gcc4.8.
o Wider LRA testing for x86/x86-64 will make smoother a hard transition of
other targets to LRA during gcc4.9 development.
During development of gcc4.9, I'd like to switch major targets to
LRA as it was planned before. I hope that all targets will be
switched for the next release after gcc4.9 (although it will be
dependent mostly on the target maintainers). When/if it is done,
reload and reload oriented machine-dependent code can be removed.
LRA project was reported on 2012 GNU Tools Cauldron
(http://gcc.gnu.org/wiki/cauldron2012). The presentation contains a
high-level description of LRA and the project status.
The following patches makes LRA working for x86/x86-64. Separately
patches mostly do nothing until the last patch switches on LRA for
x86/x86-64. Although compiler is bootstrapped after applying each
patch in given order, the division is only for review convenience.
Any comments and proposals are appreciated. Even if GCC community
decides that it is too late to submit it to gcc4.8, the earlier reviews
are always useful.
From a release-manager point of view the patch is "in time" for 4.8, in that
it is during stage1 (which I expect to last another two to four weeks). Note
that there is no such thing as "stage2" anymore but we go straight to
"stage3" (bugfixing mode, no new features) from stage1. After three months
of stage3 we go into regression-fixes only mode for as long as there are
release-blocking bugs (regressions with priority P1). You will have roughly
half a year to fix LRA for 4.8.0 after stage1 closes.

Thanks,
Richard.
Post by Vladimir Makarov
The patches were successfully bootstrapped and tested for x86/x86-64.
Steven Bosscher
2012-09-28 08:21:26 UTC
Permalink
Post by Vladimir Makarov
Any comments and proposals are appreciated. Even if GCC community
decides that it is too late to submit it to gcc4.8, the earlier reviews
are always useful.
I would like to see some benchmark numbers, both for code quality and
compile time impact for the most notorious compile time hog PRs for
large routines where IRA performs poorly (e.g. PR54146, PR26854).

Ciao!
Steven
Vladimir Makarov
2012-09-28 15:21:10 UTC
Permalink
Post by Steven Bosscher
Post by Vladimir Makarov
Any comments and proposals are appreciated. Even if GCC community
decides that it is too late to submit it to gcc4.8, the earlier reviews
are always useful.
I would like to see some benchmark numbers, both for code quality and
compile time impact for the most notorious compile time hog PRs for
large routines where IRA performs poorly (e.g. PR54146, PR26854).
I should look at this, Steven. Unfortunately, the compiler @ trunk
(without my patch) crashes on PR54156:

../../../trunk2/slow.cc: In function ‘void check_() [with NT =
CGAL::Gmpfi; int s = 3]’:
../../../trunk2/slow.cc:95489:6: internal compiler error: Segmentation fault
void check_(){
^
0x888adf crash_signal
/home/vmakarov/build1/trunk/gcc/gcc/toplev.c:335
0x8f4718 gimple_code
/home/vmakarov/build1/trunk/gcc/gcc/gimple.h:1126
0x8f4718 gimple_nop_p
/home/vmakarov/build1/trunk/gcc/gcc/gimple.h:4851
0x8f4718 walk_aliased_vdefs_1
/home/vmakarov/build1/trunk/gcc/gcc/tree-ssa-alias.c:2204
0x8f50ed walk_aliased_vdefs(ao_ref_s*, tree_node*, bool (*)(ao_ref_s*,
tree_node*, void*), void*, bitmap_head_def**)
/home/vmakarov/build1/trunk/gcc/gcc/tree-ssa-alias.c:2240
0x9018b5 propagate_necessity
/home/vmakarov/build1/trunk/gcc/gcc/tree-ssa-dce.c:909
0x9027b3 perform_tree_ssa_dce
/home/vmakarov/build1/trunk/gcc/gcc/tree-ssa-dce.c:1584
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.


PR26854 will take a lot of time to get the data. So I inform you when I
get them.

Related to compilation time. I reported the compilation time on GCC
Cauldron for -O2/-O3 and Richard asked me about -O0. I did not have the
answer that time. I checked the compilation time for
all_cp2k_fortran.f90 (500K lines of fortran). The compilation time (usr
and real time) was the same (no visible differences) for GCC with reload
and for GCC with LRA for -O0.

When I started LRA project, my major design decision was to reflect LRA
decision in RTL as much as possible. This simplifies LRA and make it
easy for maintanence and this is quite different from reload design. I
realized that time that LRA will be slower reload because of this
decision as reload works on specialized very fast representation and
roughly speaking changes RTL only once at the end of its work when it
decides that it can a generate a right RTL from the representation while
LRA takes most info from RTL (a bit simplified picture) and changes RTL
many times during its work.

For me it was a surprise that I managed the same GCC speed (or even 2-3%
faster for all_cp2k_fortran.f90 on x86) as reload after some hard work.
But if you check LRA through valgrind --tool=lackey, you will see that
LRA still, as I guessed before, executes more insns than reload. I think
that the same or better speed of LRA is achieved by better data and code
locality, and smaller code size which is translated in faster work of
the subsequent passes.
Steven Bosscher
2012-09-28 19:23:58 UTC
Permalink
Post by Steven Bosscher
Post by Vladimir Makarov
Any comments and proposals are appreciated. Even if GCC community
decides that it is too late to submit it to gcc4.8, the earlier reviews
are always useful.
I would like to see some benchmark numbers, both for code quality and
compile time impact for the most notorious compile time hog PRs for
large routines where IRA performs poorly (e.g. PR54146, PR26854).
../../../trunk2/slow.cc: In function ‘void check_() [with NT = CGAL::Gmpfi;
../../../trunk2/slow.cc:95489:6: internal compiler error: Segmentation fault
void check_(){
^
0x888adf crash_signal
/home/vmakarov/build1/trunk/gcc/gcc/toplev.c:335
0x8f4718 gimple_code
/home/vmakarov/build1/trunk/gcc/gcc/gimple.h:1126
0x8f4718 gimple_nop_p
/home/vmakarov/build1/trunk/gcc/gcc/gimple.h:4851
0x8f4718 walk_aliased_vdefs_1
/home/vmakarov/build1/trunk/gcc/gcc/tree-ssa-alias.c:2204
0x8f50ed walk_aliased_vdefs(ao_ref_s*, tree_node*, bool (*)(ao_ref_s*,
tree_node*, void*), void*, bitmap_head_def**)
/home/vmakarov/build1/trunk/gcc/gcc/tree-ssa-alias.c:2240
0x9018b5 propagate_necessity
/home/vmakarov/build1/trunk/gcc/gcc/tree-ssa-dce.c:909
0x9027b3 perform_tree_ssa_dce
/home/vmakarov/build1/trunk/gcc/gcc/tree-ssa-dce.c:1584
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.
Works for me on gcc17 at r191835, with a gcc configured like so:

"../trunk/configure --with-mpfr=/opt/cfarm/mpfr-latest
--with-gmp=/opt/cfarm/gmp-latest --with-mpc=/opt/cfarm/mpc-latest
--with-isl=/opt/cfarm/isl-latest --with-cloog=/opt/cfarm/cloog-latest
--enable-languages=c,c++ --disable-bootstrap --enable-checking=release
--with-gnu-as --with-gnu-ld
--with-as=/opt/cfarm/binutils-latest/bin/as
--with-ld=/opt/cfarm/binutils-latest/bin/ld"

Top 10 time consumers:
integrated_RA 191.66
df_live&initialized_regs 73.43
df_live_regs 72.25
out_of_ssa 45.21
tree_PTA 35.44
tree_SSA_incremental 26.53
remove_unused_locals 18.78
combiner 16.54
dominance_computation 14.44
register_information 14.20
TOTAL : 732.10

Note I'm using the simplified test case, see comment #14 in the PR.
You can just take the original test case
(http://gcc.gnu.org/bugzilla/attachment.cgi?id=27912) and apply this
patch:

--- slow.cc.orig 2012-09-28 21:07:58.000000000 +0200
+++ slow.cc 2012-09-28 21:08:38.000000000 +0200
@@ -95503,6 +95503,7 @@
check_<NT,Eigen::Dynamic>();
}
int main(){
+#if 0
{
typedef CGAL::Interval_nt<true> I1;
I1::Protector p1;
@@ -95517,11 +95518,14 @@
check<CGAL::Gmpz>();
check<CGAL::Gmpq>();
check<CGAL::Gmpfr>();
+#endif
check<CGAL::Gmpfi>();
+#if 0
check<CGAL::Quotient<CGAL::Gmpz> >();
check<CGAL::Lazy_exact_nt<CGAL::Gmpq> >();
check<CORE::BigInt>();
check<CORE::BigRat>();
check<CORE::BigFloat>();
check<CORE::Expr>();
+#endif
}

You can compile the test case with:
"./xgcc -B. -S -std=gnu++11 -O1 -frounding-math -ftime-report slow.cc"

Even with this path, the test case really is a great scalability
challange for the compiler :-) I never got the full test case to work
at -O1, and the simpler test case still blows up the compiler at -O2
and higher. At -O1 you need a machine with at least 8GB of memory.
More than half of that is for IRA+reload...

Ciao!
Steven
Markus Trippelsdorf
2012-09-28 19:47:36 UTC
Permalink
Post by Vladimir Makarov
Post by Steven Bosscher
Post by Vladimir Makarov
Any comments and proposals are appreciated. Even if GCC community
decides that it is too late to submit it to gcc4.8, the earlier reviews
are always useful.
I would like to see some benchmark numbers, both for code quality and
compile time impact for the most notorious compile time hog PRs for
large routines where IRA performs poorly (e.g. PR54146, PR26854).
../../../trunk2/slow.cc: In function ‘void check_() [with NT =
../../../trunk2/slow.cc:95489:6: internal compiler error: Segmentation fault
void check_(){
^
0x888adf crash_signal
/home/vmakarov/build1/trunk/gcc/gcc/toplev.c:335
0x8f4718 gimple_code
/home/vmakarov/build1/trunk/gcc/gcc/gimple.h:1126
0x8f4718 gimple_nop_p
/home/vmakarov/build1/trunk/gcc/gcc/gimple.h:4851
0x8f4718 walk_aliased_vdefs_1
/home/vmakarov/build1/trunk/gcc/gcc/tree-ssa-alias.c:2204
0x8f50ed walk_aliased_vdefs(ao_ref_s*, tree_node*, bool (*)(ao_ref_s*,
tree_node*, void*), void*, bitmap_head_def**)
/home/vmakarov/build1/trunk/gcc/gcc/tree-ssa-alias.c:2240
0x9018b5 propagate_necessity
/home/vmakarov/build1/trunk/gcc/gcc/tree-ssa-dce.c:909
0x9027b3 perform_tree_ssa_dce
/home/vmakarov/build1/trunk/gcc/gcc/tree-ssa-dce.c:1584
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.
See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54735
--
Markus
Vladimir Makarov
2012-09-29 01:01:31 UTC
Permalink
Post by Vladimir Makarov
On Fri, Sep 28, 2012 at 12:56 AM, Vladimir Makarov
Post by Vladimir Makarov
Any comments and proposals are appreciated. Even if GCC community
decides that it is too late to submit it to gcc4.8, the earlier reviews
are always useful.
I would like to see some benchmark numbers, both for code quality and
compile time impact for the most notorious compile time hog PRs for
large routines where IRA performs poorly (e.g. PR54146, PR26854).
PR26854 will take a lot of time to get the data. So I inform you when
I get them.
Related to compilation time. I reported the compilation time on GCC
Cauldron for -O2/-O3 and Richard asked me about -O0. I did not have
the answer that time. I checked the compilation time for
all_cp2k_fortran.f90 (500K lines of fortran). The compilation time
(usr and real time) was the same (no visible differences) for GCC with
reload and for GCC with LRA for -O0.
When I started LRA project, my major design decision was to reflect
LRA decision in RTL as much as possible. This simplifies LRA and make
it easy for maintanence and this is quite different from reload
design. I realized that time that LRA will be slower reload because of
this decision as reload works on specialized very fast representation
and roughly speaking changes RTL only once at the end of its work when
it decides that it can a generate a right RTL from the representation
while LRA takes most info from RTL (a bit simplified picture) and
changes RTL many times during its work.
For me it was a surprise that I managed the same GCC speed (or even
2-3% faster for all_cp2k_fortran.f90 on x86) as reload after some hard
work. But if you check LRA through valgrind --tool=lackey, you will
see that LRA still, as I guessed before, executes more insns than
reload. I think that the same or better speed of LRA is achieved by
better data and code locality, and smaller code size which is
translated in faster work of the subsequent passes.
Here is the numbers for PR26854: I used GCC with -O2 on Corei7-2600
(3.4Gzh) with 8GB memory.

The results are strange to me. Although user time is less for reload
but the real time is less for LRA. I checked it several times the
results are always the same with insignificant noise. Also the
difference in reload and LRA times reported by -ftime-report is
sometimes less than whole compile times difference.

The code size with LRA is always smaller (up to 15%). The data and
code locality is better with LRA (according reported major and minor
page faults).

I added logs for more details.

----------------------------------32-bit------------------------------------
Reload:
581.85user 29.91system 27:15.18elapsed 37%CPU (0avgtext+0avgdata
7730628maxresident)k
17558136inputs+21888outputs (307229major+8039646minor)pagefaults 0swaps
text data bss dec hex filename
2102836 213032 1568 2317436 235c7c rall.o
LRA:
629.67user 24.16system 24:31.08elapsed 44%CPU (0avgtext+0avgdata
7739172maxresident)k
13219464inputs+21432outputs (230546major+6874087minor)pagefaults 0swaps
text data bss dec hex filename
1707980 213032 1568 1922580 1d5614 lall.o

----------------------------------64-bit:-----------------------------------
Reload:
503.26user 36.54system 30:16.62elapsed 29%CPU (0avgtext+0avgdata
7755104maxresident)k
23111632inputs+19824outputs (399464major+8804300minor)pagefaults 0swaps
text data bss dec hex filename
1595546 423792 3040 2022378 1edbea rall.o
LRA:
598.70user 30.90system 27:26.92elapsed 38%CPU (0avgtext+0avgdata
7749424maxresident)k
19382904inputs+19752outputs (333306major+7343974minor)pagefaults 0swaps
text data bss dec hex filename
1581178 423792 3040 2008010 1ea3ca lall.o

Here is the numbers for PR54146 on the same machine with -O1 only for
64-bit (compiler reports error for -m32).

Reload:
350.40user 21.59system 17:09.75elapsed 36%CPU (0avgtext+0avgdata
7289044maxresident)k
8825824inputs+92800outputs (167226major+4631676minor)pagefaults 0swaps
text data bss dec hex filename
6556628 16 607 6557251 640e43 rs.o
LRA:
468.29user 21.35system 15:47.76elapsed 51%CPU (0avgtext+0avgdata
7728200maxresident)k
7407936inputs+91552outputs (140025major+5271594minor)pagefaults 0swaps
text data bss dec hex filename
6277934 16 607 6278557 5fcd9d ls.o
Steven Bosscher
2012-09-29 20:26:23 UTC
Permalink
Hi Vlad,

Thanks for the testing and the logs. You must have good hardware, your
timings are all ~3 times faster than mine :-)
Post by Vladimir Makarov
----------------------------------32-bit------------------------------------
581.85user 29.91system 27:15.18elapsed 37%CPU (0avgtext+0avgdata
629.67user 24.16system 24:31.08elapsed 44%CPU (0avgtext+0avgdata
This is a ~8% slowdown.
Post by Vladimir Makarov
----------------------------------64-bit:-----------------------------------
503.26user 36.54system 30:16.62elapsed 29%CPU (0avgtext+0avgdata
598.70user 30.90system 27:26.92elapsed 38%CPU (0avgtext+0avgdata
This is a ~19% slowdown
Post by Vladimir Makarov
Here is the numbers for PR54146 on the same machine with -O1 only for
64-bit (compiler reports error for -m32).
Right, the test case is for 64-bits only, I think it's preprocessed
code for AMD64.
Post by Vladimir Makarov
350.40user 21.59system 17:09.75elapsed 36%CPU (0avgtext+0avgdata
468.29user 21.35system 15:47.76elapsed 51%CPU (0avgtext+0avgdata
This is a ~34% slowdown.

To put it in another perspective, here are my timings of trunk vs lra
(both checkouts done today):

trunk:
integrated RA : 181.68 (24%) usr 1.68 (11%) sys 183.43
(24%) wall 643564 kB (20%) ggc
reload : 11.00 ( 1%) usr 0.18 ( 1%) sys 11.17 (
1%) wall 32394 kB ( 1%) ggc
TOTAL : 741.64 14.76 756.41
3216164 kB

lra branch:
integrated RA : 174.65 (16%) usr 1.33 ( 8%) sys 176.33
(16%) wall 643560 kB (20%) ggc
reload : 399.69 (36%) usr 2.48 (15%) sys 402.69
(36%) wall 41852 kB ( 1%) ggc
TOTAL :1102.06 16.05 1120.83
3231738 kB

That's a 49% slowdown. The difference is completely accounted for by
the timing difference between reload and LRA.
(Timings done on gcc17, which is AMD Opteron(tm) Processor 8354 with
15GB ram, so swapping is no issue.)

It looks like the reload timevar is used for LRA. Why not have
multiple timevars, one per phase of LRA? Sth like the patch below
would be nice. This gives me the following timings:

integrated RA : 189.34 (16%) usr 1.84 (11%) sys 191.18
(16%) wall 643560 kB (20%) ggc
LRA non-specific : 59.82 ( 5%) usr 0.22 ( 1%) sys 60.12 (
5%) wall 18202 kB ( 1%) ggc
LRA virtuals eliminatenon: 56.79 ( 5%) usr 0.03 ( 0%) sys 56.80 (
5%) wall 19223 kB ( 1%) ggc
LRA reload inheritance : 6.41 ( 1%) usr 0.01 ( 0%) sys 6.42 (
1%) wall 1665 kB ( 0%) ggc
LRA create live ranges : 175.30 (15%) usr 2.14 (13%) sys 177.44
(15%) wall 2761 kB ( 0%) ggc
LRA hard reg assignment : 130.85 (11%) usr 0.20 ( 1%) sys 131.17
(11%) wall 0 kB ( 0%) ggc
LRA coalesce pseudo regs: 2.54 ( 0%) usr 0.00 ( 0%) sys 2.55 (
0%) wall 0 kB ( 0%) ggc
reload : 6.73 ( 1%) usr 0.20 ( 1%) sys 6.92 (
1%) wall 0 kB ( 0%) ggc

so the LRA "slowness" (for lack of a better word) appears to be due to
scalability problems in all sub-passes.

The code size changes are impressive, but I think that this kind of
slowdown should be addressed before making LRA the default for any
target.

Ciao!
Steven




Index: lra-assigns.c
===================================================================
--- lra-assigns.c (revision 191858)
+++ lra-assigns.c (working copy)
@@ -1261,6 +1261,8 @@ lra_assign (void)
bitmap_head insns_to_process;
bool no_spills_p;

+ timevar_push (TV_LRA_ASSIGN);
+
init_lives ();
sorted_pseudos = (int *) xmalloc (sizeof (int) * max_reg_num ());
sorted_reload_pseudos = (int *) xmalloc (sizeof (int) * max_reg_num ());
@@ -1312,5 +1314,6 @@ lra_assign (void)
free (sorted_pseudos);
free (sorted_reload_pseudos);
finish_lives ();
+ timevar_pop (TV_LRA_ASSIGN);
return no_spills_p;
}
Index: lra.c
===================================================================
--- lra.c (revision 191858)
+++ lra.c (working copy)
@@ -2193,6 +2193,7 @@ lra (FILE *f)

lra_dump_file = f;

+ timevar_push (TV_LRA);

init_insn_recog_data ();

@@ -2271,6 +2272,7 @@ lra (FILE *f)
to use a constant pool. */
lra_eliminate (false);
lra_inheritance ();
+
/* We need live ranges for lra_assign -- so build them. */
lra_create_live_ranges (true);
live_p = true;
@@ -2343,6 +2345,8 @@ lra (FILE *f)
#ifdef ENABLE_CHECKING
check_rtl (true);
#endif
+
+ timevar_pop (TV_LRA);
}

/* Called once per compiler to initialize LRA data once. */
Index: lra-eliminations.c
===================================================================
--- lra-eliminations.c (revision 191858)
+++ lra-eliminations.c (working copy)
@@ -1297,6 +1297,8 @@ lra_eliminate (bool final_p)
struct elim_table *ep;
int regs_num = max_reg_num ();

+ timevar_push (TV_LRA_ELIMINATE);
+
bitmap_initialize (&insns_with_changed_offsets, &reg_obstack);
if (final_p)
{
@@ -1317,7 +1319,7 @@ lra_eliminate (bool final_p)
{
update_reg_eliminate (&insns_with_changed_offsets);
if (bitmap_empty_p (&insns_with_changed_offsets))
- return;
+ goto lra_eliminate_done;
}
if (lra_dump_file != NULL)
{
@@ -1349,4 +1351,7 @@ lra_eliminate (bool final_p)
process_insn_for_elimination (insn, final_p);
}
bitmap_clear (&insns_with_changed_offsets);
+
+lra_eliminate_done:
+ timevar_pop (TV_LRA_ELIMINATE);
}
Index: lra-lives.c
===================================================================
--- lra-lives.c (revision 191858)
+++ lra-lives.c (working copy)
@@ -962,6 +962,8 @@ lra_create_live_ranges (bool all_p)
basic_block bb;
int i, hard_regno, max_regno = max_reg_num ();

+ timevar_push (TV_LRA_CREATE_LIVE_RANGES);
+
complete_info_p = all_p;
if (lra_dump_file != NULL)
fprintf (lra_dump_file,
@@ -1016,6 +1018,7 @@ lra_create_live_ranges (bool all_p)
sparseset_free (pseudos_live_through_setjumps);
sparseset_free (pseudos_live);
compress_live_ranges ();
+ timevar_pop (TV_LRA_CREATE_LIVE_RANGES);
}

/* Finish all live ranges. */
Index: timevar.def
===================================================================
--- timevar.def (revision 191858)
+++ timevar.def (working copy)
@@ -223,10 +223,16 @@ DEFTIMEVAR (TV_REGMOVE , "
DEFTIMEVAR (TV_MODE_SWITCH , "mode switching")
DEFTIMEVAR (TV_SMS , "sms modulo scheduling")
DEFTIMEVAR (TV_SCHED , "scheduling")
-DEFTIMEVAR (TV_IRA , "integrated RA")
-DEFTIMEVAR (TV_RELOAD , "reload")
+DEFTIMEVAR (TV_IRA , "integrated RA")
+DEFTIMEVAR (TV_LRA , "LRA non-specific")
+DEFTIMEVAR (TV_LRA_ELIMINATE , "LRA virtuals eliminatenon")
+DEFTIMEVAR (TV_LRA_INHERITANCE , "LRA reload inheritance")
+DEFTIMEVAR (TV_LRA_CREATE_LIVE_RANGES, "LRA create live ranges")
+DEFTIMEVAR (TV_LRA_ASSIGN , "LRA hard reg assignment")
+DEFTIMEVAR (TV_LRA_COALESCE , "LRA coalesce pseudo regs")
+DEFTIMEVAR (TV_RELOAD , "reload")
DEFTIMEVAR (TV_RELOAD_CSE_REGS , "reload CSE regs")
-DEFTIMEVAR (TV_GCSE_AFTER_RELOAD , "load CSE after reload")
+DEFTIMEVAR (TV_GCSE_AFTER_RELOAD , "load CSE after reload")
DEFTIMEVAR (TV_REE , "ree")
DEFTIMEVAR (TV_THREAD_PROLOGUE_AND_EPILOGUE, "thread pro- & epilogue")
DEFTIMEVAR (TV_IFCVT2 , "if-conversion 2")
Index: lra-coalesce.c
===================================================================
--- lra-coalesce.c (revision 191858)
+++ lra-coalesce.c (working copy)
@@ -221,6 +221,8 @@ lra_coalesce (void)
bitmap_head involved_insns_bitmap, split_origin_bitmap;
bitmap_iterator bi;

+ timevar_push (TV_LRA_COALESCE);
+
if (lra_dump_file != NULL)
fprintf (lra_dump_file,
"\n********** Pseudos coalescing #%d: **********\n\n",
@@ -371,5 +373,6 @@ lra_coalesce (void)
free (sorted_moves);
free (next_coalesced_pseudo);
free (first_coalesced_pseudo);
+ timevar_pop (TV_LRA_COALESCE);
return coalesced_moves != 0;
}
Index: lra-constraints.c
===================================================================
--- lra-constraints.c (revision 191858)
+++ lra-constraints.c (working copy)
@@ -4859,6 +4859,8 @@ lra_inheritance (void)
basic_block bb, start_bb;
edge e;

+ timevar_push (TV_LRA_INHERITANCE);
+
lra_inheritance_iter++;
if (lra_dump_file != NULL)
fprintf (lra_dump_file, "\n********** Inheritance #%d: **********\n\n",
@@ -4907,6 +4909,8 @@ lra_inheritance (void)
bitmap_clear (&live_regs);
bitmap_clear (&check_only_regs);
free (usage_insns);
+
+ timevar_pop (TV_LRA_INHERITANCE);
}

^L
Richard Guenther
2012-09-30 16:01:23 UTC
Permalink
Post by Steven Bosscher
Hi Vlad,
Thanks for the testing and the logs. You must have good hardware, your
timings are all ~3 times faster than mine :-)
Post by Vladimir Makarov
----------------------------------32-bit------------------------------------
581.85user 29.91system 27:15.18elapsed 37%CPU (0avgtext+0avgdata
629.67user 24.16system 24:31.08elapsed 44%CPU (0avgtext+0avgdata
This is a ~8% slowdown.
Post by Vladimir Makarov
----------------------------------64-bit:-----------------------------------
503.26user 36.54system 30:16.62elapsed 29%CPU (0avgtext+0avgdata
598.70user 30.90system 27:26.92elapsed 38%CPU (0avgtext+0avgdata
This is a ~19% slowdown
I think both measurements run into swap (low CPU utilization), from the LRA
numbers I'd say that LRA uses less memory but the timings are somewhat
useless with the swapping.
Post by Steven Bosscher
Post by Vladimir Makarov
Here is the numbers for PR54146 on the same machine with -O1 only for
64-bit (compiler reports error for -m32).
Right, the test case is for 64-bits only, I think it's preprocessed
code for AMD64.
Post by Vladimir Makarov
350.40user 21.59system 17:09.75elapsed 36%CPU (0avgtext+0avgdata
468.29user 21.35system 15:47.76elapsed 51%CPU (0avgtext+0avgdata
This is a ~34% slowdown.
To put it in another perspective, here are my timings of trunk vs lra
integrated RA : 181.68 (24%) usr 1.68 (11%) sys 183.43
(24%) wall 643564 kB (20%) ggc
reload : 11.00 ( 1%) usr 0.18 ( 1%) sys 11.17 (
1%) wall 32394 kB ( 1%) ggc
TOTAL : 741.64 14.76 756.41
3216164 kB
integrated RA : 174.65 (16%) usr 1.33 ( 8%) sys 176.33
(16%) wall 643560 kB (20%) ggc
reload : 399.69 (36%) usr 2.48 (15%) sys 402.69
(36%) wall 41852 kB ( 1%) ggc
TOTAL :1102.06 16.05 1120.83
3231738 kB
That's a 49% slowdown. The difference is completely accounted for by
the timing difference between reload and LRA.
(Timings done on gcc17, which is AMD Opteron(tm) Processor 8354 with
15GB ram, so swapping is no issue.)
It looks like the reload timevar is used for LRA. Why not have
multiple timevars, one per phase of LRA? Sth like the patch below
integrated RA : 189.34 (16%) usr 1.84 (11%) sys 191.18
(16%) wall 643560 kB (20%) ggc
LRA non-specific : 59.82 ( 5%) usr 0.22 ( 1%) sys 60.12 (
5%) wall 18202 kB ( 1%) ggc
LRA virtuals eliminatenon: 56.79 ( 5%) usr 0.03 ( 0%) sys 56.80 (
5%) wall 19223 kB ( 1%) ggc
LRA reload inheritance : 6.41 ( 1%) usr 0.01 ( 0%) sys 6.42 (
1%) wall 1665 kB ( 0%) ggc
LRA create live ranges : 175.30 (15%) usr 2.14 (13%) sys 177.44
(15%) wall 2761 kB ( 0%) ggc
LRA hard reg assignment : 130.85 (11%) usr 0.20 ( 1%) sys 131.17
(11%) wall 0 kB ( 0%) ggc
LRA coalesce pseudo regs: 2.54 ( 0%) usr 0.00 ( 0%) sys 2.55 (
0%) wall 0 kB ( 0%) ggc
reload : 6.73 ( 1%) usr 0.20 ( 1%) sys 6.92 (
1%) wall 0 kB ( 0%) ggc
so the LRA "slowness" (for lack of a better word) appears to be due to
scalability problems in all sub-passes.
It would be nice to see if LRA just has a larger constant cost factor
compared to reload or if it has bigger complexity.
Post by Steven Bosscher
The code size changes are impressive, but I think that this kind of
slowdown should be addressed before making LRA the default for any
target.
Certainly if it shows bigger complexity, not sure for the constant factor
(but for sure improvements are welcome).

I suppose there is the option to revert back to reload by default for
x86_64 as well for 4.8, right? That is, do both reload and LRA
co-exist for each target or is it a definite decision target by target?

Thanks,
Richard.
Post by Steven Bosscher
Ciao!
Steven
Index: lra-assigns.c
===================================================================
--- lra-assigns.c (revision 191858)
+++ lra-assigns.c (working copy)
@@ -1261,6 +1261,8 @@ lra_assign (void)
bitmap_head insns_to_process;
bool no_spills_p;
+ timevar_push (TV_LRA_ASSIGN);
+
init_lives ();
sorted_pseudos = (int *) xmalloc (sizeof (int) * max_reg_num ());
sorted_reload_pseudos = (int *) xmalloc (sizeof (int) * max_reg_num ());
@@ -1312,5 +1314,6 @@ lra_assign (void)
free (sorted_pseudos);
free (sorted_reload_pseudos);
finish_lives ();
+ timevar_pop (TV_LRA_ASSIGN);
return no_spills_p;
}
Index: lra.c
===================================================================
--- lra.c (revision 191858)
+++ lra.c (working copy)
@@ -2193,6 +2193,7 @@ lra (FILE *f)
lra_dump_file = f;
+ timevar_push (TV_LRA);
init_insn_recog_data ();
@@ -2271,6 +2272,7 @@ lra (FILE *f)
to use a constant pool. */
lra_eliminate (false);
lra_inheritance ();
+
/* We need live ranges for lra_assign -- so build them. */
lra_create_live_ranges (true);
live_p = true;
@@ -2343,6 +2345,8 @@ lra (FILE *f)
#ifdef ENABLE_CHECKING
check_rtl (true);
#endif
+
+ timevar_pop (TV_LRA);
}
/* Called once per compiler to initialize LRA data once. */
Index: lra-eliminations.c
===================================================================
--- lra-eliminations.c (revision 191858)
+++ lra-eliminations.c (working copy)
@@ -1297,6 +1297,8 @@ lra_eliminate (bool final_p)
struct elim_table *ep;
int regs_num = max_reg_num ();
+ timevar_push (TV_LRA_ELIMINATE);
+
bitmap_initialize (&insns_with_changed_offsets, &reg_obstack);
if (final_p)
{
@@ -1317,7 +1319,7 @@ lra_eliminate (bool final_p)
{
update_reg_eliminate (&insns_with_changed_offsets);
if (bitmap_empty_p (&insns_with_changed_offsets))
- return;
+ goto lra_eliminate_done;
}
if (lra_dump_file != NULL)
{
@@ -1349,4 +1351,7 @@ lra_eliminate (bool final_p)
process_insn_for_elimination (insn, final_p);
}
bitmap_clear (&insns_with_changed_offsets);
+
+ timevar_pop (TV_LRA_ELIMINATE);
}
Index: lra-lives.c
===================================================================
--- lra-lives.c (revision 191858)
+++ lra-lives.c (working copy)
@@ -962,6 +962,8 @@ lra_create_live_ranges (bool all_p)
basic_block bb;
int i, hard_regno, max_regno = max_reg_num ();
+ timevar_push (TV_LRA_CREATE_LIVE_RANGES);
+
complete_info_p = all_p;
if (lra_dump_file != NULL)
fprintf (lra_dump_file,
@@ -1016,6 +1018,7 @@ lra_create_live_ranges (bool all_p)
sparseset_free (pseudos_live_through_setjumps);
sparseset_free (pseudos_live);
compress_live_ranges ();
+ timevar_pop (TV_LRA_CREATE_LIVE_RANGES);
}
/* Finish all live ranges. */
Index: timevar.def
===================================================================
--- timevar.def (revision 191858)
+++ timevar.def (working copy)
@@ -223,10 +223,16 @@ DEFTIMEVAR (TV_REGMOVE , "
DEFTIMEVAR (TV_MODE_SWITCH , "mode switching")
DEFTIMEVAR (TV_SMS , "sms modulo scheduling")
DEFTIMEVAR (TV_SCHED , "scheduling")
-DEFTIMEVAR (TV_IRA , "integrated RA")
-DEFTIMEVAR (TV_RELOAD , "reload")
+DEFTIMEVAR (TV_IRA , "integrated RA")
+DEFTIMEVAR (TV_LRA , "LRA non-specific")
+DEFTIMEVAR (TV_LRA_ELIMINATE , "LRA virtuals eliminatenon")
+DEFTIMEVAR (TV_LRA_INHERITANCE , "LRA reload inheritance")
+DEFTIMEVAR (TV_LRA_CREATE_LIVE_RANGES, "LRA create live ranges")
+DEFTIMEVAR (TV_LRA_ASSIGN , "LRA hard reg assignment")
+DEFTIMEVAR (TV_LRA_COALESCE , "LRA coalesce pseudo regs")
+DEFTIMEVAR (TV_RELOAD , "reload")
DEFTIMEVAR (TV_RELOAD_CSE_REGS , "reload CSE regs")
-DEFTIMEVAR (TV_GCSE_AFTER_RELOAD , "load CSE after reload")
+DEFTIMEVAR (TV_GCSE_AFTER_RELOAD , "load CSE after reload")
DEFTIMEVAR (TV_REE , "ree")
DEFTIMEVAR (TV_THREAD_PROLOGUE_AND_EPILOGUE, "thread pro- & epilogue")
DEFTIMEVAR (TV_IFCVT2 , "if-conversion 2")
Index: lra-coalesce.c
===================================================================
--- lra-coalesce.c (revision 191858)
+++ lra-coalesce.c (working copy)
@@ -221,6 +221,8 @@ lra_coalesce (void)
bitmap_head involved_insns_bitmap, split_origin_bitmap;
bitmap_iterator bi;
+ timevar_push (TV_LRA_COALESCE);
+
if (lra_dump_file != NULL)
fprintf (lra_dump_file,
"\n********** Pseudos coalescing #%d: **********\n\n",
@@ -371,5 +373,6 @@ lra_coalesce (void)
free (sorted_moves);
free (next_coalesced_pseudo);
free (first_coalesced_pseudo);
+ timevar_pop (TV_LRA_COALESCE);
return coalesced_moves != 0;
}
Index: lra-constraints.c
===================================================================
--- lra-constraints.c (revision 191858)
+++ lra-constraints.c (working copy)
@@ -4859,6 +4859,8 @@ lra_inheritance (void)
basic_block bb, start_bb;
edge e;
+ timevar_push (TV_LRA_INHERITANCE);
+
lra_inheritance_iter++;
if (lra_dump_file != NULL)
fprintf (lra_dump_file, "\n********** Inheritance #%d: **********\n\n",
@@ -4907,6 +4909,8 @@ lra_inheritance (void)
bitmap_clear (&live_regs);
bitmap_clear (&check_only_regs);
free (usage_insns);
+
+ timevar_pop (TV_LRA_INHERITANCE);
}
^L
Steven Bosscher
2012-09-30 16:47:09 UTC
Permalink
On Sun, Sep 30, 2012 at 6:01 PM, Richard Guenther
Post by Richard Guenther
Post by Steven Bosscher
Post by Vladimir Makarov
----------------------------------64-bit:-----------------------------------
503.26user 36.54system 30:16.62elapsed 29%CPU (0avgtext+0avgdata
598.70user 30.90system 27:26.92elapsed 38%CPU (0avgtext+0avgdata
This is a ~19% slowdown
I think both measurements run into swap (low CPU utilization), from the LRA
numbers I'd say that LRA uses less memory but the timings are somewhat
useless with the swapping.
Not on gcc17. It has almost no swap to begin with, but the max.
resident size is less than half of the machine's RAM (~7GB max.
resident vs 16GB machine RAM). It obviously has to do with memory
behavior, but it's probably more a matter of size (>200,000 basic
blocks, >600,000 pseudos, etc., basic blocks with livein/liveout sets
with a cardinality in the 10,000s, etc.), not swapping.
Post by Richard Guenther
It would be nice to see if LRA just has a larger constant cost factor
compared to reload or if it has bigger complexity.
It is complexity in all typical measures of size (basic blocks, number
of insns, number of basic blocks), that's easily verified with
artificial test cases.
Post by Richard Guenther
Post by Steven Bosscher
The code size changes are impressive, but I think that this kind of
slowdown should be addressed before making LRA the default for any
target.
Certainly if it shows bigger complexity, not sure for the constant factor
(but for sure improvements are welcome).
I suppose there is the option to revert back to reload by default for
x86_64 as well for 4.8, right? That is, do both reload and LRA
co-exist for each target or is it a definite decision target by target?
Do you really want to have two such bug-sensitive paths through of the compiler?

Ciao!
Steven
Andi Kleen
2012-09-30 16:50:22 UTC
Permalink
Post by Richard Guenther
I think both measurements run into swap (low CPU utilization), from the LRA
numbers I'd say that LRA uses less memory but the timings are somewhat
useless with the swapping.
On Linux I would normally recommend to use

/usr/bin/time -f 'real=%e user=%U system=%S share=%P%% maxrss=%M ins=%I
outs=%O mfaults=%R waits=%w'

instead of plain time. It gives you much more information
(especially maxrss and waits), so it's more reliable to tell if you
have a memory problem or not.

-Andi
--
***@linux.intel.com -- Speaking for myself only
Steven Bosscher
2012-09-30 16:52:53 UTC
Permalink
Hi,
Post by Steven Bosscher
integrated RA : 189.34 (16%) usr
LRA non-specific : 59.82 ( 5%) usr
LRA virtuals eliminatenon: 56.79 ( 5%) usr
LRA create live ranges : 175.30 (15%) usr
LRA hard reg assignment : 130.85 (11%) usr
The IRA pass is slower than the next-slowest pass (tree PRA) by almost
a factor 2.5. Each of the individually-measured *phases* of LRA is
slower than the complete IRA *pass*. These 5 timevars together make up
for 52% of all compile time.

IRA already has scalability problems, let's not add more of that with LRA.

Ciao!
Steven
Richard Guenther
2012-09-30 17:03:21 UTC
Permalink
Post by Steven Bosscher
Hi,
Post by Steven Bosscher
integrated RA : 189.34 (16%) usr
LRA non-specific : 59.82 ( 5%) usr
LRA virtuals eliminatenon: 56.79 ( 5%) usr
LRA create live ranges : 175.30 (15%) usr
LRA hard reg assignment : 130.85 (11%) usr
The IRA pass is slower than the next-slowest pass (tree PRA) by almost
a factor 2.5. Each of the individually-measured *phases* of LRA is
slower than the complete IRA *pass*. These 5 timevars together make up
for 52% of all compile time.
That figure indeed makes IRA + LRA look bad. Did you by chance identify
anything obvious that can be done to improve the situation?

Thanks,
Richard.
Post by Steven Bosscher
IRA already has scalability problems, let's not add more of that with LRA.
Ciao!
Steven
Steven Bosscher
2012-09-30 20:42:39 UTC
Permalink
On Sun, Sep 30, 2012 at 7:03 PM, Richard Guenther
Post by Richard Guenther
Post by Steven Bosscher
Hi,
Post by Steven Bosscher
integrated RA : 189.34 (16%) usr
LRA non-specific : 59.82 ( 5%) usr
LRA virtuals eliminatenon: 56.79 ( 5%) usr
LRA create live ranges : 175.30 (15%) usr
LRA hard reg assignment : 130.85 (11%) usr
The IRA pass is slower than the next-slowest pass (tree PRA) by almost
a factor 2.5. Each of the individually-measured *phases* of LRA is
slower than the complete IRA *pass*. These 5 timevars together make up
for 52% of all compile time.
That figure indeed makes IRA + LRA look bad. Did you by chance identify
anything obvious that can be done to improve the situation?
Not really. It was what I was looking/hoping for with the multiple
timevars, but no cheese.

Ciao!
Steven
Vladimir Makarov
2012-09-30 22:55:05 UTC
Permalink
Post by Steven Bosscher
On Sun, Sep 30, 2012 at 7:03 PM, Richard Guenther
Post by Richard Guenther
Post by Steven Bosscher
Hi,
Post by Steven Bosscher
integrated RA : 189.34 (16%) usr
LRA non-specific : 59.82 ( 5%) usr
LRA virtuals eliminatenon: 56.79 ( 5%) usr
LRA create live ranges : 175.30 (15%) usr
LRA hard reg assignment : 130.85 (11%) usr
The IRA pass is slower than the next-slowest pass (tree PRA) by almost
a factor 2.5. Each of the individually-measured *phases* of LRA is
slower than the complete IRA *pass*. These 5 timevars together make up
for 52% of all compile time.
That figure indeed makes IRA + LRA look bad. Did you by chance identify
anything obvious that can be done to improve the situation?
Not really. It was what I was looking/hoping for with the multiple
timevars, but no cheese.
I spent a lot of time to speed up LRA code. So I don't think there is a
simple solution. The problem can be solved by using by simpler
algorithms which results in generation of worse code. It is not one
week work even for me.
Vladimir Makarov
2012-09-30 22:50:50 UTC
Permalink
Post by Richard Guenther
Post by Steven Bosscher
Hi,
Post by Steven Bosscher
integrated RA : 189.34 (16%) usr
LRA non-specific : 59.82 ( 5%) usr
LRA virtuals eliminatenon: 56.79 ( 5%) usr
LRA create live ranges : 175.30 (15%) usr
LRA hard reg assignment : 130.85 (11%) usr
The IRA pass is slower than the next-slowest pass (tree PRA) by almost
a factor 2.5. Each of the individually-measured *phases* of LRA is
slower than the complete IRA *pass*. These 5 timevars together make up
for 52% of all compile time.
That figure indeed makes IRA + LRA look bad. Did you by chance identify
anything obvious that can be done to improve the situation?
As I wrote, I don't see that LRA has a problem right now because even
on 8GB machine, GCC with LRA is 10% faster than GCC with reload with
real time point of view (not saying that LRA generates 15% smaller
code). And real time is what really matters for users.

But I think that LRA cpu time problem for this test can be fixed. But
I don't think I can fix it for 2 weeks. So if people believe that
current LRA behaviour on this PR is a stopper to include it into gcc4.8
than we should postpone its inclusion until gcc4.9 when I hope to fix it.

As for IRA, IRA uses Chaitin-Briggs algorithm which scales worse than
most other optimizations. So the bigger test, the bigger percent of IRA
in compilation time. I don't believe that somebdoy can achieve a better
code using other faster RA algorithms. LLVM has no such problem because
even their new RA (a big improvement for llvm3.0) is not based on CB
algorithm. It is still based on modification of linear-scan RA. It
would be interesting to check how other compilers behave on this test.
Particually Intel one is most interesting (but I have doubts that it
will be doing well because I saw programs when Intel compiler with
optimizations struggled more than 40 mins on a file compilation).

Still we can improve IRA behaviour from simple solutions like using a
fast algorithm (currently used for -O0) for huge functions or by
implementing division of function on smaller regions (but it is hard to
implement and it will not work well for tests when most pseudos have
very long live range). I will work on it when I am less busy.

About 14 years ago in Cygnus time, I worked on some problem from a
customer. GCC was not able to compile a big program by that time.
Fixing GCC would have required a lot of efforts. Finally, the customer
modifies its code to generate smaller functions and problem was gone. I
mean that we could spend a lot of efforts to fix corner cases, ignoring
improvements for majority of users. But it seems to me that I'll have
to work on these PRs.
Steven Bosscher
2012-09-30 23:15:05 UTC
Permalink
As I wrote, I don't see that LRA has a problem right now because even on
8GB machine, GCC with LRA is 10% faster than GCC with reload with real time
point of view (not saying that LRA generates 15% smaller code). And real
time is what really matters for users.
For me, those compile times I reported *are* real times.

But you are right that the test case is a bit extreme. Before GCC 4.8
other parts of the compiler also choked on it. Still, the test case
comes from real user's code (combination of Eigen library with MPFR),
and it shows scalability problems in LRA (and IRA) that one can't just
"explain away" with an "RA is just expensive" claim. The test case for
PR26854 is Brad Lucier's Scheme interpreter, that is also real user's
code.

FWIW, I had actually expected IRA to extremely well on this test case
because IRA is supposed to be a regional allocator and I had expected
that would help for scalability. But most of the region data
structures in IRA are designed to hold whole functions (e.g. several
per region arrays of size max_reg_num / max_insn_uid / ...) and that
appears to be a problem for IRA's memory foot print. Perhaps something
similar is going on with LRA?

Ciao!
Steven
Vladimir Makarov
2012-10-01 00:13:17 UTC
Permalink
Post by Steven Bosscher
As I wrote, I don't see that LRA has a problem right now because even on
8GB machine, GCC with LRA is 10% faster than GCC with reload with real time
point of view (not saying that LRA generates 15% smaller code). And real
time is what really matters for users.
For me, those compile times I reported *are* real times.
Sorry, I missed your data (it was buried in calculations of percents
from my data). I saw that on my machine maxrss was 8GB with a lot of
page faults and small cpu utillizations (about 30%). I guess you used
16GB machine and 16GB is enough for this test. Ok, I'll work on this
problem although I think it will take some time to solve it or make it
more tolerable. Although I think it is not right to pay attention only
to compilation time. See my reasons below.
Post by Steven Bosscher
But you are right that the test case is a bit extreme. Before GCC 4.8
other parts of the compiler also choked on it. Still, the test case
comes from real user's code (combination of Eigen library with MPFR),
and it shows scalability problems in LRA (and IRA) that one can't just
"explain away" with an "RA is just expensive" claim. The test case for
PR26854 is Brad Lucier's Scheme interpreter, that is also real user's
code.
I myself wrote a few interpreters, so I looked at the code of Scheme
interpreter.

It seems to me it is a computer generated code. So the first
solution would be generate a few functions instead of one. Generating a
huge function is not wise for performance critical applications because
compilers for this corner cases use simpler faster algorithms for
optimization generating worse code. By the way, I can solve the
compilation time problem by using simpler algorithms harming
performance. The author will be happy with compilation speed but will
be disappointed by saying 10% slower interpreter. I don't think it is a
solution the problem, it is creating a bigger problem. It seems to me I
have to do this :) Or if I tell him that waiting 40% more time he can
get 15% smaller code, I guess he would prefer this. Of course it is
interesting problem to speed up the compiler but we don't look at whole
picture when we solve compilation time by hurting the performance.

Scalability problem is a problem of computer generated programs and
usually there is simpler and better solution for this by generating
smaller functions.

By the way, I also found that the author uses label values. It is
not the best solution although there are still a lot of articles
recommending it. One switch is faster for modern computers. Anton Ertl
proposed to use several switches (one switch after each interpreter
insn) for better branch predictions but I found this work worse than one
switch solution at least for my interpreters.
Jakub Jelinek
2012-10-01 05:48:16 UTC
Permalink
Post by Vladimir Makarov
But I think that LRA cpu time problem for this test can be fixed.
But I don't think I can fix it for 2 weeks. So if people believe
that current LRA behaviour on this PR is a stopper to include it
into gcc4.8 than we should postpone its inclusion until gcc4.9 when
I hope to fix it.
I think this testcase shouldn't be a show stopper for LRA inclusion into
4.8, but something to look at for stage3.

I think a lot of GCC passes have scalability issues on that testcase,
that is why it must be compiled with -O1 and not higher optimization
options, so perhaps it would be enough to choose a faster algorithm
generating worse code for the huge functions and -O1.

And I agree it is primarily a bug in the generator that it creates such huge
functions, that can't perform very well.

Jakub
Jakub Jelinek
2012-10-01 07:16:53 UTC
Permalink
The test case compiles just fine at -O2, only VRP has trouble with it.
Let's try to stick with facts, not speculation.
I was talking about the other PR, PR26854, which from what I remember when
trying it myself and even the latest -O3 time reports from the reduced
testcase show that IRA/reload aren't there very significant (for -O3 IRA
takes ~ 6% and reload ~ 1%).
I've put a lot of hard work into it to fix almost all scalability problems
on this PR for gcc 4.8. LRA undoes all of that work. I understand it is
painful for some people to hear, but I remain of opinion that LRA cannot be
considered "ready" if it scales so much worse than everything else in the
compiler.
Judging the whole implementation from just these corner cases and not how it
performs on other testcases (SPEC, rebuild a distro, ...) is IMHO not the
right thing, if Vlad thinks the corner cases are fixable during stage3; IMHO
we should allow LRA in, worst case it can be disabled by default even for
i?86/x86_64.

Jakub
Steven Bosscher
2012-10-01 09:55:14 UTC
Permalink
Post by Jakub Jelinek
The test case compiles just fine at -O2, only VRP has trouble with it.
Let's try to stick with facts, not speculation.
I was talking about the other PR, PR26854, which from what I remember when
trying it myself and even the latest -O3 time reports from the reduced
testcase show that IRA/reload aren't there very significant (for -O3 IRA
takes ~ 6% and reload ~ 1%).
OK, but what does LRA take? Vlad's numbers for 64-bits and looking at user time:

Reload: 503.26user
LRA: 598.70user

So if reload is ~1% of 503s then that'd be ~5s. And the only
difference between the two timings is LRA instead of reload, so LRA
takes ~100s, or 20%.
Post by Jakub Jelinek
I've put a lot of hard work into it to fix almost all scalability problems
on this PR for gcc 4.8. LRA undoes all of that work. I understand it is
painful for some people to hear, but I remain of opinion that LRA cannot be
considered "ready" if it scales so much worse than everything else in the
compiler.
Judging the whole implementation from just these corner cases and not how it
performs on other testcases (SPEC, rebuild a distro, ...) is IMHO not the
right thing, if Vlad thinks the corner cases are fixable during stage3; IMHO
we should allow LRA in, worst case it can be disabled by default even for
i?86/x86_64.
I'd be asked to do a guest lecture on compiler construction (to be
clear: I'd be highly surprised if anyone would ask me to, but for sake
of argument, bear with me ;-) then I'd start by stating that
algorithms should be designed for the corner cases, because the
devil's always in the details.

But more to the point regarding stage3: It will already be a busy
stage3 if the other, probably even more significant, scalability
issues have to be fixed, i.e. var-tracking and macro expansion. And
there's also the symtab work that's bound to cause some interesting
bugs still to be shaken out. With all due respect to Vlad, and,
seriously, hats off to Vlad for tacking reload and coming up with a
much easier to understand and nicely phase-split replacement, I just
don't believe that these scalability issues can be addressed in
stage3.

It's now very late stage1, and LRA was originally scheduled for GCC
4.9. Why the sudden hurrying? Did I miss the 2 minute warning?

Ciao!
Steven
Steven Bosscher
2012-10-01 08:18:35 UTC
Permalink
[ Sorry for re-send, it seems that mobile gmail sends text/html and
the sourceware mailer daemon rejects that. ]
Post by Jakub Jelinek
I think this testcase shouldn't be a show stopper for LRA inclusion into
4.8, but something to look at for stage3.
I think a lot of GCC passes have scalability issues on that testcase,
that is why it must be compiled with -O1 and not higher optimization
options,
The test case compiles just fine at -O2, only VRP has trouble with it. Let's
try to stick with facts, not speculation.

And the test case is not generated, it is the Eigen template library applied
to mpfr.

I've put a lot of hard work into it to fix almost all scalability problems
on this PR for gcc 4.8. LRA undoes all of that work. I understand it is
painful for some people to hear, but I remain of opinion that LRA cannot be
considered "ready" if it scales so much worse than everything else in the
compiler.

Ciao!
Steven
Richard Guenther
2012-10-01 09:52:06 UTC
Permalink
Post by Jakub Jelinek
Post by Vladimir Makarov
But I think that LRA cpu time problem for this test can be fixed.
But I don't think I can fix it for 2 weeks. So if people believe
that current LRA behaviour on this PR is a stopper to include it
into gcc4.8 than we should postpone its inclusion until gcc4.9 when
I hope to fix it.
I think this testcase shouldn't be a show stopper for LRA inclusion into
4.8, but something to look at for stage3.
I agree here.
Post by Jakub Jelinek
I think a lot of GCC passes have scalability issues on that testcase,
that is why it must be compiled with -O1 and not higher optimization
options, so perhaps it would be enough to choose a faster algorithm
generating worse code for the huge functions and -O1.
Yes, we spent quite some time in making basic optimization work for
insane testcases (basically avoid quadratic or bigger complexity in any
IL size variable (number of basic-blocks, edges, instructions, pseudos, etc.)).

And indeed if you use -O2 we do have issues with existing passes (and even
at -O1 points-to analysis can wreck things, or even profile guessing - see
existing bugs for that). Basically I would tune -O1 towards being able to
compile and optimize insane testcases with memory and compile-time
requirements that are linear in any of the above complexity measures.

Thus, falling back to the -O0 register allocating strathegy at certain
thresholds for the above complexity measures is fine (existing IRA
for example has really bad scaling on the number of loops in the function,
but you can tweak with flags to make it not consider that).
Post by Jakub Jelinek
And I agree it is primarily a bug in the generator that it creates such huge
functions, that can't perform very well.
Well, not for -O2+, yes, but at least we should try(!) hard.

Thanks,
Richard.
Post by Jakub Jelinek
Jakub
Steven Bosscher
2012-10-01 10:01:36 UTC
Permalink
On Mon, Oct 1, 2012 at 11:52 AM, Richard Guenther
Post by Richard Guenther
Post by Jakub Jelinek
I think this testcase shouldn't be a show stopper for LRA inclusion into
4.8, but something to look at for stage3.
I agree here.
I would also agree if it were not for the fact that IRA is already a
scalability bottle-neck and that has been known for a long time, too.
I have no confidence at all that if LRA goes in now, these scalability
problems will be solved in stage3 or at any next release cycle. It's
always the same thing with GCC: Once a patch is in, everyone moves on
to the next fancy new thing dropping the not-quite-broken but also
not-quite-working things on the floor.

Ciao!
Steven
Jakub Jelinek
2012-10-01 10:14:18 UTC
Permalink
Post by Steven Bosscher
I would also agree if it were not for the fact that IRA is already a
scalability bottle-neck and that has been known for a long time, too.
I have no confidence at all that if LRA goes in now, these scalability
problems will be solved in stage3 or at any next release cycle. It's
always the same thing with GCC: Once a patch is in, everyone moves on
to the next fancy new thing dropping the not-quite-broken but also
not-quite-working things on the floor.
If we open a P1 bug for it for 4.8, then it will need to be resolved some
way before branching. I think Vlad is committed to bugfixing LRA, after
all the intent is for 4.9 to enable it on more (all?) targets, and all the
bugfixing and scalability work on LRA is needed for that anyway.

Jakub
Steven Bosscher
2012-10-01 10:30:03 UTC
Permalink
Post by Jakub Jelinek
Post by Steven Bosscher
I would also agree if it were not for the fact that IRA is already a
scalability bottle-neck and that has been known for a long time, too.
I have no confidence at all that if LRA goes in now, these scalability
problems will be solved in stage3 or at any next release cycle. It's
always the same thing with GCC: Once a patch is in, everyone moves on
to the next fancy new thing dropping the not-quite-broken but also
not-quite-working things on the floor.
If we open a P1 bug for it for 4.8, then it will need to be resolved some
way before branching. I think Vlad is committed to bugfixing LRA, after
all the intent is for 4.9 to enable it on more (all?) targets, and all the
bugfixing and scalability work on LRA is needed for that anyway.
I don't question Vlad's commitment, but the last time I met him, he
only had two hands just like everyone else.

But I've made my point and it seems that I'm not voting with the majority.

Ciao!
Steven
Bernd Schmidt
2012-10-01 10:30:48 UTC
Permalink
Post by Jakub Jelinek
Post by Steven Bosscher
I would also agree if it were not for the fact that IRA is already a
scalability bottle-neck and that has been known for a long time, too.
I have no confidence at all that if LRA goes in now, these scalability
problems will be solved in stage3 or at any next release cycle. It's
always the same thing with GCC: Once a patch is in, everyone moves on
to the next fancy new thing dropping the not-quite-broken but also
not-quite-working things on the floor.
If we open a P1 bug for it for 4.8, then it will need to be resolved some
way before branching. I think Vlad is committed to bugfixing LRA, after
all the intent is for 4.9 to enable it on more (all?) targets, and all the
bugfixing and scalability work on LRA is needed for that anyway.
Why can't this be done on the branch? We've made the mistake of rushing
things into mainline too early a few times before, we should have
learned by now. And adding more half transitions is not something we
really want either.


Bernd
Vladimir Makarov
2012-10-01 17:51:43 UTC
Permalink
Post by Bernd Schmidt
Post by Jakub Jelinek
Post by Steven Bosscher
I would also agree if it were not for the fact that IRA is already a
scalability bottle-neck and that has been known for a long time, too.
I have no confidence at all that if LRA goes in now, these scalability
problems will be solved in stage3 or at any next release cycle. It's
always the same thing with GCC: Once a patch is in, everyone moves on
to the next fancy new thing dropping the not-quite-broken but also
not-quite-working things on the floor.
If we open a P1 bug for it for 4.8, then it will need to be resolved some
way before branching. I think Vlad is committed to bugfixing LRA, after
all the intent is for 4.9 to enable it on more (all?) targets, and all the
bugfixing and scalability work on LRA is needed for that anyway.
Why can't this be done on the branch? We've made the mistake of rushing
things into mainline too early a few times before, we should have
learned by now. And adding more half transitions is not something we
really want either.
I should clearly express that the transition will be not happen for
short time because of the task complexity. I believe that lra will
coexist with reload for 1-2 releases. I only ported LRA for 9 major
targets. The transition completion will be dependent on secondary
target maintainers too because I alone can not do porting LRA for all
supported targets. It was discussed with a lot of people on 2012 GNU
Tools Cauldron. Maintenance of LRA on the branch is a big burden, even
x86-64 is sometimes broken after merge with the trunk.

When I proposed merge LRA to gcc4.8, I had in mind that:
o moving most changes from LRA branch will help LRA maintenance on
the branch and I'll have more time to work on other targets and problems.
o the earlier we start the transition, the better it will be for LRA
because LRA on the trunk will have more feedback and better testing.

I've chosen x86/x86-64 for this because I am confident in this port. On
majority of tests, it generates faster, smaller code (even for these two
extreme tests it generates 15% smaller code) for less time. IMO, the
slow compilation of the extreme tests are much less important than what
I've just mentioned.

But because I got clear objections from at least two people and no clear
support for the LRA inclusion (there were just no objections to include
it), I will not insists on LRA merge now.

I believe in the importance of this work as LLVM catches GCC on RA front
by implementing a new RA for LLVM3.0. I believe we should get rid off
reload as outdated, hard to maintain, and preventing implementation of
new RA optimizations.

In any case submitting the patches was a good thing to do because I got
a lot of feedback. I still appreciate any comments on the patches.
Ian Lance Taylor
2012-10-01 18:55:56 UTC
Permalink
o moving most changes from LRA branch will help LRA maintenance on the
branch and I'll have more time to work on other targets and problems.
o the earlier we start the transition, the better it will be for LRA
because LRA on the trunk will have more feedback and better testing.
I've chosen x86/x86-64 for this because I am confident in this port. On
majority of tests, it generates faster, smaller code (even for these two
extreme tests it generates 15% smaller code) for less time. IMO, the slow
compilation of the extreme tests are much less important than what I've just
mentioned.
But because I got clear objections from at least two people and no clear
support for the LRA inclusion (there were just no objections to include it),
I will not insists on LRA merge now.
I believe that we should proceed with the LRA merge as Vlad has
proposed, and treat the compilation time slowdowns on specific test
cases as bugs to be addressed.

Clearly these slowdowns are not good. However, requiring significant
work like LRA to be better or equal to the current code in every
single case is making the perfect the enemy of the good. We must
weigh the benefits and drawbacks, not require that there be no
drawbacks at all. In this case I believe that the benefits of LRA
significantly outweigh the drawbacks.

Steven is correct in saying that there is a tendency to move on and
never address GCC bugs. However, there is also a counter-vailing
tendency to fix GCC bugs. Anyhow I'm certainly not saying that in all
cases it's OK to accept a merge with regressions; I'm saying that in
this specific case it is OK.

(I say all this based on Vlad's descriptions, I have not actually
looked at the patches.)

Ian
David Miller
2012-10-01 19:19:24 UTC
Permalink
From: Ian Lance Taylor <***@google.com>
Date: Mon, 1 Oct 2012 11:55:56 -0700
Post by Ian Lance Taylor
Steven is correct in saying that there is a tendency to move on and
never address GCC bugs. However, there is also a counter-vailing
tendency to fix GCC bugs. Anyhow I'm certainly not saying that in all
cases it's OK to accept a merge with regressions; I'm saying that in
this specific case it is OK.
I think it's more important in this case to recognize Steven's real
point, which is that for an identical situation (IRA), and with an
identical patch author, we had similar bugs. They were promised to be
worked on, and yet some of those regressions are still very much with
us.

The likelyhood of a repeat is therefore very real.

I really don't have a lot of confidence given what has happened in
the past. I also don't understand what's so evil about sorting this
out on a branch. It's the perfect carrot to get the compile time
regressions fixed.
Vladimir Makarov
2012-10-01 19:51:48 UTC
Permalink
Post by David Miller
Date: Mon, 1 Oct 2012 11:55:56 -0700
Post by Ian Lance Taylor
Steven is correct in saying that there is a tendency to move on and
never address GCC bugs. However, there is also a counter-vailing
tendency to fix GCC bugs. Anyhow I'm certainly not saying that in all
cases it's OK to accept a merge with regressions; I'm saying that in
this specific case it is OK.
I think it's more important in this case to recognize Steven's real
point, which is that for an identical situation (IRA), and with an
identical patch author, we had similar bugs. They were promised to be
worked on, and yet some of those regressions are still very much with
us.
That is not true. I worked on many compiler time regression bugs. I
remeber one serious degradation of compilation time on
all_cp2k_gfortran.f90. I solved the problem and make IRA working faster
and generating much better code than the old RA.

http://blog.gmane.org/gmane.comp.gcc.patches/month=20080501/page=15

About other two mentioned PRs by Steven:

PR26854. I worked on this bug even when IRA was on the branch and make
again GCC with IRA 5% faster on this test than GCC with the old RA.

PR 54146 is 3 months old. There were a lot work on other optimizations
before IRA became important. It happens only 2 months ago. I had no
time to work on it but I am going to.

People sometimes see that RA takes a lot of compilation time but it is
in the nature of RA. I'd recommend first to check how the old RA
behaves and then call it a degradation.

And please, don't listen just one side.
Post by David Miller
The likelyhood of a repeat is therefore very real.
I really don't have a lot of confidence given what has happened in
the past. I also don't understand what's so evil about sorting this
out on a branch. It's the perfect carrot to get the compile time
regressions fixed.
Wrong assumptions result in wrong conclusions.
Steven Bosscher
2012-10-01 20:24:34 UTC
Permalink
Post by David Miller
I think it's more important in this case to recognize Steven's real
point, which is that for an identical situation (IRA), and with an
identical patch author, we had similar bugs. They were promised to be
worked on, and yet some of those regressions are still very much with
us.
That is not true. I worked on many compiler time regression bugs. I remeber
one serious degradation of compilation time on all_cp2k_gfortran.f90. I
solved the problem and make IRA working faster and generating much better
code than the old RA.
http://blog.gmane.org/gmane.comp.gcc.patches/month=20080501/page=15
PR26854. I worked on this bug even when IRA was on the branch and make
again GCC with IRA 5% faster on this test than GCC with the old RA.
PR 54146 is 3 months old. There were a lot work on other optimizations
before IRA became important. It happens only 2 months ago. I had no time to
work on it but I am going to.
This is also not quite true, see PR37448, which shows the problems as
the test case for PR54146.

I just think scalability is a very important issue. If some pass or
algorithm scales bad on some measure, then users _will_ run into that
at some point and report bugs about it (if you're lucky enough to have
a user patient enough to sit out the long compile time :-) ). Also,
good scalability opens up opportunities. For example, historically GCC
has been conservative on inlining heuristics to avoid compile time
explosions. I think it's better to address the causes of that
explosion and to avoid introducing new potential bottlenecks.
People sometimes see that RA takes a lot of compilation time but it is in
the nature of RA. I'd recommend first to check how the old RA behaves and
then call it a degradation.
There's no question that RA is one of the hardest problems the
compiler has to solve, being NP-complete and all that. I like LRA's
iterative approach, but if you know you're going to solve a hard
problem with a number potentially expensive iterations, there's even
more reason to make scalability a design goal!

As I said earlier in this thread, I was really looking forward to IRA
at the time you worked on it, because it is supposed to be a regional
allocator and I had expected that to mean it could, well, allocate
per-region which is usually very helpful for scalability (partition
your function and insert compensation code on strategically picked
region boundaries). But that's not what IRA has turned out to be.
(Instead, its regional nature is one of the reasons for its
scalability problems.) IRA is certainly not worse than old global.c
in very many ways, and LRA looks like a well thought-through and
welcome replacement of old reload. But scalability is an issue in the
design of IRA and LRA looks to be the same in that regard.

Ciao!
Steven
Vladimir Makarov
2012-10-02 01:11:54 UTC
Permalink
Post by Steven Bosscher
Post by David Miller
I think it's more important in this case to recognize Steven's real
point, which is that for an identical situation (IRA), and with an
identical patch author, we had similar bugs. They were promised to be
worked on, and yet some of those regressions are still very much with
us.
That is not true. I worked on many compiler time regression bugs. I remeber
one serious degradation of compilation time on all_cp2k_gfortran.f90. I
solved the problem and make IRA working faster and generating much better
code than the old RA.
http://blog.gmane.org/gmane.comp.gcc.patches/month=20080501/page=15
PR26854. I worked on this bug even when IRA was on the branch and make
again GCC with IRA 5% faster on this test than GCC with the old RA.
PR 54146 is 3 months old. There were a lot work on other optimizations
before IRA became important. It happens only 2 months ago. I had no time to
work on it but I am going to.
This is also not quite true, see PR37448, which shows the problems as
the test case for PR54146.
I worked on this PR too and did some progress but it was much less than
on other PRs.

Sometimes I think that I should have maintained the old RA to compare
with it and show that IRA is still a step ahead. Unfortunately,
comparison with old releases has no sense now because more aggressive
optimizations (inlining).
Post by Steven Bosscher
I just think scalability is a very important issue. If some pass or
algorithm scales bad on some measure, then users _will_ run into that
at some point and report bugs about it (if you're lucky enough to have
a user patient enough to sit out the long compile time :-) ). Also,
good scalability opens up opportunities. For example, historically GCC
has been conservative on inlining heuristics to avoid compile time
explosions. I think it's better to address the causes of that
explosion and to avoid introducing new potential bottlenecks.
As I wrote, scalability sometimes misleading as in case PR (a Scheme
interpreter) where 30% more compilation time with LRA is translated into
15% decrease in code size and most probably better performance. Now we
can manage to achieve scalability with worse performance but that is not
what user expects and even worse he did know it. It is even a bigger
problem IMO.

Ideally, we should have more scalable algorithms as fallback but we
should warn the programmer that worse performance algorithms are used
and he could achieve a better performance by dividing a huge function
into several ones.
Post by Steven Bosscher
People sometimes see that RA takes a lot of compilation time but it is in
the nature of RA. I'd recommend first to check how the old RA behaves and
then call it a degradation.
There's no question that RA is one of the hardest problems the
compiler has to solve, being NP-complete and all that. I like LRA's
iterative approach, but if you know you're going to solve a hard
problem with a number potentially expensive iterations, there's even
more reason to make scalability a design goal!
When I designed IRA I kept this goal in my mind too. Although my first
priority was the performance. The old RA was ok when the register
pressure was not high. I knew aggressive inlining and LTO were coming.
And my goal was to design RA generating a good code for bigger programs
with much higher register pressure where the old RA drawbacks would be
obvious.
Post by Steven Bosscher
As I said earlier in this thread, I was really looking forward to IRA
at the time you worked on it, because it is supposed to be a regional
allocator and I had expected that to mean it could, well, allocate
per-region which is usually very helpful for scalability (partition
your function and insert compensation code on strategically picked
region boundaries). But that's not what IRA has turned out to be.
(Instead, its regional nature is one of the reasons for its
scalability problems.) IRA is certainly not worse than old global.c
in very many ways, and LRA looks like a well thought-through and
welcome replacement of old reload. But scalability is an issue in the
design of IRA and LRA looks to be the same in that regard.
Regional allocator means that allocation is made taking mostly a region
into account to achieve a better allocation. But a good regional RA
(e.g. Callahan-Koblenz or Intel fusion based RA) takes interactions with
other regions too and such allocators might be even slower than
non-regional. Although I should say that in my impressions CK-allocator
(I tried and implemented this about 6 years ago) is probably better
scalable than IRA but it generates worse code.
Steven Bosscher
2012-10-01 20:03:29 UTC
Permalink
Post by David Miller
Date: Mon, 1 Oct 2012 11:55:56 -0700
Post by Ian Lance Taylor
Steven is correct in saying that there is a tendency to move on and
never address GCC bugs. However, there is also a counter-vailing
tendency to fix GCC bugs. Anyhow I'm certainly not saying that in all
cases it's OK to accept a merge with regressions; I'm saying that in
this specific case it is OK.
I think it's more important in this case to recognize Steven's real
point, which is that for an identical situation (IRA), and with an
identical patch author, we had similar bugs. They were promised to be
worked on, and yet some of those regressions are still very much with
us.
My point is not to single out Vlad here! I don't think this patch
author is any worse or better than the next one. There are other
examples enough, e.g. VRP is from other contributors and it has had a
few horrible pieces of code from the start that just don't get
addressed, or var-tracking for which cleaning up a few serious compile
time problems will be a Big Job for stage3. It's the general pattern
that worries me.

Ciao!
Steven
Steven Bosscher
2012-10-01 10:10:40 UTC
Permalink
On Sun, Sep 30, 2012 at 7:03 PM, Richard Guenther
Post by Richard Guenther
Post by Steven Bosscher
Hi,
Post by Steven Bosscher
integrated RA : 189.34 (16%) usr
LRA non-specific : 59.82 ( 5%) usr
LRA virtuals eliminatenon: 56.79 ( 5%) usr
LRA create live ranges : 175.30 (15%) usr
LRA hard reg assignment : 130.85 (11%) usr
The IRA pass is slower than the next-slowest pass (tree PRA) by almost
a factor 2.5. Each of the individually-measured *phases* of LRA is
slower than the complete IRA *pass*. These 5 timevars together make up
for 52% of all compile time.
That figure indeed makes IRA + LRA look bad. Did you by chance identify
anything obvious that can be done to improve the situation?
The " LRA create live range" time is mostly spent in merge_live_ranges
walking lists. Perhaps the live ranges can be represented better with
a sorted VEC, so that the start and finish points can be looked up on
log-time instead of linear.

Ciao!
Steven
Steven Bosscher
2012-10-01 11:19:55 UTC
Permalink
Post by Steven Bosscher
The " LRA create live range" time is mostly spent in merge_live_ranges
walking lists.
Hmm no, that's just gcc17's ancient debugger telling me lies.
lra_live_range_in_p is not even used.

/me upgrades to something newer than gdb 6.8...

Ciao!
Steven
Steven Bosscher
2012-09-30 23:11:25 UTC
Permalink
Actually, I don't see there is a problem with LRA right now. I think we
should first to solve a whole compiler memory footprint problem for this
test because cpu utilization is very small for this test. On my machine
with 8GB, the maximal resident space achieves almost 8GB.
Sure. But note that up to IRA, the max. resident memory size of the
test case is "only" 3.6 GB. IRA/reload allocate more than 4GB,
doubling the foot print. If you want to solve that first, that'd be
great of course...

Ciao!
Steven
Steven Bosscher
2012-10-02 22:36:47 UTC
Permalink
Post by Steven Bosscher
Actually, I don't see there is a problem with LRA right now. I think we
should first to solve a whole compiler memory footprint problem for this
test because cpu utilization is very small for this test. On my machine
with 8GB, the maximal resident space achieves almost 8GB.
Sure. But note that up to IRA, the max. resident memory size of the
test case is "only" 3.6 GB. IRA/reload allocate more than 4GB,
doubling the foot print. If you want to solve that first, that'd be
great of course...
BTW, I get these numbers from a hack I've made in passes.c to trace
the resident memory size and the size of the resident bitmap obstacks
when I was working on reducing the memory foot print of the test case
a couple of months ago. It's obviously not something I'd propose for
including in the trunk, but I've found this hack to be quite helpful
to identify where all the memory goes.

The output looks like this (for the LRA branch on PR54146):

...
current pass = mode_sw (193) 362020
3625951232 3581693952 9289728 17849088 16256
current pass = asmcons (194) 362020
3625951232 3581693952 9289728 17849088 16256
current pass = ira (197) 362020
3625951232 3581693952 9289728 17849088 16256
current pass = reload (198) 362020
6812741632 6732029952 9289728 17849088 105664
...

Note the big jump in the 2nd and 3rd number from ira to reload. That's
a big jump in memory foot print between the start of the IRA pass and
the start of the reload pass, the memory foot print almost doubles.

BTW In the same output I'm now including the live range compression
results from LRA. For this test case:
Compressing live ranges: from 1742579 to 554532 - 31%
Compressing live ranges: from 1742569 to 73069 - 4%
LRA_iter_stats:220333;1335056;457327;2;3
(1st number is # of basic blocks, 2nd is max_uid, 3rd is max_reg_num,
4rd and 5th are iteration counts on the main outer and inner loops of
LRA). So LRA isn't really iterating much on this test case.

Ciao!
Steven

Index: passes.c
===================================================================
--- passes.c (revision 191858)
+++ passes.c (working copy)
@@ -79,15 +79,69 @@ struct opt_pass *current_pass;

static void register_pass_name (struct opt_pass *, const char *);

+typedef struct
+{
+ unsigned long size,resident,share,text,lib,data,dt;
+} statm_t;
+
+static void
+read_off_memory_status (statm_t &result)
+{
+ const char* statm_path = "/proc/self/statm";
+
+ FILE *f = fopen(statm_path,"r");
+ if (!f)
+ {
+ perror (statm_path);
+ gcc_unreachable ();
+ }
+ if (7 != fscanf (f, "%lu %lu %lu %lu %lu %lu %lu",
+ &result.size, &result.resident, &result.share,
+ &result.text, &result.lib, &result.data,
+ &result.dt))
+ {
+ perror (statm_path);
+ gcc_unreachable ();
+ }
+ fclose(f);
+}
+
/* Call from anywhere to find out what pass this is. Useful for
printing out debugging information deep inside an service
routine. */
+
+#include "bitmap.h"
+#include "regset.h"
+
+static size_t // NB difference from obstack_memory_used
+obstack_memory_used2 (struct obstack *h)
+{
+ struct _obstack_chunk* lp;
+ size_t nbytes = 0;
+
+ for (lp = h->chunk; lp != 0; lp = lp->prev)
+ {
+ nbytes += (size_t) (lp->limit - (char *) lp);
+ }
+ return nbytes;
+}
+
void
print_current_pass (FILE *file)
{
if (current_pass)
- fprintf (file, "current pass = %s (%d)\n",
- current_pass->name, current_pass->static_pass_number);
+ {
+ statm_t statm;
+ int pagesize = getpagesize ();
+ unsigned bos = obstack_memory_used2 (&bitmap_default_obstack.obstack);
+ unsigned ros = obstack_memory_used2 (&reg_obstack.obstack);
+ read_off_memory_status (statm);
+ fprintf (file, "current pass = %32s (%3d) %8d %12lu %12lu %12lu
%12u %12u\n",
+ current_pass->name, current_pass->static_pass_number,
+ max_reg_num (),
+ statm.size * pagesize, statm.resident * pagesize,
+ statm.share * pagesize, bos, ros);
+ }
else
fprintf (file, "no current pass.\n");
}
@@ -2113,7 +2167,7 @@ execute_one_pass (struct opt_pass *pass)
current_pass = NULL;
return false;
}
-
+print_current_pass (stderr);
/* Pass execution event trigger: useful to identify passes being
executed. */
invoke_plugin_callbacks (PLUGIN_PASS_EXECUTION, pass);
Index: lra.c
===================================================================
--- lra.c (revision 191858)
+++ lra.c (working copy)
@@ -2249,10 +2243,13 @@ lra (FILE *f)
bitmap_initialize (&lra_split_pseudos, &reg_obstack);
bitmap_initialize (&lra_optional_reload_pseudos, &reg_obstack);
live_p = false;
+ int _inner_loop = 0, _outer_loop = 0;
for (;;)
{
+ _outer_loop++;
for (;;)
{
+ _inner_loop++;
bitmap_clear (&lra_optional_reload_pseudos);
/* We should try to assign hard registers to scratches even
if there were no RTL transformations in
@@ -2271,6 +2268,7 @@ lra (FILE *f)
to use a constant pool. */
lra_eliminate (false);
lra_inheritance ();
+
/* We need live ranges for lra_assign -- so build them. */
lra_create_live_ranges (true);
live_p = true;
@@ -2304,6 +2302,7 @@ lra (FILE *f)
bitmap_clear (&lra_matched_pseudos);
lra_constraint_iter_after_spill = 0;
}
+ fprintf (stderr, "\nLRA_iter_stats:%u;%u;%u;%u;%u\n",
n_basic_blocks, get_max_uid (), max_reg_num (), _outer_loop,
_inner_loop);
restore_scratches ();
lra_eliminate (true);
lra_hard_reg_substitution ();
Steven Bosscher
2012-10-01 13:03:53 UTC
Permalink
Post by Steven Bosscher
LRA create live ranges : 175.30 (15%) usr 2.14 (13%) sys 177.44
(15%) wall 2761 kB ( 0%) ggc
I've tried to split this up a bit more:

process_bb_lives ~50%
create_start_finish_chains ~25%
remove_some_program_points_and_update_live_ranges ~25%

The latter two have a common structure with loops that look like this:

for (i = FIRST_PSEUDO_REGISTER; i < max_regno; i++)
{
for (r = lra_reg_info[i].live_ranges; r != NULL; r = r->next)

Perhaps it's possible to do some of the work of compress_live_ranges
during process_bb_lives, to create shorter live_ranges chains.

Also, maybe doing something else than a linked list of live_ranges
will help (unfortunately that's not a trivial change, it seems,
because the lra_live_range_t type is used everywhere and there are no
iterators or anything abstracted out like that -- just chain
walks...).

Still it does seem to me that a sorted VEC of lra_live_range objects
probably would speed things up. Question is of course how much... :-)

Ciao!
Steven
Vladimir Makarov
2012-10-02 01:14:06 UTC
Permalink
Post by Steven Bosscher
Post by Steven Bosscher
LRA create live ranges : 175.30 (15%) usr 2.14 (13%) sys 177.44
(15%) wall 2761 kB ( 0%) ggc
process_bb_lives ~50%
create_start_finish_chains ~25%
remove_some_program_points_and_update_live_ranges ~25%
for (i = FIRST_PSEUDO_REGISTER; i < max_regno; i++)
{
for (r = lra_reg_info[i].live_ranges; r != NULL; r = r->next)
Perhaps it's possible to do some of the work of compress_live_ranges
during process_bb_lives, to create shorter live_ranges chains.
It is aready done. Moreover program points are compressed to minimal
to guarantee right conflict resolution. For example, if only 2 pseudos
live in 1..10 and 2..14, they actually will have the same range like 1..1.

Analogous live ranges are used in IRA as intermidiate step to build a
conflict graph. Actually, the first approach was to use IRA code to
assign hard registers to pseudos (e.g. Jeff Law tried this approach)
but it was rejected as requiring much more compilation time. In some
way, one can look at the assignment in LRA is a compromise between
quality (which could achieved through repeated buidling conflict graphs
and using graph coloring) and compilation speed.
Post by Steven Bosscher
Also, maybe doing something else than a linked list of live_ranges
will help (unfortunately that's not a trivial change, it seems,
because the lra_live_range_t type is used everywhere and there are no
iterators or anything abstracted out like that -- just chain
walks...).
Still it does seem to me that a sorted VEC of lra_live_range objects
probably would speed things up. Question is of course how much... :-)
My experience shows that these lists are usually 1-2 elements.
Although in this case, there are pseudos with huge number elements
(hundreeds). I tried -fweb for this tests because it can decrease the
number elements but GCC (I don't know what pass) scales even worse:
after 20 min of waiting and when virt memory achieved 20GB I stoped it.

I guess the same speed or close to reload one on this test can be
achieved through implementing other simpler algorithms generating worse
code. I need some time to design them and implement this.
Jeff Law
2012-10-02 04:22:43 UTC
Permalink
Post by Vladimir Makarov
Analogous live ranges are used in IRA as intermidiate step to build a
conflict graph. Actually, the first approach was to use IRA code to
assign hard registers to pseudos (e.g. Jeff Law tried this approach)
but it was rejected as requiring much more compilation time. In some
way, one can look at the assignment in LRA is a compromise between
quality (which could achieved through repeated buidling conflict graphs
and using graph coloring) and compilation speed.
Not only was it slow (iterating IRA), guaranteeing termination was a
major problem. There's some algorithmic games that have to be played
(they're at least discussed in literature, but not under the heading of
termination) and there's some issues specific to the IRA implementation
which make ensuring termination difficult.

I got nearly as good of results by conservative updates of the conflicts
after splitting ranges and (ab)using ira's reload hooks to give the new
pseudos for the split range a chance to be allocated again.

The biggest problem with that approach was getting the costing right for
the new pseudos. That requires running a fair amount of IRA a second
time. I'd still like to return to some of the ideas from that work as I
think some of the bits are still relevant in the IRA+LRA world.
Post by Vladimir Makarov
My experience shows that these lists are usually 1-2 elements.
That's been my experience as well. The vast majority of the time the
range lists are very small.

Jeff
Vladimir Makarov
2012-10-02 14:57:14 UTC
Permalink
Post by Jeff Law
Post by Vladimir Makarov
Analogous live ranges are used in IRA as intermidiate step to build a
conflict graph. Actually, the first approach was to use IRA code to
assign hard registers to pseudos (e.g. Jeff Law tried this approach)
but it was rejected as requiring much more compilation time. In some
way, one can look at the assignment in LRA is a compromise between
quality (which could achieved through repeated buidling conflict graphs
and using graph coloring) and compilation speed.
Not only was it slow (iterating IRA), guaranteeing termination was a
major problem. There's some algorithmic games that have to be played
(they're at least discussed in literature, but not under the heading
of termination) and there's some issues specific to the IRA
implementation which make ensuring termination difficult.
Chaitin-Briggs literature does not discuss the termination, just saying
that live-ranges shortening will result to assigning hard regs to all
necessary pseudos which is not clearly guaranteed. There is the same
problem in LRA. So LRA checks that too many passes are done or to many
reloads for one insn are made and abort LRA. Porting LRA is mostly
fixing such aborts.

Another thing omitted by literature is inheritance which is very
important for performance. Although it could be considered as a special
case of live-range splitting. There are also a lot of small important
details (e.g. what to do in case of displacement constraints, or when
non-load/store insns permits memory and registers etc) not discussed
well or at all in the literature I read.
Post by Jeff Law
I got nearly as good of results by conservative updates of the
conflicts after splitting ranges and (ab)using ira's reload hooks to
give the new pseudos for the split range a chance to be allocated again.
The biggest problem with that approach was getting the costing right
for the new pseudos. That requires running a fair amount of IRA a
second time. I'd still like to return to some of the ideas from that
work as I think some of the bits are still relevant in the IRA+LRA world.
Post by Vladimir Makarov
My experience shows that these lists are usually 1-2 elements.
That's been my experience as well. The vast majority of the time the
range lists are very small.
Peter Bergner
2012-10-11 21:53:24 UTC
Permalink
Post by Vladimir Makarov
Chaitin-Briggs literature does not discuss the termination, just saying
that live-ranges shortening will result to assigning hard regs to all
necessary pseudos which is not clearly guaranteed. There is the same
problem in LRA. So LRA checks that too many passes are done or to many
reloads for one insn are made and abort LRA. Porting LRA is mostly
fixing such aborts.
IIRC, talking with the guys from Rice, they had a limit on the number of
color/spill iterations (20) before aborting, since anything more than
that would be due to a bug. I believe the largest number of iterations
I ever saw in my allocator was about 6 iterations of color/spill. I hit
a few cases that iterated forever, but those were always due to bugs in
my code or special hairy details I hadn't handled. You're correct that
the hairy details are never discussed in papers. :)
Post by Vladimir Makarov
Another thing omitted by literature is inheritance which is very
important for performance. Although it could be considered as a special
case of live-range splitting. There are also a lot of small important
details (e.g. what to do in case of displacement constraints,
To handle displacement constraints, instead of spilling to stack slots,
we spilled to spill pseudos which look like normal register pseudos.
We would then color them just like normal pseudos, but the colors
represent stack slots and not registers. If "k" becomes too big, it
means you surpassed the maximum displacement, and you'll have to spill
the spill pseudo. For small displacement cpus, coloring the spill pseudos
does a good job reusing stack slots which reduces the largest displacement
you'll see. For cpus with no displacement issues, you could just give
each spill pseudo a different color which would mean you wouldn't have
to compute a interference graph of the spill pseudos and all the work
and space that goes into building that.

Note, with spill pseudos, you can perform dead code elimination, coalescing
and other optimizations on them just like normal pseudos to reduce the
amount of spill code generated.

Peter
Vladimir Makarov
2012-10-12 03:34:40 UTC
Permalink
Post by Peter Bergner
Post by Vladimir Makarov
Chaitin-Briggs literature does not discuss the termination, just saying
that live-ranges shortening will result to assigning hard regs to all
necessary pseudos which is not clearly guaranteed. There is the same
problem in LRA. So LRA checks that too many passes are done or to many
reloads for one insn are made and abort LRA. Porting LRA is mostly
fixing such aborts.
IIRC, talking with the guys from Rice, they had a limit on the number of
color/spill iterations (20) before aborting, since anything more than
that would be due to a bug. I believe the largest number of iterations
I ever saw in my allocator was about 6 iterations of color/spill. I hit
a few cases that iterated forever, but those were always due to bugs in
my code or special hairy details I hadn't handled. You're correct that
the hairy details are never discussed in papers. :)
Interesting. The max number passes is very dependent on the target.
The biggest I saw was about 20 on a test on m68k
Post by Peter Bergner
Post by Vladimir Makarov
Another thing omitted by literature is inheritance which is very
important for performance. Although it could be considered as a special
case of live-range splitting. There are also a lot of small important
details (e.g. what to do in case of displacement constraints,
To handle displacement constraints, instead of spilling to stack slots,
we spilled to spill pseudos which look like normal register pseudos.
We would then color them just like normal pseudos, but the colors
represent stack slots and not registers. If "k" becomes too big, it
means you surpassed the maximum displacement, and you'll have to spill
the spill pseudo. For small displacement cpus, coloring the spill pseudos
does a good job reusing stack slots which reduces the largest displacement
you'll see. For cpus with no displacement issues, you could just give
each spill pseudo a different color which would mean you wouldn't have
to compute a interference graph of the spill pseudos and all the work
and space that goes into building that.
Interesting approach to spill a spilled pseudo.

Although it is not wise to give a different color for spilled pseudos on
targets without displacement issues. Small space for pseudos (by
reusing slots) gives better data locality and smaller displacements
which are important to reduce code size for targets having different
size of insn displacement field for different displacements (e.g. x86).
I know only one drawback of reusing stack slots. It is less freedom for
insn-scheduling after RA. But still it is more important to reuse the
stack than better 2nd insn scheduling at least for most widely used
target x86/x86-64.

For targets with different displacement sizes, in my experience coloring
is also not best algorithm for this. It usually results in smaller
stack space but it has a tendency to evenly spread pseudos to different
slots instead of putting important pseudos into slots with smaller
displacements to generate smaller insns.

Just my observations, coloring is pretty good for register allocation.
It works particularly well for unknown (or varying) execution profiles.
But if you have exact execution profiles, there are heuristic approaches
which could work better coloring.
Post by Peter Bergner
Note, with spill pseudos, you can perform dead code elimination, coalescing
and other optimizations on them just like normal pseudos to reduce the
amount of spill code generated.
True.

Steven Bosscher
2012-10-02 07:28:17 UTC
Permalink
My experience shows that these lists are usually 1-2 elements. Although in
this case, there are pseudos with huge number elements (hundreeds). I tried
-fweb for this tests because it can decrease the number elements but GCC (I
don't know what pass) scales even worse: after 20 min of waiting and when
virt memory achieved 20GB I stoped it.
Ouch :-)

The webizer itself never even runs, the compiler blows up somewhere
during the df_analyze call from web_main. The issue here is probably
in the DF_UD_CHAIN problem or in the DF_RD problem.

Ciao!
Steven
Paolo Bonzini
2012-10-02 08:29:21 UTC
Permalink
Post by Steven Bosscher
My experience shows that these lists are usually 1-2 elements. Although in
this case, there are pseudos with huge number elements (hundreeds). I tried
-fweb for this tests because it can decrease the number elements but GCC (I
don't know what pass) scales even worse: after 20 min of waiting and when
virt memory achieved 20GB I stoped it.
Ouch :-)
The webizer itself never even runs, the compiler blows up somewhere
during the df_analyze call from web_main. The issue here is probably
in the DF_UD_CHAIN problem or in the DF_RD problem.
/me is glad to have fixed fwprop when his GCC contribution time was more
than 1-2 days per year...

Unfortunately, the fwprop solution (actually a rewrite) was very
specific to the problem and cannot be reused in other parts of the compiler.

I guess here it is where we could experiment with region-based
optimization. If a loop (including the parent dummy loop) is too big,
ignore it and only do LRS on smaller loops inside it. Reaching
definitions is insanely expensive on an entire function, but works well
on smaller loops.

Perhaps something similar could be applied also to IRA/LRA.

Paolo
Steven Bosscher
2012-10-02 08:49:34 UTC
Permalink
Post by Paolo Bonzini
Post by Steven Bosscher
My experience shows that these lists are usually 1-2 elements. Although in
this case, there are pseudos with huge number elements (hundreeds). I tried
-fweb for this tests because it can decrease the number elements but GCC (I
don't know what pass) scales even worse: after 20 min of waiting and when
virt memory achieved 20GB I stoped it.
Ouch :-)
The webizer itself never even runs, the compiler blows up somewhere
during the df_analyze call from web_main. The issue here is probably
in the DF_UD_CHAIN problem or in the DF_RD problem.
/me is glad to have fixed fwprop when his GCC contribution time was more
than 1-2 days per year...
I thought you spent more time on GCC nowadays, working for RedHat?
Who's your manager, perhaps we can coerce him/her into letting you
spend more time on GCC :-P
Post by Paolo Bonzini
Unfortunately, the fwprop solution (actually a rewrite) was very
specific to the problem and cannot be reused in other parts of the compiler.
That'd be too bad... But is this really true? I thought you had
something done that builds chains only for USEs reached by multiple
DEFs? That's the only interesting kind for web, too.
Post by Paolo Bonzini
I guess here it is where we could experiment with region-based
optimization. If a loop (including the parent dummy loop) is too big,
ignore it and only do LRS on smaller loops inside it. Reaching
definitions is insanely expensive on an entire function, but works well
on smaller loops.
Heh, yes. In fact I have been working on a region-based version of web
because it is (or at least: used to be) a useful pass that only isn't
enabled by default because the underlying RD problem scales so badly.
My current collection of hacks doesn't bootstrap, doesn't even build
libgcc yet, but I plan to finish it for GCC 4.9. It's based on
identifying SEME regions using structural analysis, and DF's partial
CFG analysis (the latter is currently the problem).

FWIW: part of the problem for this particular test case is that there
are many registers with partial defs (vector registers) and the RD
problem doesn't (and probably cannot) keep track of one partial
def/use killing another partial def/use. This handling of vector regs
appears to be a general problem with much of the RTL infrastructure.

Ciao!
Steven
Paolo Bonzini
2012-10-02 09:33:59 UTC
Permalink
Post by Paolo Bonzini
Post by Steven Bosscher
My experience shows that these lists are usually 1-2 elements. Although in
this case, there are pseudos with huge number elements (hundreeds). I tried
-fweb for this tests because it can decrease the number elements but GCC (I
don't know what pass) scales even worse: after 20 min of waiting and when
virt memory achieved 20GB I stoped it.
Ouch :-)
The webizer itself never even runs, the compiler blows up somewhere
during the df_analyze call from web_main. The issue here is probably
in the DF_UD_CHAIN problem or in the DF_RD problem.
/me is glad to have fixed fwprop when his GCC contribution time was more
than 1-2 days per year...
I thought you spent more time on GCC nowadays, working for Red Hat?
No, I work on QEMU most of the time. :) Knowing myself, if I had
GCC-related assignments you'd see me _a lot_ on upstream mailing lists!
Post by Paolo Bonzini
Unfortunately, the fwprop solution (actually a rewrite) was very
specific to the problem and cannot be reused in other parts of the compiler.
That'd be too bad... But is this really true? I thought you had
something done that builds chains only for USEs reached by multiple
DEFs? That's the only interesting kind for web, too.
No, it's the other way round. I have a dataflow problem that recognizes
USEs reached by multiple DEFs, so that I can use a dominator walk to
build singleton def-use chains. It's very similar to how you build SSA,
but punting instead of inserting phis.

Another solution is to build factored use-def chains for web, and use
them instead of RD. In the end it's not very different from regional
live range splitting, since the phi functions factor out the state of
the pass at loop (that is region) boundaries. I thought you had looked
at FUD chains years ago?
FWIW: part of the problem for this particular test case is that there
are many registers with partial defs (vector registers) and the RD
problem doesn't (and probably cannot) keep track of one partial
def/use killing another partial def/use.
So they are subregs of regs? Perhaps they could be represented with
VEC_MERGE to break the live range:

(set (reg:V4SI 94) (vec_merge:V4SI (reg:V4SI 94)
(const_vector:V4SI [(const_int 0)
(const_int 0)
(const_int 0)
(reg:SI 95)])
(const_int 7)))

And then reload, or something after reload, would know how to split
these when spilling V4SI to memory.

Paolo
Steven Bosscher
2012-10-04 20:56:23 UTC
Permalink
Post by Vladimir Makarov
Analogous live ranges are used in IRA as intermidiate step to build a
conflict graph.
Right, ira-lives.c and lra-lives.c look very much alike, the only
major difference is that the object of interest in an IRA live range
is an ira_object_t, and in an LRA live range it's just a regno. But
the code of create_start_finish_chains,
ira_rebuild_start_finish_chains,
remove_some_program_points_and_update_live_ranges, and most the live
range printing functions, are almost identical between lra-lives.c and
ira-lives.c. That looks like unnecessary code duplication. Do you
think some of that code be shared (esp. in a C++ world where maybe a
live range can be a container for another type)?

Ciao!
Steven
Vladimir Makarov
2012-10-05 03:37:50 UTC
Permalink
Post by Steven Bosscher
Post by Vladimir Makarov
Analogous live ranges are used in IRA as intermidiate step to build a
conflict graph.
Right, ira-lives.c and lra-lives.c look very much alike, the only
major difference is that the object of interest in an IRA live range
is an ira_object_t, and in an LRA live range it's just a regno. But
the code of create_start_finish_chains,
ira_rebuild_start_finish_chains,
remove_some_program_points_and_update_live_ranges, and most the live
range printing functions, are almost identical between lra-lives.c and
ira-lives.c. That looks like unnecessary code duplication. Do you
think some of that code be shared (esp. in a C++ world where maybe a
live range can be a container for another type)?
Yes, C++ could help here. The simplest solution is inheritance from a
common class. But it is not a high priority task now.
Steven Bosscher
2012-10-04 18:43:26 UTC
Permalink
Post by Steven Bosscher
To put it in another perspective, here are my timings of trunk vs lra
integrated RA : 181.68 (24%) usr 1.68 (11%) sys 183.43
(24%) wall 643564 kB (20%) ggc
reload : 11.00 ( 1%) usr 0.18 ( 1%) sys 11.17 (
1%) wall 32394 kB ( 1%) ggc
TOTAL : 741.64 14.76 756.41
3216164 kB
integrated RA : 174.65 (16%) usr 1.33 ( 8%) sys 176.33
(16%) wall 643560 kB (20%) ggc
reload : 399.69 (36%) usr 2.48 (15%) sys 402.69
(36%) wall 41852 kB ( 1%) ggc
TOTAL :1102.06 16.05 1120.83
3231738 kB
That's a 49% slowdown. The difference is completely accounted for by
the timing difference between reload and LRA.
With Vlad's patch to switch off expensive LRA parts for extreme
functions ([lra revision 192093]), the numbers are:

integrated RA : 154.27 (17%) usr 1.27 ( 8%) sys 155.64
(17%) wall 131534 kB ( 5%) ggc
LRA non-specific : 69.67 ( 8%) usr 0.79 ( 5%) sys 70.40 (
8%) wall 18805 kB ( 1%) ggc
LRA virtuals elimination: 55.53 ( 6%) usr 0.00 ( 0%) sys 55.49 (
6%) wall 20465 kB ( 1%) ggc
LRA reload inheritance : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.02 (
0%) wall 57 kB ( 0%) ggc
LRA create live ranges : 80.46 ( 4%) usr 1.05 ( 6%) sys 81.49 (
4%) wall 2459 kB ( 0%) ggc
LRA hard reg assignment : 1.78 ( 0%) usr 0.05 ( 0%) sys 1.85 (
0%) wall 0 kB ( 0%) ggc
reload : 6.38 ( 1%) usr 0.13 ( 1%) sys 6.51 (
1%) wall 0 kB ( 0%) ggc
TOTAL : 917.42 16.35 933.78
2720151 kB
Post by Steven Bosscher
TOTAL : 741.64 14.76 756.41
the slowdown due to LRA is down from 49% to 23%, with still room for
improvement (even without crippling LRA further). Size with the
expensive LRA parts switched off is still better thank trunk:
$ size slow.o*
text data bss dec hex filename
3499938 8 583 3500529 3569f1 slow.o.00_trunk_r191835
3386117 8 583 3386708 33ad54 slow.o.01_lra_r191626
3439755 8 583 3440346 347eda slow.o.02_lra_r192093

The lra-branch outperforms trunk on everything else I've thrown at it,
in terms of compile time and code size at least, and also e.g. on
Fortran polyhedron runtime.

Ciao!
Steven
Andi Kleen
2012-09-28 17:48:00 UTC
Permalink
Post by Steven Bosscher
Post by Vladimir Makarov
Any comments and proposals are appreciated. Even if GCC community
decides that it is too late to submit it to gcc4.8, the earlier reviews
are always useful.
I would like to see some benchmark numbers, both for code quality and
compile time impact for the most notorious compile time hog PRs for
large routines where IRA performs poorly (e.g. PR54146, PR26854).
I would be interested in some numbers how much the new XMM spilling
helps on x86 and how it affects code size.

Unfortunately not really qualified to review the code.

-Andi
--
***@linux.intel.com -- Speaking for myself only
Vladimir Makarov
2012-10-01 00:53:41 UTC
Permalink
Post by Andi Kleen
Post by Steven Bosscher
Post by Vladimir Makarov
Any comments and proposals are appreciated. Even if GCC community
decides that it is too late to submit it to gcc4.8, the earlier reviews
are always useful.
I would like to see some benchmark numbers, both for code quality and
compile time impact for the most notorious compile time hog PRs for
large routines where IRA performs poorly (e.g. PR54146, PR26854).
I would be interested in some numbers how much the new XMM spilling
helps on x86 and how it affects code size.
I have some results which I got after implementation of spilling into
SSE regs:

Average code size change: Corei7 Bulldozer
SPECInt 32-bit -0.15% -0.14%
SPECFP 32-bit -0.36% -0.24%
SPECInt 64-bit -0.03% -0.07%
SPECFP 64-bit -0.11% -0.09%

Rate change: Corei7 Bulldozer
SPECInt 32-bit +0.6% -1.2%
SPECFP 32-bit +0.3% 0%
SPECInt 64-bit 0% 0%
SPECFP 64-bit 0% 0%

I used -O3 -mtune=corei7 -march=corei7 for Corei7 and -O3
-mtune=bdver1 -march=bdver1 for Bulldozer processor. Additionally I
enabled inter unit moves for Bulldozer when the optimization works
because without this spilling general regs into SSE regs is not
possible.
Loading...