smoke failures on Tie::File's 29a_upcopy.t, 29

Discussion:

smoke failures on Tie::File's 29a_upcopy.t, 29_downcopy.t

(too old to reply)

Dave Mitchell

2024-02-25 14:30:47 UTC

For years we've had intermittent timeout smoke failures on two of Tie::File's
test files, 29a_upcopy.t, 29_downcopy.t.

Currently it would appear that they only fail (via test timeouts) on Arm
platform smokers. The failing test(s) vary.

The alarm timeout is per test, initially at 5 seconds; then increased by
me to 10s in Nov 2022, then to 20s for parallel builds by Yves in Feb 2023.

For comparison, on my fast Intel laptop on a debugging build,
29_downcopy.t takes 0.08s to run - that represents about 300 microseconds
per test - a test that is taking more than 20s on a smoker and timing out.

In terms of memory usage, (again on my debugging build), just loading the
test script but running no tests has a memory footprint of 12.5Mb;
running all the tests too increases that to 13.1Mb.

So it's really not using much memory or CPU. So why does the smoke fail?
I'm guessing at least one of the smokers is a Raspberry Pi based on the OS
("Raspbian GNU/Linux 10.13"). Now I'd expect an rPi to be slower, but that
much slower?

So the question really is:

is the fact that the tests running that slow reasonable, and so we should
just disable the test files on slow platforms, or is something else afoot
here?

Does anyone have an rPi they could run those test scripts on and see how
much memory and CPU those two files use?

Each failing test does something like: create a test file consisting of a
few 8k blocks, then do some line inserts, which behind the scenes triggers
a few OS calls along the lines of lseek(i); read(8k block); lseek(j);
write(8k block). But overall, not really a lot of IO activity, nor on
large files. Even with tests running in parallel I wouldn't expect the CPU
to over-burdened to that extent.

--
"Strange women lying in ponds distributing swords is no basis for a system
of government. Supreme executive power derives from a mandate from the
masses, not from some farcical aquatic ceremony."
-- Dennis, "Monty Python and the Holy Grail"

G.W. Haywood

2024-02-25 15:00:06 UTC

Permalink

Hi there,

Post by Dave Mitchell
For years we've had intermittent timeout smoke failures on two of Tie::File's
test files, 29a_upcopy.t, 29_downcopy.t.
...
...
is the fact that the tests running that slow reasonable, and so we should
just disable the test files on slow platforms, or is something else afoot
here?
Does anyone have an rPi they could run those test scripts on and see how
much memory and CPU those two files use?
...

You're probably looking at I/O issues - almost all Raspberry Pis will
have USB mass storage. Some (including my desktop machine) will have
e.g. networked home directories.

Here's 'make' test on my desktop using the remote home directory:

8<----------------------------------------------------------------------
raspberrypi:~/junk/Tie-File-1.07$ >>> li README
raspberrypi:~/junk/Tie-File-1.07$ >>> perl Makefile.PL
Checking if your kit is complete...
Looks good
Generating a Unix-style Makefile
Writing Makefile for Tie::File
Writing MYMETA.yml and MYMETA.json
raspberrypi:~/junk/Tie-File-1.07$ >>> time make test
cp lib/Tie/File.pm blib/lib/Tie/File.pm
PERL_DL_NONLAZY=1 "/usr/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/01_gen.t ........... ok
t/02_fetchsize.t ..... ok
t/03_longfetch.t ..... ok
t/04_splice.t ........ ok
t/05_size.t .......... ok
t/06_fixrec.t ........ ok
t/07_rv_splice.t ..... ok
t/08_ro.t ............ ok
t/09_gen_rs.t ........ ok
t/10_splice_rs.t ..... ok
t/11_rv_splice_rs.t .. ok
t/12_longfetch_rs.t .. ok
t/13_size_rs.t ....... ok
t/14_lock.t .......... ok
t/15_pushpop.t ....... ok
t/16_handle.t ........ ok
t/17_misc_meth.t ..... ok
t/18_rs_fixrec.t ..... ok
t/19_cache.t ......... ok
t/20_cache_full.t .... ok
t/21_win32.t ......... skipped: (no reason given)
t/22_autochomp.t ..... ok
t/23_rv_ac_splice.t .. ok
t/24_cache_loop.t .... ok
t/25_gen_nocache.t ... ok
t/26_twrite.t ........ ok
t/27_iwrite.t ........ ok
t/28_mtwrite.t ....... ok
t/29_downcopy.t ...... ok
t/29a_upcopy.t ....... ok
t/30_defer.t ......... ok
t/31_autodefer.t ..... ok
t/32_defer_misc.t .... ok
t/33_defer_vs.t ...... ok
t/40_abs_cache.t ..... ok
t/41_heap.t .......... ok
t/42_offset.t ........ ok
t/43_synopsis.t ...... ok
All tests successful.
Files=38, Tests=4725, 500 wallclock secs ( 1.50 usr 0.37 sys + 5.25 cusr 4.75 csys = 11.87 CPU)
Result: PASS

real 8m21.944s
user 0m7.332s
sys 0m5.282s
raspberrypi:~/junk/Tie-File-1.07$ >>>
8<----------------------------------------------------------------------

Here it is again, running in a RAM fs on /tmp:

8<----------------------------------------------------------------------
raspberrypi:~/junk/ged/Tie-File-1.07$ >>> perl Makefile.PL
Checking if your kit is complete...
Looks good
Generating a Unix-style Makefile
Writing Makefile for Tie::File
Writing MYMETA.yml and MYMETA.json
raspberrypi:~/junk/ged/Tie-File-1.07$ >>> time make test
cp lib/Tie/File.pm blib/lib/Tie/File.pm
PERL_DL_NONLAZY=1 "/usr/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/01_gen.t ........... ok
t/02_fetchsize.t ..... ok
t/03_longfetch.t ..... ok
t/04_splice.t ........ ok
t/05_size.t .......... ok
t/06_fixrec.t ........ ok
t/07_rv_splice.t ..... ok
t/08_ro.t ............ ok
t/09_gen_rs.t ........ ok
t/10_splice_rs.t ..... ok
t/11_rv_splice_rs.t .. ok
t/12_longfetch_rs.t .. ok
t/13_size_rs.t ....... ok
t/14_lock.t .......... ok
t/15_pushpop.t ....... ok
t/16_handle.t ........ ok
t/17_misc_meth.t ..... ok
t/18_rs_fixrec.t ..... ok
t/19_cache.t ......... ok
t/20_cache_full.t .... ok
t/21_win32.t ......... skipped: (no reason given)
t/22_autochomp.t ..... ok
t/23_rv_ac_splice.t .. ok
t/24_cache_loop.t .... ok
t/25_gen_nocache.t ... ok
t/26_twrite.t ........ ok
t/27_iwrite.t ........ ok
t/28_mtwrite.t ....... ok
t/29_downcopy.t ...... ok
t/29a_upcopy.t ....... ok
t/30_defer.t ......... ok
t/31_autodefer.t ..... ok
t/32_defer_misc.t .... ok
t/33_defer_vs.t ...... ok
t/40_abs_cache.t ..... ok
t/41_heap.t .......... ok
t/42_offset.t ........ ok
t/43_synopsis.t ...... ok
All tests successful.
Files=38, Tests=4725, 5 wallclock secs ( 1.01 usr 0.20 sys + 3.65 cusr 1.25 csys = 6.11 CPU)
Result: PASS

real 0m5.967s
user 0m5.192s
sys 0m1.597s
raspberrypi:~/junk/ged/Tie-File-1.07$ >>>
8<----------------------------------------------------------------------

The really slow test in the first run was in fact 28, not 29. After
getting bored waiting for 28, I missed 29 altogether. Test 27 was
pretty slow too.

Please let me know if I can do anything else to help, I'd be very glad
to run more tests/benchmarks/logging/whatever if you need them.

--
73,
Ged.

Dave Mitchell

2024-02-25 16:24:05 UTC

Permalink

Post by G.W. Haywood
You're probably looking at I/O issues - almost all Raspberry Pis will
have USB mass storage. Some (including my desktop machine) will have
e.g. networked home directories.

Ah, interesting.

Post by G.W. Haywood
Please let me know if I can do anything else to help, I'd be very glad
to run more tests/benchmarks/logging/whatever if you need them.

Would be interesting to the output from /usr/bin/time for various runs:

From the perl build directory:

$ /usr/bin/time ./perl -Ilib -e 1
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 7192maxresident)k
0inputs+0outputs (0major+235minor)pagefaults 0swaps

$ /usr/bin/time ./perl -Ilib -MTie::File -e 1
0.03user 0.00system 0:00.03elapsed 94%CPU (0avgtext+0avgdata 12236maxresident)k
0inputs+0outputs (0major+1062minor)pagefaults 0swaps

$ /usr/bin/time ./perl -Ilib dist/Tie-File/t/29_downcopy.t
....
0.10user 0.03system 0:00.53elapsed 26%CPU (0avgtext+0avgdata 13192maxresident)k
0inputs+47048outputs (0major+1327minor)pagefaults 0swaps

$ /usr/bin/time ./perl -Ilib dist/Tie-File/t/29a_upcopy.t
....
0.03user 0.00system 0:00.04elapsed 91%CPU (0avgtext+0avgdata 12572maxresident)k
0inputs+5408outputs (0major+1156minor)pagefaults 0swaps

--
Overhead, without any fuss, the stars were going out.
-- Arthur C Clarke

G.W. Haywood

2024-02-26 09:26:07 UTC

Permalink

Oh, bother, copy to list:

Hi there,

Post by Dave Mitchell
$ /usr/bin/time ./perl -Ilib -e 1
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 7192maxresident)k
0inputs+0outputs (0major+235minor)pagefaults 0swaps

raspberrypi:~/Downloads/src/perl/perl-5.39.8$ >>> /usr/bin/time ./perl -Ilib -e 1
0.00user 0.00system 0:00.00elapsed 85%CPU (0avgtext+0avgdata 4460maxresident)k
0inputs+0outputs (0major+205minor)pagefaults 0swaps

Post by Dave Mitchell
$ /usr/bin/time ./perl -Ilib -MTie::File -e 1
0.03user 0.00system 0:00.03elapsed 94%CPU (0avgtext+0avgdata 12236maxresident)k
0inputs+0outputs (0major+1062minor)pagefaults 0swaps

raspberrypi:~/Downloads/src/perl/perl-5.39.8$ >>> /usr/bin/time ./perl -Ilib -MTie::File -e 1
0.06user 0.00system 0:00.07elapsed 90%CPU (0avgtext+0avgdata
7972maxresident)k
0inputs+0outputs (0major+939minor)pagefaults 0swaps

Post by Dave Mitchell
$ /usr/bin/time ./perl -Ilib dist/Tie-File/t/29_downcopy.t
....
0.10user 0.03system 0:00.53elapsed 26%CPU (0avgtext+0avgdata 13192maxresident)k
0inputs+47048outputs (0major+1327minor)pagefaults 0swaps

raspberrypi:~/Downloads/src/perl/perl-5.39.8$ >>> /usr/bin/time ./perl -Ilib dist/Tie-File/t/29_downcopy.t
1..718
ok 1
...
...
ok 718
0.42user 0.47system 0:55.79elapsed 1%CPU (0avgtext+0avgdata 8972maxresident)k
0inputs+47000outputs (0major+1212minor)pagefaults 0swaps

Post by Dave Mitchell
$ /usr/bin/time ./perl -Ilib dist/Tie-File/t/29a_upcopy.t
....
0.03user 0.00system 0:00.04elapsed 91%CPU (0avgtext+0avgdata 12572maxresident)k
0inputs+5408outputs (0major+1156minor)pagefaults 0swaps

raspberrypi:~/Downloads/src/perl/perl-5.39.8$ >>> /usr/bin/time ./perl -Ilib dist/Tie-File/t/29a_upcopy.t
1..55
ok 1
...
...
ok 55 - try(20960, 200, undef)
0.07user 0.08system 0:05.43elapsed 2%CPU (0avgtext+0avgdata 8344maxresident)k
0inputs+5344outputs (0major+1032minor)pagefaults 0swaps

FWIW building perl on the Pi takes about twenty minutes, on the 2.7GHz
four core i7 laptop which supplies the Pi's home directory, about 8.

HTH

--
73,
Ged.

Dave Mitchell

2024-02-26 11:30:03 UTC

Permalink

Thanks for the results.

They show very long elapsed times with low CPU, so highly I/O bound as you
suggested. I've just created this PR:

https://github.com/Perl/perl5/pull/22031

which updates those two test files to use the OS's temp dir rather than
the current dir. I would be interested to see how their elapsed time
change on your system, and whether your /tmp is mounted as tmpfs or
similar.

--
Red sky at night - gerroff my land!
Red sky at morning - gerroff my land!
-- old farmers' sayings #14

G.W. Haywood

2024-02-26 15:08:16 UTC

Permalink

Hi Dave,

... I would be interested to see how their elapsed time change on
your system ...

raspberrypi:~/Downloads/src/perl/perl-5.39.8$ >>> /usr/bin/time ./perl -Ilib dist/Tie-File/t/29_downcopy.t
1..718
ok 1
...
...
ok 718
0.60user 0.23system 0:00.91elapsed 92%CPU (0avgtext+0avgdata 11400maxresident)k
0inputs+0outputs (0major+1857minor)pagefaults 0swaps

raspberrypi:~/Downloads/src/perl/perl-5.39.8$ >>> /usr/bin/time ./perl -Ilib dist/Tie-File/t/29a_upcopy.t
1..55
...
...
ok 55 - try(20960, 200, undef)
0.13user 0.09system 0:00.28elapsed 79%CPU (0avgtext+0avgdata 10756maxresident)k
0inputs+0outputs (0major+1593minor)pagefaults 0swaps

and whether your /tmp is mounted as tmpfs or similar.

raspberrypi:~$ >>> mount | grep /tmp
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,relatime)
tmpfs on /var/tmp type tmpfs (rw,nosuid,nodev,relatime)

This Pi has 8GB of RAM.

Incidentally your patches didn't apply cleanly, I had to wing it.
Same issue for both 29 and 29a.

Patch for 29_downcopy.t:
8<----------------------------------------------------------------------
@@ -256,8 +264,9 @@ sub try0 {
map { defined $_ ? $_ : 'undef' }
$pos, $len, $newlen, $FLEN, $line;

- open F, '>', $file or die "Couldn't open file $file: $!";
- binmode F;
+ my ($fh, $file) = File::Temp::tempfile("29-XXXXX", DIR => $tempdir);
+
+ binmode $fh;
8<----------------------------------------------------------------------

Expected:
8<----------------------------------------------------------------------
@@ -241,8 +249,9 @@

sub try {
my ($pos, $len, $newlen) = @_;
- open F, '>', $file or die "Couldn't open file $file: $!";
- binmode F;
+ my ($fh, $file) = File::Temp::tempfile("29-XXXXX", DIR => $tempdir);
+
+ binmode $fh;
8<----------------------------------------------------------------------

--
73,
Ged.

Dave Mitchell

2024-02-26 15:20:23 UTC

Permalink

Post by G.W. Haywood

... I would be interested to see how their elapsed time change on
your system ...

Thanks for that.

So if I'm reading right, the elapsed time for 29_downcopy.t has reduced
from 55.29s to 0.91s by the combination of: switching to File::Temp, and
/tmp being tmpfs !

So it looks like this might have fixed the smoking issue.

Post by G.W. Haywood
Incidentally your patches didn't apply cleanly, I had to wing it.
Same issue for both 29 and 29a.

Both files were modified in blead by me a few commits ago - perhaps your
bleed wasn't quite up to date?

--
Never do today what you can put off till tomorrow.

G.W. Haywood

2024-02-26 15:26:55 UTC

Permalink

Hi there,

Post by Dave Mitchell
So if I'm reading right, the elapsed time for 29_downcopy.t has reduced
from 55.29s to 0.91s by the combination of: switching to File::Temp, and
/tmp being tmpfs !

Yup.

Post by Dave Mitchell
Both files were modified in blead by me a few commits ago - perhaps your
bleed wasn't quite up to date?

Right, I thought it would be something like that.

--
73,
Ged.

G.W. Haywood

2024-02-26 07:55:22 UTC

Permalink

Hi there,

Post by Dave Mitchell
$ /usr/bin/time ./perl -Ilib -e 1
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 7192maxresident)k
0inputs+0outputs (0major+235minor)pagefaults 0swaps

raspberrypi:~/Downloads/src/perl/perl-5.39.8$ >>> /usr/bin/time ./perl -Ilib -MTie::File -e 1
0.06user 0.00system 0:00.07elapsed 90%CPU (0avgtext+0avgdata 7972maxresident)k
0inputs+0outputs (0major+939minor)pagefaults 0swaps

--
73,
Ged.