[Toybox] Toybox test image / fuzzing

Discussion:

Andy Chu

2016-03-11 23:12:11 UTC

What is the best way to run the toybox tests? If I just run "make test", I
get a lot of failures, some of which are probably because I'm not running
as root, while some I don't understand, like:

PASS: pgrep -o pattern
pgrep: bad -s '0'
^
FAIL: pgrep -s

I'm on an Ubuntu 14.04 machine, running against the master branch. I
didn't try running as root since it seems like there is a non-zero chance
that it will mess up my machine.

I saw in the ELC YouTube talk that test infrastructure is a TODO.

http://landley.net/talks/celf-2015.txt

Is this something I can help with? I guess if you can tell me what
environment you use to get all tests to pass, it shouldn't be too hard to
make a shell script to create that environment, probably with Aboriginal
Linux. I have built Aboriginal Linux before (like a year ago).

One of the reasons I ran into this was because I wanted to distill a test
corpus for fuzzing from the shell test cases. afl-fuzz has a utility to
minimize a test corpus based on code path coverage. So getting a stable
test environment seems like a prerequisite for that.

FWIW, I had a different approach for fuzzing each arg:

https://github.com/andychu/toybox/commit/ff937e97881bfdf4b1221618c38857b75c9534e0

This seems to be a little laborious, because I have to manually write shell
scripts to fuzz individual inputs (and I didn't find anything beyond that
one crash yet). I think the mass fuzzing thing might work better, but I'm
not sure.

thanks,
Andy

Rob Landley

2016-03-12 17:51:25 UTC

Permalink

Post by Andy Chu
What is the best way to run the toybox tests? If I just run "make
test", I get a lot of failures, some of which are probably because I'm
PASS: pgrep -o pattern
pgrep: bad -s '0'
^
FAIL: pgrep -s

Unfortunately, the test suite needs as much work as the command
implementations do. :(

Ok, backstory!

Where I started toybox I had a big todo list, and was filling it in.
Some got completed and some (like sh.c, mdev.c, or mke2fs.c) got
partially finished and put on hold for a long time.

Then I started getting contributions of new commands from other
developers, some of which were easy to verify, polish up, and declare
done, and some of which required extensive review (an in several cases
an outright rewrite). I used to just merge stuff and track the state of
it in a local text file, but that didn't scale, and I got overwhelmed.

So I created toys/pending and moved all the unfinished command
implementations there. (And a lib/pending.c for shared infrastructure
used by toys/pending which needs its own review/cleanup pass.) After a
while I wrote a page (http://landley.net/toybox/cleanup.html) explaining
about the "pending" directory and the work I do to promote stuff _out_
of the pending directory, in hopes other people would be interested in
doing some of the cleanup for me.

But people kept asking how they could help other than implementing new
commands that would go into the giant toys/pending pile, or doing
cleanup, and the next logical thing for me was "test suite". So I
suggested that.

And got a lot of test suite entries full of tests that don't pass, tests
that don't actually test anything interesting in toybox (some test the
kernel, most don't test the interesting edge cases, none of them were
written with a though reading of the relevant standards document and/or
man page...)

Really, I need a tests/pending. :(

There's a missing layer of test suite infrastructure, which isn't just
"this has to be tested as root" but "this has to be tested on a known
system with a known environment". Preferably a synthetic one running
under an emulator, which makes it a good fit for my aboriginal linux
project with its build control images:

http://landley.net/aboriginal/about.html
http://landley.net/aboriginal/control-images

Unfortunately, when I tried to do this, the first one I did was "ps" and
making the process less ps -a sees reproducible is hard, because the
kernel launches a bunch of kernel threads based on driver configuration
and kernel version, so getting stable behavior out of that was enough of
a head-scratcher it went back on the todo list. I should try again with
"mount" or something...

Anyway, I've done a few minor cleanup passes over the test suite, but an
awful lot of it is still tests that fail because the test is wrong, or
lack of test coverage.

One example of a test I did some cleanup on was tests/chmod.test, a "git
log" of that might be instructive? That said, the result isn't remotely
_complete_. (Endless cut and paste of "u+r" makes this ls output that's
not a loop, but no tests for the sticky bit? Nothing sets the excutable
bit on a script and then tests we can run it? Removes exec permission
from a directory and checks we can't ls it? Removes read permission from
a file and checks we can't read it? No, all it tests is ls output over
and over...)

Post by Andy Chu
I'm on an Ubuntu 14.04 machine, running against the master branch. I
didn't try running as root since it seems like there is a non-zero
chance that it will mess up my machine.

Very much so!

That's why I need to do an aboriginal linux test harness that boots
under qemu and runs tests in a known chroot.

Post by Andy Chu
I saw in the ELC YouTube talk that test infrastructure is a TODO.
http://landley.net/talks/celf-2015.txt
Is this something I can help with?

If you could just triage the test suite and tell me the status of the
tests, that would be great. (I've been meaning to do that forever, but
every time I try I get distracted by fixing up a specific test and the
related command...)

First pass, you could sort the tests into:

1) this command is hard to test due to butterfly effects (run it twice
get different output, so even a known emulated environment won't help;
top, ps, bootchartd, vmstat...)

2) This command could be produce reliable output under an emulated
environment. This includes everything requiring root access. (Properly
testing oneit probably requires containers _within_ an emulator, but
let's burn that bridge when we come to it.)

3) This command can have a good test now. (Whether it _does_ is separate.)

Then let's put #1 and #2 aside for the moment and concentrate on filling
out #3.

Post by Andy Chu
I guess if you can tell me what
environment you use to get all tests to pass, it shouldn't be too hard
to make a shell script to create that environment, probably with
Aboriginal Linux.

Unfortunately, there isn't one. The test suite's bit rotted ever since I
started getting significant contributions to it without having a
"pending" directory to separate curated from wild tests. :(

Post by Andy Chu
I have built Aboriginal Linux before (like a year ago).
One of the reasons I ran into this was because I wanted to distill a
test corpus for fuzzing from the shell test cases. afl-fuzz has a
utility to minimize a test corpus based on code path coverage. So
getting a stable test environment seems like a prerequisite for that.

Looking at the tests, I suspect my recent changes to the dirtree
infrastructure broke "mv". (Something did, anyway...)

There's also the issue that "make test_mv" and "make tests" actually
test slightly different things. The first builds the command standalone,
and not all commands build correctly standalone. (That might be why
"make test_mv" didn't work, if it's not building standalone...)

Sometimes the command needs fixing, sometimes the build infrastructure
needs fixing, sometimes the test needs fixing...

Post by Andy Chu
https://github.com/andychu/toybox/commit/ff937e97881bfdf4b1221618c38857b75c9534e0
This seems to be a little laborious, because I have to manually write
shell scripts to fuzz individual inputs (and I didn't find anything
beyond that one crash yet). I think the mass fuzzing thing might work
better, but I'm not sure.

Building scripts to test each individual input is what the test suite is
all about. Figuring out what those inputs should _be_ (and the results
to expect) is, alas, work.

There's also the fact that either the correct output or the input to use
is non-obvious. It's really easy for me to test things like grep by
going "grep -r xopen toys/pending". There's a lot of data for it to bite
on, and I can test ubuntu's version vs mine trivially and see where they
diverge.

But putting that in the test suite, I need to come up with a set of test
files (the source changes each commit, source changes shouldn't cause
test case regressions). I've done a start of tests/files with some utf8
code in there, but it hasn't got nearly enough complexity yet, and
there's "standard test load that doesn't change" vs "I thought of a new
utf8 torture test and added it, but that broke the ls -lR test."

Or with testing "top", the output is based on the current system load.
Even in a controlled environment, it's butterfly effects all the way
down. I can look at the source files under /proc I calculated the values
from, but A) hugely complex, B) giant race condition, C) is implementing
two parallel code paths that do the same thing a valid test? If I'm
calculating the wrong value because I didn't understand what that field
should mean, my test would also be wrong...

In theory testing "ps" is easier, but in theory "ps" with no arguments
is the same as "ps -o pid,tty,time,cmd". But if you run it twice, the
pid of the "ps" binary changes, and the "TIME" of the shell might tick
over to the next second. You can't "head -n 2" that it because it's
sorted by pid, which wraps, so if your ps pid is lower than your bash
pid it would come first. Oh, and there's no guarantee the shell you're
running is "bash" unless you're in a controlled environment... That's
just testing the output with no arguments.)

Post by Andy Chu
thanks,
Andy

Rob

Andy Chu

2016-03-13 08:34:46 UTC

Permalink

Post by Rob Landley
Unfortunately, the test suite needs as much work as the command
implementations do. :(
Ok, backstory!

OK, thanks a lot for all the information! That helps. I will work on
this. I think a good initial goal is just to triage the tests that
pass and make sure they don't regress (i.e. make it easy to run the
tests, keep them green, and perhaps have a simple buildbot). For
example, the factor bug is trivial but it's a lot easier to fix if you
get feedback in an hour or so rather than a month later, when you have
to load it back into your head.

Post by Rob Landley
Really, I need a tests/pending. :(

Yeah I have some ideas about this. I will try them out and send a
patch. I think there does need to be more than 2 categories as you
say though, and perhaps more than kind of categorization.

Post by Rob Landley
Building scripts to test each individual input is what the test suite is
all about. Figuring out what those inputs should _be_ (and the results
to expect) is, alas, work.

Right, it is work that the fuzzing should be able to piggy back on...
so I was trying to find a way to leverage the existing test cases,
pretty much like this:

http://lcamtuf.blogspot.com/2015/04/finding-bugs-in-sqlite-easy-way.html

But the difference is that unlike sqlite, fuzzing toybox could do
arbitrarily bad things to your system, so it really needs to be
sandboxed. It gives really nasty inputs -- I wouldn't be surprised if
it can crash the kernel too.

Parsers in C are definitely the most likely successful targets for a
fuzzer, and sed seems like the most complex parser in toybox so far.
The regex parsing seem to be handled by libraries, and I don't think
those are instrumented (because they are in a shared library not
compiled with afl-gcc). I'm sure we can find a few more bugs though.

Post by Rob Landley
There's also the fact that either the correct output or the input to use
is non-obvious. It's really easy for me to test things like grep by
going "grep -r xopen toys/pending". There's a lot of data for it to bite
on, and I can test ubuntu's version vs mine trivially and see where they
diverge.

Yeah there are definitely a lot of inputs beside the argv values, like
the file system state and kernel state. Those are harder to test, but
I like that you are testing with Aboriginal Linux and LFS. That is
already a great torture test.

FWIW I think the test harness is missing a few concepts:

- exit code
- stderr
- file system state -- the current method of putting setup at the
beginning of foo.test *might* be good enough for some commands, but
probably not all

But this doesn't need to be addressed initially.

By the way, is there a target language/style for shell and make? It
looks like POSIX shell, and I'm not sure about the Makefile -- is it
just GNU make or something more restrictive? I like how you put most
stuff in scripts/make.sh -- that's also how I like to do it.

What about C? Clang is flagging a lot of warnings that GCC doesn't,
mainly -Wuninitialized.

Post by Rob Landley
But putting that in the test suite, I need to come up with a set of test
files (the source changes each commit, source changes shouldn't cause
test case regressions). I've done a start of tests/files with some utf8
code in there, but it hasn't got nearly enough complexity yet, and
there's "standard test load that doesn't change" vs "I thought of a new
utf8 torture test and added it, but that broke the ls -lR test."

Some code coverage stats might help? I can probably set that up as
it's similar to making an ASAN build. (Perhaps something like this
HTML http://llvm.org/docs/CoverageMappingFormat.html)

The build patch I sent yesterday will help with that as well since you
need to set CFLAGS.

Post by Rob Landley
Or with testing "top", the output is based on the current system load.
Even in a controlled environment, it's butterfly effects all the way
down. I can look at the source files under /proc I calculated the values
from, but A) hugely complex, B) giant race condition, C) is implementing
two parallel code paths that do the same thing a valid test? If I'm
calculating the wrong value because I didn't understand what that field
should mean, my test would also be wrong...
In theory testing "ps" is easier, but in theory "ps" with no arguments
is the same as "ps -o pid,tty,time,cmd". But if you run it twice, the
pid of the "ps" binary changes, and the "TIME" of the shell might tick
over to the next second. You can't "head -n 2" that it because it's
sorted by pid, which wraps, so if your ps pid is lower than your bash
pid it would come first. Oh, and there's no guarantee the shell you're
running is "bash" unless you're in a controlled environment... That's
just testing the output with no arguments.)

enh

2016-03-13 18:06:26 UTC

Permalink

#include <cwhyyoushouldbedoingunittestinginstead>

only having integration tests is why it's so hard to test toybox ps
and why it's going to be hard to fuzz the code: we're missing the
boundaries that let us test individual pieces. it's one of the major
problems with the toybox design/coding style. sure, it's something all
the existing competition in this space gets wrong too, but it's the
most obvious argument for the creation of the _next_ generation
tool...

Post by Andy Chu

Post by Rob Landley
Unfortunately, the test suite needs as much work as the command
implementations do. :(
Ok, backstory!

Post by Rob Landley
Really, I need a tests/pending. :(

Yeah I have some ideas about this. I will try them out and send a
patch. I think there does need to be more than 2 categories as you
say though, and perhaps more than kind of categorization.

Post by Rob Landley
Building scripts to test each individual input is what the test suite is
all about. Figuring out what those inputs should _be_ (and the results
to expect) is, alas, work.

Right, it is work that the fuzzing should be able to piggy back on...
so I was trying to find a way to leverage the existing test cases,
http://lcamtuf.blogspot.com/2015/04/finding-bugs-in-sqlite-easy-way.html
But the difference is that unlike sqlite, fuzzing toybox could do
arbitrarily bad things to your system, so it really needs to be
sandboxed. It gives really nasty inputs -- I wouldn't be surprised if
it can crash the kernel too.
Parsers in C are definitely the most likely successful targets for a
fuzzer, and sed seems like the most complex parser in toybox so far.
The regex parsing seem to be handled by libraries, and I don't think
those are instrumented (because they are in a shared library not
compiled with afl-gcc). I'm sure we can find a few more bugs though.

Yeah there are definitely a lot of inputs beside the argv values, like
the file system state and kernel state. Those are harder to test, but
I like that you are testing with Aboriginal Linux and LFS. That is
already a great torture test.
- exit code
- stderr
- file system state -- the current method of putting setup at the
beginning of foo.test *might* be good enough for some commands, but
probably not all
But this doesn't need to be addressed initially.
By the way, is there a target language/style for shell and make? It
looks like POSIX shell, and I'm not sure about the Makefile -- is it
just GNU make or something more restrictive? I like how you put most
stuff in scripts/make.sh -- that's also how I like to do it.
What about C? Clang is flagging a lot of warnings that GCC doesn't,
mainly -Wuninitialized.

Some code coverage stats might help? I can probably set that up as
it's similar to making an ASAN build. (Perhaps something like this
HTML http://llvm.org/docs/CoverageMappingFormat.html)
The build patch I sent yesterday will help with that as well since you
need to set CFLAGS.

Those are definitely hard ones... I agree with the strategy of
classifying the tests, and then we can see how many of the hard cases
are. I think detecting trivial breakages will be an easy first step,
and it should allow others to contribute more easily.
thanks,
Andy
_______________________________________________
Toybox mailing list
http://lists.landley.net/listinfo.cgi/toybox-landley.net

--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.

Rob Landley

2016-03-13 18:55:05 UTC

Permalink

Post by enh
#include <cwhyyoushouldbedoingunittestinginstead>
only having integration tests is why it's so hard to test toybox ps
and why it's going to be hard to fuzz the code: we're missing the
boundaries that let us test individual pieces. it's one of the major
problems with the toybox design/coding style. sure, it's something all
the existing competition in this space gets wrong too, but it's the
most obvious argument for the creation of the _next_ generation
tool...

I started adding test_blah commands to the toys/example directory. I
plan to move the central ps plumbing to lib/proc.c and untangle the 5
commands in there into separate files, we can add test_proc commands if
you can think of good individual pieces to test.

I'm open to this category of test, and have the start of a mechanism.
I'm just spread a bit thin, and it's possible I don't understand the
kind of test harness you want?

Rob

Andy Chu

2016-03-13 19:52:56 UTC

Permalink

Post by Rob Landley

I started adding test_blah commands to the toys/example directory. I
plan to move the central ps plumbing to lib/proc.c and untangle the 5
commands in there into separate files, we can add test_proc commands if
you can think of good individual pieces to test.
I'm open to this category of test, and have the start of a mechanism.
I'm just spread a bit thin, and it's possible I don't understand the
kind of test harness you want?

The toys/example/test_*.c files seem to print to stdout, so I guess
they still need a shell wrapper to test correctness. That's
technically still an integration test rather than a unit test --
roughly I would say integration tests involve one more than one
process (e.g. for a system of servers) whereas unit tests are run
entirely within the language using a unit test framework in that
language.

Google uses gunit/googletest for testing, and I guess Android does too:

https://github.com/google/googletest

Example: https://android.googlesource.com/platform/system/core.git/+/master/libziparchive/zip_archive_test.cc

You basically write a bunch of functions wrapped in TEST_ macros and
they are linked into a binary with a harness and run.

I guess toybox technically could use it if the tests were in C++ but
the code is in C, though it seems like it clashes with the style of
the project pretty badly.

I think the main issue that Elliot is pointing to is that there are no
internal interfaces to test against or mock out so you don't hose your
system while running tests (i.e. you can "reify" the file system state
and kernel state, and the substitute them with fake values in tests).

I agree it would be nicer if there were such interfaces, but it's
fairly big surgery, and somewhat annoying to do in C. I think you
would have to get rid of most global vars, and use a strategy like Lua
or sqlite, where they pass around a context struct everywhere, which
can have system functions like open()/read()/write()/malloc()/etc.
sqlite has a virtual file system (VFS) abstraction for extensive tests
and Lua lets you plug in malloc/free at least. They are libraries and
not programs so I guess that is more natural.

I think this is worth keeping in mind perhaps, but it seems like there
is a lot of other low hanging fruit to address beforehand.

Andy

Rob Landley

2016-03-13 21:56:52 UTC

Permalink

Post by Andy Chu

Post by Rob Landley

I started adding test_blah commands to the toys/example directory. I
plan to move the central ps plumbing to lib/proc.c and untangle the 5
commands in there into separate files, we can add test_proc commands if
you can think of good individual pieces to test.
I'm open to this category of test, and have the start of a mechanism.
I'm just spread a bit thin, and it's possible I don't understand the
kind of test harness you want?

The toys/example/test_*.c files seem to print to stdout, so I guess
they still need a shell wrapper to test correctness.

cat tests/test_human_readable.test

scripts/test.sh test_human_readable

(I didn't hook up the scripts/examples directory in the script that
makes the "make test_blah" targets. I should add that, although "make
test_test_human_readable" is an awkward name...)

Post by Andy Chu
That's technically still an integration test rather than a unit test --
roughly I would say integration tests involve one more than one
process (e.g. for a system of servers) whereas unit tests are run
entirely within the language using a unit test framework in that
language.

/me goes to look up the definitions of integration test and unit test...

https://en.wikipedia.org/wiki/Unit_testing
https://en.wikipedia.org/wiki/Integration_testing

And the second of those links to "validation testing" which redirects to
https://en.wikipedia.org/wiki/Software_verification_and_validation which
implies that testing (like documentation) is something done badly by a
third party team in Bangalore after the original team scatters to the
four winds, so no.

I do "unit testing" while developing, but then I repeatedly refactor
that code as I go. I just split xexit() into _xexit() and xexit() while
redoing sigatexit() to replace atexit(). (Because the toys.rebound
longjmp stuff needs to happen after "atexit" but the standard C
functions don't give you a way of triggering the list early, nor of
removing things from it short of actually exiting.)

If I had a unit test suite for xexit(), I would have made more work for
myself updating them. I'm still trying to get toys/code.html to have
decent coverage of lib, so that other people can use these tools. A test
suite that tests things that have no external visiblity in a running
programming proves what exactly?

My test suite is _deeply_ unfinished, and testing a moving target, but
its eventual goals are:

1) Regression testing.
2) Standards compliance.
3) Coverage of all code paths.

#3 is non-obvious: how does signal delivery work in here, or disk full?
If sed -i receives a kill signal while saving it should leave the old
file in place which means write new .file.tmp and the mv it atomically
over the old one, but kill -9 means when re-run it needs to cleanup
.file.tmp but sed -i doesn't get re-run a lot (like vi would) which
means what we WANT to do is open our tempfile, delete it, write to it,
and then hardlink the /proc/$$/fd/fileno into the new location taking
advantage of proc's special case behavior, but is that portable enough
(sed should work if /proc isn't mounted) and other things may want to
use that so that code should live in /lib and have a fallback path with
atexit() stuff (see lib/lib.c copy_tempfile())...

So if we _do_ make this plumbing, do we test it in sed or do we have a
test_copy_tempfile in toys/examples that specifically tests this part of
the plumbing, and then it's just a question of whether sed uses it? But
if sed _didn't_ use it, we wouldn't notice unless we tested it...

Another coverage vs duplication issue is the fact that every command
should be calling xexit() at the end (including return from main) which
means it does a fflush(0) and checks ferror() and does a perror_exit()
if there was a problem. (Which is why I've been pruning back use of
xprintf() and similar, those cause the program to exit early rather than
producing endless output when writing to a full disk or closed socket,
but the fflush() affects performance and the exit path should notice), but

Possibly what I need is a shell function a test can call that says "this
command line modifies/replaces file $BLAH, make sure it handles disk
full and being interrupted and so on sanely", and it can run it in its
own directory and make sure there are no leftover files if it gets a
kill signal while running" (have it read from a fifo, once we're sure
it's blocked send it non -9 kill signal and then read the directory to
make sure there's only one file in there...)

This is the kind of thing I'm worried about in future. My idea of "full
coverage" is full of that sort of thing. Things which are externally
visible from the command can be tested by running the command in the
right environment.

Post by Andy Chu
https://github.com/google/googletest
Example: https://android.googlesource.com/platform/system/core.git/+/master/libziparchive/zip_archive_test.cc
You basically write a bunch of functions wrapped in TEST_ macros and
they are linked into a binary with a harness and run.

This is a set of command line utilities, not a C library.

If, after the 1.0 release, somebody wants to make a C library out of it,
have fun. But until then, infrastructure bits are subject to change
without notice. (I'm curently banging on dirtree to try to get that
infinite depth thing rm wants, for example. Commit 8d95074b7d03 changed
some of the semantics, adjusted the callers, and updated the
documentation. What would altering a test suite at that level
accomplish? Either the behavior is visible to the outside world when the
command runs, or it isn't. I can make a test_dirtree wrapper to check
specific dirtree corner cases, but we should also have _users_ of all
those corner cases, and should be testing the visible behavior of those
users...)

Post by Andy Chu
I guess toybox technically could use it if the tests were in C++ but
the code is in C, though it seems like it clashes with the style of
the project pretty badly.

The toybox shared C infrastructure isn't exported to the outside world
for use outside of toybox. If its semantics change, we adjust the users
in-tree.

Instrumenting the build to show that in allyesconfig this function is
never used from anywhere is interesting (and can probably be done with
readelf and sed).

Post by Andy Chu
I think the main issue that Elliot is pointing to is that there are no
internal interfaces to test against or mock out so you don't hose your
system while running tests (i.e. you can "reify" the file system state
and kernel state, and the substitute them with fake values in tests).

I've always planned to test these commands under an emulator in a
virtual system. (Aboriginal Linux is a much older project than toybox.)

Heck, back under busybox I was using User Mode Linux as my emulator
(qemu wasn't available yet):

https://git.busybox.net/busybox/tree/testsuite/umlwrapper.sh?h=1_2_1&id=f86a5ba510ef

Before that, I added a chroot mode to busybox tests:

https://git.busybox.net/busybox/tree/testsuite/testing.sh?h=1_2_1&id=f86a5ba510ef#n95

I am aware of that problem, but rather than dissecting the code and
sticking pins in it, I prefer to run the tests under an emulator in an
environment it can trash without repercussions. I just haven't finished
implementing it yet because it doesn't solve the "butterfly effect"
tests. (It's on the todo list!)

Note: solving the butterfly effect tests _is_ possible by providing a
fake /proc instead of a real one, --bind mounting a directory of known
data over /proc for the duration of the test so it produces consistent
results. It's all solvable, it's just a can of worms I haven't opened
yet because I've got six cans of worms going in parallel already.

Post by Andy Chu
I agree it would be nicer if there were such interfaces, but it's
fairly big surgery, and somewhat annoying to do in C. I think you
would have to get rid of most global vars,

I mostly have. All command-specific global variables should go in
GLOBALS() (which mean they go in "this", which a union of structs), and
everything else should be in the global "toys" union except for toybuf
and libbuf.

Let's see

nm --size-sort toybox_unstripped | sort -k2,2

and ignoring the "r", "t", and "T" entries gives us:

0000000000000001 b completed.6973

grep -r is not finding "completed" as a variable name? Odd...

0000000000000008 b tempfile2zap

In lib/lib.c so copy_tempfile() can let tempfile_handler() know what
file to delete atexit(). Bit of a hack, largely because there's only
_one_ file at a time it can store (not a list, but no users need it to
be a list yet). I think code.html mentions this? (If not it should.)

0000000000000028 b chattr

Blah, that's garbage I missed when cleaning up this contribution. That
should go in GLOBALS(), you can have a union of structs in there to have
per-command variables when sharing a file. (But why do they share a
file? I'd have to dig...)

0000000000000004 B __daylight@@GLIBC_2.2.5
0000000000000008 B __environ@@GLIBC_2.2.5
0000000000000008 B stderr@@GLIBC_2.2.5
0000000000000008 B stdin@@GLIBC_2.2.5
0000000000000008 B stdout@@GLIBC_2.2.5
0000000000000008 B __timezone@@GLIBC_2.2.5

glibc vomited forth these for no apparent reason.

0000000000000048 B toys
0000000000001000 B libbuf
0000000000001000 B toybuf
0000000000002028 B this

The ones I mentioned above, these are _expected_.

0000000000000150 d e2attrs

More lsattr stuff. You'll note that 2013 was before I had "pending",
looks like I missed some cleanup in this command.

0000000000001600 D toy_list

That could probably be "r" with a little more work, although I vaguely
recall adding it made lots of spurious "a const was passed to a
non-const thing! Alas and alack, woe betide! Did you know that string
constants are in the read-only section and segfault if you try to write
to them but the compiler doesn't complain if you pass them to a
non-const argument yet it all works out fine? Oh doom and gloom!"

That's probably why I didn't.

0000000000000004 V daylight@@GLIBC_2.2.5
0000000000000008 V environ@@GLIBC_2.2.5
0000000000000008 V timezone@@GLIBC_2.2.5

glibc again.

000000000000001f W mknod

What is a "W" type symbol?

And of course there's buckets of violations needing to be fixed in
pending, this is just defconfig...

Post by Andy Chu
and use a strategy like Lua
or sqlite, where they pass around a context struct everywhere, which
can have system functions like open()/read()/write()/malloc()/etc.

A) On nommu systems you have a limited stack size.

B) I looked into rewriting this in Lua back around 2009 or so. I chose
to stick with C. If you'd like to write a version in Lua, feel free.

If you're proposing that I extensively reengineer the project so you can
use a different style of test architecture, could you please explain
what those tests could test that the way I'm doing it couldn't?

Post by Andy Chu
sqlite has a virtual file system (VFS) abstraction for extensive tests

Linux has --bind and union mounts, containers, and I can run the entire
system under QEMU.

Post by Andy Chu
and Lua lets you plug in malloc/free at least. They are libraries and
not programs so I guess that is more natural.

The first time I wrote my own malloc/free wrapper to intercept and track
all allocations in a program was under OS/2 in 1996. I expect all
programmers do that at some point. I can only think of one person who
lists such a wrapper as one of his major life accomplishments on his
resume, and I have longstanding disagreements with that man.

I've been thinking for a long time about making generic recovery
infrastructure so you can nofork() any command and clean up after it.
And when I say "a long time" I mean a decade now:

http://lists.busybox.net/pipermail/busybox/2006-March/053270.html

And the conclusion I came to is "let the OS do it". I'm sure I blogged
about this in like 2011, but if you look at "man execve" and scroll down
to "All process attributes are preserved during an execve(), except the
following:" there's a GIANT list of things we'd need to clean up, and
that's just a _start_. It could mess with environment variables, or mess
with umask... it's just not worth it.

So I've since wandered to "fork a child process, have the child recurse
into the other_main() function to avoid an exec if there's enough stack
space left, and then have it exit with the parent waiting for it" as the
standard way of dealing with stuff that's not already easy. A command
can refrain from altering the process's state and be marked
TOYBOX_NOFORK or it can be a child process the OS cleans up after. I'm
not going for a case in between.

That said, it's important for long-lived processes not to leak. Init or
httpd can't leak, grep and sed can't leak per-file or per-line because
they can have unbounded input size... But again, people are looking for
that with valgrind, and I can make a general test memory and open
filehandles and such in xexit() under TOYBOX_DEBUG.

Post by Andy Chu
I think this is worth keeping in mind perhaps, but it seems like there
is a lot of other low hanging fruit to address beforehand.

If we need to test C functions in ways that aren't easily allowed by the
users of those C functions, we can write a toys/example command that
calls those C functions in the way we want to check. But if the behavior
isn't already accessible from an existing user of that function in one
of the commands, why do we care again?

Post by Andy Chu
Andy

Rob

Rob Landley

2016-03-13 18:18:58 UTC

Permalink

Post by Andy Chu

Post by Rob Landley
Unfortunately, the test suite needs as much work as the command
implementations do. :(
Ok, backstory!

I fixed "make test_mv" last night. The problem is that
scripts/singleconfig.sh was creating a "mv" that acted like "cp". (I
should write up a blog entry explaining the plumbing. This may fix one
or two other tests, I haven't checked. It should change the "make tests"
build which tests the multiplexer version, which depends on make
menuconfig to tell it what to test.)

Post by Andy Chu
For
example, the factor bug is trivial but it's a lot easier to fix if you
get feedback in an hour or so rather than a month later, when you have
to load it back into your head.

Indeed, but I did most of the fix yesterday and can check it in today.

(I special cased "-" is the first character, to print out a -1 and skip
it, then the rest of the math is unsigned for the larger range. This
means that "-" by itself is treated as -1, I'm not sure how to catch
that without an ugly special case test for that...)

I also switched it to long long, which should make no difference on 64
bit plaforms (with current compilers, anyway; there's nothing STOPPING
128 bit long long the way they wrote LP64, but nobody does it). On 32
bit platforms, it slows it down up to 50%.

Post by Andy Chu

Post by Rob Landley
Really, I need a tests/pending. :(

Yeah I have some ideas about this. I will try them out and send a
patch. I think there does need to be more than 2 categories as you
say though, and perhaps more than kind of categorization.

Eventually it should all collapse back into one category, but there's a
lot of work to do between now and then. But tests/posix and tests/lsb
and such make a certain amount of sense, and that would both get us
tests/pending and not have to be undone later.

Post by Andy Chu

Post by Rob Landley
Building scripts to test each individual input is what the test suite is
all about. Figuring out what those inputs should _be_ (and the results
to expect) is, alas, work.

I have plans to sandbox it using
http://landley.net/aboriginal/about.html but haven't finished that yet
because Giant TODO List.

(If I go off to my corner and focus on my todo list, I vanish for
months. Things like sed and ps can easily soak up a couple months each.
If I prioritize interrupts, I jump from topic to topic and wind up with
giant heaps of half-finished stuff, but at least other people can sort
of follow along. :)

Post by Andy Chu
Parsers in C are definitely the most likely successful targets for a
fuzzer, and sed seems like the most complex parser in toybox so far.

lib/args.c is a pretty complicated parser, and toys/*/find.c is also
moderately horrid in that regard (because it can't leverage lib/args to
do anything in a common way.)

I want to genericize find.c plumbing to have expr.c and maybe test.c do
parenthesization and prioritization and such the same way, but despite
sitting down to think it through more than once haven't come up with a
clean way to factor out the common code yet. I should just do expr.c and
then try to cleanup common code (if any) afterwards. (Yes there's an
expr.c in pending, and when I sat down to try to clean it up I hit
http://landley.net/notes-2014.html#02-12-2014 and then
http://landley.net/notes-2015.html#30-01-2015 and and it's on the todo
list.)

This might help:

ls -loSr toys/{android,example,other,lsb,posix}/*.c

The size of ps is partly illusory, I implemented "ps", "top", "iotop",
"pgrep", and "pkill" in the same command because I hadn't cleaned out
the common infrastructure to move it to lib/proc.c yet. (I should do
that. It can't use any of the GLOBALS/TT stuff and can't use any FLAG_
macros, because neither are available in lib. Oh, and it also shouldn't
ever check toys.which->name to see which command is running. I've got
that mostly cleaned out, need to factor it out into lib. It's on the
todo list.)

Post by Andy Chu
The regex parsing seem to be handled by libraries, and I don't think
those are instrumented (because they are in a shared library not
compiled with afl-gcc). I'm sure we can find a few more bugs though.

I'd prioritize musl and bionic. As far as I'm concerned uClibc is dead
(and uClibc-ng is necromancy, not a fresh start), and glibc is big iron
along with the rest of the GNU/nonsense.

Post by Andy Chu

Yeah there are definitely a lot of inputs beside the argv values, like
the file system state and kernel state.

I'm working on tests/files. I need directory traversal weirdness with
some symlinks and different permissions and fifos and such, but I
suspect I need a tarball and/or script to set those up because trying to
check intentionally filesystem corner cases into git is not a happy thought.

Post by Andy Chu
Those are harder to test, but
I like that you are testing with Aboriginal Linux and LFS. That is
already a great torture test.

Indeed, and ~2 weeks ago I was churning through LFS 7.8 packages until I
got distracted. I should get back to that. It's on the todo list.

Post by Andy Chu
- exit code

blah; echo $?

Post by Andy Chu
- stderr

2>&1

Post by Andy Chu
- file system state -- the current method of putting setup at the
beginning of foo.test *might* be good enough for some commands, but
probably not all

I mentioned the need for a standard directory of files everything can
assume is there, and tests/files being a start of that. For testing by
hand I just use the toybox source du jour, but that's obviously
unsuitable for automated testing.

That said, these test scripts are shell scripts. You can do any
setup/teardown you need to. The automated stuff is a convenience.

That said, right now the tests are run by sourcing them, which means
there's potential leftover crap if you define shell functions and such.
I need to make sure there's an appropraite ( ) subshell at the right
places. (When I first wrote this, I knew the answers to that sort of
thing off the top of my head. That was in... 2005? Now I have to go back
and confirm and add comments, but that's what other people have to do
looking at my code so probably a net win...)

Post by Andy Chu
But this doesn't need to be addressed initially.
By the way, is there a target language/style for shell and make?

I'm targeting bash (but older bash, like bash 2 with only a couple bash
3 features like ~=), because toybox's shell should be a proper bash
replacement, and toybox building itself is an obvious smoketest.

That said, there's a bootstrapping problem on weird systems. If I could
carve out the toysh.c and sed.c standalone builds so they can be run on
systems that haven't got acceptable versions of those commands, I'd
increase the portability of toybox a lot. (It still mucks about in /proc
and /sed looking for stuff, and calls some linux-only syscalls and
ioctls, but everybody and their dog has a linux emulation layer these
days. Large chunks of posix is still stuck in the 1970's, and they
always chickened out about standardizing things like "mount" or "init"
so you can't _boot_ a system that doesn't go beyond posix.)

Post by Andy Chu
It looks like POSIX shell, and I'm not sure about the Makefile -- is it
just GNU make or something more restrictive? I like how you put most
stuff in scripts/make.sh -- that's also how I like to do it.

In theory make is only there to provide the expected API. In practice,
the kconfig subdirectory was copied from Linux 2.6.12 and I need to
write a new one from scratch. (It's on the todo list! Note we only use
the generated .config file which is produced from our Config.in source,
so washing data through that plumbing doesn't affect the copyright and
thus license of the resulting binary. But it's an ugliness that really
should go bye-bye, and now that I've broken open the
lib/interestingtimes.c and lib/linestack.c can of worms... It's on the
todo list.)

Post by Andy Chu
What about C? Clang is flagging a lot of warnings that GCC doesn't,
mainly -Wuninitialized.

The Android guys build with clang against bionic. I need to set up a
local clang toolchain, but my netbook is still ubuntu 12.04 and AOSP's
moved on to 14.04. It's on the todo list.

That said, gcc produces buckets of _spurious_ "may be used uninitialized
but never actually is" warnings, which I sometimes silence with "int
a=a;" in the declarations. (Generates no code but shuts up the warning.)

Are these _real_ uninitialized warnings? I'm very interested in those,
but find wading through large quantities of false positives tiresome.
(That's why I'm not a big fan of static analysis either. False positives
as far as the eye can see.)

Post by Andy Chu

Some code coverage stats might help? I can probably set that up as
it's similar to making an ASAN build. (Perhaps something like this
HTML http://llvm.org/docs/CoverageMappingFormat.html)

Ooh, that sounds interesting.

Post by Andy Chu
The build patch I sent yesterday will help with that as well since you
need to set CFLAGS.

I lost it in the noise, I need to do a pass over the mailing list web
archive again today and see what's fallen through the cracks...

Post by Andy Chu

Initially I was only adding tests that either passed or showed something
interesting I needed to fix. This left large holes in the test suite
that I didn't know how to fill in yet, and when other people filled them
in I don't necessarily know how to fix them yet.

I'm glad somebody's taking a look. :)

Post by Andy Chu
thanks,
Andy

No, thank _you_,

Rob

Samuel Holland

2016-03-13 19:13:29 UTC

Permalink

Post by Rob Landley

Post by Andy Chu
- exit code

blah; echo $?

Post by Andy Chu
- stderr

2>&1

I think the idea here was the importance of differentiating between
stdout and stderr, and between text output and return code. This is as
simple as having a separate output variable for each type of output.

Granted, it will usually be unambiguous as to the correctness of the
program, but having the return code in the output string can be
confusing to the human looking at the test case. Plus, why would you not
want to verify the exit code for every test? It's a lot of duplication
to write "echo $?" in all of the test cases.

As for stdout/stderr, it helps make sure diagnostic messages are going
to the right stream when not using the helper functions.

--
Regards,
Samuel Holland <***@sholland.org>

Andy Chu

2016-03-13 19:54:04 UTC

Permalink

Post by Samuel Holland
I think the idea here was the importance of differentiating between
stdout and stderr, and between text output and return code. This is as
simple as having a separate output variable for each type of output.
Granted, it will usually be unambiguous as to the correctness of the
program, but having the return code in the output string can be
confusing to the human looking at the test case. Plus, why would you not
want to verify the exit code for every test? It's a lot of duplication
to write "echo $?" in all of the test cases.
As for stdout/stderr, it helps make sure diagnostic messages are going
to the right stream when not using the helper functions.

Yes, that is exactly what I was getting at. Instead of "testing",
there could be another function "testing-errors" or something. But
it's not super important right now.

Andy

Rob Landley

2016-03-13 20:32:48 UTC