Discussion:
[Toybox] awk seen in the wild
enh
2016-07-08 16:36:48 UTC
Permalink
https://github.com/android-ndk/ndk/issues/133#issuecomment-231318129
is the first time in about a decade i've seen awk more complicated
than "print $2" in the wild...

readelf -sW toolkit/library/libxul.so |grep FUNC |awk '$2 ~ /[048c]$/ {print}'

ruins my genius plan to "implement" an awk that only supports "print
$(\d+)"... :-)
Ed Maste
2016-07-09 01:01:36 UTC
Permalink
Post by enh
https://github.com/android-ndk/ndk/issues/133#issuecomment-231318129
is the first time in about a decade i've seen awk more complicated
than "print $2" in the wild...
readelf -sW toolkit/library/libxul.so |grep FUNC |awk '$2 ~ /[048c]$/ {print}'
ruins my genius plan to "implement" an awk that only supports "print
$(\d+)"... :-)
Heh. If you want some contemporary "more complicated" samples there
are 85 awk scripts in the FreeBSD tree that should serve that purpose.

Here's a few interesting ones for your amusement:
https://svnweb.freebsd.org/base/head/sys/dev/bhnd/tools/nvram_map_gen.awk?view=markup
https://svnweb.freebsd.org/base/head/sys/tools/sound/feeder_rate_mkfilter.awk?view=markup
https://svnweb.freebsd.org/base/head/sys/tools/makeobjops.awk?view=markup
https://svnweb.freebsd.org/base/head/tools/tools/nanobsd/mtree-dedup.awk?view=markup
Post by enh
_______________________________________________
Toybox mailing list
http://lists.landley.net/listinfo.cgi/toybox-landley.net
Andy Chu
2016-07-09 01:26:45 UTC
Permalink
Post by Ed Maste
Post by enh
https://github.com/android-ndk/ndk/issues/133#issuecomment-231318129
is the first time in about a decade i've seen awk more complicated
than "print $2" in the wild...
readelf -sW toolkit/library/libxul.so |grep FUNC |awk '$2 ~ /[048c]$/ {print}'
ruins my genius plan to "implement" an awk that only supports "print
$(\d+)"... :-)
Heh. If you want some contemporary "more complicated" samples there
are 85 awk scripts in the FreeBSD tree that should serve that purpose.
https://svnweb.freebsd.org/base/head/tools/tools/nanobsd/mtree-dedup.awk?view=markup
OK sure, this apparently processes a log file and gives you some kind
of file system tree (202 lines)
Post by Ed Maste
https://svnweb.freebsd.org/base/head/sys/tools/makeobjops.awk?view=markup
C code generation (505 lines). OK sure, I've noticed that most shell
implementations use some awk for code generation. There are lots of
enums and string arrays, and awk is great for eliminating redundancy.
Post by Ed Maste
https://svnweb.freebsd.org/base/head/sys/dev/bhnd/tools/nvram_map_gen.awk?view=markup
OK wow, a full parser for a recursive language, and C code generator.
Not unheard of, since Kernighan's book demos recursive descent parsing
in Awk. Has a qsort() implementation and other "library" functions.
1162 lines.
Post by Ed Maste
https://svnweb.freebsd.org/base/head/sys/tools/sound/feeder_rate_mkfilter.awk?view=markup
WOW. OK that's a creative use of awk. It's still a C code generator,
but it's doing lots of floating point math to create audio filters.
899 lines.

Where exactly is this used? Is it just a random user space tool or
actually fundamental to BSD in some way?

Andy
Ed Maste
2016-07-09 01:58:46 UTC
Permalink
Post by Andy Chu
Post by Ed Maste
https://svnweb.freebsd.org/base/head/sys/tools/sound/feeder_rate_mkfilter.awk?view=markup
WOW. OK that's a creative use of awk. It's still a C code generator,
but it's doing lots of floating point math to create audio filters.
899 lines.
Where exactly is this used? Is it just a random user space tool or
actually fundamental to BSD in some way?
It's used in the FreeBSD kernel build process:
https://svnweb.freebsd.org/base/head/sys/conf/files?revision=301778&view=markup#l40

Perhaps not "fundamental to BSD" in general, but important for FreeBSD
users who want sound :-)
Rob Landley
2016-07-10 17:28:06 UTC
Permalink
Post by Andy Chu
Post by Ed Maste
https://svnweb.freebsd.org/base/head/sys/tools/sound/feeder_rate_mkfilter.awk?view=markup
WOW. OK that's a creative use of awk. It's still a C code generator,
but it's doing lots of floating point math to create audio filters.
899 lines.
Where exactly is this used? Is it just a random user space tool or
actually fundamental to BSD in some way?
Awk's better than bc.

A few years back Peter Anvin added perl as a build dependency to every
package he maintained (linux, syslinux, klibc, etc), at the same time.
I'm not sure exactly what agenda he was pushing, but there was a
definite theme. When I complained about the damage to the kernel and
submitted perl removal patches rewriting them in shell and C, he called
my objection to builds growing gratuitous new dependencies "academic":

http://lkml.iu.edu/hypermail/linux/kernel/0802.1/4400.html

Years later when I got serious about pushing the perl removal patches
upstream, I.E. not just periodically posting updated versions (which I
did for something like 5 years) but actually responding to people
explaining WHY they should go upstream:

http://lkml.iu.edu/hypermail/linux/kernel/1302.3/01520.html
http://lkml.iu.edu/hypermail/linux/kernel/1302.3/02061.html

Peter replaced timeconst.pl with timeconst.bc, which blocked my C
rewrite by porting it to a language that busybox didn't support and
which _correctly_ implementing would require an arbitrary precision math
library. (Which is why I started looking at the public domain libtommath
fork in dropbear, which I found utterly unintelligible.) And of course
Peter didn't post it to the list, but instead checked it straight in and
sent it to Linus in a pull request.

His bc version did successfully block my C rewrite of the same code and
keep my "academic" simple "build the kernel with busybox" from going up
into the mainstream, so I still have to maintain a patch out of tree to
do that:

https://github.com/landley/aboriginal/blob/master/sources/patches/linux-noperl-timeconst.patch

If he'd ported it to awk, busybox could have handled it. Which as far as
I can tell is why he _didn't_.

Rob
enh
2016-07-11 16:46:06 UTC
Permalink
Post by Rob Landley
Post by Andy Chu
Post by Ed Maste
https://svnweb.freebsd.org/base/head/sys/tools/sound/feeder_rate_mkfilter.awk?view=markup
WOW. OK that's a creative use of awk. It's still a C code generator,
but it's doing lots of floating point math to create audio filters.
899 lines.
Where exactly is this used? Is it just a random user space tool or
actually fundamental to BSD in some way?
Awk's better than bc.
A few years back Peter Anvin added perl as a build dependency to every
package he maintained (linux, syslinux, klibc, etc), at the same time.
I'm not sure exactly what agenda he was pushing, but there was a
definite theme. When I complained about the damage to the kernel and
submitted perl removal patches rewriting them in shell and C, he called
http://lkml.iu.edu/hypermail/linux/kernel/0802.1/4400.html
Years later when I got serious about pushing the perl removal patches
upstream, I.E. not just periodically posting updated versions (which I
did for something like 5 years) but actually responding to people
http://lkml.iu.edu/hypermail/linux/kernel/1302.3/01520.html
http://lkml.iu.edu/hypermail/linux/kernel/1302.3/02061.html
Peter replaced timeconst.pl with timeconst.bc, which blocked my C
rewrite by porting it to a language that busybox didn't support and
which _correctly_ implementing would require an arbitrary precision math
library. (Which is why I started looking at the public domain libtommath
fork in dropbear, which I found utterly unintelligible.)
another plug for supporting a libcrypto dependency: the *sum utilities
are orders of magnitude faster with libcrypto, the SSL support in
things like netcat/wget would be something i could actually use on
Android (there's no way i'll be able to ship an alternative SSL
implementation) and you'd get arbitrary precision integers too.

i don't think you're likely to go this route, but i do like to keep
bringing it up so the idea of being API-compatible enough that it's
possible to use toybox with either your backend or *ssl is in the back
of your mind...

(no one's complained about the slow *sum commands yet, but if you're
interested i'm happy to send a patch.)
Post by Rob Landley
And of course
Peter didn't post it to the list, but instead checked it straight in and
sent it to Linus in a pull request.
His bc version did successfully block my C rewrite of the same code and
keep my "academic" simple "build the kernel with busybox" from going up
into the mainstream, so I still have to maintain a patch out of tree to
https://github.com/landley/aboriginal/blob/master/sources/patches/linux-noperl-timeconst.patch
If he'd ported it to awk, busybox could have handled it. Which as far as
I can tell is why he _didn't_.
Rob
_______________________________________________
Toybox mailing list
http://lists.landley.net/listinfo.cgi/toybox-landley.net
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
Rob Landley
2016-07-11 19:45:58 UTC
Permalink
Post by enh
another plug for supporting a libcrypto dependency: the *sum utilities
are orders of magnitude faster with libcrypto, the SSL support in
things like netcat/wget would be something i could actually use on
Android (there's no way i'll be able to ship an alternative SSL
implementation) and you'd get arbitrary precision integers too.
What I want to do is take the approach Isaac Dunham suggested, of using
"openssl s_client -quiet -connect" as an alternative to netcat. So
toybox wget should call out to that to get https support, and that would
be provided by something external.

Lipi Lee did a first pass at this already, which I didn't immediately
apply because for some reason the patch he sent me didn't apply to the
wget he sent me (I don't think I'd modified it). When I tested the wget
it was corrupting the files it downloaded (outputting numbers in the
middle of the data), and that sent me down the road of rewriting the
thing...

I'm balancing some competing design goals here: 'self-contained' vs
'people use this and need speed out of some tightly optimized
algorithms'. The way busybox dealt with this was by having multiple
implementations (CONFIG_MD5_SMALL has 4 settings), which I very much
don't want to do...

The problem with having an internal an external implementation is the
internal one gets much less testing that way. I suspect the right answer
is to just lump it and have the actual unrolled fast version in toybox,
because the simple one isn't good enough for the userbase. That said, a
lot of these external libraries have assembly optimized versions for
various platforms, and I KNOW I'm not going there...

Hmmm. Which lib is "libcrypto", by the way?
Post by enh
i don't think you're likely to go this route, but i do like to keep
bringing it up so the idea of being API-compatible enough that it's
possible to use toybox with either your backend or *ssl is in the back
of your mind...
(no one's complained about the slow *sum commands yet, but if you're
interested i'm happy to send a patch.)
People have sent patches to speed up md5sum and sha1sum and it boils
down to lots of loop unrolling that makes the algorithm harder to
understand. It was back around here:
http://lists.landley.net/pipermail/toybox-landley.net/2014-May/006638.html

I applied the first few, but the code got very large and very unreadable
and I kept hoping there was a way the compiler's darn optimizer could do
that for me. I should go back and look at those patches again, but it's
competing with 60 other todo items...

According to http://valerieaurora.org/hash.html both md5sum and sha1sum
are semi-obsolete. I need to do sha256 and sha3 and so on, and adding
those is a higher priority todo item for me than speeding up the old
ones. But despite being obsolete as _cryopto_, rsync moved from md4sum
to md5sum a couple years back, and git is based on sha1, and neither is
moving off those because you wrap the transport in https and sign the
commits if you care about security...

My deflate implementation is also half the speed it should be, although
the first pass was focusing on correctness rather than any kind of
optimization. (In theory zlib started life as a public domain
implementation, but that version seems to have fallen off the net and
doesn't have a lot of modern optimizations anyway.)

One thing I've been meaning to do with deflate/inflate is add the SMP
mode (where it outputs zero byte packets at each dictionary reset so you
can scan ahead for them and do blocks in parallel).

So many todo items...

Rob
enh
2016-07-11 20:10:51 UTC
Permalink
Post by Rob Landley
Post by enh
another plug for supporting a libcrypto dependency: the *sum utilities
are orders of magnitude faster with libcrypto, the SSL support in
things like netcat/wget would be something i could actually use on
Android (there's no way i'll be able to ship an alternative SSL
implementation) and you'd get arbitrary precision integers too.
What I want to do is take the approach Isaac Dunham suggested, of using
"openssl s_client -quiet -connect" as an alternative to netcat. So
toybox wget should call out to that to get https support, and that would
be provided by something external.
Lipi Lee did a first pass at this already, which I didn't immediately
apply because for some reason the patch he sent me didn't apply to the
wget he sent me (I don't think I'd modified it). When I tested the wget
it was corrupting the files it downloaded (outputting numbers in the
middle of the data), and that sent me down the road of rewriting the
thing...
I'm balancing some competing design goals here: 'self-contained' vs
'people use this and need speed out of some tightly optimized
algorithms'. The way busybox dealt with this was by having multiple
implementations (CONFIG_MD5_SMALL has 4 settings), which I very much
don't want to do...
The problem with having an internal an external implementation is the
internal one gets much less testing that way. I suspect the right answer
is to just lump it and have the actual unrolled fast version in toybox,
because the simple one isn't good enough for the userbase. That said, a
lot of these external libraries have assembly optimized versions for
various platforms, and I KNOW I'm not going there...
Hmmm. Which lib is "libcrypto", by the way?
any flavor of openssl/libressl/boringssl. they all share the same API
for this subset.

boringssl -- which Google uses -- is basically just openssl cut down
to "what you actually need in the modern world" anyway. boringssl has
an equivalent that would make the s_client approach work, so that's
fine by me for netcat/wget.
Post by Rob Landley
Post by enh
i don't think you're likely to go this route, but i do like to keep
bringing it up so the idea of being API-compatible enough that it's
possible to use toybox with either your backend or *ssl is in the back
of your mind...
(no one's complained about the slow *sum commands yet, but if you're
interested i'm happy to send a patch.)
People have sent patches to speed up md5sum and sha1sum and it boils
down to lots of loop unrolling that makes the algorithm harder to
http://lists.landley.net/pipermail/toybox-landley.net/2014-May/006638.html
I applied the first few, but the code got very large and very unreadable
and I kept hoping there was a way the compiler's darn optimizer could do
that for me. I should go back and look at those patches again, but it's
competing with 60 other todo items...
oh, no, my patch just left your portable-but-slow implementations in
place and called *ssl if configured that way. so it's basically one or
two extra lines per toy (an if and a function call).
Post by Rob Landley
According to http://valerieaurora.org/hash.html both md5sum and sha1sum
are semi-obsolete. I need to do sha256 and sha3 and so on, and adding
well, i did sha256, sha384, and sha512 too. no reason not to when
they're all there for free.

plus being able to say "if you care about performance, link against
*ssl and get hand-optimized assembler for your specific target
platform" gets all the loop unrollers off your lawn forever :-)
Post by Rob Landley
those is a higher priority todo item for me than speeding up the old
ones. But despite being obsolete as _cryopto_, rsync moved from md4sum
to md5sum a couple years back, and git is based on sha1, and neither is
moving off those because you wrap the transport in https and sign the
commits if you care about security...
My deflate implementation is also half the speed it should be, although
the first pass was focusing on correctness rather than any kind of
optimization. (In theory zlib started life as a public domain
implementation, but that version seems to have fallen off the net and
doesn't have a lot of modern optimizations anyway.)
One thing I've been meaning to do with deflate/inflate is add the SMP
mode (where it outputs zero byte packets at each dictionary reset so you
can scan ahead for them and do blocks in parallel).
So many todo items...
Rob
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
Rob Landley
2016-07-11 21:21:10 UTC
Permalink
Post by enh
Post by Rob Landley
Post by enh
another plug for supporting a libcrypto dependency: the *sum utilities
are orders of magnitude faster with libcrypto, the SSL support in
things like netcat/wget would be something i could actually use on
Android (there's no way i'll be able to ship an alternative SSL
implementation) and you'd get arbitrary precision integers too.
What I want to do is take the approach Isaac Dunham suggested, of using
"openssl s_client -quiet -connect" as an alternative to netcat. So
toybox wget should call out to that to get https support, and that would
be provided by something external.
Lipi Lee did a first pass at this already, which I didn't immediately
apply because for some reason the patch he sent me didn't apply to the
wget he sent me (I don't think I'd modified it). When I tested the wget
it was corrupting the files it downloaded (outputting numbers in the
middle of the data), and that sent me down the road of rewriting the
thing...
I'm balancing some competing design goals here: 'self-contained' vs
'people use this and need speed out of some tightly optimized
algorithms'. The way busybox dealt with this was by having multiple
implementations (CONFIG_MD5_SMALL has 4 settings), which I very much
don't want to do...
The problem with having an internal an external implementation is the
internal one gets much less testing that way. I suspect the right answer
is to just lump it and have the actual unrolled fast version in toybox,
because the simple one isn't good enough for the userbase. That said, a
lot of these external libraries have assembly optimized versions for
various platforms, and I KNOW I'm not going there...
Hmmm. Which lib is "libcrypto", by the way?
any flavor of openssl/libressl/boringssl. they all share the same API
for this subset.
boringssl -- which Google uses -- is basically just openssl cut down
to "what you actually need in the modern world" anyway. boringssl has
an equivalent that would make the s_client approach work, so that's
fine by me for netcat/wget.
Post by Rob Landley
Post by enh
i don't think you're likely to go this route, but i do like to keep
bringing it up so the idea of being API-compatible enough that it's
possible to use toybox with either your backend or *ssl is in the back
of your mind...
(no one's complained about the slow *sum commands yet, but if you're
interested i'm happy to send a patch.)
People have sent patches to speed up md5sum and sha1sum and it boils
down to lots of loop unrolling that makes the algorithm harder to
http://lists.landley.net/pipermail/toybox-landley.net/2014-May/006638.html
I applied the first few, but the code got very large and very unreadable
and I kept hoping there was a way the compiler's darn optimizer could do
that for me. I should go back and look at those patches again, but it's
competing with 60 other todo items...
oh, no, my patch just left your portable-but-slow implementations in
place and called *ssl if configured that way. so it's basically one or
two extra lines per toy (an if and a function call).
Ok, I'll bite. What does your patch look like?

(If I can shove it in portability.c, retain the option of _not_ doing it
via menuconfig, and probably only support static linking of this extra
library... I'm still mildly concerned about the built-in version getting
less testing, but hashes aren't likely to fail in a _subtle_ manner...)

Rob
enh
2016-07-13 22:38:32 UTC
Permalink
looks like i never sent my previous patch to the list, and looks like i
lost it on disk too. (it's probably still *somewhere*, but at some point
it's cheaper to stop looking and just do it again...)

attached is a rewrite that shows the basic idea, but misses the step of
refactoring your portable implementation to look like the libcrypto one so
we can remove the 10 lines of duplication. (the majority of the lines added
to this patch is just the same help text duplicated over and over.)

this actually makes toybox faster than the version in ubuntu 14.04.


[PATCH] Add support for libcrypto for MD5/SHA.

Orders of magnitude faster (for architectures where OpenSSL/BoringSSL
has optimized assembler).

Also adds sha224sum, sha256sum, sha384sum, and sha512sum for folks
building with libcrypto.

The fallback portable C implementations could easily be refactored
to be API-compatible, but I don't know whether they'd stay here or
move to lib/ so I've left that part alone for now.
---
Config.in | 6 +++
scripts/make.sh | 2 +-
toys/lsb/md5sum.c | 133
++++++++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 135 insertions(+), 6 deletions(-)
Post by Rob Landley
Post by enh
Post by Rob Landley
Post by enh
another plug for supporting a libcrypto dependency: the *sum utilities
are orders of magnitude faster with libcrypto, the SSL support in
things like netcat/wget would be something i could actually use on
Android (there's no way i'll be able to ship an alternative SSL
implementation) and you'd get arbitrary precision integers too.
What I want to do is take the approach Isaac Dunham suggested, of using
"openssl s_client -quiet -connect" as an alternative to netcat. So
toybox wget should call out to that to get https support, and that would
be provided by something external.
Lipi Lee did a first pass at this already, which I didn't immediately
apply because for some reason the patch he sent me didn't apply to the
wget he sent me (I don't think I'd modified it). When I tested the wget
it was corrupting the files it downloaded (outputting numbers in the
middle of the data), and that sent me down the road of rewriting the
thing...
I'm balancing some competing design goals here: 'self-contained' vs
'people use this and need speed out of some tightly optimized
algorithms'. The way busybox dealt with this was by having multiple
implementations (CONFIG_MD5_SMALL has 4 settings), which I very much
don't want to do...
The problem with having an internal an external implementation is the
internal one gets much less testing that way. I suspect the right answer
is to just lump it and have the actual unrolled fast version in toybox,
because the simple one isn't good enough for the userbase. That said, a
lot of these external libraries have assembly optimized versions for
various platforms, and I KNOW I'm not going there...
Hmmm. Which lib is "libcrypto", by the way?
any flavor of openssl/libressl/boringssl. they all share the same API
for this subset.
boringssl -- which Google uses -- is basically just openssl cut down
to "what you actually need in the modern world" anyway. boringssl has
an equivalent that would make the s_client approach work, so that's
fine by me for netcat/wget.
Post by Rob Landley
Post by enh
i don't think you're likely to go this route, but i do like to keep
bringing it up so the idea of being API-compatible enough that it's
possible to use toybox with either your backend or *ssl is in the back
of your mind...
(no one's complained about the slow *sum commands yet, but if you're
interested i'm happy to send a patch.)
People have sent patches to speed up md5sum and sha1sum and it boils
down to lots of loop unrolling that makes the algorithm harder to
http://lists.landley.net/pipermail/toybox-landley.net/2014-May/006638.html
Post by enh
Post by Rob Landley
I applied the first few, but the code got very large and very unreadable
and I kept hoping there was a way the compiler's darn optimizer could do
that for me. I should go back and look at those patches again, but it's
competing with 60 other todo items...
oh, no, my patch just left your portable-but-slow implementations in
place and called *ssl if configured that way. so it's basically one or
two extra lines per toy (an if and a function call).
Ok, I'll bite. What does your patch look like?
(If I can shove it in portability.c, retain the option of _not_ doing it
via menuconfig, and probably only support static linking of this extra
library... I'm still mildly concerned about the built-in version getting
less testing, but hashes aren't likely to fail in a _subtle_ manner...)
Rob
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
Rob Landley
2016-07-14 21:35:10 UTC
Permalink
Post by enh
looks like i never sent my previous patch to the list, and looks like i
lost it on disk too. (it's probably still *somewhere*, but at some point
it's cheaper to stop looking and just do it again...)
attached is a rewrite that shows the basic idea, but misses the step of
refactoring your portable implementation to look like the libcrypto one
so we can remove the 10 lines of duplication. (the majority of the lines
added to this patch is just the same help text duplicated over and over.)
this actually makes toybox faster than the version in ubuntu 14.04.
Grumble grumble.

Sigh.

Fidget.

Grrr...

[Sleeps on it; next afternoon...]

This is almost certainly the right thing to do. Lemme figure out where
the new edges of this can of worms are.

I know that no external library should ever be REQUIRED by the build.
And I still want the hermetic/selfhosting builds to work with defconfig
toybox, so may wind up adding builtin slow/simple implementations of
more stuff that libraries provide optimized versions of.

But yes, I should apply this. Lemme finish up my pending md5sum changes
first though. (I'm 2/3 of the way through adding -c support.)

Rob
enh
2016-07-14 21:43:23 UTC
Permalink
Post by Rob Landley
Post by enh
looks like i never sent my previous patch to the list, and looks like i
lost it on disk too. (it's probably still *somewhere*, but at some point
it's cheaper to stop looking and just do it again...)
attached is a rewrite that shows the basic idea, but misses the step of
refactoring your portable implementation to look like the libcrypto one
so we can remove the 10 lines of duplication. (the majority of the lines
added to this patch is just the same help text duplicated over and over.)
this actually makes toybox faster than the version in ubuntu 14.04.
Grumble grumble.
Sigh.
Fidget.
Grrr...
[Sleeps on it; next afternoon...]
This is almost certainly the right thing to do. Lemme figure out where
the new edges of this can of worms are.
I know that no external library should ever be REQUIRED by the build.
And I still want the hermetic/selfhosting builds to work with defconfig
toybox, so may wind up adding builtin slow/simple implementations of
more stuff that libraries provide optimized versions of.
yeah, that's actually a very good argument for the portable
implementations that hadn't occurred to me.
Post by Rob Landley
But yes, I should apply this. Lemme finish up my pending md5sum changes
first though. (I'm 2/3 of the way through adding -c support.)
no hurry. like i said, i've had a patch like this hanging around for
so long that i literally couldn't find it any more :-) at least
there's something in the list archives now!
Post by Rob Landley
Rob
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
Rob Landley
2016-07-15 09:54:02 UTC
Permalink
Post by enh
Post by Rob Landley
Post by enh
looks like i never sent my previous patch to the list, and looks like i
lost it on disk too. (it's probably still *somewhere*, but at some point
it's cheaper to stop looking and just do it again...)
attached is a rewrite that shows the basic idea, but misses the step of
refactoring your portable implementation to look like the libcrypto one
so we can remove the 10 lines of duplication. (the majority of the lines
added to this patch is just the same help text duplicated over and over.)
this actually makes toybox faster than the version in ubuntu 14.04.
Grumble grumble.
Sigh.
Fidget.
Grrr...
[Sleeps on it; next afternoon...]
This is almost certainly the right thing to do. Lemme figure out where
the new edges of this can of worms are.
I know that no external library should ever be REQUIRED by the build.
And I still want the hermetic/selfhosting builds to work with defconfig
toybox, so may wind up adding builtin slow/simple implementations of
more stuff that libraries provide optimized versions of.
yeah, that's actually a very good argument for the portable
implementations that hadn't occurred to me.
Post by Rob Landley
But yes, I should apply this. Lemme finish up my pending md5sum changes
first though. (I'm 2/3 of the way through adding -c support.)
no hurry. like i said, i've had a patch like this hanging around for
so long that i literally couldn't find it any more :-) at least
there's something in the list archives now!
One rough edge is that if you build a statically linked standalone sha1
app, it'll suck in the hash functions for all of these:

HASH_INIT("md5sum", MD5), HASH_INIT("sha1sum", SHA1),
HASH_INIT("sha224sum", SHA224), HASH_INIT("sha256sum", SHA256),
HASH_INIT("sha384sum", SHA384), HASH_INIT("sha512sum", SHA512),

I dunno if they share plumbing internally (probably not if they're
optimized), but the compiler can't tell that this table entry's string
won't be matched this build.

If I'd kept the individual functions and initialized the structure
there, then maybe it could tell...

Anyway, I wrapped the relevant USE_ macro around each one.

(I'm amused that even doing that, the dynamically linked -libcrypto is
slightly _larger_ than the one with the built-in hash function.)

(The segfault that took a while to track down is because SHA512_CTX and
friends are bigger than SHA_CTX so it was stomping the stack when I
tested those. Fixed now, and seems to be working...)

Rob
enh
2016-07-15 15:14:00 UTC
Permalink
Post by Rob Landley
Post by enh
Post by Rob Landley
Post by enh
looks like i never sent my previous patch to the list, and looks like i
lost it on disk too. (it's probably still *somewhere*, but at some point
it's cheaper to stop looking and just do it again...)
attached is a rewrite that shows the basic idea, but misses the step of
refactoring your portable implementation to look like the libcrypto one
so we can remove the 10 lines of duplication. (the majority of the lines
added to this patch is just the same help text duplicated over and over.)
this actually makes toybox faster than the version in ubuntu 14.04.
Grumble grumble.
Sigh.
Fidget.
Grrr...
[Sleeps on it; next afternoon...]
This is almost certainly the right thing to do. Lemme figure out where
the new edges of this can of worms are.
I know that no external library should ever be REQUIRED by the build.
And I still want the hermetic/selfhosting builds to work with defconfig
toybox, so may wind up adding builtin slow/simple implementations of
more stuff that libraries provide optimized versions of.
yeah, that's actually a very good argument for the portable
implementations that hadn't occurred to me.
Post by Rob Landley
But yes, I should apply this. Lemme finish up my pending md5sum changes
first though. (I'm 2/3 of the way through adding -c support.)
no hurry. like i said, i've had a patch like this hanging around for
so long that i literally couldn't find it any more :-) at least
there's something in the list archives now!
One rough edge is that if you build a statically linked standalone sha1
HASH_INIT("md5sum", MD5), HASH_INIT("sha1sum", SHA1),
HASH_INIT("sha224sum", SHA224), HASH_INIT("sha256sum", SHA256),
HASH_INIT("sha384sum", SHA384), HASH_INIT("sha512sum", SHA512),
I dunno if they share plumbing internally (probably not if they're
optimized), but the compiler can't tell that this table entry's string
won't be matched this build.
If I'd kept the individual functions and initialized the structure
there, then maybe it could tell...
Anyway, I wrapped the relevant USE_ macro around each one.
(I'm amused that even doing that, the dynamically linked -libcrypto is
slightly _larger_ than the one with the built-in hash function.)
for the single toy, not because of the duplicated help text? that seems odd.
Post by Rob Landley
(The segfault that took a while to track down is because SHA512_CTX and
friends are bigger than SHA_CTX so it was stomping the stack when I
tested those. Fixed now, and seems to be working...)
sorry... i remember i made that mistake last time too, assuming that
SHA meant "generic" not "SHA1 but for some reason we only included the
numbers on all the other variants". even if it was a historical
accident, i wish they'd gone back and typedef/#define'd over the mess.
Post by Rob Landley
Rob
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
Rob Landley
2016-07-14 22:30:08 UTC
Permalink
Post by enh
looks like i never sent my previous patch to the list, and looks like i
lost it on disk too. (it's probably still *somewhere*, but at some point
it's cheaper to stop looking and just do it again...)
attached is a rewrite that shows the basic idea, but misses the step of
refactoring your portable implementation to look like the libcrypto one
so we can remove the 10 lines of duplication. (the majority of the lines
added to this patch is just the same help text duplicated over and over.)
this actually makes toybox faster than the version in ubuntu 14.04.
The md5sum -c stuff I just checked in means this patch doesn't remotely
apply to current, so I'm doing my cleanup while reimplementing the bits
that need changing anyway.

One advantage of OLDTOY() is that multiple commands share help text,
which kinda comes up here. :) I had problems building oldtoys standalone
but I think I fixed them, and if there's still issues I should deal with
them rather than repeating the same largeish block of help text a half
dozen times. (Larger for me than it was for you because I added -c.)

(I pondered adding some sort of macro substitution so the three
differing places in the text could be swapped and noped right back out
of there. I still need to fix the hep text collation so pgrep and such
show their common parts properly, but complicating it more seems like a
bad idea.)
Post by enh
The fallback portable C implementations could easily be refactored
to be API-compatible, but I don't know whether they'd stay here or
move to lib/ so I've left that part alone for now.
Hmmm... Probably move them to lib. md5sum would be needed by rsync,
sha1sum would be needed by git/repo. Both are post-1.0, but still...

Rob
Andy Chu
2016-07-17 19:08:50 UTC
Permalink
Post by Rob Landley
Awk's better than bc.
That's interesting... I had no idea bc was a language with functions and loops!

https://github.com/jck/822kernel/blob/master/kernel/time/timeconst.bc

This is the problem with DSLs... shell, make, awk, and presumably bc
all started out as very specific languages, for different purposes.
Over time, they all grew a C-like imperative language. And nobody
wants to remember 3 or 4 different syntaxes for that:

f() { echo hi $1; }
f "bob"

function f(name) { print "hi" name }
f("bob")

define f
hi %1
end
$(call f,"bob")

(And repeat this mess for every other construct in a language...)

It does seem that if you rule out Python/Perl, awk is the winner with
respect to code generation, based on the fact that a lot of C code
uses it (many shells, Android, FreeBSD, etc.) And I agree with the
idea of minimizing build dependencies.

However, I did a bunch of research and hacking on Kernighan's Awk. I
was trying to morph it into a "proper" modern language. For example,
you could imagine writing "ls" or "xargs" or even a shell in Awk, sort
of like the idea to write tools in Lua.

But then I ran into some big limitations, like you can't return
associative arrays from a function, or pass or return functions
to/from functions. Awk looks very similar to JavaScript -- C syntax
with associative arrays, but is semantically much more limited.

I lost interest in awk because of these limitations. awk is used, but
seems to be waning, and it's not really evolving. (But I haven't lost
interest in the shell.)

I did however automate and slightly rewrite Kernighan's EXTENSIVE test
suite here, which is AFAICT is not in the other Git reconstructions:

https://www.cs.princeton.edu/~bwk/btl.mirror/

https://github.com/onetrueawk/awk

I think you mentioned you were looking for an awk test suite. Well
there it is -- there are hundreds or thousands of test cases,
including for the regex language. I actually ran it under LLVM
sanitizers (ASAN/MSAN/etc.), just as I did for toybox, and it revealed
the expected C coding bugs, in this code being maintained by one
person for 30 years... (BTW you never responded to my last message
about that)

I will publish the combined repo at some point, and if anyone has a
burning need I can accelerate that. I should make a blog post at some
point, demonstrating the sanitizers on old C code... though
unfortunately writing about code takes just as much time as coding
itself. And I didn't actually fix any of the memory problems that I
found, as I did for toybox, since I don't have any plans for that code
in the future.

The bottom line is that LLVM sanitizers are mandatory if you care
about bugs... nobody is careful enough (even Kernighan, with his
astoundingly thorough tests, much more thorough than toybox!). toybox
wget, tar, and crypto/compression libraries especially need this,
because they process untrusted input.

The other point about Kernighan's Awk is that if I were building
something like Aboriginal Linux, I would just use that for now, and
put toybox awk at low priority. Elliot showed me that the Android NDK
actually uses a copy of Kernighan's Awk and not the system awk for its
builds.

I get why you don't like GNU stuff. But Kernighan's Awk is like 7
files of pure ANSI C, POSIX yacc, POSIX makefile, etc. that builds
anywhere. Kernighan also expanded the yacc to c, so you don't need
yacc as a build dependency... that is a little "unprincipled" but I
think fine given that awk is changing so slowly and will likely not
need any maintenance.

Andy
dmccunney
2016-07-17 19:58:47 UTC
Permalink
Post by Andy Chu
However, I did a bunch of research and hacking on Kernighan's Awk. I
was trying to morph it into a "proper" modern language. For example,
you could imagine writing "ls" or "xargs" or even a shell in Awk, sort
of like the idea to write tools in Lua.
Er, why?

I attended a talk decades ago by Tom Weinberger, the W in awk. He
stated that awk was originally designed for writing "one liners" to be
run from the command line in a terminal, and described his shock the
first time he saw a multi-page awk script.

There are lots of domain specific little languages, intended to
address particular problems, and attempts to expand them rapidly grow
out of scope. You reach a point where what you are trying to do is
something else's job, and you should simply use the something else.

That's the case with awk. It should not *be* a general purpose
scripting language in which you might implement a shell. It's
intended to query and modify the contents of arbitrary text files,
based on supplied parameters and conditionals in the awk script. If
you need to do more than that, awk isn't what you use.

The original design of Unix was one tool for one job, with shell
scripts the glue to tie together disparate tools to perform complex
operations, and text files as the common medium to pass things between
them. That ideal has been increasingly lost, and perl might be
considered an example. Its "there's more than one way to do it"
paradigm is likely a greater weakness than a strength. There is more
than one way, but which is best? How do you know?" And the nature of
the perl language leads to the old joke in classifying languages by
how you shoot yourself in the foot:

Perl: You shoot yourself in the foot, but no one else can understand
how you did it. Six weeks later, neither can you.

Awk was created to perform a particular job and fill a particular
niche. Attempting to expand it beyond that is likely an error.
______
Dennis
https://plus.google.com/u/0/105128793974319004519
Andy Chu
2016-07-21 04:41:16 UTC
Permalink
Post by dmccunney
Post by Andy Chu
However, I did a bunch of research and hacking on Kernighan's Awk. I
was trying to morph it into a "proper" modern language. For example,
you could imagine writing "ls" or "xargs" or even a shell in Awk, sort
of like the idea to write tools in Lua.
Er, why?
I attended a talk decades ago by Tom Weinberger, the W in awk. He
stated that awk was originally designed for writing "one liners" to be
run from the command line in a terminal, and described his shock the
first time he saw a multi-page awk script.
There are lots of domain specific little languages, intended to
address particular problems, and attempts to expand them rapidly grow
out of scope. You reach a point where what you are trying to do is
something else's job, and you should simply use the something else.
I'm well aware of that point of view, and it's totally valid if you're
an language *user*... I'm taking the point of view of a language
implementer. Kernighan's Awk is only 6K lines of code and not very
difficult to modify.

The motivation is exactly the same as writing Unix tools in Lua --
shorter code, smaller binary size, and eliminating certain classes of
bugs a priori. Awk is not that far semantically from Lua or
JavaScript -- it's an interpreted language with hash tables. The main
difference is that it doesn't have an object system, but I viewed that
as an advantage, because I don't like either Lua's or JavaScript's OOP
support. But as mentioned, after a little hacking, I didn't think it
was feasible.
Post by dmccunney
The original design of Unix was one tool for one job, with shell
scripts the glue to tie together disparate tools to perform complex
operations, and text files as the common medium to pass things between
them. That ideal has been increasingly lost, and perl might be
considered an example. Its "there's more than one way to do it"
paradigm is likely a greater weakness than a strength. There is more
than one way, but which is best? How do you know?" And the nature of
the perl language leads to the old joke in classifying languages by
Perl: You shoot yourself in the foot, but no one else can understand
how you did it. Six weeks later, neither can you.
Awk was created to perform a particular job and fill a particular
niche. Attempting to expand it beyond that is likely an error.
It's *already* been expanded, and so have make and shell. That ship
sailed decades ago. As of a couple years ago, Android used 200K+
lines of Make. It seems to include the "Gnu Make Standard Library",
which is basically a Lisp threaded through a tiny hole in Make syntax.

That misfeature in the way Unix evolved was the point of the first
part of my message, with the function call examples. The idea I have
floating around is to turn my shell into a tool that also contains
make and awk. 75% of the lines in a typical Makefile ARE shell. And
Awk has function calls, loop, boolean expressions, arithmetic,
builtins, regexes, etc. just like shell. If you squint, it's almost a
different syntax for the same language.

After hacking on Kernighan's awk and Mozilla's pymake, I figured out a
lot of the semantic differences... and abandoned the idea of doing
anything highly compatible with awk and make.

Shell actually has good semantics (it's a thin wrapper around classic
Unix syscalls), but terrible syntax. Awk has a good syntax (it was
written with an actual parser, unlike shell and make) but bad
semantics. Make has both bad syntax and bad semantics. Happy to go
into detail if anyone is interested :)

Andy
Rob Landley
2016-07-18 19:17:50 UTC
Permalink
Post by Andy Chu
Post by Rob Landley
Awk's better than bc.
That's interesting... I had no idea bc was a language with functions and loops!
Neither did I until I tried to implement it.
Post by Andy Chu
https://github.com/jck/822kernel/blob/master/kernel/time/timeconst.bc
This is the problem with DSLs... shell, make, awk, and presumably bc
all started out as very specific languages, for different purposes.
Over time, they all grew a C-like imperative language. And nobody
f() { echo hi $1; }
f "bob"
function f(name) { print "hi" name }
f("bob")
define f
hi %1
end
$(call f,"bob")
(And repeat this mess for every other construct in a language...)
It does seem that if you rule out Python/Perl,
And ruby, php, java, javascript, tcl, lithp, go, swift, rust... A friend
of mine was doing a job programming haskell a few years ago. I follow
somebody on twitter maintaining a snobol compiler. Microsoft created C#
because Java hadn't been invented there (and visual basic got taken out
by internal politics during
http://www.joelonsoftware.com/articles/APIWar.html), back under OS/2
there was something called Rexx, Apple had AppleScript...

Awk is in posix and actually gets used. Heck, the linux kernel top level
Makefile has:

$(Q)$(AWK) '!x[$$0]++' $(vmlinux-dirs:%=$(objtree)/%/modules.order) >
$(objtree)/modules.order
$(Q)$(AWK) '!x[$$0]++' $^ > $(objtree)/modules.builtin

And of course:
$ find . -name "*.awk"
./tools/perf/util/intel-pt-decoder/gen-insn-attr-x86.awk
./tools/perf/arch/x86/tests/gen-insn-x86-dat.awk
./tools/objtool/arch/x86/insn/gen-insn-attr-x86.awk
./arch/x86/tools/gen-insn-attr-x86.awk
./arch/x86/tools/distill.awk
./arch/x86/tools/chkobjdump.awk
./Documentation/arm/Samsung/clksrc-change-registers.awk
./lib/raid6/unroll.awk
./net/wireless/genregdb.awk
Post by Andy Chu
awk is the winner with
respect to code generation, based on the fact that a lot of C code
uses it (many shells, Android, FreeBSD, etc.)
And half of autoconf, apparently.
Post by Andy Chu
And I agree with the idea of minimizing build dependencies.
However, I did a bunch of research and hacking on Kernighan's Awk. I
was trying to morph it into a "proper" modern language.
Another one?

why?

Presumably you wouldn't remove anything significant from the base
language, since that would break compatability with existing awk
scripts, so your reaction to awk was "how could I fork this to make it
bigger"?

This feels like a variant of https://xkcd.com/927/ somehow.
Post by Andy Chu
For example,
you could imagine writing "ls" or "xargs" or even a shell in Awk, sort
of like the idea to write tools in Lua.
awk can readlink()? (ls -l needs it.)

The lua thing fell apart trying to write mount, ifconfig, netcat,
losetup, nsenter, ionice, chroot, swapon, setsid, insmod, taskset,
dmesg... The language just didn't have the bindings.

(Then again java 1.1 didn't have any way to truncate a file until I
reported the lack to a guy name Mark English and he added it to 1.2.
Languages get usable when they get used. Most code has to be broken in.)
Post by Andy Chu
But then I ran into some big limitations, like you can't return
associative arrays from a function, or pass or return functions
to/from functions. Awk looks very similar to JavaScript -- C syntax
with associative arrays, but is semantically much more limited.
There are an awful lot of scripting languages.
Post by Andy Chu
I lost interest in awk because of these limitations. awk is used, but
seems to be waning, and it's not really evolving. (But I haven't lost
interest in the shell.)
Linus Torvalds recently said (https://lwn.net/Articles/687916/):

Yeah, I know, I should have used 'awk' for this. Sue me. It's been
too long since I did awk state machines. There's a reason there's a
"git grep" but not a "git awk" command.

One of the reasons some of these tools fall out of use is the
documentation for them is terrible. (Gnu man pages often point to "info"
pages nobody will EVER read and which sometimes _still_ aren't online.)
I'm trying to write --help text for each command that's sufficient to
learn to use the command just from that. It's not easy and I'm not happy
with a lot of the results (terse vs complete vs easy to read, pick 2).
Working on it...

It would be nice if there were some youtube clips on "an introduction to
sed", "an introduction to awk", and so on. I might wind up doing them
someday if nobody else beats me to it.
Post by Andy Chu
I did however automate and slightly rewrite Kernighan's EXTENSIVE test
https://www.cs.princeton.edu/~bwk/btl.mirror/
https://github.com/onetrueawk/awk
The git account isn't the test suite, it seems to be the same
https://www.cs.princeton.edu/~bwk/btl.mirror/awk.tar.gz source file from
the first page. Commit history is two commits, last touched 2012.

Did you post the automated version anyway?
Post by Andy Chu
I think you mentioned you were looking for an awk test suite. Well
there it is -- there are hundreds or thousands of test cases,
including for the regex language.
Which is provided by libc.

Let's see... https://www.cs.princeton.edu/~bwk/btl.mirror/awktest.a is
an ar archive, ar x awktest.a gives a directory full of files,
README.TESTS says REGRESS controls the testing process, running that does...

$ sh ./REGRESS
Linux driftwood 4.2.0-38-generic #45~14.04.1-Ubuntu SMP Thu Jun 9
09:27:51 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
echo compiled
oldawk=awk, awk=../a.out
./REGRESS: 11: ./REGRESS: Compare.t: not found
167 tests

./REGRESS: 14: ./REGRESS: Compare.p: not found
58 tests

./REGRESS: 17: ./REGRESS: Compare.T: not found
252 tests

./REGRESS: 20: ./REGRESS: Compare.tt: not found
21 tests

Right, maybe I'll dig into this later but it's not obvious to me how to
get it to work.
Post by Andy Chu
I actually ran it under LLVM
sanitizers (ASAN/MSAN/etc.), just as I did for toybox, and it revealed
the expected C coding bugs, in this code being maintained by one
person for 30 years... (BTW you never responded to my last message
about that)
My laptop rebooted during txlf and I lost my open windows. I have a todo
item to look at your test suite suggestions, but when I glanced at the
start of it, it was things like adding "only run these tests as root"
guards to some files which are part of any testing triage, so I just
started doing test suite triage until I ran out of time that day, and
haven't gotten back to it yet...
Post by Andy Chu
I will publish the combined repo at some point, and if anyone has a
burning need I can accelerate that.
I'm interested.
Post by Andy Chu
I should make a blog post at some
point, demonstrating the sanitizers on old C code... though
unfortunately writing about code takes just as much time as coding
itself.
Often more. :)
Post by Andy Chu
And I didn't actually fix any of the memory problems that I
found, as I did for toybox, since I don't have any plans for that code
in the future.
I tried to link to your March 7 2016 email about the sed -f segfault and
found out the mailing list archive is down again. Has it been a year
already? (Answer: no, just it's been 7 months since
http://landley.net/toybox/#12-21-2015 .)

I wonder if Dreamhost will delete another chunk of history restoring
from a stale backup? (My actual WEBSITE is still up. But of course they
won't let me run mailman on that.)
Post by Andy Chu
The bottom line is that LLVM sanitizers are mandatory if you care
about bugs...
You said the sed -f thing was "literally the first thing you tried" and
was found with a fuzzer.

The other thing you found outside of pending was commit c73947814aab
(which was a thinko on my part, I was trusting the -1/2 to be zero, but
was testing <= not = so it still went through the loop body then), which
I can't find your submission email for (might have been on irc?) so I
dunno how you found it.

The other stuff you patched was in pending so hadn't BEEN reviewed.
Post by Andy Chu
nobody is careful enough (even Kernighan, with his
astoundingly thorough tests, much more thorough than toybox!). toybox
wget, tar, and crypto/compression libraries especially need this,
because they process untrusted input.
Feel free to run it. I've never had much interest in false positive
generators myself.
Post by Andy Chu
The other point about Kernighan's Awk is that if I were building
something like Aboriginal Linux, I would just use that for now, and
put toybox awk at low priority. Elliot showed me that the Android NDK
actually uses a copy of Kernighan's Awk and not the system awk for its
builds.
Understood.

Busybox had awk back when I maintained it so I'd get comparisons if I
didn't have one in toybox, and I need it for dependency reduction for my
mythical "four packages" goal, but I need make for that too and that's
not even in the 1.0 goal list. :)
Post by Andy Chu
I get why you don't like GNU stuff. But Kernighan's Awk is like 7
files of pure ANSI C, POSIX yacc, POSIX makefile, etc.
Closing the circle means you have yacc as a dependency. (Although he
mitigates it by shipping the yacc output...)
Post by Andy Chu
that builds
anywhere. Kernighan also expanded the yacc to c, so you don't need
yacc as a build dependency... that is a little "unprincipled" but I
think fine given that awk is changing so slowly and will likely not
need any maintenance.
It's pretty common to ship generated files for prerequisites you don't
want to demand from the end user. Everybody and their dog has a
./configure produced by autoconf, the Linux kernel has
scripts/kconfig/zconfig*_shipped and so on.

And of course Android's toybox git has the generated/ directory checked
in. :)
Post by Andy Chu
Andy
Rob
dmccunney
2016-07-18 21:03:18 UTC
Permalink
Post by Rob Landley
Post by Andy Chu
This is the problem with DSLs... shell, make, awk, and presumably bc
all started out as very specific languages, for different purposes.
Over time, they all grew a C-like imperative language. And nobody
<...>
Post by Rob Landley
Post by Andy Chu
It does seem that if you rule out Python/Perl,
And ruby, php, java, javascript, tcl, lithp, go, swift, rust...
<...>
Post by Rob Landley
back under OS/2 there was something called Rexx, Apple had AppleScript...
REXX began on mainframes, supplied with IBM's VM/CMS OS, and is now in
a lot of places. It got ported the the Commodore Amiga as the
scripting language. IBM now calls if Open Object REXX, and it's
available as open source for Windows, Linux, and OS/X, Mark Hessling
maintains a variant call Regina REXX that is also open source and
cross-platform, and there are others like BRexx. I have a flavor of
REXX on my Android tablet.
Post by Rob Landley
Post by Andy Chu
However, I did a bunch of research and hacking on Kernighan's Awk. I
was trying to morph it into a "proper" modern language.
Another one?
why?
Presumably you wouldn't remove anything significant from the base
language, since that would break compatability with existing awk
scripts, so your reaction to awk was "how could I fork this to make it
bigger"?
This feels like a variant of https://xkcd.com/927/ somehow.
<grin>

The reason is presumably to reduce the number of languages a
programmer must learn and remember. But Andy's comments earlier about
remembering those different syntaxes overstates the issue. As a rule,
you learn and recall the syntaxes you actually use. You probably have
to learn and use more than one, but you know that going in. You
likely aren't going to learn/use a lot of different scripting
languages, because they address specific domains you may not be
working in.

Under Unix, I gained reasonable fluency in the Bourne shell language,
and added the extensions in ksh when it got to the point where it
could be installed as /bin/sh. Under Linux, I use bash, but it might
as well be the Korn shell for my purposes. I've simply had no need
for the additional bells and whistles bash added. I occasionally used
awk and sed for domain specific stuff not intended for doing in the
shell, and learned just enough perl to be dangerous because it got
deeply embedded in system administration chores. These days, I'm
watching a few distros move to doing everything in Python with some
dismay.
Post by Rob Landley
The lua thing fell apart trying to write mount, ifconfig, netcat,
losetup, nsenter, ionice, chroot, swapon, setsid, insmod, taskset,
dmesg... The language just didn't have the bindings.
I rather like Lua, but I bear in mind the domain it addresses. It's
intended to be an embedded scripting language you can call from within
your application. You can't write stand alone Lua code because it
lacks the infrastructure. It assumes the application it's embedded in
will handle that, and it doesn't need to. It's gotten a fair amount of
pick up as the script engine for games, and there are a couple of IDEs
intended for writing Lua used in games.

It's also popping up as the script language in editors. One of the
things I'm poking at the TextAdept, an extensible text editor. It
uses the Scintilla edit control, and Lua for scripting. The C code
is under 2,000 lines and will remain so (and is basically the
framework to embed Scintilla and Lua.) The actual editing functions
are written in Lua, and you can extend it all over the map by writing
Lua code, in a manner akin to extending Gnu Emacs in eLisp. The Lua
code is currently about 4,000 lines.
Post by Rob Landley
There are an awful lot of scripting languages.
And new ones emerging all the time, with a Darwinian natural selection
going on as they struggle to define the specific problems they are
addressing and see whether enough other people have those problems to
gain traction.
Post by Rob Landley
One of the reasons some of these tools fall out of use is the
documentation for them is terrible. (Gnu man pages often point to "info"
pages nobody will EVER read and which sometimes _still_ aren't online.)
I'm trying to write --help text for each command that's sufficient to
learn to use the command just from that. It's not easy and I'm not happy
with a lot of the results (terse vs complete vs easy to read, pick 2).
Working on it...
This is something I was unhappy about from the days when I was
learning Unix using SysVR2. Documentation was man pages, but the man
pages were references which assumed you already had background
knowledge. They were not tutorials. (And they also implicitly
assumed there was a guru on site to supply the background knowledge
you might lack. If you *were* the guru on site and didn't grok the
man page, you had trouble right here in River City.)

Given how Unix was developed, it was likely inevitable. Many Unix
tools were the products of developers at Bell Labs scratching personal
itches, and everyone was at Bell Labs, so you could talk directly to
the guy who wrote a tool to clear up misunderstanding. Once Unix
escaped into the wild, that happy circumstance vanished.
Post by Rob Landley
It would be nice if there were some youtube clips on "an introduction to
sed", "an introduction to awk", and so on. I might wind up doing them
someday if nobody else beats me to it.
Oh, crap. Sorry, but one of my pet peeves these days is the
increasing use of video as documentation. Video has a place, but I
can *read* much faster than I can *watch*. Give me a bleeping written
tutorial, that I can have open in one window while I experiment in
another.
Post by Rob Landley
Rob
______
Dennis
https://plus.google.com/u/0/105128793974319004519
David Seikel
2016-07-19 02:03:19 UTC
Permalink
On Mon, 18 Jul 2016 17:03:18 -0400 dmccunney
<snip>
Post by dmccunney
Post by Rob Landley
The lua thing fell apart trying to write mount, ifconfig, netcat,
losetup, nsenter, ionice, chroot, swapon, setsid, insmod, taskset,
dmesg... The language just didn't have the bindings.
I rather like Lua, but I bear in mind the domain it addresses. It's
intended to be an embedded scripting language you can call from within
your application. You can't write stand alone Lua code because it
lacks the infrastructure. It assumes the application it's embedded in
will handle that, and it doesn't need to. It's gotten a fair amount of
pick up as the script engine for games, and there are a couple of IDEs
intended for writing Lua used in games.
It's also popping up as the script language in editors. One of the
things I'm poking at the TextAdept, an extensible text editor. It
uses the Scintilla edit control, and Lua for scripting. The C code
is under 2,000 lines and will remain so (and is basically the
framework to embed Scintilla and Lua.) The actual editing functions
are written in Lua, and you can extend it all over the map by writing
Lua code, in a manner akin to extending Gnu Emacs in eLisp. The Lua
code is currently about 4,000 lines.
This is the sort of thing I might do to boxes, the editor I was trying
to get into toybox. Implementing a bunch of editors needed some sort
of internal language that wraps editing primitives, and I'm a fan of
Lua. The original wasn't going to use Lua, since Lua isn't in toybox
for the reasons Rob gives above.

I'm still not sure what direction I should take boxes, now that it's
not going into toybox, and it's currently waiting on someone else to
right some code anyway.

<snip>
Post by dmccunney
Post by Rob Landley
It would be nice if there were some youtube clips on "an
introduction to sed", "an introduction to awk", and so on. I might
wind up doing them someday if nobody else beats me to it.
Oh, crap. Sorry, but one of my pet peeves these days is the
increasing use of video as documentation. Video has a place, but I
can *read* much faster than I can *watch*. Give me a bleeping written
tutorial, that I can have open in one window while I experiment in
another.
This I fully agree with. You can't copy paste text from a video,
making them useless for teaching shell stuff. It's a lot easier and
quicker to back your eyes up and re read a sentence you didn't grok the
first time, than it is to rewind the video, and deal with the
inaccuracy of the rewind process. Try searching a two hour video for a
bit of text, grep's gonna find it a hell of a lot quicker in a written
tutorial.

You can slow down or speed up your reading depending on your ability to
keep up, but that's just not possible in a video, the thing will drone
on at it's fixed pace.

They say a picture is worth a thousand words, but it's worth way more
than that in bandwidth requirements, and videos even more so. My
bandwidth is precious to me, I don't have a lot. It's expensive here
in Oz. Consider that an entire average novel would fill an entire
average floppy disk back in the day, and how big the average digital
photo is, you're not gonna squeeze an entire novel into a photo.

Having that trouble now, the gubermit has enrolled me in an online
course, I calculate that downloading all the videos for that course
will fully soak up five months of my bandwidth for the six month course.
I want to do other things with my bandwidth. I'm trying to negotiate
this video silliness away, but they aren't talking. The crazy thing is
that after poking around their web site, I have stumbled across what
looks like a written version of their first video, carefully hidden
away. I'm not sure yet if it actually matches the videos in subject
matter that is to be tested, they wont tell me. I'll have to watch the
first video and compare I guess. There may be more of these text
versions.
--
A big old stinking pile of genius that no one wants
coz there are too many silver coated monkeys in the world.
Andy Chu
2016-07-21 05:57:50 UTC
Permalink
Post by dmccunney
The reason is presumably to reduce the number of languages a
programmer must learn and remember. But Andy's comments earlier about
remembering those different syntaxes overstates the issue. As a rule,
you learn and recall the syntaxes you actually use. You probably have
to learn and use more than one, but you know that going in. You
likely aren't going to learn/use a lot of different scripting
languages, because they address specific domains you may not be
working in.
That assumption doesn't make sense unless you only work on your own code.

Look at something like Aboriginal Linux -- even if toybox were
complete, it would still be mostly other people's code (LLVM,
binutils, and whatnot). Pretty much EVERY package uses a bunch of
shell and make, lots of it probably autoconf generated, and awk is
pretty common as well. Those languages add up to about 150% of
functionality as far as I can tell, not 300%.

Rob apparently has no desire to learn awk, bc, m4 and all the rest
(and I don't blame him). But he's learning those languages precisely
because he's dealing with other people's code.

Andy
Andy Chu
2016-07-21 05:25:52 UTC
Permalink
Post by Rob Landley
Post by Andy Chu
However, I did a bunch of research and hacking on Kernighan's Awk. I
was trying to morph it into a "proper" modern language.
Another one?
why?
Because if you're going to rule out Python/Perl/etc. on a minimal
Unix, which I mostly agree with, then you still need a decent
scripting language that level of abstraction. Awk is by far the
closest out of any "classic Unix" language.
Post by Rob Landley
Presumably you wouldn't remove anything significant from the base
language, since that would break compatability with existing awk
scripts, so your reaction to awk was "how could I fork this to make it
bigger"?
The overall system would be smaller if you expanded awk (add
readlink(), etc.) and then wrote the core utilities in it.

But as mentioned in my previous message, through my hacking I
determined that both Awk and Make have bad semantics (while shell has
good semantics). So my current idea is to have "extreme
compatibility" for my shell, but add some awk and make features to it,
so you don't have to remember 3 different syntaxes for loops,
conditionals, and function calls.

In other words, the awk and make parts aren't compatible with actual
awk and make -- they just share the same architecture (row-wise
streaming of data and data-oriented parallel builds.) But the shell
part is compatible.

To remind everyone, the basic beef is that Unix has a good
architecture, but horrible syntax. And there are too many languages.
Nobody younger than me learns awk or make anymore. Or even shell.
It just feels old. The average engineer at Google doesn't know any of
those things if they started their career in the last 10 years.

The xkcd somewhat applies, but is mitigated by 2 things:

1) You can replace an entrenched technology/language if you make your
new thing a superset of the old thing, i.e. retaining a high degree of
compatibility.

After my research on ksh, it's clear that this is how bash gained
popularity. ksh was the most popular implementation at the time of
the POSIX standard, and was probably the biggest influence on the
standard. bash was playing catch up -- it aggressively implemented
POSIX *and* the non-POSIX parts of ksh. So eventually people ported
their ksh scripts to bash.

It's basically embrace-and-extend in the open source world... there's
a reason that Microsoft used that strategy -- it works. You implement
something bug-for-bug and then you extend it with useful features.

2) awk has a much smaller user base than shell. You do see big awk
scripts, but you see MANY more big shell scripts. And there are more
shell scripts altogether.

awk and make are also at least 5x smaller and 5x easier to implement
than the shell (if you look at bash/zsh vs GNU awk/make, as well as
other implementations)

If you can manage to fold some awk functionality into shell, then you
could possibly decrease the total number of languages (at least in a
given system). As I said, nobody needs 3 different syntaxes for
loops, function calls, and expressions. (And you cannot avoid them,
at least if you are looking at real systems...)
Post by Rob Landley
The lua thing fell apart trying to write mount, ifconfig, netcat,
losetup, nsenter, ionice, chroot, swapon, setsid, insmod, taskset,
dmesg... The language just didn't have the bindings.
Sure but you can find bindings or write them yourself. That's the
whole point of Lua!
Post by Rob Landley
Post by Andy Chu
I think you mentioned you were looking for an awk test suite. Well
there it is -- there are hundreds or thousands of test cases,
including for the regex language.
Which is provided by libc.
Kernighan Awk has its own regex implementation "b.c" in 958 lines, and
there is an argument to keep it. It uses the Thompson linear-time
NFA/DFA algorithm rather than exponential backtracking. See the note
here:

https://github.com/andychu/bwk

Coincidentally StackOverflow was down today for a related reason...
matching regexes on user input can blow up CPU on your servers:
https://news.ycombinator.com/item?id=12131909 (I linked to my bwk repo
there). And I have seen this bug before elsewhere.

It matters what algorithm you use, and awk/sed/grep are all used with
big input data and (I think?) big regexes. I use them on gigabytes of
text. It probably doesn't matter for bash [[, because there you are
just matching a short string against a regex.

I think GNU awk/sed/grep all have their own regex implementation and
don't use libc, but I could be wrong (?). I thought busybox had some
of its own regex support too.

Andy
Andy Chu
2016-07-21 08:28:12 UTC
Permalink
Post by Andy Chu
Kernighan Awk has its own regex implementation "b.c" in 958 lines, and
there is an argument to keep it. It uses the Thompson linear-time
NFA/DFA algorithm rather than exponential backtracking. See the note
Never mind about this tangent... I *think* GNU libc actually uses the
linear time algorithm, with a possible exception for backreferences.
I was in the middle of some research on that but didn't finish (musl
libc uses a fork of the TRE regex engine, etc.).

But oddly, GNU grep, awk, sed, and coreutils all have a copy of the
GNU libc regex engine? That is just annoying.

$ wc -l gawk-*/reg* */lib/reg*.[ch] | sort -n
81 coreutils-8.22/lib/regex.c
81 grep-2.24/lib/regex.c
81 sed-4.2.2/lib/regex.c
85 gawk-4.1.3/regex.c
591 gawk-4.1.3/regex.h
664 grep-2.24/lib/regex.h
667 coreutils-8.22/lib/regex.h
668 sed-4.2.2/lib/regex.h
834 gawk-4.1.3/regex_internal.h
868 sed-4.2.2/lib/regex_internal.h
910 coreutils-8.22/lib/regex_internal.h
912 grep-2.24/lib/regex_internal.h
1742 grep-2.24/lib/regex_internal.c
1744 sed-4.2.2/lib/regex_internal.c
1746 coreutils-8.22/lib/regex_internal.c
1759 gawk-4.1.3/regex_internal.c
3927 coreutils-8.22/lib/regcomp.c
3941 sed-4.2.2/lib/regcomp.c
3958 gawk-4.1.3/regcomp.c
3962 grep-2.24/lib/regcomp.c
4391 gawk-4.1.3/regexec.c
4412 grep-2.24/lib/regexec.c
4418 coreutils-8.22/lib/regexec.c
4421 sed-4.2.2/lib/regexec.c

Andy
Rob Landley
2016-07-21 10:57:34 UTC
Permalink
Post by Andy Chu
Post by Rob Landley
Post by Andy Chu
However, I did a bunch of research and hacking on Kernighan's Awk. I
was trying to morph it into a "proper" modern language.
Another one?
why?
Because if you're going to rule out Python/Perl/etc. on a minimal
Unix, which I mostly agree with, then you still need a decent
scripting language that level of abstraction. Awk is by far the
closest out of any "classic Unix" language.
I'm fairly certain awk wasn't intended to be turing complete, but I'd
have to dig into the computerphile youtube interviews with Kernighan to
try to find his actual quote.
Post by Andy Chu
Post by Rob Landley
Presumably you wouldn't remove anything significant from the base
language, since that would break compatability with existing awk
scripts, so your reaction to awk was "how could I fork this to make it
bigger"?
The overall system would be smaller if you expanded awk (add
readlink(), etc.) and then wrote the core utilities in it.
You are aware that perl started life as an attempt to combine awk, sed,
and shell into a single tool, right?

Is this really a model you want to emulate? (I say this as someone who
happily used C++ back when it was "C with classes" before templates went
into the language. Doesn't mean I think another "C with classes" fork
would be worth doing, which is why I never tried to learn objective C.)
Post by Andy Chu
But as mentioned in my previous message, through my hacking I
determined that both Awk and Make have bad semantics (while shell has
good semantics). So my current idea is to have "extreme
compatibility" for my shell, but add some awk and make features to it,
so you don't have to remember 3 different syntaxes for loops,
conditionals, and function calls.
So instead of awk+sed+sh you're doing awk+make+sh.
Post by Andy Chu
In other words, the awk and make parts aren't compatible with actual
awk and make -- they just share the same architecture (row-wise
streaming of data and data-oriented parallel builds.) But the shell
part is compatible.
You're creating yet another programming language (because clearly we
haven't got enough of those yet) and if a system needs a script written
in your tool they'll install yet another language alongside ruby and lua
and python and so on.

People write programs in a language. If it's not a compiled langauge,
then the runtime of that language becomes a runtime dependency of that
thing. If it is compiled, it's still a build-time dependency, and
knowledge of it becomes a maintenance dependency. It can only be
modified/extended/ported/fixed by people who understand that language.

Everybody who creates a new language dreams that their language will
cause a net simplification of the world by displacing OTHER languages
and causing them to die out, but just about the only time this has EVER
happened was C and it took several decades to do it, and by that time C
itself was under embrace-and-extend attack by C++ (which still hasn't
got any sort of boundaries to stop its endless feature creep), not to
mention also-rans like objective C, or the modern crop of go/rust/swift
that is each sure it will become the new C and somehow get all the
kernels and other language runtimes rewritten in it.

I ran "apropos interpreter" on my ubuntu 14.04 netbook (reasonably
stock, I try not to install extra build dependencies on it) and it found
erb (ruby), perl, dash, and tclsh. This DIDN'T pull up bash or python,
which I know are on here, so clearly isn't a complete list.

Your new thing, if _wildly_ successful, would displace none of those.
Python 3 can't even displace python 2.
Post by Andy Chu
To remind everyone, the basic beef is that Unix has a good
architecture, but horrible syntax. And there are too many languages.
Nobody younger than me learns awk or make anymore.
I didn't learn make until I had to (after graduating from college).
Post by Andy Chu
Or even shell. It just feels old.
I sat though a couple decades being vastly outnumbered by MCSEs doing
visual basic. They also pointed and laughed at shell scripts. Guess
which outlived which?
Post by Andy Chu
The average engineer at Google doesn't know any of
those things if they started their career in the last 10 years.
If their job is making chrome render faster on windows, I'm not
surprised? You guys do web infrastructure through self driving cars on
systems that got installed for you. You're writing apps in Python and C
and such, doing cluster load balancing and AI research into semantic web
analysis. Banging on the OS is not their domain expertise.

(Do you know how hard it is to google for android _system_ information?
You try to find information about Android programming it's apps in java
all the way down. Clearly this means there IS no part of the system
written in C, it's just java, as javaos demonstrated the feasability of
20 years ago...)

Also, from 2005 to 2012 Guido Van Rossum worked at Google, during which
Google made a big deal about writing stuff in Python. Then Guido left
for Dropbox and Google stopped talking about python so much (at least
externally), but Ken Thompson and Rob Pike took over providing an
in-house language (Go) for Google to have Invented Here. (Apple's is
Swift, apparently because Objective C was so totally overshadowed by C++
they wanted to try again.)

As I said, a few years ago a friend worked for a company called Basho
that did everything in Haskell. A decade back I worked a contract at a
company that did everything in Perl 4. (Yes, significantly post-y2k!)
The culture of a specific company, even Google, isn't necessarily
representative of the world at large. "We use this set of technologies
here" != "this is the universally important set of technologies".

Years ago I noticed a corollary to Moore's Law, which is that 50% of
what you know about software becomes obsolete every 18 months. The nice
thing about unix is it's mostly been the same 50% cycling out over and
over for many decades now.

I've learned lots of things that only lasted 5 years, and decided to
just wait for others to go away. I waited out AOL. I'm currently waiting
out Facebook. It took a LONG TIME to wait out Windows but at this point
I no longer have to care about it. I can wait out systemd.

I'm not sure what parts of Android's infrastructure will cycle out in my
lifetime and which new generations will embrace, but "it builds under
itself" is necessary for the long-term health of any platform. Right now
it builds under a semi-posix environment that's enumerable and can be
bounded, and I'm trying to transplant that.
Post by Andy Chu
1) You can replace an entrenched technology/language if you make your
new thing a superset of the old thing, i.e. retaining a high degree of
compatibility.
The way C++ included the whole of C and was therefore clearly superior
to C, you mean? Or the way perl combined awk and sed and shell?

Who was it who said any problem can be solved by adding another layer of
indirection, except too many layers of indirection?

I'm not sure "let's use a bigger tool to solve the same problems" is a
rallying cry I'm really comfortable getting behind. A couple core ideas
of unix were small tools connected by pipes (communicating via mostly
textual interfaces because humans can read that), and "do one thing and
do it well".

That's part of the reason it's survived so well, it's made from
decoupled parts you can individually swap out. (more->less and so on.)
The point of a shell script is to easily call lots of external commands
which are not, strictly speaking, part of the shell. Yes busybox and
toybox blur those lines, but not execing itself again out of the $PATH
is _mostly_ a performance hack, I.E. entirely optional and can be disabled.

There are a set of shell builtin commands that can't be implemented
externally (cd/exit/export/read are process-local because they modify
process attributes like environment variables and cwd), but when the
kernel guys gave me the tools to do "ulimit" as a standalone command, I did.
Post by Andy Chu
After my research on ksh, it's clear that this is how bash gained
popularity. ksh was the most popular implementation at the time of
the POSIX standard, and was probably the biggest influence on the
standard. bash was playing catch up -- it aggressively implemented
POSIX *and* the non-POSIX parts of ksh. So eventually people ported
their ksh scripts to bash.
Yes and no.

Bash was the first program Linux ever ran. Linus created the Linux
kernel by extending his boot-from-floppy terminal program to handle bash
system calls so he didn't have to keep rebooting into minix to
list/rename/move/delete files and directories. (He had a tiny hard drive
and was constantly clearing off space to download more files from usenet
via the university microvax).

The initial release of bash was June 8, 1989, meaning when Linus posted
the first Linux announcement in August 25, 1991 bash was 2 years old.

Bash did not become the default shell of solaris (tcsh), freebsd (also
tcsh), or aix (korn). As far as I can tell, its popularity was largely
driven by the fact that it was the default shell of every single Linux
installation for fifteen years, until Ubuntu decided in 2006 that its
init scripts ran too slow. No really, that's the reason they gave for
the switch:

https://wiki.ubuntu.com/DashAsBinSh

Even then diversifying away from bash was gradual, even Debian (which
Ubuntu was basically carrying at that point, hiring full-time people to
work on what was otherwise a badly struggling distro) took 3 years to
bow to Ubuntu's will here:

https://lwn.net/Articles/343924/
Post by Andy Chu
It's basically embrace-and-extend in the open source world... there's
a reason that Microsoft used that strategy -- it works. You implement
something bug-for-bug and then you extend it with useful features.
No, their strategy was bundling. There were two entire antitrust trials
about this, and a quote about a ham sandwich. From windows coming free
with every copy of DOS (and per-motherboard licensing so you couldn't
buy the hardware without getting their OS) to Office (can't use their
spreadsheet without their word processor and powerpoint) to making their
browser un-removable from their OS (which still didn't let them change
HTML much, no matter how they tried).

They've embraced and extended all sorts of stuff people utterly ignored
or which got traction and then lost it again, from the Zune embracing
and extending MP3, Microsoft's "J" language (then C# when they got sued)
embracing and extending Java... Outlook tried to embrace and extend
email but their SUCCESS in that area was bundling calendaring with email
so you needed one to use the other (both tied to an exchange server
using protocols they tried to make very hard to reverse engineer).

Bundling can't really be forced in the open source world, the closest
you get is de-facto standards, such as bash and gcc were for linux. As
Linux succeeded, bash succeeded, and got installed on other systems
because people wanted their familiar Linux environment there too. Then
ubuntu bundled dash instead (because stupid) and pushed the other way,
but it still took a while even with Ubuntu having the same 50%
workstation maketshare that Red Hat gave up a few years earlier...
Post by Andy Chu
2) awk has a much smaller user base than shell. You do see big awk
scripts, but you see MANY more big shell scripts. And there are more
shell scripts altogether.
awk and make are also at least 5x smaller and 5x easier to implement
than the shell (if you look at bash/zsh vs GNU awk/make, as well as
other implementations)
Which is why I plan to implement awk and make as their own commands?
Post by Andy Chu
If you can manage to fold some awk functionality into shell, then you
could possibly decrease the total number of languages (at least in a
given system).
No, seriously, this is how Larry Wall created perl:

http://www.shlomifish.org/lecture/Perl/Newbies/lecture1/intro/history.html

That way lies the Emperor of all Cosmos having a drunken bender and
making you push a ball around Japanese living rooms to surprisingly
catching theme music in order to create replacement stars.
Post by Andy Chu
As I said, nobody needs 3 different syntaxes for
loops, function calls, and expressions. (And you cannot avoid them,
at least if you are looking at real systems...)
Those who do not know the history of loop syntaxes are doomed to repeat.
Post by Andy Chu
Post by Rob Landley
The lua thing fell apart trying to write mount, ifconfig, netcat,
losetup, nsenter, ionice, chroot, swapon, setsid, insmod, taskset,
dmesg... The language just didn't have the bindings.
Sure but you can find bindings or write them yourself. That's the
whole point of Lua!
If I have to write code in C and cross-compile it to every supported
target in order to bootstrap a system, what am I bothering with Lua for?
It's just another unnecessary prerequisite package, I might as well just
write the whole thing in C. (So I did.)

(And no, "You can install 7 externally maintained prerequisite packages
instead" is not an improvement.)
Post by Andy Chu
Post by Rob Landley
Post by Andy Chu
I think you mentioned you were looking for an awk test suite. Well
there it is -- there are hundreds or thousands of test cases,
including for the regex language.
Which is provided by libc.
Kernighan Awk has its own regex implementation "b.c" in 958 lines, and
there is an argument to keep it. It uses the Thompson linear-time
NFA/DFA algorithm rather than exponential backtracking.
If musl or bionic should have a better regex expression, fine. But I see
no need to reinvent this particular wheel. (And I've reinvented a lot of
wheels. In fact I wrote my own regex engine for OS/2 feature install in
1996, although that used glob syntax rather than regex syntax because I
did way more DOS than Unix back then. But I have NOT written one for
toybox. Libc exists and posix says it should have this.)
Post by Andy Chu
https://github.com/andychu/bwk
Coincidentally StackOverflow was down today for a related reason...
https://news.ycombinator.com/item?id=12131909 (I linked to my bwk repo
there). And I have seen this bug before elsewhere.
I collated identical * runs even in my old glob implementation because
otherwise it was obvious N^X complexity dealing with them. I assume Rich
has decent stuff in musl and bionic can rip anything they haven't
already got from there. If not, I'm aware of a couple smart guys
maintaining those things who can handle this so it's not _my_ problem. :)
Post by Andy Chu
It matters what algorithm you use, and awk/sed/grep are all used with
big input data and (I think?) big regexes. I use them on gigabytes of
text. It probably doesn't matter for bash [[, because there you are
just matching a short string against a regex.
Don't assume what your inputs look like. Modern Linux removed the 128k
environment space limitation almost a decade ago (commit b6a2fea3931
which went into 2.6.22 released July 2007) and it was never there for
local shell variables anyway.

So there's no reason shell can't read X and "$X" ~= blah and churn
through as big an input as anything else.
Post by Andy Chu
I think GNU awk/sed/grep all have their own regex implementation and
don't use libc, but I could be wrong (?).
Gnu bash 2.x has its own malloc implementation, which is why I had to
say --without-bash-malloc in configure. (Dunno what more current bash is
doing, I haven't built any of the gplv3 versions from source.)

The epic "not invented here" of the gnu/dammit brigades is kind of
impressive. Me, I've consistently said if your libc is broken fix your libc.

(I'm also aware of the performance hacks their grep does with block
reads instead of line reads, and then backing up to find line context
after the match. In theory I could do that with libc's regex stuff too.
In practice, I haven't gone there. My big todo item is making it work
with embedded NUL bytes, which is delaying my current round of grep
replumbing...)

(So you can "grep string /bin/blah" out of executables, of course. Has
to find the string after the NUL byte. No, I can't use libc's regex
engine to cross null bytes, but I can't cross newlines either.)
Post by Andy Chu
I thought busybox had some of its own regex support too.
Not that I recall, but it's been 10 years. The sed I wrote for busybox
way back when used libc's regex though. (There was a wrapper but I
believe it was just the standard "turn regcomp() failure into exit()"
sort of thing.)
Post by Andy Chu
Andy
Rob

Rob Landley
2016-07-10 16:52:51 UTC
Permalink
Post by enh
https://github.com/android-ndk/ndk/issues/133#issuecomment-231318129
is the first time in about a decade i've seen awk more complicated
than "print $2" in the wild...
readelf -sW toolkit/library/libxul.so |grep FUNC |awk '$2 ~ /[048c]$/ {print}'
ruins my genius plan to "implement" an awk that only supports "print
$(\d+)"... :-)
A lot of autoconf stuff uses fairly horrible and elaborate awk. Here's a
bug I extracted fro the glibc build many moons ago:

http://lists.busybox.net/pipermail/busybox/2004-February/044743.html

And this was freeswan:

http://lists.busybox.net/pipermail/busybox/2004-July/011979.html

And here was debian's port of the Defective Annoying SHell:

http://lists.busybox.net/pipermail/busybox/2007-January/025743.html

And so on...

It does get used. I have the posix awk spec and a copy of "The AWK
Programming Language" and intend to do a thing with this when I get
time, but it's a bit down the todo list.

I met Steve French at txlf again (I had dinner with him a couple years
ago but couldn't remember his _name_) and he walked me through the Linux
kernel source files that implement the smb3 server in its simplest mode
and I'm pretty sure I can implement a posix-only latest protocol only
tiny smb server in toybox (it needs to implement something like 19
commands, and can reply "no" to all the weird flags asking it to do
non-posix things like case insensitivity).

He thinks it'll take 3000 lines, I'm hoping to keep it under 1000 lines.
So that got added to the todo list this weekend, presumably before awk. :)

Rob
Loading...