Post by QuadiblocSome progress has been made in advancing a small step towards sanity
in the description of the Concertina II architecture described at
http://www.quadibloc.com/arch/ct17int.htm
As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.
Ironically, I am getting slightly better reach on average with (scaled)
9-bit (and 10) bit displacements than RISC-V gets with 12 bits...
Say:
DWORD:
12s, Unscaled: +/- 2K
9u, 4B Scale : + 2K
10s, 4B Scale: +/- 2K (XG2)
QWORD:
12s, Unscaled: +/- 2K
9u, 8B Scale : + 4K
10s, 8B Scale: +/- 4K (XG2)
It was a pretty tight call between 10s and 10u, but 10s won out by a
slight margin mostly because the majority of structs and stack-frames
tend to be smaller than 4K (but, does create an incentive to use larger
storage formats for on-stack storage).
Though, for integer immediate instructions, RISC-V would have a slight
advantage. Where, say, roughly 9% of 3R integer immediate values miss
with the existing Imm9u/Imm9n scheme; but the sliver of "Misses with 9
bits, but would hit with 12 bits", is relatively small (most of the
"miss" cases are much larger constants).
However, a fair chunk of these "miss" cases, could be handled with a
bit-set/bit-clear instruction, say:
y=x|0x02000000;
z=x&0xFBFFFFFF;
Turning into, say:
BIS R4, 25, R6
BIC R4, 25, R7
Unclear if this case is quite common enough to justify adding these
instructions though (granted, a case could be made for them).
However, a few cases do typically need larger displacements:
PC relative, such as branches.
GBR relative, namely constant loads.
For PC relative, 20-bits is "mostly enough", but one program has hit the
20-bit limit (+/- 1MB). Recently, via a tweak, in current forms of the
ISA, the effective branch-displacement limit (for a 32-bit instruction
form) has been increased to 23 bit (+/- 8MB).
Baseline+XGPR: Unconditional BRA and BSR only.
Conditional branches still limited to 20 bits.
XG2: Also includes conditional branches.
In these cases, it was mostly because the bits that were being used to
extend the GPRs to 6 bits were N/A for their original purpose with
branch-ops, and this could be repurposed to the displacement. Main other
alternatives would have been 22 bits + alternate link register, or a
3-bit LR field; however, the cost of supporting this would have been
higher than that of reassigning them simply towards making the
displacement bigger.
Potentially a similar role could have been served by a conjoined "MOV
LR, R1 | BSR Disp" instruction (and/or allowing "MOV LR, R1" in Lane 2
as a special case for this, even if it would not otherwise be allowed
within the ISA rules). Though, would defeat the point if this encoding
foils the branch predictor.
Recently, had ended up adding some Disp11s Compare-with-Zero branches,
mostly as these branches turn out to be useful (in the face of 2-cycle
CMPxx), and 8 bits "wasn't quite enough". Say, Disp11s can cover a much
bigger if/else block or loop body (+/- 2K) than Disp8s (+/- 256B).
For GBR Relative:
The default 9-bit displacement was Byte scaled (for "reasons");
But, a 512B range isn't terribly useful;
Later forms ended up with Disp10u Scaled:
This gives 4K or 8K of range (in Baseline)
This increases to 8K and 16K in XG2.
If the compiler sorts primitive global variables by descending-usage
(and emits the top N specially, at the start of ".data"), then the
Scaled GBR cases can access a majority of the global variables (around
75-80% with a scaled 10-bit displacement).
Effectively, the remaining 20-25% or so need to be handled as one of:
Jumbo Disp33s (if Jumbo prefixes are available, most profiles);
2-op Disp25s (no jumbo, '.data'+'.bss' less than 16MB).
3-op Disp33s (else).
Though, as with the stack frames, these instructions do create an
incentive to effectively promote any small global variables to a larger
storage type (such as 'char' or 'short' to 'int'); just with implicit
sign (or zero) extensions to preserve the expected behavior of the
smaller type (though, strictly speaking, only zero-extensions would be
required by the C standard, given signed overflow is technically UB; but
there would be something "deeply wrong" with a 'char' variable being
able to hold, say, -4495213, or similar).
Though, does mean for normal variables, "just use int or similar" is
typically faster (say, because there are dedicated 32-bit sign and zero
extending forms of some of the common ALU ops, but not for 8 or 16 bit
cases).
A Disp16u case could maybe reach 256K or 512K, which could cover much of
a combined data+bss section. While in theory this could be better, to
make effective use of this would require effectively folding much of
".bss" into ".data", which is not such a good thing for the program
loader (as opposed to merely folding the top N most-used variables into
".data").
Then again, uninitialized global arrays could probably still be left in
".bss", which tend to be the main "bulking factor" for this section (as
opposed to normal variables).
Post by QuadiblocI want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.
Yeah.
If you want a Load/Store to have two 5 bit registers and a 16-bit
displacement, only 6 bits are left in a 32-bit instruction word. This
is, not a whole lot...
For a full set of Load/Store ops, this is 4 bits;
For a set of basic ALU ops, this is another 3 bits.
So, just for Load/Store and basic ALU ops, half the encoding space is
gone...
Would it be worth it?...
Post by QuadiblocSo what I had done was, after squeezing as much as I could into a basic
instruction format, I provided for switching into alternate instruction
formats which made different compromises by using the block headers.
This has now been dropped. Since I managed to get the normal (unaligned)
memory-reference instruction squeezed into so much less opcode space that
I also had room for the aligned memory-reference format without compromises
in the basic instruction set, it wasn't needed to have multiple instruction
formats.
I had to change the instructions longer than 32 bits to get them in the
basic instruction format, so now they're less dense.
Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).
The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.
Such is a long standing issue...
I am also annoyed sometimes at how complicated my design has gotten.
Still, it is within reason, and not too far outside the scope of many
existing RISC's.
But, as noted, the reason XG2 exists as-is was sort of a compromise:
I couldn't come up with any encoding which could actually give
everything I wanted, and the "most practical" option was effectively to
dust off an idea I had originally rejected:
Having an alternate encoding which dropped 16-bit ops in favor of
reusing these bits for more GPRs.
At first glance, RISC-V seems cleaner and simpler, but this falls on its
face once one goes outside the scope of RV64IM or similar.
And, it isn't tempting when, at least from my POV, RV64 seems "less
good" than what I have already (others may disagree; but at least to me,
some parts of RISC-V's design seem to me like kind of a trash fire).
The main tempting thing the RV64 has is that, maybe, if one goes and
implements RV64GC and clones a bunch of SiFive's hardware interfaces,
then potentially one can run a mainline Linux on it.
There have apparently been some people that have gotten NOMMU Linux
working on RV32IM targets, which is possible (and, ironically, seemingly
basing these on the SuperH branch in the Linux kernel from what I had
seen...).
Seemingly, AMD/Xilinx is jumping over from MicroBlaze to an RV32
variant. But, granted, RV32 isn't too far from what MicroBlaze is
typically used for, so not really a huge stretch.
I sometimes wonder if maybe I would be better off jumping to RV, but
then I end up seeing examples where cores running at somewhat higher
clock speeds still manage to deliver relatively poor framerates in Doom.
Like, as-is, my MIPs scores are kinda weak, but I am still getting
around 30 fps in Doom at around 20-24 MIPs.
RV64IM seemingly needs significantly higher MIPs to get similar
framerates in Doom.
Say, for Doom:
BJX2 needs ~ 800k instructions / frame;
RV64IM seemingly needs nearly 2 million instructions / frame.
Not entirely sure what all is going on, but I have my suspicions.
Though, it does seem to be the inverse situation with Dhrystone.
Say:
BJX2: around 1.3 DMIPS per BJX2 instruction;
RV64: around 3.8 DMIPS per RV64 instruction.
Though, I can note that there seems to be "something weird" with
Dhrystone and GCC (in multiple scenarios, GCC gives Dhrystone scores
that are significantly above what could be "reasonably expected", or
which agree with the scores given by other compilers, seemingly as-if it
is optimizing away a big chunk of the benchmark...).
But, these results don't typically extend to other programs (where
scores are typically much closer together).
Actually, I have noted that if comparing BGBCC with MSVC and BJX2 with
my Ryzen, performance relations seem to scale pretty closer to linearly
relative to clock-speed, albeit with some outliers.
There are cases where deviation has been noted:
Speed differences for TKRA-GL's software rasterizer backend are smaller
than the difference in clock-speed (74x clock-speed delta; 20x fill-rate
delta);
And cases where it is bigger: The performance delta for things like LZ4
decompression or some of my image codecs is somewhat larger than the
clock-speed delta (say: 74x clock-speed delta, 115x performance delta, *1).
*1: Though, LZ4 still operates near memcpy() speed in both cases; issue
is mostly that, relative to MHz, my BJX2 core has comparably slower
memory access.
Albeit somehow, this trend reverses for my early 2000s laptop, which has
slower RAM access. However, the SO-DIMM is 4x the width (64b vs 16b),
and 133MHz vs 50MHz; and this leads to a theoretical 10.64x ratio, which
isn't too far off from the observed memcpy() performance of the laptop.
So, laptop has 10.64x faster RAM, relative to 28x more MHz.
Wheres, say, my Ryzen has 2.64x more MHz (3.7 vs 1.4), but around 40x
more memory bandwidth (12.7x for single-thread memcpy).
Well, and if I did jump over to RV64, it would renderer much of what I
am doing entirely moot.
I *could* do a dedicated RV64 core, but could unlikely make it "notable"
enough to be worthwhile.
So, it seems like my options are either:
Continue on doing stuff mostly as is;
Drop it and probably go off to doing something else entirely.
...
But, don't have much else better to be doing, considering the typically
"meh" response to most of my 3D engine attempts. And my general
lackluster skills towards most types of "creative" endeavors (I suspect
"affective alexithymia" probably doesn't help too much for artistic
expression).
Well, and I have also recently noted other oddities, for example:
It seems I may have "reverse slope hearing loss", and my hearing is
seemingly notably poor for sounds much lower than about 1.5 or 2kHz
(lower-frequency sine waves are nearly inaudible, but I can still hear
square/triangle/sawtooth waves well; most of what I perceive as
low-frequency sounds seemingly being based on higher-frequency harmonics
of those sounds).
So, say:
2kHz..4kHz, loud, heard easily;
4kHz..8kHz, also heard readily;
8..15kHz, fades away and disappears.
But, OTOH, for sine waves:
1kHz: much quieter than 2kHz
500Hz: fairly mild at full volume
250Hz: relatively quiet
125Hz: barely audible.
But, for sounds much under around 200Hz, I can feel the vibrations, and
can associate these with sound (but, this effect is not localized to
ears, also works with hands and similar; this effect seems strongest at
around 50-100 Hz, but has a lower range of around 6-8Hz, below this
point, feeling becomes less sensitive to it, but visual perception can
take over at this point).
I can take audio and apply a fairly aggressive 2kHz high-pass filter
(say, -48db per octave, applied several times), and for the most part it
doesn't sound that much different, though does sound a little more
tinny. This "tinny" effect is reduced with a 1kHz high-pass filter.
Most of what I had perceived as low-frequency sounds are still present
even after the filtering (and while entirely absent in a spectrum plot).
Zooming in generally shows patterns of higher frequency vibrations
following similar patterns to the low-frequency vibrations, which
seemingly I perceive "as" the low-frequency vibration.
And, in all this, I hadn't noticed that anything was amiss until looking
into it for other reasons.
I am left to wonder is some of this could be related to my preference
for the sound of ADPCM compression over that of MP3 at lower quality
levels (low bitrate MP3 sounds particularly awful, whereas ADPCM tends
to fare better; but seemingly other people disagree).
Does possibly explain some other past difficulties:
I can make a noise and hear the walls within a room;
But, trying to hit a metal tank to determine how much sand was in the
tank by hearing, was quite a bit more difficult (best I could do was hit
the tank, and then try to hear what parts of the tank had reduced echo;
but results were pretty mixed as the sand level did not significantly
change the echoes).
Apparently, it turns out, people were listening for "thud" vs "not
thud", but like, I couldn't really hear this part, and wasn't even
really aware there should be a "thud" (or even really what a "thud"
sounds like apart from the effects of, say, something hitting a chunk of
wood; hitting a sand-filled steel tank with a rubber mallet was nearly
silent, but, knuckles or tapping it with a screwdriver was easier to
hear, ...).
Well, also can't really understand what anyone is saying over the phone
(as the phone reduces everything to difficult to understand muffled noises).
Or, like the sound-effects in Wolfenstein 3D being theoretically voice
clips saying stuff, but are more things like "aaaa uunn" or "aaaauuuu"
or "uu aa uu" or similar owing to the poor audio quality.
Well, and my past failures to achieve any kind of intelligibility in
past experiments messing with formant synthesis.
And some experiments with vocoder like designs, noting that I could
seemingly discard pretty much everything much below 500Hz or 1kHz
without much ill effect; but theoretically there is "relevant stuff" in
these frequency ranges. Didn't really think of much at the time (it
seemed like all of this was a "based frequency" where the combined
amplitude of everything could be averaged together and treated like a
single channel).
Had noted that, one thing that did sort of work, was, say:
Split the audio into 32 frequency bands;
Pick the top 2 or 3 bands, ignoring low-frequency or adjacent bands;
Say, anything below 1kHz is ignored.
Record the band number and relative volume.
Then, regenerate waveforms at each of these bands with the measured
volume (along with alternate versions spread across different octaves;
it worked better if higher power-of-2 frequencies were also synthesized,
albeit at lower intensities). Get back "mostly intelligible" speech.
IIRC, had mostly used 32 bands spread across 2 octaves (say, 1-2 kHz and
2-4kHz, or 2-4 kHz and 4-8 kHz).
Can also mix in sounds from the same relative position in other octaves.
Seemed to have best results with mostly evenly-spread frequency bands.
...