Benchmarks for high-end Qualcomm ARM server vs Intel

Post by William Edwards
https://blog.cloudflare.com/arm-takes-wing/
My workloads in DB and SMSC world are actually very similar, although I don't see much golang.

Very interesting. I want to see how the ThunderX goes too. I've used a rented one in "the cloud", and seemed pretty good to me.

I've always thought that while "ISA doesn't matter" and the "x86 decoder tax" is minor and affordable if you only want a handful of cores in a desktop or laptop system, it's nuts to think that it doesn't matter in handheld/IoT and massive server farms where electricity and cooling are major operating costs.

These results seem to bear out that ARM64 has a real advantage.

It will be interesting to see how RISC-V systems do, once SoCs of this core count hit the market. I think they'll beat ARM64 on power/performance.

Quadibloc

2017-11-12 16:39:47 UTC

And, of course, this is profoundly obscured for consumer desktops and laptops by the fact that the "x86
decoder tax" is an absolutely necessary expense to run Microsoft Windows, which is absolutely
necessary to do many of the things one is likely to want to do with a computer.

John Dallman

2017-11-12 17:13:00 UTC

Post by Quadibloc
And, of course, this is profoundly obscured for consumer desktops
and laptops by the fact that the "x86 decoder tax" is an absolutely
necessary expense to run Microsoft Windows, which is absolutely
necessary to do many of the things one is likely to want to do with a computer.

The rumours about ARM-based Windows laptops, with full desktop Windows,
keep coming and going. Intel don't seem keen on binary translation of x86
to ARM as part of Windows.

<https://arstechnica.com/gadgets/2017/05/qualcomm-microsoft-announce-snapd
ragon-835-pcs-with-gigabit-lte/>

John

Melzzzzz

2017-11-12 21:46:33 UTC

Post by Quadibloc
And, of course, this is profoundly obscured for consumer desktops and laptops by the fact that the "x86
decoder tax" is an absolutely necessary expense to run Microsoft Windows, which is absolutely
necessary to do many of the things one is likely to want to do with a computer.

Well, I run Linux, and keep all computers from my familly Windows free
;)

--
press any key to continue or any other to quit...

Andreas Eder

2017-11-14 06:50:16 UTC

Post by Quadibloc
And, of course, this is profoundly obscured for consumer desktops and
laptops by the fact that the "x86 decoder tax" is an absolutely
necessary expense to run Microsoft Windows, which is absolutely
necessary to do many of the things one is likely to want to do with a computer.

Nonsense, you can do almost anything that is worth doing on a computer
without windows - and for the tiny rest there is virtualization.

'Andreas
living without windows since 1994

Quadibloc

2017-11-14 08:00:30 UTC

Post by Andreas Eder
Nonsense, you can do almost anything that is worth doing on a computer
without windows - and for the tiny rest there is virtualization.

That is true in one sense...

but for many, if not most, ordinary consumers who use computers - for more
than checking E-mail and surfing the Web, which certainly can be done from
Linux - what's important to them is 100% confidence that when a new popular
application (including computer games) comes out, their system can run it.

Playing, oh, say, Witcher 3 may not be "worth doing" in some senses, but
it's that kind of thing I have in mind.

Large corporate computer users may not be worried about being able to run
popular computer games, but they will be concerned about things like the
cost of managing their computer systems, strict observance of license
agreements, and the availability of software for terminal emulation, thin
client operation, and the like from the supplier of their back-end systems
such as servers and mainframes. Obviously, things like RHEL will _also_ be
useful for the large corporate computer user, since it has considerable
support.

John Savard

Quadibloc

2017-11-14 08:37:02 UTC

And this news story

https://www.theregister.co.uk/2017/11/13/munich_committee_says_all_windows_2020/

describes how one attempt encountered difficulties that led to it being
abandoned.

John Savard

Anssi Saari

2017-11-14 13:44:32 UTC

Post by Quadibloc
And this news story
https://www.theregister.co.uk/2017/11/13/munich_committee_says_all_windows_2020/
describes how one attempt encountered difficulties that led to it being
abandoned.

Well, the frustrating thing about the Munich project has been that
there's very little information about anything. What did they do, what
were the problems? That story from the Register doesn't exactly help.

Of course, it's clear when a city starts a Linux migration by creating a
Linux distribution themselves that things were already very wrong at
that point. Looks to me actually having a plan might have worked but I'm
not convinced they had any. Although I have no idea what kind of IT a
city needs or what's available for Linux.

Anton Ertl

2017-11-17 09:21:40 UTC

Post by Anssi Saari

There have been several articles, one of them recently:

https://lwn.net/Articles/737818/

I remember (but cannot find) a more technical one, which discussed
various problems that Munich's IT has (typical large-organization
issues), that the new mayor takes as justification for returning to
Windows (although they also have to solve these problems if they
switch to Windows).

Of course, Microsoft has a big interest in having this high-profile
case being declared a failure, and they have the money to support
their interests, and to make sure that any move to windows is well
publicized.

- anton

--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Nick Maclaren

2017-11-17 12:04:36 UTC

Post by Quadibloc

Post by Anssi Saari

Post by Quadibloc
And this news story

https://www.theregister.co.uk/2017/11/13/munich_committee_says_all_windows_2020/

Post by Anssi Saari

Post by Quadibloc
describes how one attempt encountered difficulties that led to it being
abandoned.

I doubt that they will have been spelled out, for the following
reasons.

Post by Quadibloc
https://lwn.net/Articles/737818/
I remember (but cannot find) a more technical one, which discussed
various problems that Munich's IT has (typical large-organization
issues), that the new mayor takes as justification for returning to
Windows (although they also have to solve these problems if they
switch to Windows).

In my experience, those dominate such decisions. And it isn't rare
for the decision to be taken without any consideration of the
technical issues, and the latter to be used as excuses, er,
justification of that.

Post by Quadibloc
Of course, Microsoft has a big interest in having this high-profile
case being declared a failure, and they have the money to support
their interests, and to make sure that any move to windows is well
publicized.

One wonders whether any pressure were put on the application vendors,
as has happened in the past. I doubt that, in the EU, but one still
wonders.

Regards,
Nick Maclaren.

Bruce Hoult

2017-11-14 11:10:40 UTC

Post by Quadibloc

Post by Andreas Eder
Nonsense, you can do almost anything that is worth doing on a computer
without windows - and for the tiny rest there is virtualization.

For 90% of people now, that means Android or iOS, not Windows.

As for myself, I've been using computers for almost 40 years (and owned one myself for 30) without ever once owning a computer that had either MSDOS or Windows installed as the OS [1]. I don't consider that I've missed out on much.

I have, alas, at a few places had to run Windows in a virtual machine from time to time to access corporate systems.

[1] I have had to boot MSDOS from a floppy or flash drive sometimes in order to install BIOS updates.

Anton Ertl

2017-11-14 10:57:32 UTC

Post by Quadibloc
Large corporate computer users may not be worried about being able to run
popular computer games, but they will be concerned about things like the
cost of managing their computer systems, strict observance of license
agreements, and the availability of software for terminal emulation, thin
client operation, and the like from the supplier of their back-end systems
such as servers and mainframes.

All of which are weaknesses of the Windows ecosystem. E.g., I have
been using "thin clients" on Unix since 1993 (or, if you count
classical text terminals, since 1986), before they were called "thin
clients", and before Windows had anything in that direction (however,
recent developments in the Linux ecosystem (Wayland) are directed at
getting rid of this advantage).

No, just like with games, the thing that keeps Windows alive there is
the fear that some specialty application does not work elsewhere. But
it seems that there are now also specialty applications that only have
native Linux support, therefore Windows 10 has also acquired some
amount of Linux compatibility. I have no idea how well it works,
though.

- anton

--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

William Edwards

2017-11-14 15:13:25 UTC

All anyone wants to do these days seems doable online.

I live in a country with a lot of junk mail, and regularly flick through several fliers. I have noticed that chromebooks now get the lion share of space, followed by tablets/pads, with PCs now a distant forth. I recently went to buy a new laptop and was dismayed how shrunk and underwhelming the available options were (and barely faster than my old mbp2013).

Given how hard it is to buy a Windows computer, and how increasingly the only app anyone installs on a new Windows computer is the Chrome browser, I think of x86+win as "legacy". Most hardware shipping today doesn't seem to run it, yet it still sells...

Megol

2017-11-14 10:51:03 UTC

Post by Andreas Eder

Post by Quadibloc
And, of course, this is profoundly obscured for consumer desktops and
laptops by the fact that the "x86 decoder tax" is an absolutely
necessary expense to run Microsoft Windows, which is absolutely
necessary to do many of the things one is likely to want to do with a computer.

Nonsense, you can do almost anything that is worth doing on a computer
without windows - and for the tiny rest there is virtualization.

You mean emulation? Virtualization doesn't help reducing the decoder tax
mentioned above. Virtualization still require an x86 processor and it still
require a Windows installation.

Given that many people play games on Windows PCs and
that many popular games continues to push graphics and processing emulation
may not be a good solution. Overheads of emulating the processor adds to
the overheads of emulating DX (though some games support the Vulcan API).

If alternatives continue to improve compared to x86 game developers will of
course make sure that their code run on those alternatives.
But for now it's unrealistic.

Of course for most simply having a web browser is enough. Waste of CPU
cycles IMO.

Post by Andreas Eder
'Andreas
living without windows since 1994

a***@yahoo.com

2017-11-12 19:45:54 UTC

Post by Bruce Hoult
I think they'll beat ARM64 on power/performance.

Why do you think so?

Bruce Hoult

2017-11-12 20:21:29 UTC

Post by Bruce Hoult
I think they'll beat ARM64 on power/performance.

Why do you think so?

That's given a design by the same team, using the same technology, similar micro-architecture etc.

There are a number of small things that add up, but the #1 in my mind is that the ARM64 has to fetch 30% - 50% more instruction bytes than a RV64 for the same program with the same number of instructions. icache size and bandwidth is one of the most expensive things in a high performance CPU. A RISCV with mixed 16- and 32-bit instructions can perform the same with, say, 24k of L1 icache as the ARM64 does with 32k.

The ARM64 is by far the best pure 32-bit instruction 64 bit CPU in this regard, beating PowerPC by a good bit and MIPS and Alpha by a lot, but it just boggles me that ARM knew the benefits of Thumb2 and then totally forgot them when they went 64 bit.

a***@yahoo.com

2017-11-12 21:33:50 UTC

Post by Bruce Hoult
I think they'll beat ARM64 on power/performance.

Why do you think so?

That's given a design by the same team, using the same technology, similar micro-architecture etc.
There are a number of small things that add up, but the #1 in my mind is that the ARM64 has to fetch 30% - 50% more instruction bytes than a RV64 for the same program with the same number of instructions. icache size and bandwidth is one of the most expensive things in a high performance CPU. A RISCV with mixed 16- and 32-bit instructions can perform the same with, say, 24k of L1 icache as the ARM64 does with 32k.
The ARM64 is by far the best pure 32-bit instruction 64 bit CPU in this regard, beating PowerPC by a good bit and MIPS and Alpha by a lot, but it just boggles me that ARM knew the benefits of Thumb2 and then totally forgot them when they went 64 bit.

O.k.
I am a fun of code density of Thumb2 myself. But were high-performance implementations of mixed-length variant of RISC-V ever attempted? Or planned?
I was under impression that mixed-length encoding is defined mostly for the benefit of MCU-class implementations.

And how about a number of small and not so small things in which aarch64 is better than MIPS/RISC-V ?
#1 in my mind is FAR more powerful addressing modes.
#2 is load/store register pair.
#3 is longer reach of conditional branches, but that's already far less important than #2.

Bruce Hoult

2017-11-12 23:11:46 UTC

Post by Bruce Hoult
I think they'll beat ARM64 on power/performance.

Why do you think so?

That's given a design by the same team, using the same technology, similar micro-architecture etc.
There are a number of small things that add up, but the #1 in my mind is that the ARM64 has to fetch 30% - 50% more instruction bytes than a RV64 for the same program with the same number of instructions. icache size and bandwidth is one of the most expensive things in a high performance CPU. A RISCV with mixed 16- and 32-bit instructions can perform the same with, say, 24k of L1 icache as the ARM64 does with 32k.
The ARM64 is by far the best pure 32-bit instruction 64 bit CPU in this regard, beating PowerPC by a good bit and MIPS and Alpha by a lot, but it just boggles me that ARM knew the benefits of Thumb2 and then totally forgot them when they went 64 bit.

O.k.
I am a fun of code density of Thumb2 myself. But were high-performance implementations of mixed-length variant of RISC-V ever attempted? Or planned?

It's a little early to use words such as "ever" when the base ISA was only firmly nailed down two years ago, the 16-bit instructions more recently than that, and several other bits important for competing with high end Intel are not yet finalised: bit manipulation instructions, crypto instructions, and vector instructions for example. I'm not sure about crypto but the other two are well under way and expected to have first drafts and reference implementations in the next couple of quarters.

SiFive's U54-MC is I guess medium performance, around 1.5 - 2 GHz, single issue in order. It supports the 16 bit instructions.

"BOOM" (Berkeley Out Of Order Machine doesn't currently have a decoder for 16 bit instructions. I talked to Chris Celio at a conference last month and he says 16 bit instruction support will be added.

We don't know whether there are truly high performance commercial implementations under way. They don't have to say.

Post by a***@yahoo.com
I was under impression that mixed-length encoding is defined mostly for the benefit of MCU-class implementations.

Current expectations are that any chip capable of running Linux will implement mixed-length encoding, and prebuilt Linux distributions with use it. The current Fedora distro doesn't use it, but the Fedora guys now say that was a mistake and future releases will.

Post by a***@yahoo.com
And how about a number of small and not so small things in which aarch64 is better than MIPS/RISC-V ?
#1 in my mind is FAR more powerful addressing modes.
#2 is load/store register pair.
#3 is longer reach of conditional branches, but that's already far less important than #2.

#1 I think those (as well as a number of other things in aarch64) are cracked into multiple uops in most implementations.

#2 load/store pair is a bit of a hack to get better instruction density with only 32 bit instructions available. The common cases (e.g. save/restore to the stack) use 16 bit instructions in RISC-V so the code density is the same. Higher-end implementations can easily recognize adjacent pairs of loads or stores and fuse them into a single uop if that is advantageous.

#3 RISC-V has +/4k reach on conditional branches. Aarch64 has +/-1M. That's quite a difference but I'm not sure it's significant in most code. Machines with +/-128 bytes are a pain, for sure! Not a lot of functions are over 4k in size, let alone hot loops. RISC-V unconditional branches have +/-1M reach, so you can branch to .+4 on the opposite condition and then do an unconditional branch. This is, again, an easy pattern to recognise in the decode stage in a big high performance implementation.

a***@yahoo.com

2017-11-12 23:51:17 UTC

Post by Bruce Hoult
I think they'll beat ARM64 on power/performance.

Why do you think so?

That's given a design by the same team, using the same technology, similar micro-architecture etc.
There are a number of small things that add up, but the #1 in my mind is that the ARM64 has to fetch 30% - 50% more instruction bytes than a RV64 for the same program with the same number of instructions. icache size and bandwidth is one of the most expensive things in a high performance CPU. A RISCV with mixed 16- and 32-bit instructions can perform the same with, say, 24k of L1 icache as the ARM64 does with 32k.
The ARM64 is by far the best pure 32-bit instruction 64 bit CPU in this regard, beating PowerPC by a good bit and MIPS and Alpha by a lot, but it just boggles me that ARM knew the benefits of Thumb2 and then totally forgot them when they went 64 bit.

O.k.
I am a fun of code density of Thumb2 myself. But were high-performance implementations of mixed-length variant of RISC-V ever attempted? Or planned?

By ARMv8-A standards it's not a medium, it's absolute bottom end. Even A35 aims higher.

Post by Bruce Hoult
"BOOM" (Berkeley Out Of Order Machine doesn't currently have a decoder for 16 bit instructions. I talked to Chris Celio at a conference last month and he says 16 bit instruction support will be added.
We don't know whether there are truly high performance commercial implementations under way. They don't have to say.

Post by a***@yahoo.com
I was under impression that mixed-length encoding is defined mostly for the benefit of MCU-class implementations.

#1 I think those (as well as a number of other things in aarch64) are cracked into multiple uops in most implementations.

According to my understanding/guesses, it goes like that:

• Base plus a scaled 12-bit unsigned immediate offset or base plus an unscaled 9-bit signed immediate offset.

RISC-V has a close equivalent.
Not cracked.

• Base plus a 64-bit register offset, optionally scaled.

RISC-V has nothing like that.
Not cracked.
May be, on some OoO implementations in case of integer store one register read port is "stolen" from neighbor issue port.
In case of integer loads and FP loads/store it's not necessary. Also, it is not necessary on implementations that have Intel-style dedicated "store data" issue port.

• Base plus a 32-bit extended register offset, optionally scaled.

Same as above.

• Pre-indexed by an unscaled 9-bit signed immediate offset.

RISC-V has nothing like that.
Probably cracked on majority of ooO implementations.
Probably not cracked on in-order implementations.

• Post-indexed by an unscaled 9-bit signed immediate offset.

Same as above.
Especially good fit for in-order implementations with skewed pipeline.

• PC-relative literal for loads of 32 bits or more

RISC-V has nothing like that.
Not cracked.
Here is a point where RISC-V is inferior to its ancestor MIPS and to close MIPS derivatives like Altera Nios2, because while those also don't have it, but at least they have 16-bit immediates that reduce the need for "literal pools".

Post by Bruce Hoult
#2 load/store pair is a bit of a hack to get better instruction density with only 32 bit instructions available. The common cases (e.g. save/restore to the stack) use 16 bit instructions in RISC-V so the code density is the same. Higher-end implementations can easily recognize adjacent pairs of loads or stores and fuse them into a single uop if that is advantageous.

That sort of fusion does not sound easy at all.

Post by Bruce Hoult
#3 RISC-V has +/4k reach on conditional branches. Aarch64 has +/-1M. That's quite a difference but I'm not sure it's significant in most code. Machines with +/-128 bytes are a pain, for sure! Not a lot of functions are over 4k in size, let alone hot loops. RISC-V unconditional branches have +/-1M reach, so you can branch to .+4 on the opposite condition and then do an unconditional branch. This is, again, an easy pattern to recognise in the decode stage in a big high performance implementation.

I agree that it is a minor advantage for aarch64.

Ivan Godard

2017-11-13 01:50:41 UTC

Post by Bruce Hoult
I think they'll beat ARM64 on power/performance.

Why do you think so?

That's given a design by the same team, using the same technology, similar micro-architecture etc.
There are a number of small things that add up, but the #1 in my mind is that the ARM64 has to fetch 30% - 50% more instruction bytes than a RV64 for the same program with the same number of instructions. icache size and bandwidth is one of the most expensive things in a high performance CPU. A RISCV with mixed 16- and 32-bit instructions can perform the same with, say, 24k of L1 icache as the ARM64 does with 32k.
The ARM64 is by far the best pure 32-bit instruction 64 bit CPU in this regard, beating PowerPC by a good bit and MIPS and Alpha by a lot, but it just boggles me that ARM knew the benefits of Thumb2 and then totally forgot them when they went 64 bit.

O.k.
I am a fun of code density of Thumb2 myself. But were high-performance implementations of mixed-length variant of RISC-V ever attempted? Or planned?

By ARMv8-A standards it's not a medium, it's absolute bottom end. Even A35 aims higher.

Post by a***@yahoo.com
I was under impression that mixed-length encoding is defined mostly for the benefit of MCU-class implementations.

#1 I think those (as well as a number of other things in aarch64) are cracked into multiple uops in most implementations.

• Base plus a scaled 12-bit unsigned immediate offset or base plus an unscaled 9-bit signed immediate offset.
RISC-V has a close equivalent.
Not cracked.
• Base plus a 64-bit register offset, optionally scaled.
RISC-V has nothing like that.
Not cracked.
May be, on some OoO implementations in case of integer store one register read port is "stolen" from neighbor issue port.
In case of integer loads and FP loads/store it's not necessary. Also, it is not necessary on implementations that have Intel-style dedicated "store data" issue port.
• Base plus a 32-bit extended register offset, optionally scaled.
Same as above.
• Pre-indexed by an unscaled 9-bit signed immediate offset.
RISC-V has nothing like that.
Probably cracked on majority of ooO implementations.
Probably not cracked on in-order implementations.
• Post-indexed by an unscaled 9-bit signed immediate offset.
Same as above.
Especially good fit for in-order implementations with skewed pipeline.
• PC-relative literal for loads of 32 bits or more
RISC-V has nothing like that.
Not cracked.
Here is a point where RISC-V is inferior to its ancestor MIPS and to close MIPS derivatives like Altera Nios2, because while those also don't have it, but at least they have 16-bit immediates that reduce the need for "literal pools".

That sort of fusion does not sound easy at all.

I agree that it is a minor advantage for aarch64.

Store: {Base (specReg or pointer), 0/8/16/32 bit signed literal offset,
optional scaled or unscaled index} -> 3-input address adder, plus stored
operand and optional predicate operand. Tin, with no index, offset or
predicate: 15 bits. Gold, with index, 32 bit offset, and predicate: 57
bits. Both exclusive of instruction-level overhead.

Bruce Hoult

2017-11-13 09:49:14 UTC

Post by Bruce Hoult
I think they'll beat ARM64 on power/performance.

Why do you think so?

That's given a design by the same team, using the same technology, similar micro-architecture etc.
There are a number of small things that add up, but the #1 in my mind is that the ARM64 has to fetch 30% - 50% more instruction bytes than a RV64 for the same program with the same number of instructions. icache size and bandwidth is one of the most expensive things in a high performance CPU. A RISCV with mixed 16- and 32-bit instructions can perform the same with, say, 24k of L1 icache as the ARM64 does with 32k.
The ARM64 is by far the best pure 32-bit instruction 64 bit CPU in this regard, beating PowerPC by a good bit and MIPS and Alpha by a lot, but it just boggles me that ARM knew the benefits of Thumb2 and then totally forgot them when they went 64 bit.

O.k.
I am a fun of code density of Thumb2 myself. But were high-performance implementations of mixed-length variant of RISC-V ever attempted? Or planned?

By ARMv8-A standards it's not a medium, it's absolute bottom end. Even A35 aims higher.

I've never come across one of those in the wild. U54 looks like it's going to come in comparable to the vast majority of A53s out there. (Yes, I know A53 is dual issue ... I struggle to see where it gets an advantage from that)

Post by Bruce Hoult
#1 I think those (as well as a number of other things in aarch64) are cracked into multiple uops in most implementations.

All well and good, but how much actual advantage does it give on real programs?

That sort of fusion does not sound easy at all.

Why? Decoding the pair of 16 bit instructions as a single pseudo 32-bit instruction is no harder than decoding Aarch64 or Thumb2 32 bit instructions. The encoding gets a little bit redundant because the opcode fields should be the same, and they both have offsets, which should probably be adjacent.

The Aarch64 approach says every implementation must recognize complex instructions, and simple implementations might crack them into multiple uops (two loads or two stores in this case).

The RISC-V approach says they are two valid instructions and simple implementations can remain very simple. Only big implementations that might contemplate running both instructions as one uop are burdened with recognizing the 32-bit pattern.

You touched on another aspect: some Aarch64 instructions require more than two integer read ports and one write port. No current RISC-V instruction does, and a strong effort is being made not to add any, for example this has influenced the design of the bit-manipulation extension instructions.

In general here, I'm getting the impression that your definition of better and inferior hinges on convenience for the programmer, on not needing an extra instruction in situations that are not actually all that common, certainly dynamically. For example the vast majority of conditional branch offsets are small, as are the vast majority of load/store immediate offsets.

The proof is in the pudding, and we will see how it all works out when higher end implementations start coming out.

Bear in mind that RISCV development is at least five years behind Aarch64. The RISCV effort was started (publicly) before Aarch64 was announced (including a manual with the completed ISA design), but presumably Aarch64 was being developed in secret for quite some years before that. It took from architecture announcement in October 2011 until early 2015 for the first devices (e.g. Galaxy S6) with ARM Ltd cores to be released. (The iPhone 5s was a complete shock to the industry in September 2013, and there was also a 64 bit Tegra in the Nexus 9 in 2014)

MitchAlsup

2017-11-12 22:50:35 UTC

The ARM64 is by far the best pure 32-bit instruction 64 bit CPU <snip>, but it just boggles me that ARM knew the benefits of Thumb2 and then totally forgot them when they went 64 bit.

Perhaps ARM actually measured the scenario space and came to a decision based on data.

Bruce Hoult

2017-11-12 23:14:04 UTC

Post by MitchAlsup

The ARM64 is by far the best pure 32-bit instruction 64 bit CPU <snip>, but it just boggles me that ARM knew the benefits of Thumb2 and then totally forgot them when they went 64 bit.

Perhaps ARM actually measured the scenario space and came to a decision based on data.

Perhaps the guy who co-wrote "Computer Architecture: A Quantitative Approach" did too?

Piotr Wyderski

2017-11-14 18:53:56 UTC

Post by Bruce Hoult
These results seem to bear out that ARM64 has a real advantage.

Another advantage is that when you learn about the ARM architecture,
you are basically able to program devices ranging from a microwave
ovens through mobile phones to these beefy servers. When you learn
x64, you are bound to program the PCs in this form or another.

Best regards, Piotr

Bruce Hoult

2017-11-14 19:52:44 UTC

Post by Bruce Hoult
These results seem to bear out that ARM64 has a real advantage.

True for RISCV, which has the same basic ISA whether for a 32 bit CPU or for 64 bit servers now and 128 bit in the future (maybe not so far in the future ... Krste Asanović says they included 128 bit support as a bit of a joke but they've been surprised to have some actual demand for it immediately, as it's an absolutely unique capability at the moment). If you happen to want a small low power low performance 64 bit CPU it's easy to do that e.g. the SiFive E51 core. That's good as an embedded controller in things that need more than 32 bit address space.

In ARM, the 64 bit ISA it totally different to the 32 bit ISA, and the chips don't extend very far down the size/speed/power consumption range. Conversely 32 bit doesn't extend very far up the performance range -- it's not too bad, as the out of order A15 works pretty well and does 2 GHz or a bit more. I do enjoy using the Odroid XU4 with quad 2.0 A15 plus quad 1.4 A7 using the Samsung Exynos 5422 SoC (same as non-LTE version of Galaxy S5) as it makes quite a decent desktop system -- night and day compared to Raspberry Pi. So there is some overlap between high power 32 bit ARM and low power 64 bit ARM, but you can't (currently) use one ISA all the way from a microwave oven to a supercomputer.

Piotr Wyderski

2017-11-14 22:18:01 UTC

Post by Bruce Hoult
True for RISCV, which has the same basic ISA whether for a 32 bit CPU or for 64 bit servers now and 128 bit in the future

The ISA might be the same, but the low-end chips don't exist. A bit
different in the case of an ARM:

https://pl.mouser.com/Semiconductors/Integrated-Circuits-ICs/Embedded-Processors-Controllers/Microcontrollers-MCU/ARM-Microcontrollers-MCU/_/N-a86nc

Just 8699 matches. There is even an ARM in DIP8, some people really DO
have unbounded imagination/sense of humor...

https://pl.mouser.com/ProductDetail/NXP/LPC810M021FN8FP/?qs=cbprxTG2Yq%2fZWYQ2Vx854A%3d%3d

Best regards, Piotr

Bruce Hoult

2017-11-14 23:44:36 UTC

Post by Bruce Hoult
True for RISCV, which has the same basic ISA whether for a 32 bit CPU or for 64 bit servers now and 128 bit in the future

The ISA might be the same, but the low-end chips don't exist. A bit
https://pl.mouser.com/Semiconductors/Integrated-Circuits-ICs/Embedded-Processors-Controllers/Microcontrollers-MCU/ARM-Microcontrollers-MCU/_/N-a86nc

A fairly low end chip is the *only* one currently publicly for sale:

https://www.crowdsupply.com/sifive/hifive1

(the page is mostly for a dev board with the FE310-G000 chip, but also lists bare chips for sale)

That will change over time. But meantime, if you want some other particular package or peripheral set go and talk to SiFive (or others) and they'll be happy to make it for you.

Of course they won't be $1 each unless you order a *lot* of them.

Piotr Wyderski

2017-11-17 09:14:09 UTC

Post by Bruce Hoult
That will change over time. But meantime, if you want some other particular package or peripheral
set go and talk to SiFive (or others) and they'll be happy to make it

for you.

Post by Bruce Hoult
Of course they won't be $1 each unless you order a *lot* of them.

But why should I care if the ARMs are already available and, moreover,
from many vendors? In any quantity I want, be it 1 or 1e8. The software
chain is also available and mature. Vendor lock-in is one of the most
stupid things one can do.

Best regards, Piotr

Bruce Hoult

2017-11-17 11:54:54 UTC

Post by Bruce Hoult
That will change over time. But meantime, if you want some other particular package or peripheral
set go and talk to SiFive (or others) and they'll be happy to make it

for you.

Post by Bruce Hoult
Of course they won't be $1 each unless you order a *lot* of them.

If you see an ARM-based SoC that does exactly what you want then of course use it. There are a lot of great products out there.

But, yes, be careful of vendor lock-in! The ARM cores are standard -- licencees are not allowed to change them in any way -- but the peripherals and the interfaces to program them are very different between an NXP, an STM, and an Atmel SAM. Assuming you're not programming in assembler that has a much bigger effect on the portability of your software than the CPU instruction set.

As well as the CPU cores, the RISC-V guys are assembling a set of open and license-free peripherals. That's going to take a while, of course, and in the short term it's going to be easier to add licenced peripherals to the open cores, just as people using Linux had to tolerate for a long time certain peripherals having only binary blob drivers (as NVIDIA video cards still do today).

Linux is a great comparison, I think. RISC-V now is in about the same stage as Linux was in 1994 or 1995. Linux was pretty easy to ignore then, and indeed not using Linux was the correct choice for many or most people then. But it slowly got critical mass, high quality, and runs the majority of mobile phones, internet servers, and supercomputers today. It's far from certain that RISC-V will follow the same path -- there are a thousand ways to fail -- but I think it's got a chance and a great start.

Recent moves by ARM appear to be a direct reaction to RISC-V, adding the Cortex M3 to the program, and removing upfront fees:

https://www.allaboutcircuits.com/news/arm-announces-update-to-its-designstart-program-for-custom-soc-designers/

That's a good start, but it would be much better if they allowed the use of *any* of their standard cores .. including the A-series.

Look for ARM to do that if RISC-V gains more momentum. Everyone wins .. including ARM customers.

Piotr Wyderski

2017-11-18 20:32:07 UTC

Post by Bruce Hoult
As well as the CPU cores, the RISC-V guys are assembling a set of open and license-free peripherals. That's going to take a while, of course, and in the short term it's going to be easier to add licenced peripherals to the open cores, just as people using Linux had to tolerate for a long time certain peripherals having only binary blob drivers (as NVIDIA video cards still do today).

But the peripherals' license fees are not important to the end users,
they are hidden somewhere in the cost of the chip and the final decision
is based on whether you can afford the entire chip or not. The price of
an ARM is X, including the licenses, the cost of a RISC-V is Y for some
other reasons, if both fit, select the cheapest. Besides, the embedded
world already knows that one size does not fit all and the variety of
peripherals attached to the RISC-V core will cause exactly the same
portability problems as in the case of the ARMs.

Post by Bruce Hoult
Linux is a great comparison, I think.

Exactly. Today you have so many Linuxes that portability becomes a
serious problem even between them.

Post by Bruce Hoult
Look for ARM to do that if RISC-V gains more momentum. Everyone wins .. including ARM customers.

Or switch to a mix of an FPGA and an MCU (like the Zynq) and take your
favourite peripherals with you, whoever is the producer of the hybrid.
IMO this is what the world aims at. Call it hardware-level
virtualization if you wish.

Best regards, Piotr

a***@yahoo.com

2017-11-18 21:29:17 UTC

Post by Bruce Hoult
Linux is a great comparison, I think.

Exactly. Today you have so many Linuxes that portability becomes a
serious problem even between them.

Post by Bruce Hoult
Look for ARM to do that if RISC-V gains more momentum. Everyone wins .. including ARM customers.

Zynq (or HPS variants of Cyclone-V) does not cover <20$ space, nor <500mW.
Max10 (and the smaller members of Cyclone10-LP family) do play here, but neither includes "hard" cores.

Piotr Wyderski

2017-11-18 22:12:02 UTC

Post by a***@yahoo.com
Zynq (or HPS variants of Cyclone-V) does not cover <20$ space, nor <500mW.

You also have PSOC5LP here, but its documentation and support quality is
ridiculously substandard. Nonetheless, I don't claim that Zynq is the
solution. I claim that the Zynq-like FPGA+MCU hybrid is the way to go.
Why would you care about portability issues if you can port your hardware?

Post by a***@yahoo.com
Max10 (and the smaller members of Cyclone10-LP family) do play here, but neither includes "hard" cores.

And the analog subsystem of the current hybrids is non-existent or at
most laughable (Zynq or Microsemi Fusion2). But nothing prevents
including this functionality, at least technologically. They want to
target the premium MCU market now, which is OK. The analog subsystem
of PSOC5LP + Znyq digital peripherals + Xilinx-like documentation =
guaranteed success.

Best regards, Piotr

Bruce Hoult

2017-11-18 23:30:58 UTC

Post by Piotr Wyderski
Exactly. Today you have so many Linuxes that portability becomes a
serious problem even between them.

Very minor compared to porting between different proprietary OSes!

Post by Bruce Hoult
Look for ARM to do that if RISC-V gains more momentum. Everyone wins .. including ARM customers.

Or switch to a mix of an FPGA and an MCU (like the Zynq) and take your
favourite peripherals with you, whoever is the producer of the hybrid.

Whoa! From $1 off the shelf ARM to $100 Zynq! Quite the difference.

You can take your peripherals with you and have higher performance in a small custom low power SoC for lower price than Zync at a surprisingly low volume of parts -- maybe as low as 300, but certainly lower than 1000 now. It's going to take 3 months to get them, and of course you'll want to prototype in an FPGA.

Piotr Wyderski

2017-11-19 08:48:41 UTC

Post by Bruce Hoult
Whoa! From $1 off the shelf ARM to $100 Zynq! Quite the difference.

Pure effect of the lack of competitition and novelty premium.
There is no reason for the Zynqs to cost 60$ per chip. You don't
either need a dual core 1GHz ARM + so many PL cells in many
applications, but this technology can be easily downscaled.
Just a matter of your marketing target group.

Post by Bruce Hoult
You can take your peripherals with you and have higher performance in a small custom low power SoC for lower price than Zync at a surprisingly low volume of parts -- maybe as low as 300, but certainly lower than 1000 now. It's going to take 3 months to get them, and of course you'll want to prototype in an FPGA.

This is another good option: basically the old good ASICs, but this time
developed by the end users. But this time you sacrifice the flexibility
of the FPGA fabric and the ability to make in the field hardware
updates. Nonetheless, I see the conventional MCU/SoC technology
hopelessly obsolete. The hybrids are able to kill the entire branch
of real-time systems' design and implementation techniques, where
the CPU-based solution is good only because we have no alternatives.
Throw as many PL cells as you need, get nanosecod..microsecond
reaction times EASILY and forget about EDF scheduling and related garbage.

Best regards, Piotr

a***@yahoo.com

2017-11-19 09:58:34 UTC

Post by Bruce Hoult
Whoa! From $1 off the shelf ARM to $100 Zynq! Quite the difference.

If the device that you are dreaming about is made at relatively modern process node, like TSMC 40nm or better, then I don't see how "hard" MCU-class core adds value. That is, it adds some value, but not really enough. MCU-class soft core consumes less 2000 LUTs and at this nodes LUTs are cheap.

For me two main reasons why I still sometimes prefer MCUs over FPGAs with soft core are analog peripherals and on-chip NOR flash. Power consumption is distant #3. CPU core is no higher than #4.

And on-chip NOR flash is an important factor only because on-chip SRAM is relatively expensive. And on-chip SRAM is relatively expensive only because my FPGA vendor of choice makes no 40-28nm low end devices at all. If they give me 1Mbit of SRAM cheaply, which is perfectly technically possibly at finer nodes, then I will be happy with loading my program in boot time from 50c SPI flash.

So, at the end, for me it's *only* about analog.
YMMV.

t***@gmail.com

2017-11-19 17:07:16 UTC

Post by a***@yahoo.com
If the device that you are dreaming about is made at relatively modern process node, like TSMC 40nm or better, then I don't see how "hard" MCU-class core adds value. That is, it adds some value, but not really enough. MCU-class soft core consumes less 2000 LUTs and at this nodes LUTs are cheap.

I agree if you need Cortex-Mx like performance. But if you want something like Zynq with dual Cortex-A9, DDR-controller, Ethernet and USB MAC, etc. you would need much more LUTs and would by far not get the speed that you get with hard-core. And then there are also other disadvantages like compile-time, etc.

I think this SoC + FPGA approach is a very good concept. They should get power down, however, and pricing for the high-end parts.

Regards,

Thomas

David Brown

2017-11-19 17:22:49 UTC

Post by Bruce Hoult
Whoa! From $1 off the shelf ARM to $100 Zynq! Quite the difference.

If the device that you are dreaming about is made at relatively modern process node, like TSMC 40nm or better, then I don't see how "hard" MCU-class core adds value. That is, it adds some value, but not really enough. MCU-class soft core consumes less 2000 LUTs and at this nodes LUTs are cheap.
For me two main reasons why I still sometimes prefer MCUs over FPGAs with soft core are analog peripherals and on-chip NOR flash. Power consumption is distant #3. CPU core is no higher than #4.
And on-chip NOR flash is an important factor only because on-chip SRAM is relatively expensive. And on-chip SRAM is relatively expensive only because my FPGA vendor of choice makes no 40-28nm low end devices at all. If they give me 1Mbit of SRAM cheaply, which is perfectly technically possibly at finer nodes, then I will be happy with loading my program in boot time from 50c SPI flash.
So, at the end, for me it's *only* about analog.
YMMV.

For high-end microcontrollers, there seems to be a modern trend to drop
the on-chip flash. Having flash on the die makes severe restrictions on
the kind of process and die setup you can use - that in turn makes the
chip more expensive, especially if you want a lot of on-board ram. For
devices like NXP's new "i.mx RT" family, dropping the internal flash
means the device is a lot faster and a lot cheaper than their other
high-end microcontrollers - even when you include the cost of an
external QSPI flash chip.

t***@gmail.com

2017-11-19 16:58:31 UTC

Post by Bruce Hoult
Whoa! From $1 off the shelf ARM to $100 Zynq! Quite the difference.

Depending on quantity, small Zynqs are *MUCH* cheaper then $100...

Are you talking about an ASIC here? I would be very surprised if the break-even for something with Zynq-like performance is below 100k chips... I in fact, I think it is more like 1M. See also: http://anysilicon.com/fpga-vs-asic-choose/

Or do I miss something here?

And, of course you loose the flexibility to add "hardware" changes with firmware-updates (i.e. changes in the FPGA design).

Regards,

Thomas

Bruce Hoult

2017-11-19 19:05:01 UTC

Post by t***@gmail.com

Post by Bruce Hoult
Whoa! From $1 off the shelf ARM to $100 Zynq! Quite the difference.

Depending on quantity, small Zynqs are *MUCH* cheaper then $100...

Rather excessive $4m NRE on the ASIC there!

SiFive, at least, think they can do it a lot lower than that.

https://www.sifive.com/designshare/

They're talking $100k to do the NRE and get a batch of test chips, for a synthesisable SoC using standard (but parameterised) IP components.

I believe it's about $30k a wafer after that.

t***@gmail.com

2017-11-19 21:42:57 UTC

Post by t***@gmail.com

Post by Bruce Hoult
Whoa! From $1 off the shelf ARM to $100 Zynq! Quite the difference.

Depending on quantity, small Zynqs are *MUCH* cheaper then $100...

Rather excessive $4m NRE on the ASIC there!

I read $1.5m NRE, but that does not change the story too much...

Post by Bruce Hoult
SiFive, at least, think they can do it a lot lower than that.
https://www.sifive.com/designshare/

Hmm, interesting business concept. How attractive this really easy, depends on the detailed numbers, of course. Could not find details on their web-site. But a low (financial) risk approach could really lower the entrance hurdle for many....

Post by Bruce Hoult
They're talking $100k to do the NRE and get a batch of test chips, for a synthesisable SoC using standard (but parameterised) IP components.

I assume, this is for a rather old process technology, and/or only for prototypes from a multi project wafer batch. (Then you will still have to pay big $ for the final masks?)

And what about IP pricing? (For their 64b single-core RISC V they show $600k on their website. Is this still valid for their design share approach, or is this all included in the wafer-price? Of course you will also need stuff like PLLs and a memory controller...)

Post by Bruce Hoult
I believe it's about $30k a wafer after that.

Process technology? Wafer size? IP included? Packaging included? Testing included?

It would be great news to get a let's say 28nm ASIC for 100k NRE + 30k/wafer, but I seriously doubt this...

Regards,

Thomas

EricP

2017-11-20 16:46:40 UTC

Post by Bruce Hoult
They're talking $100k to do the NRE and get a batch of test chips, for a synthesisable SoC using standard (but parameterised) IP components.
I believe it's about $30k a wafer after that.

That $30k per wafer sounds pricey.
A bit of poking about finds that foundries like Samsung charge
$2500/wafer for 14nm, and TSMC $3500/wafer for 16nm.
Or are these apples and oranges?

https://www.fool.com/investing/2017/08/15/this-mobile-chipmaker-is-cutting-prices.aspx

Eric

Bruce Hoult

2017-11-20 18:21:47 UTC

Post by EricP

That $30k per wafer sounds pricey.
A bit of poking about finds that foundries like Samsung charge
$2500/wafer for 14nm, and TSMC $3500/wafer for 16nm.
Or are these apples and oranges?
https://www.fool.com/investing/2017/08/15/this-mobile-chipmaker-is-cutting-prices.aspx

Maybe I was confused and that includes making masks?

All the better!!

SiFive's first products were using TSMC 180nm but they're moving to TSMC 28nm at the moment.

I note that in 180nm they get a "guaranteed" 320 MHz on the FE310-G000 (32 bit, single dispatch, classic 5 stage pipe) while when Motorola and Intel were using a similar process they were getting 1.0 or 1.2 GHz in the Pentium III and PowerPC G4.

I think that's largely down to the difference between spending millions on laying everything out by hand, or letting your PC spend a few hours doing it.

In 28nm they're looking at 1.5 GHz with basically the same processor core (but widened to 64 bit) and automatic layout with maybe a few hours or days of hand layout (I believe basically finding an initial layout that gives the best local minimum under automatic tweaking).

I notice Intel was getting 3 - 3.5 GHz at similar feature size with Sandy Bridge, though only 1 - 1.5 GHz for Atom.

m***@gmail.com

2017-11-14 19:12:03 UTC

Post by Bruce Hoult
I've always thought that while "ISA doesn't matter" and the "x86 decoder tax" is minor and affordable if you only want a handful of cores in a desktop or laptop system, it's nuts to think that it doesn't matter in handheld/IoT and massive server farms where electricity and cooling are major operating costs.
These results seem to bear out that ARM64 has a real advantage.

With all the comparison of AArch64 to RISC-V in this thread, maybe the better comparison is PowerPC/POWER to AArch64. PowerPC and AArch64 are closer in ISA similarities (RISC, 32 GP registers, fixed length 32 bit encodings, robust/fat instruction set, rigid but not too wide of SIMD units). AArch64 is like a re-encoded PowerPC which gained 5%-15% in code density and dropped some legacy baggage yet all of a sudden this is enough to be competitive and even have a "real advantage"? IBM has POWER chips for servers which are good at multi-threading but otherwise are lackluster and there are many core PowerPC processors which have been available for some time. The x86_64 chips do have good single core performance (many programs execute faster) and powerful single core performance can do more work for the energy used as well as being an advantage for the many tasks which can not be fully parallelized. Of course, the biggest advantage a server chip can have is economies of scale and perhaps POWER/PowerPC lacked the commitment Qualcomm is giving to AArch64 servers (where IBM is failing with POWER?). I would not bet the server farm on AARch64 from these benchmarks though. It's not a bad early showing but Qualcomm is a die shrink ahead and still couldn't match the single core performance of x86_64 server processors. It may be enough to wake up Intel like Alpha did when they thought they had a "real advantage". I expect there is plenty of fat to cut in those x86_64 processors but I don't know how much of an advantage single core performance is to server processors (unmatched single core performance allows x86_64 to dominate desktop/laptop gaming). Intel probably could cut some of the legacy hardware support and add more cores but the reduced compatibility may reduce their software advantage.

Post by Bruce Hoult
It will be interesting to see how RISC-V systems do, once SoCs of this core count hit the market. I think they'll beat ARM64 on power/performance.

A simpler RISC-V CPU can probably have more cores with the same amount of logic and reduced energy use when sitting idle but that may be a better fit for mobile/embedded than servers and laptops/desktops. It is easy to add many simple cores to a CPU but still a challenge for software to fully take advantage of them. Maximizing the number of cores would be as simple as adding many shallow pipeline in order RISC cores and using an ISA with as good of code density as possible (least logic used per core). Many people think this will allow performance to be scaled up from simplest to the most powerful processors. ARM realized they needed some single core punch when designing the AArch64 and Qualcomm likely did benchmarks in deciding to use more complex OoO cores which provide better single core performance. It is interesting that Qualcomm decided to forgo sub core threading (if I'm reading correctly) which is the opposite conclusion of IBM POWER engineers who increased sub core threading.

Bruce Hoult

2017-11-14 20:20:16 UTC

Post by m***@gmail.com

Mostly agree. I always enjoyed the PowerPC. Code density sucked a little bit but not as bad as MIPS or Alpha. IBMs high end models were just fine against contenporary Intel chips in the G5 towers, and the G4 was competitive against the Pentium 3 as far as they went (about 1.4 GHz). The problem for Apple was that Motorola was interested in low power embedded things (using the G4) and Apple didn't have anything available in either vendor's roadmap to compete with Centrino and successors (Core).

A lot of the current situation may come down to licensing, with ARM being easier to deal with.

No licensing is even easier to deal with :-)

Post by m***@gmail.com
A simpler RISC-V CPU can probably have more cores with the same amount of logic and reduced energy use when sitting idle but that may be a better fit for mobile/embedded than servers and laptops/desktops.

Yes, except that servers fit with embedded, not with laptops/desktops. Typical servers are *mostly* about total throughput of many independent tasks (and total power dissipation), not about latency on any one task, assuming you're in the same ballpark at least. Xeons with lots of cores are running at half the MHz compared to desktops.

Software development is an exception. It usually has some massively parallel parts, alternating with single-threaded parts such as the "make" program itself (and/or configure) and final linking. The project I currently work on takes about 12 minutes for a full build on the i7 6700 at work, 7 minutes on the i7 6700K at home, and 6 minutes on a 16 core two socket Xeon server. So use the server, right? Except most of the builds aren't full builds and the i7 6700K takes maybe 30 seconds while the Xeon takes a minute. That's enough to make a serious difference in productivity and concentration.

The new i9 looks like a great solution to this quandary. I hope in a couple of weeks to have an 18 core one for my team. It should hopefully be quite a bit faster than the big old Ivy Bridge Xeon on full builds (the base clock is higher), but just as fast as the 6700K on incremental builds. Well worth the $2k for the CPU if so (and total cost of about $4k for the whole box, way less than the name brand dual socket Xeon "real" server).

Stephen Fuld

2017-11-14 20:52:46 UTC

Post by Bruce Hoult
Yes, except that servers fit with embedded, not with laptops/desktops. Typical servers are *mostly* about total throughput of many independent tasks (and total power dissipation), not about latency on any one task, assuming you're in the same ballpark at least. Xeons with lots of cores are running at half the MHz compared to desktops.

Yes. Looking at the paper, it seems that Intel won most of the single
thread comparisons. Given their primary market, it is quite reasonable
for Intel to concentrate on high single thread performance. So perhaps
the answer is for Intel to do a better job of "refrigerator engineering"
to better dissipate the heat in multi core mode to allow for higher
clock rates in that mode. That won't eliminate the X86 tax, but I
suspect there is room for improvement if Intel were to concentrate on it.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Terje Mathisen

2017-11-15 07:28:10 UTC

Post by Stephen Fuld

Post by Bruce Hoult
Yes, except that servers fit with embedded, not with
laptops/desktops. Typical servers are *mostly* about total
throughput of many independent tasks (and total power dissipation),
not about latency on any one task, assuming you're in the same
ballpark at least. Xeons with lots of cores are running at half the
MHz compared to desktops.

Yes. Looking at the paper, it seems that Intel won most of the
single thread comparisons. Given their primary market, it is quite
reasonable for Intel to concentrate on high single thread
performance. So perhaps the answer is for Intel to do a better job
of "refrigerator engineering" to better dissipate the heat in multi
core mode to allow for higher clock rates in that mode. That won't
eliminate the X86 tax, but I suspect there is room for improvement if
Intel were to concentrate on it.

I think laptops are much closer to servers, in that performance/watt is
much more critical than peak performance.

Since a server has a much higher total power budget, it can use a _lot_
more cores than a laptop, but each core is still quite similar.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Stephen Fuld

2017-11-15 08:06:58 UTC

Post by Terje Mathisen

Post by Bruce Hoult
Yes, except that servers fit with embedded, not with
laptops/desktops. Typical servers are *mostly* about total
throughput of many independent tasks (and total power dissipation),
not about latency on any one task, assuming you're in the same
ballpark at least. Xeons with lots of cores are running at half the
MHz compared to desktops.

Yes. Looking at the paper, it seems that Intel won most of the
single thread comparisons. Given their primary market, it is quite
reasonable for Intel to concentrate on high single thread
performance. So perhaps the answer is for Intel to do a better job
of "refrigerator engineering" to better dissipate the heat in multi
core mode to allow for higher clock rates in that mode. That won't
eliminate the X86 tax, but I suspect there is room for improvement if
Intel were to concentrate on it.

I think laptops are much closer to servers, in that performance/watt is
much more critical than peak performance.
Since a server has a much higher total power budget, it can use a _lot_
more cores than a laptop, but each core is still quite similar.

Yes, but. The number of cores wasn't that much different. I think they
even said somewhere in the paper that the difference was the frequency
when all cores were active.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Stefan Monnier

2017-11-15 16:47:30 UTC

I think laptops are much closer to servers, in that performance/watt is much
more critical than peak performance.

I think a laptop with 256 cores running at 500MHz will not be very
successful, no matter how low its power consumption, whereas that could
still be competitive for a server if the power consumption is
sufficiently low.

Stefan

Bruce Hoult

2017-11-15 18:52:07 UTC

Post by Stefan Monnier

I think laptops are much closer to servers, in that performance/watt is much
more critical than peak performance.

Also, laptops are mostly running at 1% CPU utilisation, with the screen using more power than the CPU. They need to burst to high speed for a few seconds, but if someone wants to max them out continuously then they'll plug them in.

Stephen Fuld

2017-11-20 19:09:19 UTC

Post by Bruce Hoult
Yes, except that servers fit with embedded, not with laptops/desktops.
Typical servers are *mostly* about total throughput of many
independent tasks (and total power dissipation), not about latency on
any one task, assuming you're in the same ballpark at least. Xeons
with lots of cores are running at half the MHz compared to desktops.

Sorry for following up my own post, but this seemed to be better than
starting a new thread.

I spent some time searching around about this, and there seemed to be a
lot of discussion about a changed that Intel made regarding the IHS
(Internal Heat Sink) in some of their chips. The thought is that this
may have contributed to poorer heat dissipation, this some thermal
throttling. Searching for information on that, I came across the following:

https://overclocking.guide/the-truth-about-cpu-soldering/

I don't have the knowledge to know if this makes sense and could be the
cause for poorer multi-core performance. Can someone who knows about
this comment?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Melzzzzz

2017-11-20 19:38:45 UTC

Post by Stephen Fuld

Post by Bruce Hoult
Yes, except that servers fit with embedded, not with laptops/desktops.
Typical servers are *mostly* about total throughput of many
independent tasks (and total power dissipation), not about latency on
any one task, assuming you're in the same ballpark at least. Xeons
with lots of cores are running at half the MHz compared to desktops.

Sorry for following up my own post, but this seemed to be better than
starting a new thread.
I spent some time searching around about this, and there seemed to be a
lot of discussion about a changed that Intel made regarding the IHS
(Internal Heat Sink) in some of their chips. The thought is that this
may have contributed to poorer heat dissipation, this some thermal
https://overclocking.guide/the-truth-about-cpu-soldering/
I don't have the knowledge to know if this makes sense and could be the
cause for poorer multi-core performance. Can someone who knows about
this comment?

I follow overclockers forum, delidding Intel CPU's bring some temperature down.
But my i7 4790 with stock cooler overheats with fma3 and data in L1
cache, when running all 8 threads *with stock frequency reduced by 200Mhz*.
Normal temperature is like 75C when running all threads on stock
frequency.

--
press any key to continue or any other to quit...

Anton Ertl

2017-11-23 06:17:39 UTC

Post by Stephen Fuld
I spent some time searching around about this, and there seemed to be a
lot of discussion about a changed that Intel made regarding the IHS
(Internal Heat Sink) in some of their chips. The thought is that this
may have contributed to poorer heat dissipation, this some thermal
https://overclocking.guide/the-truth-about-cpu-soldering/

With all these complications about soldering the integrated heat
spreader (IHS) (which is made of copper), two questions come to mind:

1) Given that the solder is 1mm thick to avoid too much thermal
stress, I would expect that using a liquid thermal interface material
while reducing the distance from the IHS to the die to, say, 0.01mm
would give better cooling (the Indium solder is only 8-16 times better
than liquid TIMs, not 100 times). And indeed, from what I read, Intel
now uses a thermal paste, not solder, even for the (484mm^2) HCC Core
i9 CPUs.

2) Why bother with the IHS at all? Just put the base plate of the
cooler (made of copper) directly on the die (with a thin layer of
liquid TIM), as we did in the good old days of Athlon and Pentium III,
and you get rid of the thermal resistance of the solder and the IHS.
What has changed that brought us IHSs on desktop and server CPUs (GPUs
and laptop CPUs do without IHS to this day)?

[Many-core CPUs]

Post by Stephen Fuld
I don't have the knowledge to know if this makes sense and could be the
cause for poorer multi-core performance. Can someone who knows about
this comment?

If you have ~20W/core on a 4-core CPU with idle GPU, you can get
higher clocks than if you only have 6W/core on a fully loaded 28-core
CPU. And even the 4-core CPUs get higher clocks if only one or two of
the CPUs are active. Conversely, if you try to drive a, say, 28-core
CPU to the same clock rates on all cores as a 4-core CPU, this would
burn >560W for the whole CPU. We have seen some of this in the tests
of the Core i9s, which consumed way more power than their TDP thanks
to loose default BIOS settings. I also remember that in some test
such a Core i9 ran into thermal throttling with such too-high settings
and performed better at more conservative settings; less thermal
resistance between die and cooler would have avoided or reduced this
problem.

Can >560W from 698mm^2 (the size of Intel's 28-core die) be cooled
away? Would it be economical? Or do most server buyers favour
performance/W over raw performance? It seems that the latter is the
case.

- anton

--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

paul wallich

2017-11-23 13:30:57 UTC

On 11/23/17 1:17 AM, Anton Ertl wrote:
[...]

Post by Anton Ertl
If you have ~20W/core on a 4-core CPU with idle GPU, you can get
higher clocks than if you only have 6W/core on a fully loaded 28-core
CPU. And even the 4-core CPUs get higher clocks if only one or two of
the CPUs are active. Conversely, if you try to drive a, say, 28-core
CPU to the same clock rates on all cores as a 4-core CPU, this would
burn >560W for the whole CPU. We have seen some of this in the tests
of the Core i9s, which consumed way more power than their TDP thanks
to loose default BIOS settings. I also remember that in some test
such a Core i9 ran into thermal throttling with such too-high settings
and performed better at more conservative settings; less thermal
resistance between die and cooler would have avoided or reduced this
problem.
Can >560W from 698mm^2 (the size of Intel's 28-core die) be cooled
away? Would it be economical? Or do most server buyers favour
performance/W over raw performance? It seems that the latter is the
case.

What's been published suggests you could do that with serious liquid
cooling, but scaling might get interesting.

Anton Ertl

2017-11-15 09:31:46 UTC

PowerPC and AArch64 are closer i=
n ISA similarities (RISC, 32 GP registers, fixed length 32 bit encodings, r=
obust/fat instruction set, rigid but not too wide of SIMD units). AArch64 i=
s like a re-encoded PowerPC

Yes, when I saw the description, I got the impression that Aarch64 is
closer to PowerPC than to ARM.

which gained 5%-15% in code density and dropped=
some legacy baggage yet all of a sudden this is enough to be competitive a=
nd even have a "real advantage"?

The advantage seems to be in the business model and the current
implementation situation, not the architecture per se. It's ARMs
business to sell licenses for the architecture and for cores to
semiconductor manufacturers, and apparently the cores are good enough
for many, but weak enough that some feel the need to improve on them
(either by building from scratch, like Apple, or by improving the
existing cores). There is also an established application area in
mobile phones etc., and the server market with its focus on
performance/W looks like it is compatible with the strengths of
existing ARM implementations.

By contrast, IBM and Mot^H^HFree^H^H^H^H NXP have focused more on
making and selling implementations, and licensing to others has not
been pursued much for quite some time. Recently IBM is trying to
change this, but it's a little late. Also, IBM itself is a strong
competition in the server marketplace, and NXP in the embedded space*,
making this ecosystem not very attractive for newcomers.

* However, NXP now seems to be focussing more on ARM and considers
PowerPC legacy; there must be something in the ARM ecosystem that
makes it very attractive despite the ARM tax.

As for RISC-V, maybe the lack of the ARM tax will make it attractive,
or maybe ARM provides enough infrastructure to make paying the tax
worthwhile.

- anton

--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Bruce Hoult

2017-11-15 10:23:22 UTC

Post by Anton Ertl
As for RISC-V, maybe the lack of the ARM tax will make it attractive,
or maybe ARM provides enough infrastructure to make paying the tax
worthwhile.

Yes. Only time will tell, not discussions here.

One thing worth mentioning: many people fear that using an open source thing with no owners may leave them open to litigation in the future, with no Intel or ARM or IBM to defend them.

The RISC-V Foundation now has over 100 members, some individuals, but a number of companies. I'll list just a few names: Google, IBM, MicroSemi, NVIDIA, Qualcomm, Rambus, Western Digital, AMD, BAE Systems, Espressif, Lattice, Syntacore, Huawai, Micron, NXP, Samsung, Siemens, GLOBALFOUNDRIES, MediaTek, Seagate, SEGGER.

Being a member of the foundation doesn't necessarily mean that company plans to use RISC-V. It doesn't even mean they've paid very much.

One thing is *does* mean is they have signed a declaration that RISC-V does not infringe any of their patents.

Bruce Hoult

2017-11-16 12:05:02 UTC

Post by Anton Ertl
As for RISC-V, maybe the lack of the ARM tax will make it attractive,
or maybe ARM provides enough infrastructure to make paying the tax
worthwhile.

Yes. Only time will tell, not discussions here.
One thing worth mentioning: many people fear that using an open source thing with no owners may leave them open to litigation in the future, with no Intel or ARM or IBM to defend them.
The RISC-V Foundation now has over 100 members, some individuals, but a number of companies. I'll list just a few names: Google, IBM, MicroSemi, NVIDIA, Qualcomm, Rambus, Western Digital, AMD, BAE Systems, Espressif, Lattice, Syntacore, Huawai, Micron, NXP, Samsung, Siemens, GLOBALFOUNDRIES, MediaTek, Seagate, SEGGER.
Being a member of the foundation doesn't necessarily mean that company plans to use RISC-V. It doesn't even mean they've paid very much.
One thing is *does* mean is they have signed a declaration that RISC-V does not infringe any of their patents.

In other news: the RISC-V Linux kernel patches were merged about 16 hours ago, and will be in 4.15 (rather than only in a forked version until now). That should be released in January. glibc with RISC-V support should be released soon after. This follows bintools and gcc six months ago.

Fedora (and others) are ready to do official builds/releases when that happens (they've had unofficial for a year). And hopefully SiFive's quad core 1.5 GHz dev board will be out about the same time too.

Andy Valencia

2017-11-16 14:39:36 UTC

Post by Bruce Hoult
In other news: the RISC-V Linux kernel patches were merged about 16
hours ago,
...
And hopefully SiFive's
quad core 1.5 GHz dev board will be out about the same time too.

I, for one, am very ready to sacrifice some speed, and even reliability,
if I can start to move towards transparent and trustworthy platforms.
TBD is if SiFive can deliver, or if they will fall to the (apparently)
implacable call of the Binary Blobs and Secret Servers....

Andy

David Brown

2017-11-15 13:31:28 UTC

Post by Anton Ertl

Yes, when I saw the description, I got the impression that Aarch64 is
closer to PowerPC than to ARM.

which gained 5%-15% in code density and dropped=
some legacy baggage yet all of a sudden this is enough to be competitive a=
nd even have a "real advantage"?

The advantage seems to be in the business model and the current
implementation situation, not the architecture per se. It's ARMs
business to sell licenses for the architecture and for cores to
semiconductor manufacturers, and apparently the cores are good enough
for many, but weak enough that some feel the need to improve on them
(either by building from scratch, like Apple, or by improving the
existing cores). There is also an established application area in
mobile phones etc., and the server market with its focus on
performance/W looks like it is compatible with the strengths of
existing ARM implementations.
By contrast, IBM and Mot^H^HFree^H^H^H^H NXP have focused more on
making and selling implementations, and licensing to others has not
been pursued much for quite some time. Recently IBM is trying to
change this, but it's a little late. Also, IBM itself is a strong
competition in the server marketplace, and NXP in the embedded space*,
making this ecosystem not very attractive for newcomers.
* However, NXP now seems to be focussing more on ARM and considers
PowerPC legacy; there must be something in the ARM ecosystem that
makes it very attractive despite the ARM tax.

PowerPC has been popular in three areas of use. One is for processors
in workstations and Macs - that is long gone. Another is for network
processors, in high end firewalls, switches, routers, etc. That market
has always had a lot of competition from MIPS, and now ARM is taking
over. The final market is automotive microcontrollers - PPC
microcontrollers have traditionally been very solid and reliable, and
faster than most alternative cores. Again, this is shrinking.

Part of the reason for the drop in market share for PPC in embedded
systems is perception, part is due to Freescale/NXP strategy, part is
the tools, and part is the hardware. As has been noted in this thread,
PPC is not unlike AArm64 - it is big and complex. But PPC cores in the
automotive and industrial embedded world don't compete against 64-bit
ARM processors - they compete against 32-bit ARM microcontroller cores.

I have used a number of PPC based microcontroller devices over the years
- they are good devices, but they are difficult to use and complex to
work with in comparison to equally capable ARM Cortex-M or Cortex-R
devices. And for some aspects in embedded systems, such as interrupt
handling, PPC microcontrollers are terrible - and that is after getting
/much/ better in recent years. The number of people familiar with these
devices is /tiny/ in comparison to the number that are familiar with ARM
microcontrolllers. This in turn means that the tools available are
expensive and/or limited and/or old-fashioned, and the software support
of operating systems, communications stacks, etc., is far weaker than in
the ARM world.

Post by Anton Ertl
As for RISC-V, maybe the lack of the ARM tax will make it attractive,
or maybe ARM provides enough infrastructure to make paying the tax
worthwhile.
- anton

Bruce Hoult

2017-11-14 11:11:45 UTC