It's strange the extent to which programming interview questions reflect an 80s view of the cost of operations, particularly the overabundance of linked list and binary tree questions. Cache misses ain't free and memory scans are relatively cheap after you do the initial lookup.
Yep! The big thing people don't realize that it is memory operations that are the biggest cost in terms of both time and energy.
While it takes ~100 picojoules to do a double precision floating point operation on a Ivy Bridge Intel processor, it takes 4200 picojoules to move the 64 bits from DRAM to your registers. Most people assume that the huge power usage is because you need to move data from off the chip, but the reality (and surprising fact to most people) is that over 60% (~2500 picojoules) of the energy usage of moving the data is consumed by the on chip cache hierarchy. That doesn't mean the SRAM caches themselves, but all the additional logic that makes it hardware managed (TLBs, etc) that give you functionality like virtual memory translations, cache coherency, etc.
Getting rid of all of that cruft that has been added since the 80s to make programmers lives easier would actually reduce power consumption and latency significantly... My startup is working on that problem by removing all of that additional logic from the hardware and instead having it managed at compile time. The best thing though would be having programmers really think about locality when writing their programs though.
There are two (major) things that we are addressing with the compiler. The first is the traditional VLIW problem, which is the same problem Itanium faced, and the second is doing fully software managed memory, which no (non academic) architectures have attempted to solve with a compiler.
1) As for the traditional VLIW problem, the simplified explanation of why it is difficult for most systems is because it is difficult to know exactly when a functional unit will actually receive/have access to a piece of data (either due to data hazards, latency due to the memory system, or many other factors). We solve this at the hardware level by being the first architecture to be able to guarantee latency between any location in memory. Once you can guarantee this in hardware, your compiler has a lot more information to be able to make decisions with and does not need to needlessly insert nops that hurt performance.
2)When it comes to software memory management, we have some new proprietary techniques for determining memory usage at compile time plus runtime tools. For obvious reasons I can't go into too much detail on how they work, but we will be publishing on it in the near future.
To summarize, we think that the reason others have never made a "sufficiently smart compiler" is because the hardware never gave enough data to the compiler and vice versa. We decided to have virtual memory (which we think is unnecessary) and instead opted for having all of our cores have access to a shared memory space, which simplifies both the hardware and makes memory mapping easier for the compiler. Hardware features that guarantee the latency for both operating and moving data along with the entire system being non blocking is what really gives our compiler the information necessary to efficiently pipeline things.
Our initial target markets are those where programs would either be running on the bare metal (the basic program instructions running right on the cores, like an embedded system) or at most a pretty basic RTOS. From the bare metal standpoint, we can still have memory segmentation just like any other system... I would say it is even easier for the compiler to do that on our system due to the fact that all of the physical memory addresses are part of a single global memory map.
So the target market is .. high mips/watt microcontroller or DSP?
If you have fully software managed memory it sounds like any binary running on the system has full access to any other memory? This is kind of the opposite of ARM "TrustZone".
Edit: I'm just asking these questions because novel architectures tend to sink without trace and the small-system world is currently dominated by ARM. You need a real "wow" factor to get people to change their tooling.
Our focus has been on floating point performance. Originally we were targeting high performance computing, but have since expanded to high end DSP applications (Think mobile base station processing for LTE-Advanced and "5G").
For memory protection, the most traditional way would be leaving it up to a RTOS or microkernel. Something very small and verifiably secure like seL4 is something we want to port.
We have not made it a huge priority to start of with as our customers have a small number of applications that are being ported and are isolated on the system. As each application needs to be recompiled for our architecture, we think that memory segmentation done at compile time is good enough to start with (in these limited cases).
I'm not talking about microcontrollers in that case... Our chip has scratchpad for each core (128KB times 256 cores) with single cycle latency, along with attached DRAM. We have a whole lot more on chip memory that is 4x faster while using 1/5th of the power (on memory operations alone). Overall, we are aiming for a 10 to 25x energy efficiency improvement over CPU and GPUs for floating point heavy applications.
>managing scratch memory statically can only work for very high level (and very domain specific) languages. Hardly possible for C.
Well, we've solved that. Looking forward to sharing that in the near future.
If you really did, you've got much, much more than merely optimising a memory access. Consequences for the static analysis can be enormous. Looking forward to seeing your publication.
I hope you realize that managing memory latency has been the most fundamental motivation throughout Intel for 25 years. It isn't a problem that is a matter of a revelation about the problem itself.
I speed up programs all the time by reorganizing how memory is layed out and accessed. Many time by factors of 12x or more.
I would think that making SIMD, parallelism, and multiple simple loops instead of one bigger loop much easier to program around would be much more realistic. Something like a fusion of ISPC, Rust, C++11, and Julia.
First off, I was referring primarily to memory latency between L1 cache, which has improved over the past 2.5 decades only through the combination of Moore's law getting the wires shorter (which is going to end soon, at least for silicon) and increasing clockspeed (which really ended with the breakdown of Dennard scaling a little over 10 years ago). Intel's L1 cache latency has not improved in almost 10 years, with it still at 4 cycle latency (at best). The improvement has only been that there is more data you can access at L1, but the time to data hitting your registers has not improved at all.
Our scratchpad (the analogous term for software managed memory, in comparison to a traditional hardware managed L1/L2/L3 cache system) for instance has single cycle latency along with zero bus turnaround. Along with our ability to guarantee memory latencies between any locations in memory, our whole goal is to try to never have a wasted cycle.
>I would think that making SIMD, parallelism, and multiple simple loops instead of one bigger loop much easier to program around would be much more realistic. Something like a fusion of ISPC, Rust, C++11, and Julia.
What about functional programming? IMO the biggest benefit from programming without state is that order of execution does not matter. Thus programs can be parallelized trivially. Under the hood you end up with "multiple simple loops" without really even trying. I think when more people catch onto this, we're going to see a rise in functional language usage because of how easy it makes parallelism.
Those numbers are amazing, do you perhaps have a citation ? I'm a machine learning researcher and would love to include them in talks about data-aware algorithms.
This is also true for GPUs - in cryptocurrency mining every memory access equals more money spent on power. If you can make equivalent (in terms of throughput) compute/memory trade-offs, you always go compute.
And it's not reasonable to abandon trees and graphs and everything just for the sake of cache locality. Algorithm first, CPU optimization second. Especially because you can control allocation very easily with something like an object pool, which will minimize cache misses.
You basically shouldn't use vanilla _binary_ trees ever. b-trees with appropriate node sizes provide much better cache locality without sacrificing any of the nice properties trees have. They are easier to implement too. You don't have to go full van Emde Boas layout trickery to get most of the benefits.
I think most such interview questions are informed by CS curricula, which also have only changed modestly since 1980(some new languages and tools). The elaborate pointer chasing structures available then have basically sufficed and are in many instances less useful new.
It's not really that hard. Open addressing + linear probing + magic value for deletions is something you can implement in 15 minutes if you know how hashmaps work.
Of course "cache friendly" is relative; any hashmap has pseudorandom memory accesses as a core part of its design so again, array scans will beat it below a certain number of elements.
What interested me in the design of the Mill CPU is how it throws out the usual design of machine language.
I'm not talking about assembly language, and I know the difference, BTW.
In the name of software compatibility, we're still trying to program CPUs using machine language that wouldn't be so strange to a programmer from the 1980's. Sure, there's more registers, and some fun new stuff, but it isn't all that different.
Except that in the 1980's, the CPU actually implemented those instructions. These days, it is all a lie, especially with regards to things like register sets and aliasing. Yes, of course, logically, what the programmer wanted to happen does, but today even programming at assembly level, you are far, far removed from what the CPU is actually doing.
You say this as if it is a bad thing (or am I misinterpreting here?), but compatibility is enormously valuable. That's why the strategy of choosing compatibility over cleanliness of architecture is so widespread in successful complex systems - ISAs, OSs, the Web, programming languages, etc etc.
It's hard to love the resulting complexities, but remaining compatible really is almost always the right thing to do.
If we were in the habit of recompiling from source with each CPU generation (this presumes source is available), that would allow much more innovation in machine language. And then the design of that machine language would more closely resemble the actual design of the CPU.
But there are such significant advantage to remaining binary compatible, that I'm not surprised with how things actually turned out.
This is an interesting design question - which layer should be backwards compatible and under which significant freedom of design is allowed.
I'm not at all sure that the best answer is "source code". For x86 CPUs the compatible layer has been the ISA and that has turned out pretty damn well, suggesting that source compatibility is not the only and may not be the most effective choice.
Particularly if the source is C, which is rather ambiguous compared to most machine languages.
I'm not at all sure that the best answer is "source code". For x86 CPUs the compatible layer has been the ISA and that has turned out pretty damn well, suggesting that source compatibility is not the only and may not be the most effective choice.
I disagree, though I see your point.
If the x86 instruction set had been designed to be implemented in various ways depending on the available semiconductor technology, then that would be a much better situation. However, the basics of x86 were designed in the 1980s to be directly implemented in the hardware of the time.
If you were designing an ISA expressly to insulate software from the hardware, I'm sure there are dozens of design choices that would be made differently. Ditto for amd-64.
I think most of the opportunity for "invisible" performance improvement today is below the ISA -- new branch predictors, new prefetchers, smarter cache hierarchy eviction/replacement heuristics. These are problems that are fundamental to the code: predicting dynamic behavior and having the right data nearby. (At least, this is what I saw again and again while staring at pipeline traces while working on a commercial microarchitecture.) A new ISA won't fix that.
Or maybe said in a different way, even if we recompile from C to a new ISA, we still have to execute the code that (by virtue of its algorithm) has unpredictable branches and lots of cache misses.
I do think there's room for innovation at a higher semantic level, e.g. maybe a really fast thread message-passing/synchronization mechanism with a lot of hardware thread contexts. Or lots of other ideas, e.g. "helper threads" (user-provided speculative prefetch code) or "informing memory operations" (load-and-branch-on-cache-miss). (All of these come out of comparch research in the 90's and 00's.) But all of those would require programmer cooperation, since they introduce new semantics.
It is worth noting that many of the Mill features are also a "lie" (for example, the belt is still just registers and register renaming). Its just that the Mill uses lies designed for modern processor technology.
The belt isn't an abstraction of the structures in a modern OoO CPU corresponding to registers and renaming but rather the bypass network further down the pipeline. The belt like the bypass network only has to store data for limited periods and the translation is much simpler than what an OoO CPU has to do when renaming registers.
Yes, registers and the belt are just ISA features. Register renaming is a behind the scenes implementation tech to mitigate the problems of register semantics. It has unwanted costs that a belt machine doesn't have to pay, and it's incorrect to say that Mill does register renaming in the traditional sense of the word.
Indeed that's what the #1 paragraph at https://millcomputing.com/docs/belt/ says: "A large fraction of the power budget of modern superscalar CPUs is devoted to renaming registers: the CPU must track the dataflow of the executing program, assign physical registers and map them to the logical registers of the program, schedule operations when arguments are available, restore visible state in the event of an exception—all while avoiding register update hazards."
This is a wonderful post that no-one will care about. This may be the only post.
Today, programmers are more interested in the rate they can turn out "Just Works" code. These kinds of details are fare fare to down in the weeds for a continuous development artists.
All: I had a negative reaction to this comment too at first, but on reflection it reads less like a jab and more like a rueful lament about mainstream programming culture. The fact that mgrennan loved the article and was so perfectly wrong about it not being appreciated here suggests that he or she just hasn't realized yet how much passion this community shares for the craft.
It's depressing to feel like you're the only one who cares, and when one has felt like that for a long time, curmudgeonly biases develop. So mgrennan, please get to know your fellow HNers, who love this stuff. And HN, let's be charitable to mgrennan, who may have been mistaken but whose heart is probably in the right place.
I think it is simply a business thing. For most businesses time to market is more important than performance or correctness. If you prefer working for performance or correctness, try a conservative niche: compiler, realtime, HPC, etc
Two responses. First, once we find "Just Works" components that we like then they can be optimized. Second, we can use inspiration from articles like this to describe "Just Works" approaches and patterns that "go with the grain" of what Earth's real fabrication capabilities are actually producing.
For example, I found the discussion of how cores coordinate access to main memory on a shared bus to be quite fascinating; an easy insight there was that our programming patterns should support hard data partitions (less shared main memory than parallel main memory). One naive way to get there is to use N processes where N is something like the number of cores on the machine, and one of them serves as a message router. Something like what `httpd` does.
I really wouldn't mind if someone who knows more about the JVM implementation could talk about how and why the JVM threading model is better than native processes, for example, especially in light of memory contention.
> I really wouldn't mind if someone who knows more about the JVM implementation could talk about how and why the JVM threading model is better than native processes, for example, especially in light of memory contention.
The JVM can assume that JVM threads cooperate. The operating system has to assume native processes are hostile.
> This is a wonderful post that no-one will care about. This may be the only post.
I think you are underestimating the crowd here. Last time it was posted it got quite a few responses: https://news.ycombinator.com/item?id=8873250 (already a while back, but might be interesting for reference/to bring topics up again)
There also seems to be a constant barrage of assembly/CPU posts (I counted 4 posts including this one earlier on the front page) , I'm not sure where GP gets the idea that no one here cares.
Seven months ago is well within duplicate range, but let's leave this one up: first to show how much the HN community does care about wonderful posts, and second to thank luu for tirelessly posting (not to mention writing) so many high-quality articles here.
Yes. I should have considered the crowd I was writing. I'm here. 35+ years of systems work starting with a 4004. I was inditing the new culture of "released and poor is better than not release and good" and "if it runs slow just scale it over more processors". In 1988 I did more in 48k of memory with 500k of disk then can be done in 4G mem and 500M of disk today.
Highly technical posts on HN have a tendency to accrue a lot of upvotes before a discussion develops. I attribute this to (perhaps optimistically) people actually taking the time to read the post. When something is so technically dense, it can also be difficult for people to add anything to it; hence it takes a while before a real discussion starts.
I hate it when blog posts don't include the date. Judging by the linked question this blog post must be at most a few months old, but there was nothing on the page that would tell me that. One of the most important questions is whether the information in the article is still applicable... In this case it is, but it would be nice if readers knew it. Not to mention 10 years from now when somebody stumbles across this writeup. </rant>
It's not a static decision though - the memory accesses can still be reordered when %ebx != %esp, though of course this only ends up visible where there are multiple CPUs involved.
For example, consider the case above and assume that the initial conditions are:
(%esp) == 0
(%ebx) == 0
Now imagine we have a second CPU executing simultaneously, with the same %ebx and %esp as the first CPU, but executing this:
mov $1, (%ebx)
mov (%esp), %eax
Now if there was no reordering, either one or both CPUs must end with %eax == 1. However, the hoisting of loads before earlier stores means that you can actually end up with both CPUs have %eax == 0 after this executes.
You're correct about it being possible for other CPUs to see "crossed" loads/stores, but within one CPU/stream of instructions the programmer-visible ordering is absolutely preserved, because if it wasn't, a lot of existing software would break. In your example, if ebx == esp, and both CPUs executed those two instructions, then they must both see eax == 1. I think you had this scenario in mind instead (where A and B are different memory locations):
CPU 1:
mov $1, (A)
mov (B), %eax
CPU 2:
mov $1, (B)
mov (A), %eax
Where eax == 0 on both CPUs is definitely possible.
The point to note is that the decision on whether or not the reordering can occur, based on whether or not A and B are the same or not, is made dynamically at the point of execution.
Intel, IBM, mainframes, and embedded SOC's are all taking the same approach to a degree of combining 1-N general-purpose cores with dedicated hardware for performance-critical stuff or just stuff that shouldn't add overhead. The Octeon line is an extreme example with them adding accelerators till they hit around 500. Most modern variant being the "semi-custom" business of Intel and AMD that is making more of it happen for those with the money.
This is peripheral to an improvement in computers known as network on a chip. This plus extra layers of functionality in silicon lets the companies easily do stuff like that. The next step is incorporating FPGA logic in the processors. We already see it in embedded scene. Just wait till Intel uses Altera technology in Xeons. SGI's Altix machines with FPGA's using NUMA were already quite powerful. Imagine the same benefit of no, remote-memory access for the FPGA logic working side-by-side with CPU software. Will be badass.
This is a question from someone rather ignorant, so please don't hit me: Why didn't the Bulldozer affect the programmers? It (or Piledriver) seems to be doing quite well in applications with good threading.
In general all x86 chips are carefully designed to fulfill the abstraction that is the x86 ISA so to the programmer it shouldn't matter whether their code runs on an Atom or a Bulldozer or an i7.
I don't think "good at executing bad code" applies for embedded CPUs. Of course, good code is still more the province of the compiler (with PGO for example).
Literally everyone uses Intel syntax, except in those situations where they are forced to use AT&T syntax (inline assembly in C on Unix, somehow your box doesn't have NASM). Using AT&T syntax for examples just confuses people. Write assembler the right way. Destination, source. Come on.
The default output format of HotSpot is AT&T. The default output format of GCC is AT&T. The default output format of Clang is AT&T. Tools like the universal compiler output viewer https://gcc.godbolt.org use AT&T by default.
Intel isn't the universally accepted format you think it is. I'm a professional VM researcher and I use AT&T more often than Intel. In fact I most often see Intel when reading Intel documentation.
You're shouting about nothing more than empirical than tabs vs spaces, and even then I think your side is actually in a minority.
Not only is the all caps and profanity uncalled for here, but the unfounded assertion that "everyone" uses "Intel syntax" is wrong. The Solaris assembler as one example, uses the "AT&T syntax". Also, this is reflected in older UNIX apis such as bcopy which use src, dest.
Personally, I always preferred src,dest over dest,src but as long as the language is consistent I don't care.