I find this article to be disingenuous. Yes, C isnt "how a computer really works...

jerf · on Oct 2, 2018

But in addition to the mismatch between the abstractions provided and the real hardware, C qua C is missing a huge number of abstractions that are how real hardware works, especially if we pick C99 as Steve did, which doesn't have threading since that came later. I don't think it has any support for any sort of vector instruction (MMX and its followons), it doesn't know anything about your graphics card which by raw FLOPS may well be the majority of your machine's actual power, I don't think C99 has a memory model, and the list of things it doesn't know about goes on a long ways.

You can get to them from C, but via extensions. They're often very thin and will let you learn a lot about how the computer works, but it's reasonable to say that it's not really "C" at that point.

C also has more runtime that most people realize since it often functions as a de facto runtime for the whole system. It has particular concepts about how the stack works, how the heap works, how function calls work, and so on. It's thicker than you realize because you swim through it's abstractions like a fish through water, but, technically, they are not fundamental to how a computer works. A computer does not need to implement a stack and a heap and have copy-based functions and so on. There's a lot more accidental history in C than a casual programmer may realize. If nothing else compare CPU programming to GPU programming. Even when the latter uses "something rather C-ish", the resemblance to C only goes so deep.

There's also some self-fulfilling prophecy in the "C is how the computer works", too. Why do none of our languages have first-class understanding of the cache hierarchy? Well, we have a flat memory model in C, and that's how we all think about it. We spend a lot of silicon forcing our computers to "work like C does" (the specification for how registers work may as well read "we need to run C code very quickly no matter how much register renaming costs us in silicon"), and every year, that is becoming a slightly worse idea than last year as the divergence continues.

bjourne · on Oct 2, 2018

But that's fighting against a straw man. When people advice others to "learn C," they don't mean the C99 specification they mean C as it is used in the real world. That includes pthreads, simd intrinsics, POSIX and a whole host of other features that aren't really C but what every decent C programmer uses daily.

As you point out, modern architectures are actually designed to run C efficiently, so I'd say that's a good argument in favor of learning C to learn how (modern) computers work. Pascal is at the same level as C, but no one says "learn Pascal to learn how computers work" because architectures weren't adapted according to that language.

pjmlp · on Oct 2, 2018

> . Pascal is at the same level as C, but no one says "learn Pascal to learn how computers work" because architectures weren't adapted according to that language.

They used to say it though, before UNIX derived OSes took over the computing world.

hfdgiutdryg · on Oct 2, 2018

A computer does not need to implement a stack

What general purpose computer exists that doesn't have a stack? Push/pop have been fundamental to all the architectures I've used.

kens · on Oct 2, 2018

Hardware stacks are a relatively recent feature (in historical terms) even though subroutine calls go way back. Many systems did subroutine calls by saving the return address in a register. On the IBM 1401, when you called a subroutine, the subroutine would actually modify the jump instruction at the end of the subroutine to jump back to the caller. Needless to say, this wouldn't work with recursion. On the PDP-8, a jump to subroutine would automatically store the return address in the first word of the subroutine.

On many systems, if you wanted a stack, you had to implement it in software. For instance, on the Xerox Alto (1973), when you called a subroutine, the subroutine would then call a library routine that would save the return address and set up a stack frame. You'd call another library routine to return from the subroutine and it would pop stuff from the stack.

The 8008, Intel's first 8-bit microprocessor, had an 8-level subroutine stack inside the chip. There were no push or pop instructions.

ShroudedNight · on Oct 2, 2018

I don't have direct experience, but I believe that system/360 programs have historically not had stacks, opting instead for a statically allocated 'program save area'.

https://people.cs.clemson.edu/~mark/subroutines/s360.html

XPLINK, a different linkage convention, also for MVS, does have use a stack-based convention:

https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.2.0/...

tom_ · on Oct 2, 2018

RISC-type architectures with a branch-and-link instruction (as opposed to a jsr- or call-type instruction) generally have a stack by convention only, because the CPU doesn't need one to operate. (For handling interrupts and exceptions there is usually some other mechanism for storing the old program counter.)

hfdgiutdryg · on Oct 2, 2018

Can you point me to a RISC architecture that doesn't have push and pop instructions?

monocasa · on Oct 2, 2018

Nearly all of them?

ARM's pop is really a generic ldmia with update. You can use the same instruction in a memcpy.

MIPS, PowerPC, and Alpha don't have anything like push and pop, splitting load/stores and SP increment/decrement into separate instructions.

AArch64 has a dedicated stack pointer, but no explicit push pop.

In general the RISC style is to allocate a stack frame and use regular loads and stores off of SP rather than push and pop.

swolchok · on Oct 2, 2018

ARM64 uses regular loads and stores to access the stack, I believe. It also is one of the architectures with branch-and-link. https://community.arm.com/processors/b/blog/posts/using-the-...

jerf · on Oct 3, 2018

From the perspective of "stack" as an underlying computation structure: Have you ever played a Nintendo game? You've used a program that did not involve a stack. The matter of "stack" gets really fuzzy in Haskell, too; it certainly has something like one because there are functions, but the way the laziness gets resolved makes it quite substantially different from the way a C program's stack works.

If a time traveler from a century in the future came back and told me that their programming languages aren't anywhere as fundamentally based on stacks as ours are, I wouldn't be that surprised. I'm not sure any current AI technology (deep learning, etc.) is stack-based.

From the perspective of "stack" as in "stack vs. heap", there's a lot of existing languages where at the language level, there is no such distinction. The dynamic scripting language interpreters put everything in the heap. The underlying C-based VM may be stack based, but the language itself is not. The specification of Go actually never mentions stack vs. heap; all the values just exist. There is a stack and a heap like C, but what goes where is an implementation detail handled by the compiler, not like in C where it is explicit.

To some extent, I think the replies to my post actually demonstrate my point to a great degree. People's conceptions of "how a computer works" are really bent around the C model... but that is not "how a computer works". Computers do not care if you allocate all variables globally and use "goto" to jump between various bits of code. Thousands if not millions of programs have been written that way, and continue to be written in the embedded arena. In fact, at the very bottom-most level (machines with single-digit kilobytes), this is not just an option, but best practice. Computers do not care if you hop around like crazy as you unwind a lazy evaluation. Computers do not care if all the CPU does is ship a program to the GPU and its radically different paradigm. Computers do not care if they are evaluating a neural net or some other AI paradigm that has no recognizable equivalent to any human programming language. If you put an opcode or specialized hardware into something that really does directly evaluate a neural net without any C code in sight, the electronics do not explode. FPGAs do not explode if you do not implement a stack.

C is not how computers work.

It is a particular useful local optimum, but nowadays a lot of that use isn't because "it's how everything works" but because it's a highly supported and polished paradigm used for decades that has a lot of tooling and utility behind it. But understanding C won't give you much insight into a modern processor, a GPU, modern AI, FPGAs, Haskell, Rust, and numerous other things. Because learning languages is generally good regardless, yes, learning C and then Rust will mean you learn Rust faster than just starting from Rust. Learn lots of languages. But there's a lot of ways in which you could equally start from Javascript and learn Rust; many of the concepts may shock the JS developer, but many of the concepts in Rust that the JS programmer just goes "Oh, good, that feature is different but still there" will shock the C developer.

jstimpfle · on Oct 2, 2018

> A computer does not need to implement a stack

C doesnt "have", or require, a stack, either. It has automatic variables, and I think I looked once and it doesn't even explitly require support for recursive functions.

lmm · on Oct 2, 2018

You might be thinking of Fortran? C does require support for recursive functions.

jstimpfle · on Oct 2, 2018

You're right. I searched the C99 for "recurs" and found the relevant section that briefly mentions recursive function calls.

That means static allocation is insufficient for an implementation of automatic variables in a conformant C compiler. Nevertheless I still like to think of it as a valid implementation sometimes. In contemporary practice many stack variables are basically global variables, in the sense that they are valid during most of the program. And they are degraded to stack variables only as a by-product of a (technically, unnessary) splitting of the program into very fine-grained function calls.

swolchok · on Oct 2, 2018

> We spend a lot of silicon forcing our computers to "work like C does" (the specification for how registers work may as well read "we need to run C code very quickly no matter how much register renaming costs us in silicon"

Can you elaborate? I thought I knew what register renaming was supposed to do, but I don't see the tight connection between register renaming and C.

Annatar · on Oct 2, 2018

Register renaming was invented to give assembler code generated from C enough variables in the forms of processor registers, so that calling functions wouldn't incur as much of a performance penalty.

The UltraSPARC processor is a stereotypical example, a RISC processor designed to cater to a C compiler as much as possible: with register renaming, it has 256 virtual registers!

monocasa · on Oct 2, 2018

Register renaming is older than C (the first computer with full modern OoOE was the 360/91 from 1964). It has more to do with scheduling dynamically based on the runtime data flow graph than anything about C (or any other high level language).

psykotic · on Oct 4, 2018

Agreed with your historical information, but the comment about function calls and calling conventions is not without merit. If you have 256 architectural registers you still can't have more than a couple callee-save registers (otherwise non-leaf functions need to unconditionally save/restore too many registers), and so despite the large number of registers you can't really afford many live values across function calls, since callers have to save/restore them. Register renaming solves this problem by managing register dependencies and lifetimes for the programmer across function calls. With a conventional architecture with a lot of architectural registers, the only way you can make use of a sizable fraction of them is with massive software pipelining and strip mining in fully inlined loops, or maybe with calls only to leaf functions with custom calling conventions, or other forms of aggressive interprocedural optimization to deal with the calling convention issue. It's not a good fit for general purpose code.

Another related issue is dealing with user/kernel mode transitions and context switches. User/kernel transitions can be made cheaper by compiling the kernel to target a small subset of the architectural registers, but a user mode to user mode context switch would generally require a full save and restore of the massive register file.

And there's also an issue with instruction encoding efficiency. For example, in a typical three-operand RISC instruction format with 32-bit instructions, with 8-bit register operand fields you only have 8 bits remaining for the opcode in an RRR instruction (versus 17 bits for 5-bit operand fields), and 16 bits remaining for both the immediate and opcode in an RRI instruction (versus 22 bits for 5-bit operand fields). You can reduce the average-case instruction size with various encoding tricks, cf. RVC and Thumb, but you cannot make full use of the architectural registers without incurring significant instruction size bloat.

To make the comparison fair, it should be noted that register renaming cannot resolve all dependency hazards that would be possible to resolve with explicit registers. You can still only have as many live values as there are architectural registers. (That's a partial truth because of memory.)

There are of course alternatives to register renaming that can address some of these issues (but as you say register renaming isn't just about this). Register windows (which come in many flavors), valid/dirty bit tracking per register, segregated register types (like data vs address registers in MC68000), swappable register banks, etc.

I think a major reason register renaming is usually married to superscalar and/or deeply pipelined execution is that when you have a lot of physical registers you run into a von Neumann bottleneck unless you have a commensurate amount of execution parallellism. As an extreme case, imagine a 3-stage in-order RISC pipeline with 256 registers. All of the data except for at most 2 registers is sitting at rest at any given time. You'd be better served with a smaller register file and a fast local memory (1-cycle latency, pipelined loads/stores) that can exploit more powerful addressing modes.

swolchok · on Oct 2, 2018

Why do you claim that local variables and function calls are specific to C? They seem to be very popular among programming languages in general.

Annatar · on Oct 3, 2018

Because I'm an assembler coder, and when one codes assembler by hand, one almost never uses the stack: it's easy for a human to write subroutines in such a way that only the processor registers are used, especially on elegant processor designs which have 16 or 32 registers.

pjmlp · on Oct 3, 2018

I surely did use the stack a lot, back in the old days when coding Z80 and 80x86 Assembly.

Annatar · on Oct 3, 2018

You almost had to, because both Z80 and x86 processors have a laughably small number of general purpose registers. To write code which works along that limitation would have required lots of care and cleverness, far more than on UltraSPARC and MC68000.

On MOS 6502 we used self-modifying code rather than the stack because it was more efficient and that CPU has no instruction and data caches, so they couldn't be corrupted or invalidated.

jcelerier · on Oct 2, 2018

> computer does not need to implement a stack and a heap and have copy-based functions and so on.

AFAIK intel cpus have a hardware stack pointer

steveklabnik · on Oct 2, 2018

That's correct, but "does not need to implement" and "this kind of computer does implement" are not incompatible statements.

fmihaila · on Oct 2, 2018

> I find this article to be disingenuous. Yes, C isnt "how a computer really works". Neither is assembly. [...]

He addresses all of that in the subsequent bullet points of his summary, and elaborates on it in the body of the article (your criticism stops at his first sentence of the summary). It goes into a nuanced discussion; it doesn't just make a blanket statement. I don't find it disingenuous at all.

dublin · on Oct 9, 2018

"How a computer works" is both a moving target, and largely irrelevant, as shown by the number of ignorant (not stupid) replies in this very topic. RISC and ARM changed the game in the 90s, just as GP-GPUs did in the 00s and neural and quantum hardware will do in the future.

What's really needed are languages that make it easier to express what you want to accomplish - this is still a huge problem, as evidenced by the fact that even a simple data munging task can be easily explained to people in a few dozen words, but may require hundreds of LOC to convey to the machine...

(BTW, it's time for us to be able to assume that the computer and/or language can do base-10 arithmetic either directly or via emulation. The reasons for exposing binary in our programming languages became irrelevant decades ago - ordinary programmers should never have to care about how FP operations screw up their math, it should just work.)

steveklabnik · on Oct 2, 2018

Thank you.

vvanders · on Oct 2, 2018

> The way a computer works is based off of transistors and some concepts built on top of that (ALUs for example). However, there is no need to know about any of that because you're presented with an abstraction (assembly).

I'd argue you should know how that works.

Cache-misses, pipelining and all these subtle performance intricacies that seem to have no rhyme or reason just fall out out understanding exactly how you get shorter clock cycles, different types of memory and the tradeoffs that they present.

One of the best things I ever did was go build a couple projects on an FPGA, when you get down to that level all of these leaky (performance) abstractions are clear as day.

hfdgiutdryg · on Oct 2, 2018

I'd argue that you shouldn't be driving a car unless you've rebuilt a transmission yourself, but that argument wouldn't go far.

vvanders · on Oct 2, 2018

That'd be like saying your end user should understand cache misses, however if you're starting a car design/repair shop you might want to know about how transmissions work.

The primary reason for dropping down to native is to get better performance. If you're going to do that you'll leave 10x-50x performance on the table if you don't understand cache misses, prefetching and the other things that manual memory placement open up.

hfdgiutdryg · on Oct 2, 2018

I'm not going to play anymore metaphor/semantic games. It's nice that you did that project, but it's not at all necessary for someone to engage in that in order to understand performance issues.

vvanders · on Oct 2, 2018

You're the one that raised the metaphor, but okay?

I'm not saying that you can't do performance work without having done that. Just that you'll be at a disadvantage since you're at the mercy of whatever your HW vendor decides to disclose to you.

If you know this stuff you can work back from from first principals. With a high level memory architecture of a system(say tiled vs direct rendering GPU) you can reason about how certain operations will be fast and will be slow.

hfdgiutdryg · on Oct 4, 2018

You're the one that raised the metaphor, but okay

And your response was absurd. You don't rebuild a transmission in order to run a shop. You don't even rebuild a transmission as an engineer creating cars, you shop that out to an organization specializing in the extremely difficult task of designing and building transmissions. I wanted to avoid this waste of time, but here we are.

As for the rest of your comment about reasoning about performance, none of that requires the work you did. Again, neat project (for you), but completely unnecessary in general.

Annatar · on Oct 2, 2018

It would be valid, though. A computer programmer must understand how a computer works lest she or he write slow, bloated, inefficient software.

lmm · on Oct 2, 2018

Given how many person-years have been saved and how much value has been produced by "slow, bloated, inefficient software", I must disagree in the strongest possible terms. Producing "slow, bloated, inefficient software" is far, far preferable to not producing software at all.

Annatar · on Oct 3, 2018

I would rather have no software or write the software myself than waste all the time of my life I've had to waste because of such shitty software, and it is indeed the case I've had to write such software from scratch because the alternatives were utter garbage. So we deeply, vehemently disagree.

p4bl0 · on Oct 2, 2018

I would go further and say that how modern computers work at the level that interests most people is somewhat based on C. It would be very different with another hardware abstraction paradigm such as Lisp Machines (as their name suggests), but that's not what we have :).

EDIT: I was going to update my comment a bit now that I thought more about it and that I'm on a computer rather than a mobile phone, but in the meantime jerf posted a very good reply to the parent comment so just read that ^^.

jreck · on Oct 2, 2018

Except it isn't, not really.

Even just the distinction between the stack & heap is wrong. They aren't different things, just different functions called on the otherwise identical memory. It's why things like Go work fine, because the stack isn't special. It's just memory.

malloc & free are also totally divorced from how your program interacts with the OS memory allocator, even. GC'd languages don't necessarily sit on malloc/free, so it's not like that's an underlying building block. It's simply a different building block.

So what are you trying to teach people, and is C really the way to get that concept across? Is the concept even _useful_ to know?

If you want to write fast code, which is what you'll commonly drop to C/C++ to do, then just learning C won't get you any closer to doing that. It won't teach you branch predictors, cache locality, cache lines, prefetching, etc... that are all super critical to going fast. It won't teach you data-oriented design, which is a hugely major thing for things like game engines. It won't teach you anything that matters about modern CPUs. You can _learn_ all that stuff in C, but simply learning C won't get you that knowledge at all. It'll just teach you about pointers and about malloc & free. And about heap corruption. And stack corruption.

geofft · on Oct 2, 2018

> Even just the distinction between the stack & heap is wrong. They aren't different things, just different functions called on the otherwise identical memory. It's why things like Go work fine, because the stack isn't special. It's just memory.

To add to this: I have seen people who learned C and thought it to be "close to the metal" genuinely believe that stack memory was faster than heap memory. Not just allocation: they thought that stack and heap memory were somehow different kinds of memory with different performances characteristics.

And the C abstract machine maps just fine to computers where the heap and stack are separate, unrelated address spaces, so this isn't even necessarily mistaken reasoning for someone who just knows C.

anticensor · on Oct 2, 2018

Separate stack and data memory adress spaces will make the machine incompatible with ISO C due to impossibility to convert between "pointer to void" and "pointer to object". Code address space is allowed to be separate.

fixermark · on Oct 2, 2018

> malloc & free are also totally divorced from how your program interacts with the OS memory allocator, even. GC'd languages don't necessarily sit on malloc/free, so it's not like that's an underlying building block. It's simply a different building block.

The realization that malloc is really just kind of a crappier heavy-manual-hinting-mandatory garbage collector was a real eye-opener in my college's "implement malloc" project unit.

(To clarify: the malloc lib is doing a ton of housekeeping behind the scenes to act as a glue layer between the paging architecture the OS provides and high-resolution, fine-grained byte-range alloction within a program. There's a lot of meat on the bones of questions like sorting memory allocations to make free block reunification possible, when to try reunification vs. keeping a lot of small blocks handy for fast handout on tight loops that have a malloc() call inside of them, how much of the OS-requested memory you reserve for the library itself as memory-bookkeeping overhead [the pointer you get back is probably to the middle of a data structure malloc itself maintains!], minimization of cache misses, etc. That can all be thought of as "garbage collection," in the sense that it prepares used memory for repurposing; Java et. al. just add an additional feature that they keep track of used memory for you without heavy hinting via explicit calls to malloc() and free() about when you're done with a given region and it can be garbage-collected).

monocasa · on Oct 2, 2018

Stack accesses _are_ different in hardware these days, which is why AArch64 brings the stack pointer into the ISA level vs AArch32, and why on modern x86 using RSP like a normal register devolves into slow microcoded instructions. There's a huge complex stack engine backing them that does in fact give you better access times averaged vs regular fetches to cache as long as you use it like a stack, with stack-like data access patterns. The current stack frame can almost be thought of as L½.

jreck · on Oct 2, 2018

The stack pointer is just that, a pointer. It points to a region of the heap. It can point anywhere. It's a data structure the assembly knows how to navigate, but it's not some special thing. You can point it anywhere, and change that whenever you want. Just like you can with any other heap-allocated data structure.

It occupies the same L1/L2 cache as any other memory. There's no decreased access times or fetches other than the fact that it just happens to be more consistently in L1 due to access patterns. And this is a very critical aspect of the system, as it also means it page faults like regular memory, allowing the OS to do all sorts of things (grow on demand, various stack protections, etc...)

monocasa · on Oct 2, 2018

Google "stack engine". Huge portions of the chip are dedicated to this; if it makes you feel better you can think of it as fully associative store buffers optimized for stack like access. And all of this is completely separate from regular LSUs.

There's a reason why SP was promoted to a first class citizen in AArch64 when they were otherwise removing features like conditional execution.

That's also the reason why using RSP as a GPR on x86 gives you terrible perf compared to the other registers, it flips back and forth between the stack engine and the rest of the core and has to manually synchronize in ucode.

EDIT: Also, the stack is different to the OS generally too. On Linux you throw in the flags MAP_GROWSDOWN | MAP_STACK when building a new stack.

swolchok · on Oct 2, 2018

Would it be fair to say that it will teach you how to dictate the memory layout of your program, which is key to taking proper advantage of "cache locality, cache lines, prefetching, etc..."?

emit_time · on Oct 2, 2018

> It's why things like Go work fine, because the stack isn't special. It's just memory.

Care to elaborate?