Writing Your Own Toy Compiler Using Flex, Bison and LLVM

mahmud · on Sept 19, 2009

For the record, I think it's a terrible idea to keep prototyping compilers in C in this day and age. There is already a DSL for compiler construction and it's called Standard ML. SML/NJ, along with the New Jersey Machine Toolkit, Ramsey and Fernandez' excellent binary-frobbing-utility framework (yes! tools that generate profilers, debuggers, tracers, assemblers and disassemblers!) along with stuff from the SUIF project, you can start writing very sophisticated industrial compilers in fraction of the time it takes to debug just the front-end stuff in C. Even Perl, or your scripting language of choice, is better than C for compiler hacking.

http://suif.stanford.edu/papers/ http://www.cs.tufts.edu/~nr/toolkit/ (devour Norman Ramsey's site and read his joint papers with Mary Fernandez; him, along with Monica Lam at Stanford, the Rice people Linda Torczon, Keith Kooper et al. are producing some of the most accessible tools and papers and certainly most exciting. The Rice group is also responsible for the best introductory compiler hacking text in recent publication: Engineering a Compiler. GET IT! If even just for the carefully curated bibliography.

Also, recently, the ACM PLDI published a list of 20 most influential papers in programming languages design and implementation:

http://www.cs.utexas.edu/users/mckinley/20-years.html

I took the time to scrape as many of them as I could off of the internet, wherever they were freely available (i.e. author's websites) and I can say I have 18 of them. I would love to share them with hungry minds in one tarball, or they can google the papers individually:

A Data Locality Optimizing Algorithm.pdf

A Safe Approximate Algorithm for Interprocedural Pointer

Aliasing.pdf

An Evaluation of Staged Run-Time Optimizations in DyC.pdf

An Implementation of Lazy Code Motion for SUIF.pdf

Analysis of Pointers and Structures.pdf

Balanced Scheduling- Instruction Scheduling When Memory Latency is Uncertain.pdf

Complete Removal of Redundant Expressions.pdf

Global Register Allocation at Link Time.pdf

How To Read Floating Point Numbers Accurately.pdf

Improving Register Allocation for Subscripted Variables.pdf

Interprocedural Constant Propagation.pdf

Interprocedural Slicing Using Dependence Graphs.pdf

Lazy Code Motion.pdf

On-The-Fly Detection of Access Anomalies.pdf

Register Windows vs. Register Allocation.pdf

Soft Typing.pdf

Software Pipelining-- An Effective Scheduling Technique for VLIW Machines.pdf

The Design and Implementation of a Certifying Compiler.pdf

gizmo · on Sept 19, 2009

I'm guessing many people are interested in the whole PDF package. Can you perhaps put it online somewhere?

mahmud · on Sept 19, 2009

So far no one has requested it, I am guessing it's because people have the list above and can google it themselves. I am gonna take a look at the papers and see if any of them have clauses about free redistribution, if not, you should have a tarball on the front page RSN.

silentbicycle · on Sept 20, 2009

If you do so, could you also post a link to it here? Thanks in advance.

mahmud · on Sept 21, 2009

gizmo and silentbicyle, the papers are up:

http://news.ycombinator.com/item?id=834175

silentbicycle · on Sept 21, 2009

Awesome, thanks!

drobilla · on Sept 19, 2009

Maybe 'better' in some academic sense, but C is what all machines have. If you want to write a good native compiler userful to people without massive dependencies, you write it in C/C++.

Writing it in ML may be nice, but nobody wants to install ML just to get at another language. The kind of person that would do this is probably aspiring to a good language implementation of their own that could be considered a /peer/ of ML (or Perl, or Python, or Scheme, or Java, or....). High level languages written in yet another high level language are more of an academic curiosity than a tool anyone's likely to use.

While, yes, you're talking about prototyping, that assumes doing so in ML is much simpler. Maybe for you, but not for a C/C++ coder who doesn't know ML. There are a very, very large number of such coders.

Why write a prototype then have to rewrite the whole thing in C to have a "serious" language implementation? For small languages, the effort spent in prototyping then rewriting is going to be more effort than just writing a good stand-alone implementation from the outset. My hobby "prototype" compilers are small, portable, fast, depend on almost nothing, and integrate with the system and command line nicely. If the language was to become useful embedded in some program, say, I could do so immediately because I wrote the implementations to be good from the start. As a user, I'd take that over something I have to run in ML or some other massive high level language any day...

Would it be nice if the system's language was something better (somewhat) like ML? Sure. Is it? No.

Anyway, from the implementation point of view, when it comes time to write your GC, or any other low level details (parallelism?) writing in something like ML is a severe handicap. As far as performance goes, LLVM makes /fast/ code, and a lot of work is constantly being poured into it by a lot of people to make it even faster. This is a HUGE amount of work (i.e. far beyond what one person writing a language, or the people involved in the projects you have mentioned, can or will do), and it's already been done for you if you use LLVM (of course LLVM does have ML bindings).

Anyway, the point is, if you're trying to "prototype" or do PLT research, and you know ML well, and all you care about is the language itself you're implementing (and not the qualities/bloat/dependencies/performance of the implementation); sure, using ML probably makes sense. However these things are certainly not true of everyone writing a compiler.

mahmud · on Sept 19, 2009

I wrote a long reply then decided against it. I will ignore that "C is on all machines" fallacy, because it's not. People are accustomed to downloading viewers and codecs for even the most basic documents and multimedia files; witness the trouble people go through just to make an audio widget play on their MySpace page, or an MS Office version X template in version X++. Compiler users should be bloody well sophisticated and have no trouble installing the necessary runtimes; it couldn't be harder than installing Flash, Java or .NET, really.

Forget the word "compilers" for a second. Pick the most powerful language you can find for experimenting with graph algorithms.

That's it. That's all it boils down to. Compiler bloat, running speed, start up time and other stuff is just a programmer's wishful thinking. How many machines in the world run your software? how much of your user's time have you wasted by not implementing a piece of software as efficiently as you could? (would your users trade delivery time for startup or running time? i.e. do they want to start using your program today, even if it runs 300% slower than the machine allows, or do they want to use it next week and expect it to run at maximum speed? :-)

ekiru · on Sept 19, 2009

I'm not entirely sure how machine code(or LLVM bytecode, if you'd prefer) produced by a program written in ML is at all inferior to the same machine code(actually, probably less optimized machine code, given that ML is a high level language in which it's easier to code optimizations than in C) produced by a program written in C.

Comparing ML to Perl or Python is ridiculous. There are a number of native optimizing compilers of ML. ML is closer to C++ or Java or CL performance-wise than to those languages.

ilyak · on Sept 19, 2009

Well, you can write your own guideline on how to prototype a language in SML, post it to NH, and profit.

Now we don't see your guideline but do see Flex/Bison/LLVM one.

calambrac · on Sept 19, 2009

WTF is wrong with you? That was one of the most informative comments I've ever seen on this site. There was a ton of useful information in there, way more than the short little guide in the OP (which I'm not knocking at all).

ilyak · on Sept 19, 2009

I'd like it to be a stand-alone article. This way it would be a first-class citizen on HN.

I mean, more meat, examples, and soon quite a few more people know how to prototype a language in SML.

silentbicycle · on Sept 20, 2009

Andrew Appel's _Modern Compiler Implementation in ML_ is probably what you want. (http://www.amazon.com/Modern-Compiler-Implementation-Andrew-...)

While it uses SML, it's not hard to follow if you use OCaml, and the bibliography / references are also quite helpful. The Ghuloum paper referenced elsewhere in the thread is also quite good (and uses Scheme, for better or worse).

ilyak · on Sept 20, 2009

It's a damn book, not a quick article.

Sad.

silentbicycle · on Sept 20, 2009

Compilers are a big topic. Any "quick article" of substance will have references to underlying tools and concepts the author will expect you to know.

If you're completely lost, try starting with interpreters first. Compilers build on many of the same ideas, but interpreters are generally simpler. Read "The Art of the Interpreter" and/or SICP, and do the exercises. The former is a short paper, the latter is an excellent book expanding on it.

RiderOfGiraffes · on Sept 21, 2009

I've written compilers, and even though I no longer have any reason to do so, I'd love to know more about writing better compilers in better languages.

There are dozens and dozens, if not hundreds of articles and "how to" style pages on the web showing a concrete implementation of a small compiler for a toy language using lex/flex and yacc/bison, yet when asked for something similar in ML, SML, Haskell, whatever, the response is to be pointed at books and scholarly articles.

Where are the popular accounts of better compiler building with ML? Are there any? If not, why not?

Yes, building an industrial-strength compiler for a real-world langauge is a big and complex undertaking, but I feel the field would advance more quickly if more hackers had more access to more articles of the "Build a compiler in an afternoon" variety.

And I'd appreciate learning from it too, despite, or perhaps because of, my existing knowledge.

silentbicycle · on Sept 21, 2009

Part of it is that the culture surrounding ML is quite academic. Most of what I've found has been written for journals or books (the Appel compilers book above, _Purely Functional Data Structures_ by Okasaki, etc.), not blogs. Also, the ML community is rather small, though FWIW I'm only really familiar with the OCaml side of it.

Here's[1] a post about an OCaml / LLVM article by Jon Harrop, but it's just an intro for an article in his (pay) OCaml Journal.

   [1]: http://ocamlnews.blogspot.com/2008/09/writing-bytecode-compiler-using-llvm.html

RiderOfGiraffes · on Sept 21, 2009

And there, to some extent, is the point. People will continue to write "popular press" articles about builing compilers using flex and bison, possibly targetting C as an intermediate language, until the ML community (or equivalent) becomes less clannish.

I'd love to see how clever they are, applaud them for it, and build on their work. They don't seem to care.

silentbicycle · on Sept 21, 2009

Indeed. I think the ML languages have a lot of ideas that should see more widespread use, but the community (with exceptions, naturally) seems to be more motivated by reputation within academia than on blogs and other open discussions. That's their prerogative, I guess, but it does mean that little activity around ML is visible outside certain circles.

I'm curious about writing interpreters / compilers in Prolog (another language that seems particularly suited to prototyping them), as well, but I've already got plenty of projects.

ilyak · on Sept 21, 2009

Please no books and no research papers.

Just a short article showing how to get things done.

Exactly as one we're discussing.

jrockway · on Sept 19, 2009

"Let's intentionally do things wrong because nobody posted details about the right way to HN."

Yeah, go for it...

ilyak · on Sept 19, 2009

How would one do things right if there are no examples of comparable quality?

Revisit the recent drama with twisted vs tornado.

mahmud · on Sept 19, 2009

How would one do things right if there are no examples of comparable quality?

By researching independently when given sufficient pointers. I posted a link to Abdul Azziz Ghoulum's paper which shows a full compiler in small baby steps, bootstrapped from the first line!

Research and conference papers are fairly good for learning, almost as good as blog posts.

mikedouglas · on Sept 19, 2009

Many of the best articles posted here have very few comments. I wonder if news.yc's promotion algorithm could be altered to reflect this.

kqr2 · on Sept 19, 2009

One of the reasons why I like HN is that for a good article, you don't get a ton of one-liner "that's cool", "awesome", etc. type comments.

asdlfj2sd33 · on Sept 19, 2009

Yeah, that's what happens when actually have some competence, and thus you know that you know very little AND you also know the crowd here will call your BS on these subject. Oh but politics or other crap like that will fill with comments right quick.

mahmud · on Sept 19, 2009

[Summary: read this paper instead http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf]

The article is both acceptable and appreciated, but not good.

There are far better, not to mention easier ways to start hacking a compiler quickly than doing it with Flex/Bison/LLVM and in C++. Look at this over engineering:

http://gnuu.org/2009/09/18/writing-your-own-toy-compiler/4/

A compiler should be written as a fluid, jelly-like organism; you will be changing it so much and so often, it's a waste of time to introduce any structure like that to it so early. The only place where you need a heavy design is the intermediate representation; and to this extent, you want the most flexible "design", if you can get away with Lisp-like S-expressions, by all means do it.

You will be annotating the intermediate representation in multiple phases, so don't hesitate to copy deeply instead of mutating it with surgery. Don't bother with an elaborate symbol table design, just use the cheapest/easiest hash-table you can find. Keep your IR human readable or you will be forced to write binary analysis tools before you even settle on an IR format (horrible chicken and egg problem; and that's what you get when you model your IR with a giant C union .. you know, that trick, don't do it!)

For the last 20+ years, Schemers have been losing their voices preaching the trivialization of compiler hacking. Listen to them; Schemers live in a parallel universe to the mainstream compiler community, which still, even if they don't know it, are hard at work improving the first Fortran compiler.

Have fun!

abecedarius · on Sept 19, 2009

I also like Kragen's Ur-Scheme as a concrete readably-small example of a self-hosted compiler to x86, inspired by the Ghuloum paper you reference. [At http://www.canonical.org/~kragen/sw/urscheme/ ]

Dobbs · on Sept 19, 2009

Sadly we don't all have these options. The project I'm working on right now, a prototype DSL for writing counters to check data, can't be written in a fancy language like ML, Haskell or hell even Python. They don't want any "weird languages" that someone else will have to maintain once I leave. So C/Lex/Yacc it is.

mahmud · on Sept 19, 2009

Prototype it in the language you know best, implement it in the language you're "required" to.

A typical compiler project is at least 6 months away from start to finish. Deliver something in Python in 2 weeks and see if they can resist it; you still have 5.5 months to flesh it out in C if you still have to. I don't for once buy that a software shop will refuse having a working demo immediately, even if it's in APL.

cema · on Sept 19, 2009

What about Clojure? It's a lisp and can be used the Scheme way. At the same time, it runs in Java virtual machine and therefore can be controlled directly from Java. That is, your legacy code can be written so that it can be maintained by Java programmers (this is to sell it to "them".)

theBobMcCormick · on Sept 19, 2009

That might work if their concern is deployment of his code. But if their concern is actually maintenance (you know, patches, updates, bug fixes), than I don't see how Clojure, Scala, JPython, etc. would be any more acceptable. The concern is probably having legacy code in a language that nobody else on staff knows how to program in.

cema · on Sept 19, 2009

Correct: they will not be able to program in clojure. But they should be able to interop java with the classes created in clojure. No REPL environment and all code compiled is a requirement for this kind of legacy work, but it can be done easily. (Or "should" be done easily.)

a-priori · on Sept 19, 2009

If anyone's interested in a similar project written in Haskell, a few months back I wrote a compiler for C-Minus, which has a similar syntax. It uses Parsec for the front-end and a custom backend that targets a simple virtual machine (this was a school project), so no LLVM unfortunately. An LLVM-based backend wouldn't be cool to add though.

Anyways, I figured someone may find it interesting.

http://github.com/michaelmelanson/cminus-compiler

sketerpot · on Sept 19, 2009

I notice that Haskell has LLVM bindings, and Parsec makes writing parsers remarkably non-painful.

http://augustss.blogspot.com/2009/01/llvm-llvm-low-level-vir...

a-priori · on Sept 19, 2009

Oh, that'll teach me to double-check my posts before I submit them. I meant to say that an LLVM backend wouldn't be hard to add (or, would be cool to add).

Thanks for the link!

liuliu · on Sept 19, 2009

one advantage of bison is that it gets rid of global variables. However, Flex still uses global variables to pass state. Thus, using Flex is damaging the good part of bison (thread-safe).

nostrademons · on Sept 19, 2009

Umm, both flex and bison can be either reentrant or not, and the default for both is "not reentrant". You need to specify '%option reentrant' in your flex scanner, and '%define api.pure' in your bison parser. The signature changes, naturally - yylex takes pointers to yylval and yylloc that it's supposed to fill in. It's not terribly complicated, but the documentation for it sucks.

I've got both my flex lexer and bison parser wrapped in a C++ class, which parses string input, handles all memory management by itself, and hides all the other implementation details from the outside world. I didn't use the built-in C++ wrappers, which suck, but it wasn't hard to throw a few C data structures inside a C++ class and call a few C functions from C++ methods.

cconstantine · on Sept 19, 2009

Do you have a recommendation for a Flex replacement?