Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not just calling conventions. LLVM IR also bakes in various assumptions about the target platform, from endianness to structure alignment to miscellaneous facts like whether char is signed or unsigned.

In many of those there is no going back to the original representation, they are one-way.

If you have IR that you can compile to various archs and have it work there, that is a lucky thing in that particular case. But it is not what LLVM IR was designed for nor should that work in general.



I don't understand what this means. Could you please give an example of some code that loses information in this way when compiled with LLVM?


Say you have a struct type X with properties int, double, int. The offset of the last property depends on how alignment works on the target platform - it could be 12 or 16 on some common ones. LLVM IR can contain a read from offset 12, hardcoded. Whereas the C code contains X.propertyThree, which is more portable.


But that's not how LLVM works, at least when I worked with it a couple years ago. You would define a struct type in terms of primitive types (int64, ptr, etc), and then use getelementptr with the offset of the field path you wanted. Yes, it's a numeric offset, but it's a field offset within the struct, not a byte offset. LLVM handles packing, alignment, and pointer size issues for you automatically.


Yes, you can define structs and use getelementptr to access values. But, frontends can also bake in offsets calculated from getelementptr. They can also bake in sizeofs of structs, for example. And LLVM optimizations end up baking in hardcoded values in more subtle ways too.


Once you have defined a struct in terms of primitive types, it is platform dependent.

Consider C:

A C int can be 16 bits. Or 32. Or 64. Etc. As long constraints of the relation to the other types is met.

The moment the frontend specifies a primitive type for a field in the struct, that code is incompatible with a whole lot of platforms.


Your primitive types aren't LLVM's though, are they? I mean, I haven't looked at LLVM thoroughly (just enough to be familiar with it, a friend is writing a language he wanted some input on), but I would be surprised and disappointed if they had a C "int" type as opposed to "signed 32-bit integer" or whatever. At which point it's compatible with whatever else is throwing around a signed 32-bit integer.


But that is exactly the point - that LLVM IR is not platform independent.

The fronted must choose which specific integer type that "int" in C maps to. At that point, the IR is no longer machine independent - if you pick 32 bit signed ints to represent C "int", your program will not match the C ABI on any platform using 16 bit unsigned int as C "int" and you won't be able to directly make calls to libraries on that platform, for example.


So use uint32_t?


This misses the point. The point is that if you pass a C program that uses "int" through a C-compiler that spits out LLVM IR, the resulting LLVM IR is not portable.

You might not be able to change the C program - it might be using "int" because the libraries it needs to interface to uses "int" in their signature (and handles it appropriately) on the platforms you care about.


Ah, I think I see....you mean I could write non-portable IR code by doing that, although LLVM would never produce code like that? I guess there must always be IR that the frontend will never produce then?


No, the implication is that the LLVM IR that the frontend produces changes depending on the ultimate target that the LLVM IR will be compiled to. In other words, the frontends aren't backend-agnostic.


Oh, right! That makes more sense. So you have to specify the backend when you start the process? I didn't know LLVM did that.


Yes, the frontend very much knows what target you are aiming for. It greatly affects the IR that is generated.

And once you generate that IR, you can't just built it to an arbitrary target, it must be the one it was generated for.


No, every LLVM frontend (e.g. Clang) has to do so all the time for things to work.


That makes no sense at all. You've just said that every LLVM frontend has to produce code that every LLVM frontend won't produce! If you mean something different, then could you please be clearer, as I really don't understand what you're talking about.

[EDIT: caf's post made it clearer. I know what you meant now]


     int a() { return sizeof(void *); }
Obviously a trivial example, but it's illustrative: the front-end compiler knows a bunch of things about your target and bakes that information into the IL. If you took IL generated by the compiler with "-arch i386" and then compiled the IL using "-arch x86_64" it's quite possible to get a non-working executable.

It's possible to carefully write IL that will work on multiple platforms as done in this blog post, but I'm not sure how useful that really is. You're still giving up the exact control that assembly gives you, so I don't know how much better you'd get than clang-produced IR. In other words, if you want "portable assembly language" use C.

Still, its an interesting blog post. It's good to show people how the compiler works behind the curtain.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: