DNA is object code

etal · on Sept 29, 2008

Easier said than done, as I'm sure Andres Yates knows. Biologists have DNA libraries, they can mark expressed genes for debugging, they can link and run this object code (with varying results) -- but there's no general compiler to generate this code, and it's nontrivial to write one without fully understanding the machine architecture.

There are about 30,000 human genes, spread over 3 billion base pairs, coding for tens of millions of different proteins. Given how proteins and RNA interact with DNA, and DNA interacts with itself, the language for describing molecular biology usually isn't discrete math, it's statistics.

I think this is why Yates is accusing biologists of treating DNA as the abstraction that represents proteins, when really, it makes more sense the other way around. It seems like in the past couple of years the best fundamental research has been in proteomics, rather than genomics, probably for this reason -- bioinformaticians are recognizing that analyzing the expressed proteins, done efficiently, can give better information than raw DNA about what's actually going on in a cell.

jaytee_clone · on Sept 29, 2008

This article somewhat touches upon the issue of "bio-programming" but it doesn't give credit where it is due.

People have been studying "bio-compiler/machine code" for decades. The difficulty is that there are many different compilers at work. Just to name a few: protein folding, chemical gradient regulation, inter/intra-cellular signaling, etc. All of which "compiles" chemical composition into actual physical function. But each of the process alone is so complex, it takes years to un-cover a small portion of the black box, not to mention most of these "compilers" are inter-dependent with each other. So it is close to impossible to use the limited knowledge we have un-covered to predict other novel code -> function compilation. I worked in a protein folding lab before. Just to be blunt here, no one really has a clue.

ced · on Sept 29, 2008

I disagree completely.

For it to be object code, it would have to be compiled from something. If anything, DNA is the thing compiled (and modified) several times from the ACGT form down to the protein form. Nature doesn't operate on the "hidden abstraction layer", it modifies directly the DNA, which is a very very strong clue that the meaning is there, not elsewhere. Why does cross-over mixes large sequential swaths of chromosomes? Because genes (the meaningful units) are sequential, and are relatively unlikely to be messed up by such an operation. The requirements of evolution impose much structure on DNA.

I wouldn't call DNA source code either.

Furthermore, while the idea of building higher abstractions sounds nice in theory, it fails, because it's not an engineering problem. It's a science problem, and anyone who has some experience with physics or chemistry knows that models have to remain simple for them to work at all. Meteorological models suck. Climate change models suck. Modelling a single cell is super hard. I wish we could get better models for biology, but it just seems really unlikely. The current approach to discovering gene function seems very reasonable to me.

anamax · on Sept 29, 2008

> For it to be object code, it would have to be compiled from something.

Nope. In fact, object code doesn't even have to be assembled from something.

It just has to be made available to an execution engine.

And yes, folks modify object code.

And, even if there was some source somewhere else, that doesn't imply that the meaning isn't in the object code. (Meaning can be in multiple places.)

ced · on Sept 29, 2008

... right, but the point of the top post is that we should be looking for "the source"... Maybe that wasn't formulated properly.

The point comes down to science vs engineering, induction vs building from parts. Biologists are already building models (model != programming language!). Models are the holy grail. They are just damn hard to get right.

SapphireSun · on Sept 29, 2008

"I wouldn't call DNA source code either."

Well here's the thing. You are completely right. DNA interacts with its surroundings. It forks, splits, mutates. It is not at all like static source code. However, if you take the software analogy to the limits of our current understanding, you might be able to model it as polymorphic code that is also a quine that runs in a massively multithreaded, multiprocessor environment.

Good luck using that analogy for anything useful.

ntoshev · on Sept 29, 2008

This assumes that there is a simple, coherent way to describe life - the source code. There doesn't have to be such a description. Maybe we were just lucky with physics and the ability to describe the universe with simple equations (until we got to quantum mechanics and relativity theory and things became messy again). With genomics, things can very well stay messy: after all, life evolved at DNA level.

Herring · on Sept 29, 2008

There’s a reason why Window’s object code is everywhere, but the source code is top secret.

Might be a bad example. You couldn't use the source even if it wasn't secret. That probably protects windows' copyrights more than the other.

MaysonL · on Sept 29, 2008

No - DNA is not object code; it's much closer to Lisp.

Some of it is code, some of it is data, some of it is macros.

And it's all the self-modifying output of random genetic algorithms.

etal · on Sept 30, 2008

DNA is trillions of monkeys typing on typewriters for 4.5 billion years and throwing away every page that doesn't contain any readable words.

And sometimes a page happens to turn into another monkey, typewriter or page. Actually, I can't think of an analogy that come close to describing the hairiness of biology. It's the origin of all hair; it eclipses all else.

jsmcgd · on Sept 30, 2008

Spaghetti code.

newt0311 · on Sept 29, 2008

Amazing article. Instead of limiting it to genomics, I would apply this sentiment to nearly all parts of biology. The field needs to grow up and start using the powerful principles of building abstractions and leveraging advanced mathematics like physics did around Newton and before.

sungam · on Sept 29, 2008

I disagree. Biology had built complex abstractions, it is just that (outside of specific bioinformatic domains) mathematics has consistently proven not to be the appropriate language for describing complex, messy biological systems.

I like the DNA-as-code analogy. Getting the 'source code' is not the stumbling block - this is becoming far easier with high throughput sequencing methodologies. The real difficulty is writing the code. I have spent the last 3 months constructing 15kb of DNA by classical molecular biology techniques - and this was largely acheived by cut and pasting from existing DNA sequences. The cost of synthesizing large DNA fragments is currently >1 dollar / base. When this falls to trivial levels I think we will really start to see DNA programming taking off.

maxwell · on Sept 29, 2008

> Getting the 'source code' is not the stumbling block - this is becoming far easier with high throughput sequencing methodologies.

I don't really know anything about biology, but as a programmer, the article's suggestion that DNA is "byte code" and not source makes sense. To get actual high-level human comprehensible source, I think we'd need to build abstractions on top of DNA object code. Not that I have any idea what these abstractions might look like.

The article compares genomes to Windows, and I don't really know anything about biology, but it might actually be easier to reverse engineer ourselves than Microsoft's operating system, since (as far as I know) our "binary" is much smaller :)

sungam · on Sept 29, 2008

Ultimately it comes down to sematics but I think of DNA as source code. DNA is already human-comprehensible. The organisation of genes within the sequence is actually fairly straightforward and probably not that different from how you would invent it from first principles given the constraints of the storage medium. We can routinely mix and match existing sequence components with predictable consequences. There is no need to invoke a higher level of abstraction than this.

I would say that the object code is the set of RNAs present in the cell and the program is the state of all of the macromolecules in the cell - proteins, lipids, carbohydrates etc. Essentially the program is running on the organic chemistry virtual machine.

When it comes to 'writing' artificial DNA the complexities are that firstly we do not understand the intricacies of how genes are turned on and off, although there is not reason that we should not come to understand this. This is the area that I am working in. The second major problem is that we cannot 'invent' new proteins as we do not how to predict the 3D shape and chemical behaviour 'folding' from the protein sequence. With massive increases in computational power and clever algorithms it is possible that this difficulty will be overcome.

maxwell · on Sept 30, 2008

Very interesting, thanks.

lst · on Sept 29, 2008

We all are a mystery to ourselves, and will keep it till the end of our days...

(Poor humans, they will never be able to stop God from smiling about their relatively poor scientific investigations...)

;)