Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Principles of Data Oriented Programming (klipse.tech)
375 points by viebel on Oct 5, 2020 | hide | past | favorite | 134 comments


Very interesting!

In the context of C++ some of us have been calling these programming/design principles "Value Oriented Design". Some talks on the topic:

- Most Valuable Values (Juan Pedro Bolívar) https://www.youtube.com/watch?v=_oBx_NbLghY

- Squaring the circle, value oriented design in an object oriented system (Juanpe) https://www.youtube.com/watch?v=e2-FRFEx8CA

- Objects vs Values: Value Oriented Programming in an Object Oriented World (Tony van Eerd) https://www.youtube.com/watch?v=2JGH_SWURrI


Something I've found applying Value Oriented Design to C++ is that it often leads to freeing designs of arbitrary limitations that weren't clear beforehand but in hindsight you realize, "well, of course I should also be able to use it this (other) way."

For example I'm porting a parser engine, implemented in another language, to C++ and it wasn't clear what the hierarchy of objects should be for the purpose of RAII (because there are circular dependencies) (the original implementation language was garbage collected). The original implementation loaded a grammar file and directly created objects with pointers (originally, references) to other objects.

I introduced an intermediate layer where the file is first read into data structs which hold the integer values from the file. So instead of objects pointing to objects, the links are implicit because of things having the same index. The representation of the loaded grammar file is now copyable, moveable, immutable, etc. A side effect is it's trivial to tell whether the file loading code is correct or not when the result is just the same data with structure applied to it, rather than having the added dimension of determining whether a graph of objects are correctly relating to each other.

Then I construct the actual parser engine objects from the data representation structs. True that didn't in itself solve the RAII-hierarchy problem, but what it did do is make it easier to isolate that problem to just the domain of how the objects are used and not commingling that with the problem of how the file is loaded.

The epiphany I spoke of is that after this refactor, it became clear: the file is arbitrary. For testing, or for use of the parser with a grammar which does not change, I could dispense with the file load step and just encode the grammar directly in value-structs.

Why I think this is significant is that "the way I was trained" to think of making code like this unit-testable is to mock the file reading interface. That's a lot of work for something that's only necessary because of an over-emphasis on objects and behaviors instead of thinking about data and values.



Unsurprisingly David Abrahams, before moving to Apple, was an extremely influential member of the C++ Boost community, where value based programming is praticed extensively.


Sean Parent has also written a lot on the topic.


Yes! Also the book Elements of Programming by Stepanov has a lot of "value orientation" in it.


Unsurprising as I believe Sean is a "disciple" of Stepanov (I think they worked together at Adobe).


Principle #2 always bothers me.

"Model the data part of the entities of your application using generic data structures (mostly maps and arrays)."

and example

function createAuthorData(firstName, lastName, books) { return {firstName: firstName, lastName: lastName, books: books}; }

For a simple, obvious object like "person" it might work, but for a complicated domain object with many other composed objects this starts to be a pain.

I am not feeling ready to memorize all the field names, I prefer to have an object with documentation for each field.


The author argues DO is in favour to reduce complexity, but also notes on the prices of it and reminds that it's in the eye of the beholder to decide whether or not it is of benefit.

Many of the early PHP applications were written in a style using generic data structures (array that is in PHP) having functions dealing with all these.

There is no free lunch.

If you already have domains you can model in, the strategy to reduce complexity is perhaps more from the DDD book which looks like a higher level concept to me and therefore may be more fitting for higher complexity levels.

(but again, one for sure can shoot in her own foot with DDD as well)

What I like about the DO thing is it maps well on simple REST APIs sending and receiving JSON text. Quite popular these days to say the least if not mentioning Serverless.

A similar trend can also be seen in structural logging.

These systems are often distributed and complex.

As it bothers you in your case, I would tend to say, it's better to stick to a domain if there is one. Across boundaries of domains, values to emit and receive data can work out very well though I can imagine. As so often, it depends.


> I am not feeling ready to memorize all the field names, I prefer to have an object with documentation for each field.

Using data doesn't preclude documenting fields, particularly if those fields are assigned a unique name.

For example, in Clojure you'd namespace the keys, so instead of "lastName" one might write :author/last-name instead. This keyword can then be documented or even assigned a type/spec.

(That said, you can't currently assign a docstring to a keyword in Clojure without the use of a third party library, or using a comment or external document. Hopefully a future version of Clojure will make this part of the core language.)


If you use simple data structures without logic, it is basically what he is saying. Basically `createAuthorData` is a data structure constructor. There are some advantages by using _.pick or something in JS land, but these don't work for other more strictly typed languages.

Also, I would be careful with using it so much as his example:

``` function createAuthorData(firstName, lastName, books) { return {firstName: firstName, lastName: lastName, books: books}; }

function fullName(data) { return data.firstName + " " + data.lastName; }

function createArtistData(firstName, lastName, genre) { return {firstName: firstName, lastName: lastName, genre: genre}; } ```

you can also do `fullName({firstName: 'Elmond', not:'really'})` and have an hard to catch bug (where an actual typesystem would catch it).

You can either use a Protocol for this (PersonName) or a base class (though those are going out of fashion nowadays)


Or when structure typing is available,

    type example = { firstName : string; lastName : string};;
    let name x = x.firstName;;
    name {firstName = "Joe"; lastName = "User"};;
Or to make it more specific

    let name { firstName : string} = firstName;;


Usually a map can be fine, but isnt it a maintenance nightmare. At least if you change an object, the compiler will complain if an attribute is not found, but this leads to runtime error/ or bad behaviour


Yes, for sure it is. But what is the alternative? A compiler can only complain on a monolith, it can't scratch boundaries out of the system it compiles (and still within that monolith, encapsulated system the data might be inconsistent).

For distributed systems this has been solved in the last two decades (if not even longer), but still systems are pressing against formality, how come?

Continue to write the code to deal with the (always changing) data, maybe that's it when diving data apart from code (:OP/DO#1 ; Perlis 9. It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures.).


How has this been solved for distributed systems?


One way is to have an Interface Definition Language and rely on an IDL compiler to write the interfaces in the language of your choice. Senders and receivers will expect the same types to be marshalled/unmarshalled. If a sender attempts to send data that is different from the interface, either the compiler will catch it, or hopefully a runtime error will happen - either the sender will fail to marshal the data, or the receiver will fail to unmarshal it. Still, there are "garbage-in, garbage-out" situations that someone will have to debug, but starting with well-defined interfaces and tooling to generate code helps a lot.


Adding a version field can be useful to avoid this. Particularly if your entire state is stored in one object.


Yes, personally I'm (generally) against version fields, however in the OPs meaning as I read it, if you add the version field it breaks the value (the version field invalidates value comparison for equality) and therefore will end up adding complexity. This may go contrary towards the topic, as OP clearly states major goal is to reduce complexity.

So adding a value and make it (inherently) incompatible in the value system breaks the benefits of a couple of the six points outlined in the OP (given the version field suggestion).

Just saying. Your mileage may vary. But again, introducing version attributes is most of the time (and that is a warning) _increasing_ complexity.

One of the articles referred to by the op is [out-of-the-tar-pit] which is fundamentally about complexity and WTF it is paradigms, on syntax level and language support. A version field is a counter on higher level on top of anything of it (and therefore in the off-topic domain already to a larger extend) and also ruining any of the value comparison ability (adding the version field exploits the value inequality in DO as per OP making it part of the versioning system) introducing meta-date and IMHO ruining DO.

If you need to encapsulate state to take a short-cut, introduce state. Don't ruin value(s).

Just my 2 cents.

(/edit: better than version attributes are just any attributes as they work towards both directions of change. not that straight forward to deal with at first, but offering more flexibility. it would be that some older value handling functions are incomplete [but compatible] and new ones just fitting. similar, namespacing for attributes are orthogonal as well [as in closure, depends language at task supports them] while version fields are imposing one general forward direction only, IMHO only for last resort if any other kind of consistency is already considered lost beyond recognition [most often this is _not_ true in computer systems, it's just that everyone involved is too f-c-k'ing lazy to take care and then blames others for anything but getting the job done my/your-self. when in doubt ask operations, they may tell you if they have time])

[out-of-the-tar-pit]: https://raw.githubusercontent.com/papers-we-love/papers-we-l... Moseley/Marks 2006


Agree. But then every function using the map has to query the versiom field?


Who says there is only one version field?


This is alluded to in Price #2: No Packaging. I’d argue that grouping functions into namespaces largely resolves this problem.

These two slides in the Clojure, Made Simple talk are my favorite counterpoint to even shallow objects being easier to work with than data. See starting at 50:00:

https://youtu.be/VSdnJDO-xdg

If you don’t want to click through, the list of HttpServletRequest methods is bigger than the map itself! Each one is its own little DSL, and they are not consistent with each other. To do anything with the data, you first have to figure out which method gets the bit you want out of the thing. And that’s a simple case—the complexity only goes up from there.

The map OTOH is just a map; you have hundreds of simple, composable functions that you can use to slice and dice the data however you need.

Want the keys? (keys request)

Want the keys of the headers? (keys (:headers request))

The user-agent value? (get-in request [:headers “user-agent”])

None of these are specific to a servlet request; they will work on any piece of data. And it’s trivial and transparent to build more specific functions out of them that fit your problem domain.


What bothers me more is the lack of a way to ensure that the data is valid. Like always things look great in simple examples where it doesn't matter but fail when it gets tricky.

The main benefit of dedicated, encapsulated types is that they can preserve their invariants. Without that each function operating on "data" has to check that those invariants are kept. It's fine in trivial cases e.g. where this validation boils down to checking if a value is present or not - which might even be supported by the language. But I don't see this working in more complex cases.

In my experience many (most?) bugs are caused by the implementation assuming something incorrect about the shape of the data and I feel like this principle only makes it more likely to happen.


You can still make helper methods that get the field data for a person from the Person class if you really want. Its not the end of the world but you lose the benefits when you're operating on a single large object in that style.

The reason you would use this setup is so you have all your data in contiguous arrays you can SIMD through quickly. If you're doing a lot of operations where n is low then its not that useful (and even detrimental) to orient your data array-wise.


> for a complicated domain object with many other composed objects this starts to be a pain

It may be that complicated domain objects with many other composed objects are not compatible with this style of programming.

Whether or not that's a good thing could be the subject of an interesting discussion. For one thing, negative experiences with the "large complex classes" approach to domain modeling is a major factor in the backlash against object-oriented programming. Lately I've been trying to familiarize myself more with the early literature of OOP, and I'm discovering that even OOP's early pioneers had already had bad experiences with it, and would warn against the temptation to do things that way. OTOH, there's undeniably a certain attractiveness to it, otherwise it wouldn't be so common.


It's a good idea to write a book about a data-oriented development style (I'm working on a methodology, a way of working in data-intensive projects called Data2Value).

However, JavaScript or Clojure are not ideal for demonstrating this methodology in the sense that industrial applications will more likely be built in C++, Java or Python. For example C++ and Java support Apache UIMA, which is an industry standard for data-oriented systems. UIMA (originally developed at IBM before open sourced, and use in their Watson system) manages data as immutable objects (e.g. text, videos) that are enriched with annotations (e.g. syntax graphs, topical tags, subtitles...).

Functional designs are often well-suited to data flow related processing, whereas in OOP, you end up with a pipeline object and various DataStream objects that it inputs and outputs.

In my experience, data-intensive systems often need to: - cater for distributed processing due to large-scale (which calls for Apache Spark, and then PySpark or Scala); - compute-intensive work like machine learning, which may require GPUs or other bespoke hardware (Tensorflow supports GPUs, Google's cloud has TPUs); and - special purpose data structures (e.g. Bloom filters, huge persistent graphs, R* trees...) specific to the nature of the processing (this latter point, I guess, contradicts to the author's claims).


FYI Clojure supports GPU processing since 5 years ago.

https://github.com/uncomplicate


Data oriented programming is orthogonal to the type of biggish data tools you're talking about, although I agree that in the present day, the latter is actually a more interesting subject for methodological discussion.


I don't think these principles add up to something useful. It's not complete and I think some of the principles don't align that well to the problem space.

The big one: "Data is immutable". The problem here is that data isn't actually immutable (generally) and mutability isn't actually the problem. The problem is unmanaged references or other dependencies on the mutable data. The "source of truth" becomes muddled which creates the problem of how to keep the various instances in sync (or otherwise handle cases where they are/become out-of-sync). Immutability is a very useful tool since you have have any kind of dependency -- direct, indirect, implied, etc) -- and there's no worry. But you still need a mechanism to manage mutating data. Maybe it comes out somewhere, but the principles of DO don't cover it, which is a rather serious omission.

Also, principle 2 isn't really a principle. I think what it's getting at is that you don't really know the precise type of your data, over time, in a distributed system, so it's good to include the flexibility to handle that. That makes sense to me. but generic data structures aren't necessarily always the right way to handle that.


Immutability is not the fact that something can not change in this case. It has more to do with the identity of a value. Every time you change anything inside the data structure you get a new reference, that is it!


Think of it as versioning for your data. So instead of referring to some data structure by its general concept with immutable data structures we are a lot more specific with respect to the identity represented by the name, here the name represents a version.

This is useful because, for example, you stop having the "unmanaged references" you were talking about because now since you are pointing to a version of the data and not the data itself you can be sure of what you are talking about.


Also remember here the promise is that the data you are talking about wont suddenly change underneath not that the reference 'is up to date".

It is not a solution for change in time it is a solution for taking change in time out of the equation when we don't need to talk about that. With immutable ds when we talk about data we are just talking about exactly that and time is taken out of the picture. Its called immutable because now you are talking about facts and not the representation in time of those facts. Because now we are talking in versions so it does not make sense. The thing is this data is a snapshot so you are not guaranteed to have the most up to date snapshot. And that's ok because precisely here we take change in time out of the question in order to be able to talk more precisely about the data. Tracking change in time is another story.

So for example you could have things like react. There you have snapshots of the world updating. When you talk about the data it is immutable but then you change it and update the mutable variable where you are keeping change in time.


Honestly, I think this series of blog posts could have had a more complete title: "Principals of Data Oriented Programming (aka, Idiomatic Clojure)"

All of these ideas (and I think they are good ones) are inspired heavily by Rich Hickey's talks and rational behind developing the Clojure language (the author of the post states as much). And while you can use these techniques in other languages/paradigms/problem domains, they are really intended to work well inside the constructs of Clojure, and when applied to "information-driven situated programs" [0] (read business applications with dynamic requirements).

As for some of the short-comings you mentioned:

"But you still need a mechanism to manage mutating data"

Clojure supports this through the use of locking constructs like atoms. [1]

"I think what it's getting at is that you don't really know the precise type of your data, over time, in a distributed system, so it's good to include the flexibility to handle that. That makes sense to me. but generic data structures aren't necessarily always the right way to handle that."

Clojure attempts to bridge the gap between generic data-structures and strongly-typed constructs using run-time specifications. [2]

I mean, the ideas presented here can be generally useful, but your mileage may vary if the principals take you too far out of the idiomatic for your particular language/paradigm/problem domain. If that's the case, you could find yourself wasting energy swimming up stream.

[0] - https://www.youtube.com/watch?v=2V1FtfBDsLU [1] - https://clojure.org/reference/atoms [2] - https://clojure.org/about/spec


> Immutability is a very useful tool since you have have any kind of dependency -- direct, indirect, implied, etc) -- and there's no worry. But you still need a mechanism to manage mutating data.

I agree. For me it feels like immutable data is a tool, not a universal principle. For example, in low-level C programming or direct control of a register in an embedded context, immutable data isn't a principle, it's just another approach.

What I mean by "universal" is something like SOLID. [1] (Most people only use SOLID for OOP, but Uncle Bob makes it clear in Clean Architecture that he thinks SOLID is universal, and not limited to OOP.)

That said: it does feel like immutable data should be the _default_ approach in many situations. But that's really hard to do in most languages.

[1] https://www.amazon.com/Clean-Architecture-Craftsmans-Softwar...


I think "data are immutable" should be qualified - usually it means "data are immutable through your codepaths" and if you are mutating data, you need explicit checkouts and checkins of the data.

This is the basic principle behind how sane database transactions work.

ORMs in some languages can be especially dangerous if they overload the getter/setters of the object in such a way that the checkins and checkouts are obscured; you could be passing your object to a function or method that expects to mutate a polymorphic class[0] that is usually a traditional "shared memory" form of objects, well hopefully you can imagine the chaos, redundant database transactions, consistency problems, failure modes, uncaught exceptions, etc. that are going to be a nightmare to debug.

[0] worse yet, imagine if it's someone else's code and they change the api from not mutating to mutating for performance reasons. Will you notice the documentation change or the changelog? It's bad enough in the case when it's not an ORM and just a mutable object.


> I don't think these principles add up to something useful. It's not complete and I think some of the principles don't align that well to the problem space.

These are are part of a book that is being written at the moment as far as I can tell:

"This article is an excerpt from my upcoming book about Data Oriented Programming. The book will be published by Manning, once it is completed (hopefully in 2021). "


I think the principles do basically summarize data oriented programming. Particularly, principle 1 is important since it is the exact opposite of OOP's insistence of encapsulation, and is possibly the biggest reason DOP works so much better than OOP.


Evidence that "DOP works so much better than OOP" is scant imo.

The rise and fall of paradigms that present themselves as panaceas is instructive. You have "structured programming", "object oriented programming", "functional programming" and now "data oriented programming".

What I'd like to see is paradigms paired with "where this works well" rather than paradigms sold based on "this will solve the software crisis", "this works better(unqualified by when)" and "if you're not doing this, you're doing it wrong". The later two claims leave a bad taste in the mouth of the casual users/observers, who gradually morph into active critics and sink the paradigm, wasting the good and useful parts of each (OOP has a vast universe of haters because it's most successful of the panacea-paradigms so far and as a panacea, there's much to hate here but still).


That's because data oriented programming has already succeed on a large scale, you just might not recognize it.

SQL is about as data-oriented as it gets, you have a programming model that's constrained and focused on data layout, structure and performance over being a general purpose language. There's a good article a while back about ECS that I can't find which talks about how most performant ECS start to mirror SQL and other row based data engines.


It has succeeded but it has also failed. Using a shared database for many different functions/applications can be a disaster that eventually seizes up. Everyone is afraid to change the data structures because they don't know what it will break. It's a huge piece of global state with no isolation. Enter encapsulation and service orientation and bounded contexts.

So the parent is correct. It's not enough to say data orientation is a good thing. It needs to be compared with previous approaches (encapsulation) and then explain what the tradeoffs are.


what is ECS in this context


nevermind entity component system i knew that


It's not so much a rise and fall rather than refinement.

You still use if/then/else and repetition and subfunctions in OOP. But you restrict and replace some of their previous usage.

Same for functional programming: you take OOP but remove all the sideeffects. You still have objects which encapsulate data which can only be accessed in a certain way and can behave polymorphic. Mind that inheritance was never really part of OOP in general.

Now about DOP - not sure what exactly it adds on top of FP or if it is even an established and well defined term. Doesn't look to me like it, looks more of a bundle of recommendations right now.


Your right about the sync and dependency issue. Big reason why in JS land you treat each mutation of the data as immutable destruction and refresh of an object is to eliminate any old references that might not of been cleaned up by the garbage collector.

Data Oriented Programmming is both old and new. In that it does not have the same amount of programming patterns that OOP has. As it is a more bare metal means of programming without a ton of abstraction to ease most programmers into it.

Where I find the idea interesting is concurrent and parallel processes are more natural in the data oriented. And that is through immutablility and ownership as first principles.


You realize data is immutable when you first try to implement history.

Mutability is just a hack to save some memory.


I think when you start pulling on that string you eventually end up at event sourcing because once you are keeping comprehensive immutable history of all your changes keeping the mutable record anywhere just starts to look like a liability.


Yes, and it is incredibly comfortable to work in a properly tooled event-sourced codebase.


> just a hack to save some memory

.. and time. If we needed to compute an account balance by summing all the debits and credits since the account was opened...


Persistent data structures already solved this issue.


You don't need mutability to save on time, you just need to use referential transparency effectively when calculating your projections.


This seems to be describing a style of programming, and you'll have to take these principles in the spirit that they were intended, i.e. in the context of that programming style. I recognize this style as a common style in OOP/FP languages such as Scala, where it's common to pass around immutable Plain Old Scala Objects (made from lists and hashmaps).


I was a self-taught programmer and when I first learned to program, I found the concept of 'reference' is very counter-intuitive to me. I assume people have no CS background may find the same - why is this 5 and another 5 is the same, but not people with the same name and social security numbers?

Traditionally people use mutable struct or objects for non-primitive data structures, which implicitly allocate addresses in computer memories - that is a very implementation-wised approach, more specifically, a very Von Neumann approach.

For example, there are many systems having entities in the databases, like Person { id, name }. But if there's no database, in many programming languages people would write Person { name } only. If people write the in-memory version first, they have to re-write their models. Systems with mutable references inconsistently force the referencing system on people by default, instead of just giving people only they ask for.

In modern days, a lot of computing goes beyond a single computer, there are many distributed systems, a lot of serialization is going on, and the reference in one computer's memory doesn't mean anything to another.

If we try to loosen the definition of the computer, we can think of many systems as bizarre computers. If we take a look at data processing systems like Spark or Flink, conceptually they also look like computers - bunches of hardware and 'operating systems' sit on top of them, but in the form of clusters instead of single computers. In such cases, there are no shared memories, and the memories are implementation details - users won't aware of them like they have to realize there are underlying memory systems in Von Neumann computers. In such kind of systems, people only care about computing which is more substitution based instead of mutable references. While at most, they only need to aware of nodes as a whole in the cluster.

It's close to the 'substitution model' vs the 'environment model', or the 'functional' vs the 'object-oriented' described in the SICP. Where I found conceptually the 'substitution model' is more fundamental, and simpler - it's basically elementary or middle school maths.

> The problem here is that data isn't actually immutable (generally) and mutability isn't actually the problem. The problem is unmanaged references or other dependencies on the mutable data.

So another novel way to look at it, is if you don't have mutable references, you don't have to manage it, then the problem is eliminated as a whole. Like Haskell, Erlang, or Spark as we just mentioned. At the end of the day, you can have references in outer worlds, be it ST monads, processes, database, or just our real world, sometimes the processing part of the program don't have to care about them.


> One could argue that the complexity of the system where code and data are mixed is due to a bad design and data an experienced OO developer would have designed a simpler system, leveraging smart design patterns.

Indeed he (or she) would have made use of traits/protocols/categories/whatever, to separate behavior from data, while keeping the design extensible (via polymorphism).

This is something I usually find in OOP critics, too much focus on class driven implementations, without spending too much on the other parts of the toolbox.


The other parts exist in one form or another in non oop languages too. Heck, polymorphism is part of type theory, traits exists in Ocaml and Haskell... But arguing what and what isn’t oop isn’t that productive as no one will agree to any definition. That’s why you’ll get gut responses about lasagna code where the layering glue is more complex and of bigger proportions than the algorithm itself...


Which is why I find a complete waste of time the whole OOP vs FP vs ECS vs ADT discussions, instead of embracing all ideas as part of multi-paradigm toolbox that most mainstream languages actually are.


To a point I agree, but formal verification is a thing and wouldn’t handle a normal mainstream language as we know it.


Alan Kay and Rich Hickey discuss the merits of data vs data with interpeters:

https://news.ycombinator.com/item?id=11945722


Wow, Alan Key is being really difficult in that discussion.


Is it me or does anyone else confuse the term “Data Oriented Programming” with “Data Oriented Design” [1]?

[1]: https://youtu.be/rX0ItVEVjHc


You're not the only one. That's what first came to my mind when I saw this.


>>> Model entities with generic data structures

Which breeds data oriented "anti-patterns" when i/o performance becomes the bottleneck. Focus on hardware. It's almost like you need to work backwards to build scalable algorithms for modern data loads ;)

Scalable Machine Learning & Graph Mining via Virtual Memory

http://poloclub.gatech.edu/mmap/


I've always liked the idea of "table oriented programming" where more detailed schema info is used to do most of the CRUD and UI work. In my experiments, the tricky part is exceptions to the patterns. You always need to be able to tweak things imperatively (via code). But the attributes can still do roughly 90% of the job.

My latest approach to get enough tweakability is what I tentatively call "fractal rendering events" or "staged rendering". When rendering HTML or SQL, you need event "hooks" for the different stages. Level 1 events may override/alter field attributes. Level 2 events may override/alter the HTML (or sql) generated for the field based on the Level 1 values. Level 3 events may override/alter the HTML of page sections (or entire SQL clauses). Level 4 is overriding/altering the entire page (or final sql statement).

In other words, the schema provides drafts, which can then be adjusted along the way through event hooks. The granularity of what's tweaked goes up with each stage.

But managing that many potential events needs something more powerful than a file-based system. It may be better to manage such source-code in an RDBMS so you can search, sort, and group by different factors at different times rather than hard-wire in one viewpoint as file systems do.

But current IDE's are not ready for this. I do believe it's the future, though. File trees are too limiting.

Consider this: it's common for a non-coding analyst to want to change a field label, page title, max field length, or "required" status. If they could do it in the schema info (data dictionary), then they don't have to involve the coders. Whether the data dictionary is referenced directly or generates scaffolded code is a stack-specific or shop-specific choice. Minor things like this shouldn't involve a lot of effort.


You know, sometimes I wish the programming languages had a better distinction between "immutable data pieces" and "stateful agents". Sometimes it's really nice to have a simple struct for which you (or anyone else) can write many function to act upon. Sometimes it's really nice to have an opaque "object" with methods to yank described in some public interface, but which encapsulates loads of state and other objects inside it so you don't have to think about all of that.


C# will be getting records in .NET 5, which would be analogous to what you refer as 'immutable data pieces', in addition to the normal objects the language has always had. https://devblogs.microsoft.com/dotnet/announcing-net-5-0-rc-...


This is exactly the reason C++ was created. It has structs and classes. In theory, structs and classes are exactly the same, but conventionally, a struct represents some kind of data object while a class represents something with encapsulation and state.


This is probably an imperfect solution, but isn't this kind of what Actors provide? Syntax and concurrency aside, passing a message is equivalent to calling a method.

The biggest annoyance I can foresee is discoverability. Object-orinented classes make it pretty clear what a given object's interface is (just look at its public methods). This is not true of actors in the languages I've used (not many), but in theory it should be a fairly trivial question of syntax.


The "stateful agents" is what OOP is all about. The objects are interpreters. I referenced a discussion elsewhere in this thread between Alan Kay and Rich Hickey where Kay talks about the value of interpreters. Rich Hickey talks about data/values coming from a seismometer or an IoT device but what he misses there I think is that it's not all just "data". There are also stateful agents. The seismometer (or the earthquake) is a stateful agent and the data is messages that are sent by it. If we had to model a seismometer or iot device (e.g. a smoke alarm) it would be best to do so using an object that encapsulates it's state and manages itself. It only communicates to the outside world with messages/data (in this case a sound when the temperature exceeds some internal state). I can replace my smoke alarm with another and I don't need to understand anything about its internal data or how it interprets it.

But for dealing with the historical data from an IoT device it may make sense to use a data oriented/functional approach. That time series data is not stateful, it's an immutable history of something stateful. Functions/transformations work best there usually.


Right. And the problem is, quite often those "data/values" turn out to be objects too, and for no good reason. Take the whole Active Records approach, for example, re-implemented in Java in the most straightforward way possible: you have a class with data fields (well, with getters/setters) but it also has "Save()/Load()" methods on it. Ugh.


So like an Erlang map and a GenServer?


This reminds me a bit of some of the ideas that Eric Normand presents in his book, Grokking Simplicity.

Which I'd highly recommend. It's aimed at a less experienced audience, so, as someone who's mid-career, I admit I did skim some sections. But, all-in-all, I enjoyed reading his take on how things should be done.


I know that there some common topics with Eric's books. Do you think Eric focuses as much as I do about data?


No, he's much more focused on what he calls actions and calculations.

Perhaps overly so. I'd have liked a bit more focus on data. That could be a by-product of his Clojurist roots. Data-oriented programming is so integral to Clojure's culture that I'm not sure Clojurists even realize they're doing it half the time.


> Data-oriented programming is so integral to Clojure's culture that I'm not sure Clojurists even realize they're doing it half the time.

I totally agree with you. I am a Clojurist and my hope is that my DOP book will spread the "light" to non Clojurist.


Clojure-inspired concepts? An upvote from me, despite the hide-out in other languages :)


To principle #2:

The author acknowledges this as being incompatible with static typing, but I'm not so sure. Is it that, or is it that it's incompatible with contemporary static languages?

In FP, we already have a concept of type-safe hetergeneous lists and maps, and even some clever implementations in languages like Java[1]. The ergonomics are often less-than-stellar, but I'm pretty sure that's something a new language could fix with some syntactic sugar.

There is also the data frame abstraction (like, you see in Pandas), which is typically implemented on top of dynamic typing, but major implementations often rely on static typing behind the scenes to achieve efficiency. There are also projects like Frameless[2], which implements a statically typed interface over a dynamically typed dataframe package.[3] I'm guessing, again, that careful language design could get us something similar, but with better ergonomics.

And I'd be happy with that. I've been pulling away from static languages lately, and a big part of that is that I really like how some of the dynamic languages let me model my data as data. It's enough of a complexity saver to feel like a net win, even at the cost of some performance and static verification.

[1] For example: https://github.com/palatable/lambda#hlist

[2] https://github.com/typelevel/frameless

[3] Which is itself, notably, implemented in a static language. Spark also has a statically typed version of the API, but its usage is not recommended for several reasons, one of which is performance. That's something that all us static typing fans should really stop to think about for a bit.


I think a TypeScript style structural type system could work.


Be careful because there are perf issues if you are using parametric polymorphism. Monomorphic functions are preferable but the real problems occur once you get past the inline cache's maximum number of 'shapes'. This obviously applies to regular JS as well.


Perhaps I misunderstand, but wouldn't that only apply if you attach functions to objects? I suppose you wouldn't do that if you follow data oriented programming principles.


Nope. It's about the different types passed into a function parameter. For optimization JS engines look at the differing shapes of JS objects. Which is determined by the order and type of each member.

So:

  { a: 6, b: 7 } is different to { b: 7, a: 6 } and { a: "six", b: "seven" } is different to both.
This is done so the member can be looked up quickly by offset. Functions then have an inline cache that stores the shapes the function has seen. If the function is called monomorphically it will only ever see one shape and hit the fastest path. If it is called with up to three shapes (in V8) it will be pretty quick. Once past three the cache falls through to a global table and is dog slow.

This matters if you are using structural typing as you are still creating different shapes to be passed into the same function(s).


I see. Thanks for explaining. That's very good to know!


None of these really apply in this case. In data oriented programming, the data would not be stored in heterogenous lists but each element would be stored in a different, homogenous array and you would ties these together using the same entity ID.


That is not quite what the article is talking about.

There's a bit of a terminology collision here. What the article is talking about is not Data-Oriented Design, the practice of organizing your data for efficient processing. It's proposing a separate (but not incompatible) concept of organizing your code for easier maintainability. For example, the sample code (which appears to be JavaScript) is not doing ECS at all. It's almost exclusively doing the data modeling by creating one heterogeneous map per entity.


Even with this type of design, it's possible to implement in a typesafe way. I have seen clever ECS systems accomplish this


Maybe but ECS does in some sense circumvent types.


In some sense. You could build a type system around "has a <foo>" relationships instead of "is a <foo>" and get something pretty cohesive out that closely aligns with ECS.


What's the solution to referring to other "objects" in this setup? For example, an Author has a list of Books that they wrote. I see several possible ways to do this, but they all seem to have downsides.

1) The author map has a "books" entry that is a list of map containing the book data. But multiple authors might have written the same book, so does the book data get copied there?

2) The author contains a "books" entry that is a list of Book IDs, like foreign keys in a database. But how about immutability? If you want to get the book addressed by the ID, you need to look it up somewhere and are you guaranteed the object you get from the lookup is still the same one as it was earlier?

3) There is a "Author -> Book" map somewhere that you can pass an author and it gives you the list of books they wrote. Not sure about this one.


It is going to be the theme of Chapter 3 of my DOP book. Stay tuned.


The articles sounds like an advertisment to use Elm. Immutability and value semantic is enforcd by the language and the typesystem is rich enough to allow to use generic data sructures while been statically typed. As a functional language Elm has closures allowing to violate separation of data and code and get non-literal data types, but this is strongly discouraged by documentation, runtime (aka Elm architecture) and tools.


We certainly use some of these principles in OrgPad.com and some of those inspired even the User Experience in a fundamental way. E.g. a bullet point list in a linear medium such as a text or a slide in a presentation is like a star in a graph, where all children have the same weight. The thing is, when people see it like that graphically, they sometimes get ideas they wouldn't have, if they stared at a long text. Sometimes they figure out, that actually the points are not equal weighted or that there isn't such a clear boundary and connect some of these children together either by a link or by selecting the same colour to group them.

Btw. we program everything in Clojure + ClojureScript so immutability and the other points is like preaching to the choir.

Not related, I thought Manning will not publish the book. At least that is the last information I have seen a few days ago. I thought about buying that book.


My original book with Manning about Clojure has been abandoned. Now I am writing a book about Data oriented programming


This talk kind of alludes to the data driven stuff at the end: https://youtu.be/vK1DazRK_a0?t=774

It's a shame the code examples are just the fp and oop solutions not how it could look in a data oriented way


In his refactoring of the JS game, I was with him until about 80% through his refactor, but in the change he describes from about 50:50 - 51:00, he's actually changing the meaning of the game. Instead of printing after each turn, he's running all turns, then printing them all out at once. This is a very different behaviour and I think it's glossing over / copping out from the difficulties of applying FP in side-effectful codebases.

I mean in this case, he's transformed the program from running with a constant amount of memory (only need as much memory as it takes to run a turn) to having to store the entire history in memory. So this would be really bad if there were many turns, or if the program was going to run indefinitely, responding to user input, etc.


...or simulating the world every 16 ms as is typical in a "real" game.

You can bridge that gap by taking periodic snapshots of the reduced state, which is a useful pattern in distributed game development where you sometimes have to back-track and re-simulate when input arrives over the network.


In the context of C# this becomes a real pain.

The isuee are arrays/lists.

Because in C# an int[] array or List<int> is a reference.

So even if you put a int[] in a struct this int[] will NOT be copied when you assign an instance of this struct to another instance: you will get a reference share which makes this annoying since you cannot do a proper deep copy with the language itself: you are forced to use reference copy instead of deep copy.

In C++ this is easy because the types differentiate between pointer and non-pointer explicitly + you can overload the assignment operator.



So... is Data Oriented something new?

Storing fields in a map leads me to believe this is not Data Oriented Design (DOD). And I completely reject this idea (fields in maps). The "flexibility" there is hardly useful, and could be achieved with defined shapes (types) in modern statically typed languages without all the dowsides.

"Separate code from data" is a big core belief I share with this article, but the rest doesn't seem good idea / novel / important.


This is not Data Oriented Design, far from it. I find it confusing that the term use here is "Data Oriented Programming".



> Data never changes, but we have the possibility to create a new version of the data.

Well, it depends on what you mean by data. To avoid ambiguity it is better to talk about data values and data objects which have different properties. This can be formalized as follows [1]:

o data values are modelled via mathematical tuples – tuples are immutable

o data objects are modelled via mathematical functions (one field is a function from this reference to the field value) - functions are supposed to be mutable

(In reality of course we meet quite different situations, for example, struct is mutable and objects can be immutable.)

[1] Concept-oriented model: Modeling and processing data using functions https://www.researchgate.net/publication/337336089_Concept-o...


What do you mean by "functions are supposed to be mutable"?

Perhaps you are just pointing out that the output of the function (and therefore the value of the field...?) will change as the input changes?

If mathematical tuples are immutable, then surely mathematical functions are immutable as well ;)


Here is one possible implementation of the concept-oriented model of data for data processing. It heavily relies on functions and operations with functions and is an alternative to purely set-oriented approaches like map-reduce or join-groupby (sql):

https://github.com/prostodata/prosto - Functions matter!


Function is a mapping between two sets (of values). This mapping between values is mutable although the values are not.


Functions are a mapping between a domain and a codomain, the mapping absolutely isn’t mutable, the definition of the function is the relationship between the domains.

If I have a function:

    int Add1(int x) => x + 1
I would expect the domain and codomain to be immutable; I would also expect that x+1 to not turn in x/2 randomly also


> the mapping absolutely isn’t mutable

Assume f: X -> Y. We can now map x_1 to y_1 f(x_1)=y_1. And then change this same function by mapping x_1 to y_2: f(x_1)=y_2. Thus we can easily modify functions. Moreover, we do it constantly when we modify object fields in OOP. It is probably easier to comprehend if a function is represented as a table which we modify.

In contrast, we cannot modify data values (mathematical tuples). Say, x=42+1 means that a new value 43 is created rather than the existing value 42 is modified.

> I would expect the domain and codomain to be immutable;

No. Domains, codomains and any set can well be modified by adding or removing tuples. What is immutable are values (in the sets).


> Assume f: X -> Y. We can now map x_1 to y_1 f(x_1)=y_1. And then change this same function by mapping x_1 to y_2: f(x_1)=y_2

They would be different functions, the first being the identity function: x => x, the second being: x => x + 1

> Thus we can easily modify functions. Moreover, we do it constantly when we modify object fields in OOP

This isn't the case. A field with a different value in it just means the object is a different value. If the object is passed to a static function, then the domain is the full set of possible values that the object can hold (this is known as a product-type, you multiply the total possible values of each of its component parts to find out the size of the domain).

If it's passed to a method then there's an additional implicit argument: `this`, which is the same as a static function with an additional argument that takes the object. The function is the same.

Global (or even free variables) should also be considered part of the domain: i.e. it's akin to implicit arguments that are being passed to the function.

> No. Domains, codomains and any set can well be modified by adding or removing tuples.

This also isn't the case. If a function is defined that takes an integer and returns a boolean value: Int → Bool then the domain is the set of integers, the co-domain is True and False. You can't pass a tuple to a function that takes an Int and therefore dynamically increase the size of the domain. Even in dynamic languages the codomain is effectively `top`, the type that holds all values, and therefore the domain is all values and the codomain is all values, which makes them immutable still.

Now maybe I am misunderstanding you, but this is how all of the mainstream statically and dynamically typed languages work. Perhaps there's some edge-case language that I'm missing here that allows types to be extended, which would be interesting in its own right.


Can you expand upon this? Perhaps the difference between "re-mapping" the function:

    f(x_1)=y_2
and "re-mapping" the value:

    x=42+2
How is the former different than the latter? And by what mechanism is the former achieved? I understand what you are saying, but how does one simply "change this same function"? Redefine it?

To be clear, I'm not suggesting you are incorrect. I just don't fully understand what you are getting at.


Functions might be isomorphic to one-another, that doesn't make the function itself mutable.


In the third example, is function isProlific missing a parameter?


Yes. It should be

``` function isProlific (data) { return data.books > 100; } ```


I was hoping to find many lower level topics like caches, TLBs, locality of reference etc. but it seems the data oriented-ness implied by the author is different.


Whats the general opinion for using Maps instead of Structs/Simple Objects for the data containers in languages that allow for either?


If you know ahead of time the shape of the data you're dealing with, you might as well use structs and reap the benefits of type safety and improved performance


I think for untyped languages like clojure, you lose nothing. For typed languages that don't support parametric polymorphism, you trade type safety for flexibility and code simplicity. For typed languages that support parametric polymorphism, I don't see the advantage.


Very interesting as an introduction, I think this principles should be easy to follow using something like Rust


The examples from the first article about code reuse demonstrates power of row polymorphism. But in a statically typed language that requires a rather advanced type system that allows to declare explicitly or implicitly that code works for any struct that contains the given fields. In C++ one can use templates, but that trivially leads to unmaintainable code.


Wouldn’t this be the exact opposite of Domain-driven design and modeling behavior?

Isn’t the behavior of a system more critical than its data elements?


Reminds me of the quote: > Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious. -- Fred Brooks, The Mythical Man Month (1975)

I don't think they're inherently contradictory. You can have plain data objects representing the domain, and functions that act on these objects representing the behaviors/actions in the domain. You could include these functions as part of the "class" for these objects, and have them return new instances of the class to maintain immutability.


That evades my point. To a business person, the things they do are more important than data structures. If you try to talk to a subject matter expert, they're going to talk about how their portion of a business works, not list all of the data within their system. The only caveat is if you're talking to a person that's focused on reporting and analytics...then they will be more data focused. But businesses react to analytics, they don't run on it. Operationally, there are a set of events that happen logically from start to finish. You could do all of the data governance you want and if you miss those events and their sequence and impact, you wouldn't know what business you're in.

Data accuracy is critical. Data storage is critical. Analytics and warehousing are critical. But none of these things reflect the nature of a business, which are the behaviors of its domain and sub-domains.


this is anti-OOP


Yeah, it's great


simple read. good points. this is very useful in event sourcing or in general for sensitive data where mutability might become an issue(ie. financial transactions). also i think this maps to value object perfectly.

i have implemented ES into few projects and am now writing a ES library, since I think it can be done much simpler than my previous implementations that felt too verbose, and I will take some pointers from this into account.


Could you share an example of how to apply the principles of DO in the context of event sourcing?


A entity component system (ECS) as used by many games would probably fall under DO.

Through I would argue that #5 isn't part of DO. Especially given that e.g. `{ "a" : "b" }` is not necessary a literal but potentially an expression (depending on arbitrary language definition aspects). On the other hand e.g. `vec![ "a" ]` in rust is definitionally not an literal but wrt. to the idea behind principle #5 as good as `[ "a" ]` in JavaScript.

EDIT: Lastly sometimes contexts using something like a builder pattern can be the better way to "not verbose creation" and "data is exploreable in any context".


I think DOD of games is only tangentially related to this post.

https://en.wikipedia.org/wiki/Data-oriented_design


The "DOD of games" is just a "special case" of more generic Data Orientated design/programming.

E.g. ECS is a direct consequence of seperating data from code.

Furthermore for it to work well you normally also want #2, #3 and #4.

Sure you can build a ECS without #2,#3 and #4 but it makes it more complex.

Lastly in a ECS you split up components into many parts each having their own data and you normally want the idea behind #5 to apply to each of the parts.

EDIT: Well ok, weather #2 makes any sense at all depends on the language you use. And using a language where #2 makes no sense can be a as reasonable choice. I only would apply #2 IF it makes sense for you language of choice.


What do you mean when you write that `{ "a" : "b" }` is not necessary a literal but potentially an expression?


Depending on how a language defines their syntax.

Often times literals are only things like `"string"`, `0` and so one.

But thinks like `[ 1,2]` would be a array expression where each "entry" is syntax wise an expression (and any literal is an expression itself).

I don't know if JavaScript specifically does define object literals or object expressions, in the end it depends on what you define a literal as.

Lastly depending on the language something like `[ 1,2]` might literally de-sugar to something like following pseudo code `var tmp = Array.new(capacity=2); tmp.push(1); temp.push(2)`.

So a better way would be that a formulation like "data should be creatable without explicitly doing any function calls, variable assignments or similar. Creation must not depend on implicitly captured data".


I totally agree with "data should be creatable without explicitly doing any function calls, variable assignments or similar."

Could you explain what you mean by "Creation must not depend on implicitly captured data"?


Lets say you have a language which has some form of macro system or similar to make creation of new thinks which look like literal-like expression easier.

E.g. instead of `a = [1,2]` you have `a = vec![1,2]` which de-sugars to `a = Vec::with_capacity(2); a.push(1); a.push(2);`.

Now custom data structures can define their own macros like that, e.g. `skip_list![1,2]`.

Which would be all fine. But what if now `bad_skip_list![1,2]` accesses a thread local variable (or other implicit provided data) and adds that, too?

Now `bad_skip_list![1,2]` might be not equal to `bad_skip_list![1,2]` defined somewhere else. Which is against the ideas behind the rule #5.

You should be able to copy-past the literal-like creation of data to any place (e.g. a unit-test) and get the same result.

EDIT: If I remember correctly you could override parts of `Array.prototype` and array construction in JavaScript to brake the #5 for JavaScript for thinks like `[1,2,3]` but I'm not to sure about that anymore.


It aint pretty but if it works it works.


Great post..


Cheekily resubmitted, I see! Not that I mind. I think it's a great idea that deserves sharing.

https://news.ycombinator.com/item?id=24682380#24685657


How did it open a second HN thread?


The URLs differ. This one includes "?essence".


We should rename object oriented programing to bureaucrat oriented programming. I always think that the reasoning that led to develop the aberration of OO is the same that creates bureaucratic nonsense.The want of making people replaceable through bureaucracy so that the programmer as a human being can be removed from the picture plus all the other bureaucratic thinking nonsense leaked to the design of the language. Its funy how ridiculous we are, pretending something all the tme.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: