Let's Take a Trivial Problem and Make it Hard

sethg · on May 4, 2009

The Haskell Data.Array.Diff module provides arrays that use mutation behind the scenes, so that updates are O(1), but provide a pure functional interface. So if "c" is the 256-element array of byte counts, and "i" is the index of the element you want to increment, then you can do this:

let c' = c // [(i, c!i + 1)]

To quote the documentation: "When the // operator is applied to a diff array, its contents are physically updated in place. The old array silently changes its representation without changing the visible behavior: it stores a link to the new current array along with the difference to be applied to get the old contents."

For more information: http://www.haskell.org/ghc/dist/current/docs/libraries/array...

I don't know if something similar exists for Erlang, but if not, I assume it wouldn't be terribly hard to implement.

old-gregg · on May 4, 2009

Can you implement your own Data.Array.Diff in Haskell or it's just a standardized fallback to C? I'm on chapter 6 of 'Real World Haskell' and I honestly don't know yet.

sethg · on May 4, 2009

It appears, from skimming the source code, that they use "unsafePerformIO", which is Haskell's "I want to trick the compiler into thinking that this imperative code is pure" function. (It's not part of standard Haskell but I think all implementations provide it.)

You can use unsafePerformIO in your own code if you really need it--it's basically a giant loophole in the type system, so if you don't know what you're doing, you can screw yourself, but if you really need that power, it's there.

More here: http://www.haskell.org/haskellwiki/IO_inside#Dark_side_of_IO...

silentbicycle · on May 4, 2009

That's a problem for which imperative solution has a very direct fit. If you want to do everything in a purely-functional manner, some stuff is just going to be awkward or hard to do efficiently, period. There are other cases where the functional solution is ideal, and trying to solve it imperatively becomes clumsy and bug-ridden. A sufficiently large project will probably have at least a few of each (or a small implementation of half of Prolog, etc.).

This is why using a multiparadigm language (like OCaml or the various Lisps) or an FFI to work in a few complementary languages is often a practical approach. Working solely in a pure-(one thing) language often involves over-committing to a specific trade-off.

Edit: Or, yes, the imperative sub-language embedded in the state monad.

eru · on May 4, 2009

Or Haskell and the StateMonad, if you need imperativeness. (Though I do not advocate it for this problem.)

However, functional solutions to this specific problem are not cumbersome at all. See for example this snippet of literate Haskell. It assumes that you have already put your binary data in a list of chars and return a Map (similary to a Python dict) of frequencies:

  > import qualified Data.Map as M
  > freqMap :: [Char] -> M.Map Char Int
  > freqMap list = foldr op M.empty list
  >     where op :: Char -> M.Map Char Int -> M.Map Char Int
  >           op c freq_map = M.insertWith (+) c 1 freq_map

silentbicycle · on May 4, 2009

Could that be adapted to work for a stream, rather than a complete list in memory? That could become a constraint on the problem under reasonable circumstances. If it was a list of chars, it makes sense to just accumulate counts over the list (which is what you're doing, I think).

I'm not great at Haskell (GHC > 6.6 won't install on my main computer... http://hackage.haskell.org/trac/ghc/ticket/1346 ), and while I understand the type system etc. from using OCaml, in that, I would just do this imperatively and give it a functional interface.

I'm not just being difficult, I'm curious how you'd go about it. (Edit: Cool, thanks.)

mbrubeck · on May 4, 2009

Since Haskell lists are lazy, they aren't necessarily in memory all at once. For example, the standard "getContents" function reads a file and returns a list of characters. But since it returns a lazily-evaluated list, you can read arbitrarily large files and they will be streamed efficiently as you consume the list.

This is the same laziness that lets Haskell operate on infinite lists like [1..]

eru · on May 4, 2009

Yes, I guess you are right. I have been bitten a few times by questions of memory before, so I was cautious in my answer.

There's still a lot for me to learn about how lazyness impacts memory requirements. (Some times Haskell was lazier than I was expecting and kept thunks in memory that were bigger than their eventual result.)

silentbicycle · on May 4, 2009

While it probably wasn't clear, that's what I was wondering about when I asked about using streams. I find using lazy evaluation by default confusing sometimes, so I think about it terms of using streams or iterators. (Or are they called generators? The kind of function that, when called, returns either the next value or an end-of-stream sentinel.)

eru · on May 4, 2009

Python calls them generators. Haskell lists display somewhat similar behaviour.

Although I still have to learn a lot, lazyness by default brings more benefits than it costs IHMO.

eru · on May 4, 2009

Thanks for your question. There's nothing that prevents the general technique from working on a stream. I guess every sensible implementation of streams supports folds.

henning · on May 4, 2009

If you want to keep everything pure and also reasonably efficient, I think you need a sophisticated type system and a good compiler to back you up, e.g. Haskell and GHC.

silentbicycle · on May 4, 2009

Dude, you don't need a sophisticated type system to count the occurrences of the numbers 0 to 255.

Talk about, "Let's Take a Trivial Problem and Make it Hard".

henning · on May 4, 2009

For this trivial contrived situation, yes.

For the general problem of doing things purely where you would otherwise use destructive updates, you need a type system that lets the compiler safely and correctly figure things out about your program. Sheesh.

silentbicycle · on May 4, 2009

I agree that sometimes it helps to have the type system in non-trivial situations, but in OCaml, you could do this imperatively, keep the state change isolated to a specific function (or module), and still give it a pure functional interface. The type system would do the same checks, even. (It sounds like that's what happens with Haskell's Data.Array.Diff.)

At the same time, many programs in the real world get by without sophisticated type systems. They're a great tool, but they're also probably vastly more useful in some problem domains than others. Same with OOP, constraints and backtracking, relational databases, etc.

(That said, I would love to use an ML-like language as small & clean as Lua. Inferred, garbage-collected, recursive tagged-union types rule.)

sethg · on May 4, 2009

Ironically, I suspect that a better example of a problem that's easier to solve with mutation than with a pure functional interface would be graph reduction, which is a technique for evaluating purely functional languages.

http://en.wikipedia.org/wiki/Graph_reduction

eru · on May 4, 2009

I guess at least it will be far easier with mutations than in a strict pure functional language.

tophat02 · on May 4, 2009

I think the question posed by the author is misfocused. The right question, in my mind, is:

Will there ever be "one paradigm to rule them all"?

My guess is no. This type of simple, trivial problem screams for a structural, imperative approach. Other problems call clearly for mutable objects, others for declarative DSLs, and yet others for a pure functional approach.

Functional purists will tell you that mutable state is ALWAYS bad. OOP purists will tell you that straight-structured programming is ALWAYS bad, and structural purists will tell you that hand-optimized but difficult to understand assembly code is ALWAYS bad.

They're all wrong, of course. Pick the right tool for the right job. Now, it's true that not all programmers are skilled enough to do that, but I consider that to be a project management/source control problem, not a language problem.

sketerpot · on May 4, 2009

> Functional purists will tell you that mutable state is ALWAYS bad.

Really? The functional programming people I know tend to say that unsegregated mutable state is bad. That's why Haskell has state monads: so you can build a wall around the parts of the code which have mutable state, and present a purely functional facade to the rest of the code.

tetha · on May 4, 2009

In fact, one can argue that the one paradigm (or, the one philosophy) to rule them all is known. It is the UNIX way of doing things: have a lot of simple, little programs reading a well-defined input on on an input pipe, producing a well-defined output on the output pipe. That way, each problem can be implemented in a language fitting the problem (read this: A language which can model the input data and the output format in the most natural way) and you will end up with a lot of simple programs, all well-readable and all very beautiful.

Of course, this has a first (minor) problem: You need to know multiple languages. However, I consider the "problem" of many people not knowing multiple languages a chicken-and-egg-problem. If you just know a single programing language and never learn a second one, you will not learn how to learn programming languages. If it becomes very natural to know many languages, learning a new language becomes very, very simple (I for myself found it to be a very good way to get a good grasp on the runtime model of the language.Once you know the runtime model of a language, actually programming it is easy).

However, the worse problem is: Grasping and actually designing things in the UNIX-way is hard. In fact, hard is not the right term. It is _different_ from the way such design was taught to me. I for myself recently massively restructured a (simple) code generator into many smaller programs, communicating via pipes. At first, this felt very akward, because.. it was different? However, now that I am implementing and using this, the new version does have massive benefits. (In fact, webservices appear to be the unix-philosophy in different clothes). So I guess it is just a problem with peoples minds.

silentbicycle · on May 4, 2009

It's a good technique for trying to keep complexity from getting out of hand. Unix pipes seem similar in many ways to a kind of message-passing concurrency (like in Erlang), before it was cool. :) The OS itself keeps tabs on the processes, handles buffering, etc.

The Unix style seems to fit together relatively poorly with complex type systems, though. Some programs (compilers, in particular) need very little interaction with the outside world, and tend to require complex internal data structures. Again, no one paradigm fits everything.

derefr · on May 4, 2009

A slightly different viewpoint: a local array of 256 elements is, semantically speaking, no different than 256 separate variables. Assuming that we had 256 registers laying around to keep them in, tail-call optimizing a 256-variable recursion would just leave those values in their registers between loops. And, since the stack is just a place to hold registers we aren't paying direct attention to, we shouldn't treat the 256 variables any differently because they're on the stack instead. Thus, mutability is actually the correct tail-call optimization in this case.

mononcqc · on May 4, 2009

With Erlang, a similar effect could be achieved with the process dictionary:

  > erlang:put(1,1).
  undefined
  > erlang:get(1).
  1
  > erlang:put(1,erlang:get(1)+1).
  1
  > erlang:get(1).
  2
  > erlang:put(2,3).
  undefined
  > erlang:get().
  [{1,2},{2,3}]

It's limited to a single process and thus doesn't break the 'no-shared-memory' model. It also lets you access and modify data without copying it, so the closest implementation of such a function would need to be defined that way and would only return the final list.

Of course, this breaks the idea of 'purely functional' which is what the author found problematic.

silentbicycle · on May 4, 2009

While it's just an implementation detail, some languages (off the top of my head: Chicken Scheme and Lua) can't use that many variables in their tail-call implementations.

old-gregg · on May 4, 2009

Totals could be switched from a tuple to a tree, which might or might not be better than the setelement code, but there's no way it's in the same ballpark as the C version.

Unless I am mistaken about his suggestion, O(n log(n)) isn't quite "not in the same ballpark" as O(n), and if you replace a suggested tree with a data structure with O(1) lookup then we're precisely in the same ballpark.

calambrac · on May 4, 2009

You mean, for instance, with a destructively updateable array? A novel concept, if only he had thought of that.

old-gregg · on May 4, 2009

Medium complexity for a hash table is O(1) and when your keys are single bytes from [0..255], a good stdlib implementation shouldn't be far off from that. In other words I just found his "not in the same ballpark" remark to be an exaggeration, although I understand where he's coming from: I am learning Haskell at the moment and oftentimes I ask myself how practical the pure functional style can get.

And what's up with this passive-aggressive attitude?

calambrac · on May 4, 2009

I didn't think it was all that passive. It's frustrating to read a comment that's nitpicking the article by making the exact same point as the author, but using big-O notation as if that were the issue. Hint: it's not. The point is that whatever purely functional data structure is being used, either the whole thing is being copied on every single recursive call or some tricky behind-the-scenes magic is going on to make destructive-looking actions not actually destructive. Either way, you're nuts if you think its approaching C speed.

yason · on May 4, 2009

Bah, just take a http://c2.com/cgi/wiki?SufficientlySmartCompiler and have any unnecessary rounds of copying removed while still retaining purely function style in the source code. Right, this wasn't so hard?