Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wrote a workflow processing system (http://github.com/madhadron/bein) that's still running around the bioinformatics community in southern Switzerland, and came to the conclusion that something like make isn't actually what you want. Unfortunately, what you want varies with the task at hand. The relevant parameters are:

- The complexity of your analysis. - How fixed your pipeline is over time. - The size of a data set. - How many data sets you are running the analysis on. - How long the analysis takes to run.

If you are only doing one or two tasks, then you barely need a management tool, though if your data is huge, you probably want memoization of those steps. If your pipeline changes continuously, as it does for a scientist mucking around with new data, then you need executions of code to be objects in their own right, just like code.

Make-like systems are ideal when:

- Your analysis consists of tens of steps. - You have only a couple of data sets that you're running a given analysis on. - The analysis takes minutes to hours, so you need memoization.

Another Swiss project, openBIS, is ideal for big analyses that are very fixed, but will be run on large numbers of data sets. It's very regimented and provides lots of tools for curating data inputs and outputs. The system I wrote was meant for day to day analysis where the analysis would change with every run, was only being run on a few data sets, and the analysis tool minutes to hours to run. Having written it and had a few years to think about it, there are things I would do very differently today (notably, make executions much more first class than they are, starting with an omniscient debugger integrated with memoization, which is effectively an execution browser).

So bravo for this project for making a tool that fits their needs beautifully. More people need to do this. Tools to handle the logistics of data analysis are not one size fits all, and the habits we have inherited are often not what we really want.



Heh, all the bioinformaticians come out of the woodwork :-)

Here's yet-another-project for bioinformatics workflows that I've been involved in. This one based on Groovy:

http://bpipe.org

I agree with your sentiments about the nature of pipelines vs build system a la make. Many many people start down the path of putting the classic DAG dependency analysis as the foundation of their needs when in fact, this isn't so much of a problem in real situations, and is even somewhat counterproductive because it forces you to declare a lot of things in a static way that actually aren't static at all. I've found tools like this completely break down when your data starts determining your workflow (eg: if the file is bigger than X I will break it in n parts and run them in parallel, otherwise I will continue on and do it using a different command entirely in memory).

In my experience the problems in big data analysis are more about the complexity of managing the process, achieving as much parallelization with as little effort and craziness as possible (don't see any mention of that in Drake), documenting what actually happened when something ran so you can figure it out later, and most of all, flexibility in modifying it since it changes every day of the week.

One mistake that Drake appears to make (again, from my quick skim), is interweaving the declaration of the "stages" of the pipeline (what they do) and the dependencies between them (the order they run in). This makes your pipeline stages less reusable and the pipeline harder to maintain. Bpipe completely separates these things out, which is something I like about it.


Thanks for your feedback. We do mention parallelization in the designdoc, it's just not implemented yet. It's quite easy to add though. We have a lot of features spec'ed out, but not implemented.

I would appreciate if you elaborated on separating step definitions from dependency definitions. In my mind, they are the same thing. If you mean that steps might not be connected by input-output relationship, but still have dependencies, Drake fully supports that via tags. If you mean that steps might be connected through input-output files, but not depend upon each other, I don't frankly see how it's possible. And if you mean some other syntax which more clearly separates the two, Drake supports methods which achieves exactly that. If you mean something else, I would love to see an example.

Thanks!


> I would appreciate if you elaborated on separating step definitions from dependency definitions

As I said, I only very quickly skimmed since I'm busy, I might have overlooked information, and apologies in that case. But take the example from the front page:

    evergreens.csv <- contracts.csv
      grep Evergreen $INPUT > $OUTPUT
So now suppose a new requirement comes along - Evergreen is also called "Neverbrown" sometimes. It's decided the best way is to convert all references at input so nothing else gets confused downstream. So I need an extra step, now

    renamed.csv <- contracts.csv:
        sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT

    evergreens.csv <- renamed.csv:
        grep Evergreen $INPUT > $OUTPUT
Adding this step forced me to modify the declaration of the original command, even though what I added had nothing to do with that command. With Bpipe, for example, you say

    extract_evergreens = { 
      exec "grep Evergreen $input > $output" 
    }

    fix_names = { 
      exec "sed 's/Neverbrown/Evergreen/g' $input > $output"
    }
Then you define your pipeline order separately -

    run { fix_names + extract_evergreens }
If I get contracts from a different source that don't need the renaming, I can still run my old version and I'm not changing the definition of anything:

    run { extract_evergreens }
Hope this explains what I mean, and again apologies if this is all clearly explained in your docs and I just jumped to conclusions from the simple examples!


I see. Thank you very much. I think this is very cool. I can see several problems with this approach, and I would greatly appreciate it if you could comment on that. After all, I don't know Bpipe.

The fundamental issue is why do you have to repeat the filename, and I did give it some thought.

1. What your example does is allows to allocate dependencies based on positions. It's pretty cool. This seems to be easily reproducible in Drake, if we add a special symbol that would just mean "a temporary file" for the output, and "last temporary output" for the input (by the way, you don't need colons):

    _ <- contracts.csv
        sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT

    evergreens.csv <- _
        grep Evergreen $INPUT > $OUTPUT
or even:

    <<- contracts.csv
        sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT

    evergreens.csv <<-
        grep Evergreen $INPUT > $OUTPUT
2. One of the problems, as you can see, that it only works if you don't care about the filenames, i.e. you use a temporary file. Similarly, your Bpipe expression:

    run { fix_names + extract_evergreens }
doesn't care about filenames as well. How do you add it there? What if you need this file for debugging purposes, or if it's an input to some further step down the road? In this case, you'd have to do what you want to avoid doing (i.e. modify the original step).

3. I'm even more concerned with multiple inputs and multiple outputs. As long as your workflow is simple, you can get away with a + b. But when it's more complicated, you would have to do something like:

    run { (((fix_names + extract_evergreens) * and_some_otheroutput) + some_other_step) * some_other_output }
(I used * as an operator that puts two outputs together to create an input with two files for the next command. Mathematically, + is better for that and * is for what + is used in your examples. :))

As you can see, it gets unreadable so fast, that you'd want to use some sort of identifiers to specify dependencies, and would end up with a scheme pretty much equivalent to filenames. The fact that some file might be a temporary is a related, but parallel problem.

4. Now even worse, I'm not quite sure how this syntax could accomodate multiple outputs. If fix_name creates several outputs, and extract_evergreens uses only one, you can't get around it without some weird syntax and specifying a numeric position. It also gets out of hand pretty quickly and you're back to using some sort of identifiers, be it filenames or not.

5. Speaking of identifiers, you can use variables in Drake instead of filenames, so you can abstract filenames away. But it seems to me there's a more fundamental problem in play.

6. If you're concerned with coupling implementation and input and output names, Drake has methods for this:

    fix_names()
        sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT

    extract_evergreens()
        grep Evergreen $INPUT > $OUTPUT

    renamed.csv <- contracts.csv [method:fix_names]
    evergreens.csv <- renamed.csv [method:extract_evergreens]
    
or even, as discussed above:

    <<- contracts.csv [method:fix_names]
    evergreens.csv <<- [method:extract_evergreens]
To summarize, I think your example is cool, but it seems to only be practical for rather simple workflows. And I can also see how Drake can easily be extended to support such syntactic sugar. For more complicated dependencies though, I don't really see a better approach.

I would love to hear your further thoughts on the matter, and whether you'd like to see something similar to what I proposed in Drake. Or something else.

Artem.


Sorry for the late reply - I was really busy yesterday and didn't have time to do it justice.

> One of the problems, as you can see, that it only works if you don't care about the filenames

This is a really insightful point - it touches on one of the ways Bpipe differs philosophically from other tools. Bpipe absolutely says you don't want to manage the file names. Not that you don't care about them, but it takes the position that naming the files is a problem it should help you with, not a problem you should be helping it with. It enforces a systematic naming convention for files, so that every file is named automatically according to the pipeline stages it passed through. So, for example, after coming through the 'fix_names' stage, 'input.csv' will be called 'input.fix_names.csv'. It does sometimes give you names that aren't correct by default, but it gives you easy ways to "hint" at how to produce the right name. Eg - if we want the output to end with ".txt" we write:

    fix_names = {
        exec "sed 's/Neverbrown/Evergreen/g' $input > $output.txt
    }
Similarly if there are a lot of inputs and you need the one ending with ".txt" you will write "$input.txt", if you want the second input ending with ".txt" you will write "$input2.txt", and so on. Part of this stems from the huge number of files that you can end up dealing with. When you start having hundreds or thousands of outputs naming them quickly goes from being something you want to do to a chore that drives you completely crazy and you want a tool to help you with. Bpipe's names definitively tell you all the processing that was done on a file which is extremely helpful for auditability as well.

> I'm even more concerned with multiple inputs and multiple outputs

As I touch on above, it's really not too hard. Bpipe gives you ways to query for inputs in a flexible manner to get the ones you want. The commands you write imply what files you need, and Bpipe searches backwards through the pipeline to find the most recent files output that satisfy those needs. Multiple outputs are similar ...

    fix_names = {
        exec "sed 's/Neverbrown/Evergreen/g' > $output1.txt 2> $output2. txt"
    }
If you need to reach further back in the pipeline to find inputs there are more advanced ways to do it, but this works for 80% of your cases (the whole idea of a pipeline is that each stage usually processes the outputs from the previous one - so this is what Bpipe is optimized to give you by default).

> I think your example is cool, but it seems to only be practical for rather simple workflows. And I can also see how Drake can easily be extended to support such syntactic sugar.

It depends what you mean by "simple". I use it for fairly complicated things - 20 - 30 stages joined together with 3 or 4 levels of nested parallelism. It seems to work OK. I'd argue that it's more than syntactic sugar, though - it's a different philosophy about what problems are important and what the tool should be helping you with.

Thanks for the great discussion!


Another problem with BPipe's approach is if you change method's name, you invalidate the existing files. This can be a problem during development, when re-running steps are expensive.


I wouldn't argue with that. One could also say it's a good thing though ... if you're modifying the method enough that you need to rename it, those original files might not be valid any more, so it's a good thing if the tool wants to recreate them.


I think, that would be a stretch to say so - in my opinion, it's not a good thing for the sole reason that it doesn't let you opt out of it. And it's not one of those cases when you would want to enforce discipline because renaming a method doesn't have to do with its contents.


Thank you very much for your response.

Actually, I don't think there are any philosophical differences, and I'll try to make my case.

> Bpipe absolutely says you don't want to manage the file names.

I think this is too strong a statement as I try to show below.

> So, for example, after coming through the 'fix_names' stage, 'input.csv' will be called 'input.fix_names.csv'.

fix_names is the identifier in this case. There's really not much of a difference whether you use identifiers to come up with filenames, or you use filenames to come up with identifiers. If anything, I think filenames are preferable, because the user doesn't have to be aware of the scheme the tool uses to convert identifiers to filenames. The fact that identifiers are just a little bit shorter (e.g. don't have .txt extension or something) does not overweigh the inconvenience of knowing where the files are. The problem with this approach is because figuring out where the files are requires knowledge of the tool inner workings, that can only be acquired from reading the code or documentation.

There's another problem with these naming conventions, is that if you use the same code in multiple steps, things can become quite confusing. How will BPipe name them? Or is the only way to handle it is to copy-and-paste the code and create another rule?

It seems like not clear enough separation between the code and the filenames can be a source of problems... Please correct me if I'm wrong.

When I compare:

   _ <- contracts.csv
        sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT

    evergreens.csv <- _
        grep Evergreen $INPUT > $OUTPUT
with

   contracts:
        sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT.csv

   evergreens:
        grep Evergreen $INPUT.csv > $OUTPUT.csv

   contracts + evergreens
I strongly prefer the first option, because there's less implicit things going on, and the code is separated clearer from the file naming. Besides, it's even shorter.

> Similarly if there are a lot of inputs and you need the one ending with ".txt" you will write "$input.txt", if you want the second input ending with ".txt" you will write "$input2.txt", and so on.

This can work for very simple workflows with maybe several cases of multiple inputs and outputs, but it's unmanageable when complexity grows.

Imagine a step which takes 3 inputs - one separate, one which is output #2 of a previous step, and one which is output #6 of yet another step. You can't use numbers to resolve that. You will end up coming up with some sort of semantic identifiers, which will almost completely replace BPipe's naming convention. And what's worse, they will be hard-coded in your step's commands, which means you'll have to edit the code if you want to change the filenames, or re-use this step's implementation somewhere else.

> When you start having hundreds or thousands of outputs naming them quickly goes from being something you want to do to a chore that drives you completely crazy and you want a tool to help you with.

I'm not sure I agree here. Here's how I see it:

Instead of naming hundreds of files, you have to name hundreds of methods (commands). Yes, you don't have to repeat the filenames to create dependencies, but you have to repeat the method names (in "contracts + evergreens"), and in a way which quickly breaches the boundaries of readability.

This doesn't work for complicated workflows, and for simple ones, I would prefer positional linking rather than comping up with names, like in the example I provided above.

There's nothing that prevents Drake from coming up with filenames from more abstract identifiers. We could come up with some syntax where you'd just give an identifier (say, "~contracts"), and we'll take care of the file location and name, just like BPipe does. The major difference is not this. The major difference is that we think you need to identify inputs and outputs to build the graph, and the method name is insignificant until you want code re-use, and BPipe seems to take the opposite position - that you need to give method names, and then use a separate expression to build the graph.

I think I provided at least a few strong arguments why BPipe is wrong on this one. I would really love to hear your further thoughts.

> As I touch on above, it's really not too hard. Bpipe gives you ways to query for inputs in a flexible manner to get the ones you want.

I'm sorry I didn't understand neither this nor the example you provided. Could you please elaborate? In the example you provided you identify different outputs by adding a number to their names. Is that how subsequent steps are supposed to refer to them as inputs - by the positional output number from the step that used to generate them?

> I'd argue that it's more than syntactic sugar, though - it's a different philosophy about what problems are important and what the tool should be helping you with.

I appreciate your opinion. But the way I see it is this:

1) As far as different philosophies go, I find BPipe's one to be a bit problematic for complicated cases.

2) And for simple cases, it all comes down to syntactic sugar.

I understand it's hard to argue an abstract, so I'll tell you what. Give me an example of a BPipe workflow that you particularly like, and I'll put it in Drake. I might need to invent some Drake features on the fly, but it's a good thing. This is what these discussions are for. I'll try to show you that there's no philosophical difference, and Drake has a more flexible approach overall. I am looking forward to this challenge, because your opinion is important to me.

Thank you!

Artem.


Hey, just want to say thanks for the great discussion again. I'm a bit humbled at the length & depth of thought you're putting into it.

> The problem with this approach is because figuring out where the files are requires knowledge of the tool inner workings, that can only be acquired from reading the code or documentation

I suppose this is true but it's really not an issue I have in practice. I run the pipeline and it produces (let's say) a .csv file as a result. I execute

    ls -lt *.csv
And I see my result at the top. There's really not a huge inconvenience in trying to find the output. Having the pipeline tool automatically name everything instead of me having to specify it is definitely a win in my case. I suspect we're using these tools in very different contexts and that's why we feel differently about this. It sounds like you need the output to be well defined (probably because there's some other automated process that then takes the files?) You can specify the output file exactly with Bpipe, it's just not something you generally want to do. There's nothing wrong with either one - right tool for the job always wins!

> if you use the same code in multiple steps, things can become quite confusing. How will BPipe name them

It just keeps appending the identifiers:

   run { fix_names + fix_names + fix_names }
will produce input.fix_names.fix_names.fix_names.csv. So there's no problem with file names stepping on each other, and it'll even be clear from the name that the file got processed 3 times. One problem is you do end up with huge file names - by the time it gets though 10 stages it's not uncommon to have gigantic 200 character file names. But after getting used to that I actually like the explicitness of it.

> Imagine a step which takes 3 inputs - one separate, one which is output #2 of a previous step, and one which is output #6 of yet another step

Absolutely - you can get situations like this. We're sort of into the 20% of cases that need more advanced syntax (eventually we'll explore all of Bpipes's functions this way :-) ). But basically Bpipe gives you a query language that lets you "glob" the results of the pipeline output tree (not the files in the directory) to find input files. So to get files from specific stages you could write:

    from(".xls", ".fix_names.csv", ".extract_evergreens.csv") {
        exec "combine_stuff.py $input.xls $input1.csv $input2.csv"
    }
It doesn't solve everything, but I guess the idea is, make it work right for the majority of cases ("sensible defaults") and then offer ways to deal with harder cases ("make simple things easy, hard things possible"). And when you really get in trouble it's actually groovy code so you can write any programmatic logic you like to find and figure out the inputs if you really need to.

> Instead of naming hundreds of files, you have to name hundreds of methods (commands)

Not at all - if my pipeline has 15 stages then I have 15 commands to name. Those 15 stages might easily create hundreds of outputs though.

> The major difference is that we think you need to identify inputs and outputs to build the graph, and the method name is insignificant until you want code re-use, and BPipe seems to take the opposite position - that you need to give method names, and then use a separate expression to build the graph

Again, a really insightful comment, but I'd take it further (and this goes back to my very first comment). Bpipe isn't just not trying to build a graph up front, it really doesn't think there is a graph at all! At least, not an interesting one. The "graph" is a runtime product of the pipeline's execution. We don't actually know the graph until the pipeline finished. An individual pipeline stage can use if / then logic at runtime to decide whether to use a certain input or a different input and that will change the dependency graph. You have to go back and ask why you care about having the graph up front in the first place, and in fact it turns out you can get nearly everything you want without it. By not having the graph you lose some ability to do static analysis on the pipeline, but to have it you are giving up dynamic flexibility. So that's a tradeoff Bpipe makes (and there are downsides, it's just in the context where Bpipe shines the tradeoff is worth it).

> In the example you provided you identify different outputs by adding a number to their names. Is that how subsequent steps are supposed to refer to them as inputs - by the positional output number from the step that used to generate them

I think the "from" example above probably illustrates it. The simplest method is positional, but it doesn't have to be, you can filter with glob style matching to get inputs as well so if you need to pick out one then you just do so.

> 1) As far as different philosophies go, I find BPipe's one to be a bit problematic for complicated cases.

I can't argue with that - but that's sort of the idea: simple things easy, hard things possible. Complicated cases are complicated with every tool. I guess I would say that pipeline tools live at a level of abstraction where they aren't meant to get that complicated.

> 2) And for simple cases, it all comes down to syntactic sugar.

I guess I'd have to disagree with this, as I really think there are some fundamental differences in approach that go well beyond syntactic sugar.

> Give me an example of a BPipe workflow that you particularly like, and I'll put it in Drake

I wouldn't mind doing that - I'll need to look around and find an example I can share that would make sense (what I do is very domain specific - unless you have familiarity with bioinformatics it will probably be very hard to understand). I'll pm you when I manage to do this, but it may take me a little while (apologies).

Thanks as always for the interesting discussion. I think this is a fascinating space, not least because there have been so many attempts at it - I would say there are probably dozens of tools like this going back over 20 years or so - and it seems like nobody has ever nailed it. Bpipe has problems, but so does every tool I've ever tried (I'm probably up to my 8th one or so now!).


...continued from part1. read part1 first!...

> It doesn't solve everything, but I guess the idea is, make it work right for the majority of cases ("sensible defaults") and then offer ways to deal with harder cases ("make simple things easy, hard things possible").

My contention is that while BPipe makes simple things easy, hard things possible, Drake makes both easy and possible. I think I've made some points to that regard, and gave you examples of Drake code which is just as easy to write as the corresponding BPipe's code without compromising on functionality. But to really conclusively prove this, I'm looking forward to more BPipe examples. So far, I haven't seen anything that is simpler (or even shorter) in Bpipe.

> Not at all - if my pipeline has 15 stages then I have 15 commands to name. Those 15 stages might easily create hundreds of outputs though.

When I first read it I thought this is a great point and you're onto something. But as I thought about it more, I realized that it only seems this way.

Here's the thing: if you have 15 stages but hundreds of files, it can mean only two things:

1) The vast majority of those files are leaf files, that is - they are either inputs (with pre-determined names) or outputs, which names you don't really care about (surprisingly). Drake can generate filenames for leaf output files with ease, as they don't affect the dependency graph.

2) The vast majority of those files are not leaves, but it means that the steps either:

2a) pass to each other dozens multiple inputs and outputs, and you have to either give them identifiers (as described above, Drake can do it too) or use positions (unmanageable).

2b) even worse, have a big and complicated dependency graph with much more than 14 edges, in which case your syntax of { a + b + c } will be almost definitely inadequate to describe such a complex thing (15 vertices and several dozens edges).

So, any way you look at it, Drake can do the same thing in the same way or better. Am I missing something?

> Bpipe isn't just not trying to build a graph up front, it really doesn't think there is a graph at all! At least, not an interesting one. The "graph" is a runtime product of the pipeline's execution.

I don't understand it. I'm afraid it doesn't work this way. You can't have the graph as a runtime product of the execution (i.e. after the execution), because it cripples your ability to do partial evaluation of targets. That is, you have to have dependency graph before you can even answer the question - "is target A up-to-date?". If you need to run the workflow to arrive at a conclusion, there's no guarantee how much time it will take. I also believe it unnecessary melds the distinction between the commands and the workflow. If your code needs to care about its dependencies, it can't be used out of context. So, maybe an example?

But if all you need to do is re-run everything every time, then it means you're really doing something trivial, and it also raises the question of why we need a tool like BPipe in the first place.

> An individual pipeline stage can use if / then logic at runtime to decide whether to use a certain input or a different input and that will change the dependency graph.

I don't see how it could work this way. Could you please give me an example along with the explanation of how BPipe will handle it on the control level?

> You have to go back and ask why you care about having the graph up front in the first place, and in fact it turns out you can get nearly everything you want without it.

I'm confused, I think nothing could be further from the truth. The dependency graph specifies what steps depend on what steps. If you don't know it, you don't even know how to start evaluating the workflow, because you don't know which step to build first. I don't understand this statement at all. Could you please elaborate or give me an example?

> By not having the graph you lose some ability to do static analysis on the pipeline, but to have it you are giving up dynamic flexibility.

I need to see an example of this.

> I can't argue with that - but that's sort of the idea: simple things easy, hard things possible. Complicated cases are complicated with every tool.

I don't think having 3 inputs is a very complicated case. And neither is having any dependency graph which is not a linear step1, step2, step3. My point is as soon as you get any of those, BPipe starts to slowly evolve into Drake, with some very weird syntax and inconsistencies (like having "implicit" dependencies in steps' implementations but having to also specify some or all of the dependencies in the "run" statement).

It's possible that I'm misunderstanding BPipe. Maybe some more examples would fix this.

> I guess I'd have to disagree with this, as I really think there are some fundamental differences in approach that go well beyond syntactic sugar.

I don't really see them. And you can't just disagree, you have to provide arguments. :) I understand you can see it differently, but it seems like so far, there could be a Drake workflow for every BPipe example, which uses the same ideas and is equally easy to write (but not necessarily the reverse). This means it all comes down to syntax, no?

Again, I might be misunderstanding BPipe.

I think it's really, really hard to argue abstract concepts. I would very much appreciate some examples. It doesn't even have to be your favorite workflow. Just give me anything. Write something and ask - "how would you put it in Drake?". I think my response would make it clear whether there are syntactic or philosophical differences. We've already established that there are some things BPipe cannot do as well as Drake can. I'd like to see the reverse to be true. Because in this case we can really identify philosophical differences, but if it's the opposite - i.e. Drake can do everything BPipe can with the same ease - than it's not a question of philosophy any more but design.

I'm not trying to attack BPipe. I just want to make the best tool possible, and if we make compromises, I want to make sure they are informed. We must consciously choose some things not to be as easy or possible in Drake for some other greater good. So far, I can't identify any of those things.

Show me. :)

Artem.

P.S. You don't have to give a real world example. I think that would actually unnecessary restrain and slow you down. Just demonstrate a basic concept, a feature, name your steps A, B, C - I don't care what they do. Only if it's something extremely exotic I might ask if there's a real world use-case for this, but I think I can come up with use-cases for pretty much anything. :)

P.P.S. Please include what you do to run the workflow in your examples. I suspect I might have misconceptions about what "run" statement does and how Bpipe resolves dependencies.

P.P.S. I appreciate the dialog as well. Especially since BPipe is your 8th tool. I would like Drake to be your 9th, and better than anything you used before, including Bpipe.


I'm sorry I don't have time to answer in full. I'm just going to respond to this one point because I think it's pretty fundamental and perhaps explaining it will clear up other things!

    > The dependency graph specifies what
    > steps depend on what steps. If you don't
    > know it, you don't even know how to
    > start evaluating the workflow, because
    > you don't know which step to build
    > first. I don't understand this statement
    > at all. Could you please elaborate or
    > give me an example?
I can see this is really really hard to grok if you're basing everything on the idea of a DAG, and so many tools are that it's very natural to think you couldn't do it any other way. Think of it as imperative vs declarative if you like. In Bpipe the user declares the pipeline order explicitly (as you've seen) - so that's the first part of the answer to your question. Bpipe knows which part to execute first because the user said to explicitly. But this isn't used for figuring out dependencies - dependencies arise as actual commands are executed. Back to our famous example:

    fix_names = {
      exec "sed 's/Neverbrown/Evergreen/g' $input > $output"
    }

    extract_evergreen = ...

    run { fix_names + extract_evergreen }
We run it like this:

    bpipe run pipeline.groovy input.csv
If you run it once, Bpipe builds input.fix_names.csv. If you run it twice, Bpipe is clever enough not to build input.fix_names.csv again! How is that if it doesn't know about the dependency graph?! Well, it does it "just in time". It executes the "fix_names" pipeline stage (or "method") and that calls the "exec" command. The "exec" command sees that all the inputs referenced ($input variables) are older than the outputs referenced ($output variables). So it knows it doesn't have to rebuild those outputs, and skips executing the command. So what about transitive dependencies? If C depends on B which depends on A, (so dependencies are A => B => C) what happens if you delete file B? Technically you don't need to build C because it's still newer than A, but Bpipe can't see it any more. Well, Bpipe knows this too because it keeps a detailed manifest on all the files created. So when the call to create B is executed it can see that although B was deleted, it did exist and in its last known state was newer than input files, so there's no need to rebuild it, as long as downstream dependencies are OK.

So in this way Bpipe handles dependencies for you. What it does not do is figure out which order to execute things in. It does them in exactly the order you tell it. This is one of those things that conventional tools solve which isn't actually that important (in my uses) but which occasionally is very annoying - I actually want to control the order of things sometimes. I want to be able to tell it "do this first, then that, then the next thing" regardless of dependencies. Usually it's pretty obvious what the right order things should be in and there are other externalities that influence how I like to do it ("I know this part uses a lot of i/o so try to do it in parallel with another bit that's mainly using CPU", or "Let's run this part last because it will be after hours and the other jobs will have finished"). Having the tool think this stuff up by itself can save you a bit of time but it can lose you a lot because you don't have the ability to really control what's going on.


> I'm sorry I don't have time to answer in full.

We're not getting anywhere. Just give me goddamn examples! :) Please! Examples!

> I can see this is really really hard to grok if you're basing everything on the idea of a DAG, and so many tools are that it's very natural to think you couldn't do it any other way.

There is no other way. BPipe is based on the idea of a DAG. You just don't see it.

> In Bpipe the user declares the pipeline order explicitly.

And this is a big mistake. The reason is simple - explicit order is very hard to manage once you have multiple inputs and outputs, and as a consequence, complicated (instead of linear) dependency relationships.

What you don't seem to realize, is that by "declaring the pipeline order explicitly" you create a dependency graph. It's a part of your workflow definition. Your workflow contains the full definition of the dependency graph. Even if it didn't, you would still use it. There is no other way.

This is what I meant when I said - you create your dependency graph in "run". And this is a bad idea.

> dependencies arise as actual commands are executed.

What does it mean exactly? That the first command will somehow tell Bpipe what to run next? If not, then I don't understand this statement at all.

> How is that if it doesn't know about the dependency graph?! Well, it does it "just in time".

It does not matter if you calculate the dependency graph before you run the first command, or as you run the commands. It makes absolutely no difference. The only difference is whether it is computable or not. If you say it's not computable until run-time, please elaborate on that.

> So in this way Bpipe handles dependencies for you.

So far I see that this is very standard and doesn't differ in any way from what Drake or any other tool does. The only thing that differs, and I am repeating myself, is how you define your dependency graph - through input and outputs, or in "run". So far it seems that "run" is quite unfortunate. But please give me examples.

> So in this way Bpipe handles dependencies for you. What it does not do is figure out which order to execute things in. It does them in exactly the order you tell it.

This is a meaningless statement. Drake also executes steps in the order you tell it. The only difference is how you tell it. In Drake, you tell it through specifying a list of steps each step depends on individually (once again, it doesn't matter that filenames are used for that - Drake also supports tags, or it could be some other identifiers). In Bpipe, you tell it in "run", collectively and sequentially. Drake's way supports the whole variety of graphs, while Bpipe's way - only a very limited subset. And for this limited subset, Drake can give you (I think) a syntax just as good if not better than Bpipe's. If you don't quite understand what I'm talking about, give me an example, and I will demonstrate.

> I actually want to control the order of things sometimes.

This is fine, the only question is how. You say Bpipe's way is convenient. I say give me an example and I'll show you that Drake's way is not any less convenient. I'm sorry to keep repeating myself, I thought I stressed the importance of examples quite a bit in my previous email and I want to stress it again. Examples, please!

> I want to be able to tell it "do this first, then that, then the next thing" regardless of dependencies.

This statement is self-contradictory. You don't seem to realize that by telling it "do this first, then that" you are defining dependencies. It's fine, and it's OK, and it can be convenient, but you can't say regardless of them.

Again - give me examples! Our conversation is becoming useless without examples.

You did not, but I'll just grab whatever you threw my way:

    fix_names = {
      exec "sed 's/Neverbrown/Evergreen/g' $input > $output"
    }

    extract_evergreen = ...

    run { fix_names + extract_evergreen }

    $ bpipe run pipeline.groovy input.csv
Drake can support this perfectly:

    _ <- $[in]
      exec "sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT"

    $[out] < _
      ........

    $ drake -v out=pipeline.groovy,in=input.csv
Isn't that much nicer? What disadvantages you can see?

Tell me what is it that you would like to do with this script, and I'll tell you a better way to do it in Drake. Is it multiple versions of run that you want to have? Easy. Are you concerned about inserting a step in the middle? Trivial. Tell me why Drake's code is worse, and I'll listen. So far it seems like it's better because it's shorter and more flexible at the same time.

> Having the tool think this stuff up by itself can save you a bit of time but it can lose you a lot because you don't have the ability to really control what's going on.

What exactly are you losing?

I am sorry if I sound irritated. I am. I've just been begging for examples, and you keep talking in abstract, and it would be fine, but you're making a lot of mistakes. So, instead of looking at concrete things that would make my point apparent to you (or the opposite, prove that I'm wrong), I keep pointing to flaws in your reasoning, which frankly, is irrelevant. One picture is worth a thousand words.

I really want your feedback. But please give me examples.


> There is no other way. BPipe is based on the idea of a DAG. You just don't see it.

So if you think Bpipe uses a DAG, then I wonder how you would think it deals with:

  run { fix_names + fix_names + fix_names }
In terms of the pipeline stages that run this is cyclic, so it cannot be a DAG. On the other hand the files created do usually form a DAG dependency relationship, but even there, in the most general case, it's not at all impossible in an imperative pipeline to read a file in and write the same file out again in modified form (or more likely, to modify it in place), so the file depends on itself - another non-DAG relationship. I'm sure you'll object to this in a purist sense, and tell me it is a horribly broken idea, but as a practising bioinformatician, when I have a 10TB file and modifying it in place will save me hours and huge amounts of space, I'm much more interested in getting my job done than being pure about things.

I think you're right that we're at diminishing returns here, and I'm sorry I've frustrated you. We're trying to bite off more than we can chew in a forum like this.

I wish you all the best with Drake and I'll definitely check it out down the track (when it supports parallelism, since that's too important to me right now). For now, though, I don't intend to read / respond to any more replies in this thread.


This is not a cyclic dependency graph!!! This is a syntax for copying vertices, nothing else. It creates a DAG of three vertices and two edges, but uses only one step definition to do so. It automatically replicates the step definition as needed. It could be extremely easy to reproduce in Drake:

  fix_names()
     ...

  _ <- $[in]  [method:fix_names]
  _ <- _      [method:fix_names]
  $[out] <- _ [method:fix_names]
Is there any difference between Bpipe's version and Drake's version that I am failing to see?

> I'm much more interested in getting my job done than being pure about things.

It's funny coming from someone who I have been BEGGING for examples but getting abstract philosophical reasoning in return.

I repeat. Give me an example. So far you haven't given me one example of what Bpipe can do that Drake couldn't do in the same way or better, and yet you continue claiming philosophical differences.

If we concentrate on examples and discuss how they would work, whether there are differences, and what these differences are, I guarantee you, we'll make progress. But then again, I'm repeating myself.

Artem.


part1

>> The problem with this approach is because figuring out where the files are requires knowledge of the tool inner workings, that can only be acquired from reading the code or documentation > I suppose this is true but it's really not an issue I have in practice. I run the pipeline and it produces (let's say) a .csv file as a result.

It's a good point and I, guess, I didn't mean it's a major issue. Just something which is, I believe, less than an ideal design, because it spreads (de-centralizes) information. For example, if you want to do something with the files outside of your workflow in some shell script, this shell script would contain a filename which a reader would have no idea how you came up with. Again, it's not something to obsess over, just an observation.

> There's really not a huge inconvenience in trying to find the output.

Even if so (highly doubtful in case of, as you say, 200 character filenames), this reasoning only applies to interactive sessions.

> Having the pipeline tool automatically name everything instead of me having to specify it is definitely a win in my case.

It's only true if you have to type less. I'm trying to make a case that you don't have to sacrifice clarity to achieve the same result. I'm trying to show you can win without losing.

> I suspect we're using these tools in very different contexts and that's why we feel differently about this.

That might be true, but we were also trying to come up with a universal tool. That is, we are willing to make sacrifices if not making them means severely limiting the scope of usage. But again, I am tying to show you don't even have to make sacrifices.

> It sounds like you need the output to be well defined (probably because there's some other automated process that then takes the files?)

Sometimes, yes; sometimes only for debugging; sometimes only for convenience. But more importantly, I'm arguing using filenames is just a better way to build the dependency graph regardless of whether you write them themselves or you use some identifiers that result in automatic filename generation. Remember I said Drake could easily do that? The core issue here is not filenames. It's what is the better (easier to read, less to type, easier to understand) way to define the dependency graph.

> You can specify the output file exactly with Bpipe, it's just not something you generally want to do.

Again, it's not the point. If you start specifying filenames exactly with Bpipe (I'm assuming you mean in commands themselves), you would just end up with a very strange beast: you'd have essentially define the dependency graph twice, once indirectly, and once directly. Or at least different dependencies in different ways. It seems like this would just be a total mess. But I'm trying to show even if you want to not care about filenames, Drake's approach is better.

> There's nothing wrong with either one - right tool for the job always wins!

My feeling so far was that it's not like a comparison of C and Python, but rather like a comparison of C and C++. There's absolutely nothing that you can't do in C++ better or at least as well as in C. Of course, I might be wrong, and that's why I would love to see an example workflow which I would then put in Drake and we'll be able to objectively compare.

> It just keeps appending the identifiers: will produce input.fix_names.fix_names.fix_names.csv. So there's no problem with file names stepping on each other, and it'll even be clear from the name that the file got processed 3 times.

First, I don't want to process the file 3 times - I didn't mean call the same method 3 times, I meant use the same code in different parts of the workflow. For example, you have a method to convert data from CSV to JSON, and you use it a dozen times all over the workflow.

Secondly, I think this is pretty bad. The way you described it, it makes filenames situational - i.e. depending on what part of the workflow they're in. Removing one fix_names from the chain could invalidate other fix_names's inputs and outputs, or worse - not invalidate the timestamps, but make such a huge mess, the user won't even know what hit him. Editing the workflow should not require such careful consideration for the tool's inner workings. And if you can afford to re-run the whole thing every time you add or delete the step, you're working on something very, very simple.

> One problem is you do end up with huge file names - by the time it gets though 10 stages it's not uncommon to have gigantic 200 character file names.

I apologize I didn't even realize the filenames carry all their creation history - I thought it was only the case with repeating names. I don't want to be harsh, but I think it's beyond bad. It means any change to the workflow can invalidate everything. This makes BPipe unusable for anything even remotely expensive. Please correct me if I'm wrong.

> Absolutely - you can get situations like this.

This is actually pretty common.

from(".xls", ".fix_names.csv", ".extract_evergreens.csv") { exec "combine_stuff.py $input.xls $input1.csv $input2.csv" }

I'm sorry, I tried but I didn't understand this code. Could you please elaborate? What do you mean "glob"? The way I see it, you may glob all you want, but there are just two ways to resolve this: use positional numbers or use some sort of identifiers. If you use positional numbers, it becomes unmanageable. And if you use identifiers, we're back where we started. It doesn't matter if they're filenames or not, what matters is that once you started using identifiers, you can generate the dependency graph yourself, from identifiers. In other words, you've arrived to Drake's model.

...continued in part2...


Thank you very much. We're really looking forward to other people using this tool.

You raise some interesting points (for example, a frequently changing code), which we ran into as well. Our current approach to it is not as fundamental, and basically includes ability to force re-build any target and everything down the tree and methods, and you can also add your binaries as a step's dependency.

I'm sure as we and other people use the tool, we'll have better ideas. For example, Drake could automatically sense that the step's definition has changed and offer to rebuild or dismiss.

Other points you raised are also definitely worth thinking about.


I'm also a developer of a workflow processing system, though not open-source, and fairly specific to our company. A few more things that are desirable if you have a lot of data or need to do processing that takes a lot of time is the ability to run stages in parallel, and also to distribute the computation over a cluster of machines.


Great points.

Drake supports the ability to run stages in parallel (at least in theory) - it's been speced out (https://docs.google.com/document/d/1bF-OKNLIG10v_lMes_m4yyaJ...), just not implemented yet. But of course, once you have the entire dependency graph, it's easy to know what can be run in parallel and what cannot.

As for distributing computations, our approach is that it lies outside of Drake's scope. Drake doesn't know what's going on in steps. But you can always implement a step that would use distributed computation, for example, by submitting a Hadoop job, or in any other way. The only requirement Drake has is for the step to be synchronous, i.e. do not return before all the computation is complete. But even that can be changed for some cases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: