Hey, thanks for trying out our tool! First of all, --version shouldn't try to ru...

danpalmer · on Jan 25, 2013

Thanks for the detailed and well explained reply.

I have limited experience with Clojure, but it does seem to be a good match to this sort of task due to it's structure. However the JVM seems to be a real drawback to me. Perhaps with something like Scheme or Lisp you might get a similar program structure, and be able to compile to faster binaries?

The REPL is a solution, but as many developers are using tools like make with many other tools in the shell, running a REPL like that would prevent them from using other things efficiently. Ultimately I think the overhead time needs to be removed.

If it takes far longer than something like make, that's not necessarily an issue. The key point is making it fast from the user's perspective. As long as it runs in a fraction of a second, I can't see much of a difference between 0.1s and 0.0001s, so I don't think that sort of difference really matters, it's when it gets over 1s that it becomes an issue.

Running something like Nailgun in the background may be a good solution, I don't have any experience with it. But if it requires starting a daemon in the background, that could get in the way of using the tool in a normal way.

I don't really know what the best solution to this problem is. I'm not sure Clojure is the best tool for the job.

aboytsov · on Jan 25, 2013

I can certainly see your point about using Drake in an automated environment where this delay would still matter, but running a daemon is not practical. I think you have a lot of good arguments against JVM. There were some moments when I thought it might not have been the best choice as well - for example, Java world is notoriously poor with dealing with child processes.

So, I agree, but there are several arguments that it's not that bad after all:

- Drake is fundamentally an interactive tool. If you run it as a part of an automated process, all its flexibility is not quite needed. You could have Drake print a list of all shell commands it would execute, and save it to get your automated script.

- Most data workflows Drake is good for are quite expensive. Minutes, sometimes hours. Definitely much more than 5 seconds. The reason is simple - if your workflow takes so little time, you're really not gaining much by using a complicated tool like Drake, instead of just putting it all in a linear shell script, and simply re-running everything every time you need it.

- Maybe we'll find a good solution like Nailgun and Drip.

- Maybe someone will make a Java-code compiler that would create a stand-alone executable out of a JAR.

- Maybe Sun will eliminate JVM startup overhead. Or somebody will release a 3rd party JVM without it.

- Maybe we'll have a compiled version of Clojure one day.

- Other maybes. :)

We certainly would support any effort to port Drake into Lisp, C++, Ruby, Python or any language you desire. Porting it into Common Lisp might not be that much easier than to Ruby. We might not consider it ourselves, since the effort will be quite substantial.

Does it sound reasonable to you?

wink · on Jan 25, 2013

I would say if a startup overhead time of < 10 second bothers you, you're not working with "data". Of course sed and grep have less overhead, but I wouldn't even thin of trying out a new tool for files/datasets larger than, say a Gigabyte. (Rough guess, I know you can use grep and sed in under 10 seconds for larger files, the point is about perspective and complexity.)

Clojure is sadly a really bad choice for fire-and-forget cli scripts, but "large scale data processing" doesn't fit this criterion for me.

danpalmer · on Jan 26, 2013

I'm mostly going to use this for parsing XML into some other formats and getting it into SQLite databases I think. The reason I would like to use Drake over 'raw' Python scripts is because it supports a lot of the mundane stuff that goes around the actual processing of the data, and I want to automate the processes.

I typically deal with sub-100MB XML documents, so processing them takes very little time, but having the quick iteration of changing the format and re-outputting is a key part of the development cycle for me, and I think very useful when you are experimenting with new data and seeing how it could be used. Doing quick transforms is awesome.

aboytsov · on Jan 26, 2013

Drip now works with Drake! Yes, it's still less than ideal if you're calling Drake hundreds of times from an automated script which you need to run quickly, but for interactive development, it should work just fine:

https://github.com/Factual/drake/wiki/Faster-startup:-Drake-...

aboytsov · on Jan 25, 2013

It's a good point, and I agree it might not be the top priority, but I also understand the frustration. I, too, find 5s start up file rather irritating especially when I make errors in the workflow file, or didn't specify targets correctly. So, we are in search of ideas on how to fix it.

abraininavat · on Jan 25, 2013

Did you consider ClojureScript and V8? Are there downsides to ClojureScript that would lead you to use Clojure instead?

aboytsov · on Jan 25, 2013

To be honest with you, no, we didn't seriously consider it. Maybe we should have. I do not know if ClojureScript would be able to work with all the dependencies we have (for example, Hadoop client library to talk to HDFS). But it's a good point nevertheless. I'll mention it in https://github.com/Factual/drake/issues/1.

Thanks!

danpalmer · on Jan 26, 2013

I didn't realise originally that Drake integrated with HDFS. Thats a really awesome feature, and I can see why the JVM made sense in development because of existing HDFS libraries.

abraininavat · on Jan 26, 2013

Thanks for the response! I ask because I have an idea for a CLI program, and I want to write it in Clojure, but I'm worried about the startup time of the JVM. As I understand it, this issue is mitigated in Drake by the fact that a typical job will crunch lots of data and therefore take lots of time. That's not the case for my program, it needs to be quick.

aboytsov · on Jan 26, 2013

Yes, startup times are a pain. As of this morning, Drake now works with Drip, which is a nifty tool to bring down start up times. It spins "backup" JVMs, so next time you run the command, JVM is ready. It works great for interactive environments where at least several seconds pass between runs, but won't do much if you need to run Drake several times per second from an automated script.

Another option is Nailgun, but it has its limitations, too.

None if this is ideal. If you want to write a very simple CLI program, keep this in mind. You may want to stay away from JVM.