Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hey, thanks for trying out our tool!

First of all, --version shouldn't try to run any targets. This seems like a bug. Thanks.

Yes, you guessed correctly - this is the JVM startup time. I just hate JVM for that. We experimented with Nailgun and Drip to eliminate it - Nailgun is problematic because it uses a shared JVM for all runs, and it can get quite hairy sometimes. In the long run, Nailgun is almost certainly not an answer, since it assumes things we have no control over (i.e. Clojure runtime) don't do destructive tear down. Drip is a bit more promising, but we didn't succeed running Drake under it (simpler things worked fine though).

So, we're still looking into it, and we're looking for other ideas, too.

In the meantime, you could run Drake under REPL:

(-main "...")

The only problem is that Drake calls System/exit but we can add a flag ("--repl") that would prevent it from doing so, and you'll stay in REPL.

Thoughts?

P.S. JVM is unfortunate but Clojure is a fantastic language for something like Drake.



Thanks for the detailed and well explained reply.

I have limited experience with Clojure, but it does seem to be a good match to this sort of task due to it's structure. However the JVM seems to be a real drawback to me. Perhaps with something like Scheme or Lisp you might get a similar program structure, and be able to compile to faster binaries?

The REPL is a solution, but as many developers are using tools like make with many other tools in the shell, running a REPL like that would prevent them from using other things efficiently. Ultimately I think the overhead time needs to be removed.

If it takes far longer than something like make, that's not necessarily an issue. The key point is making it fast from the user's perspective. As long as it runs in a fraction of a second, I can't see much of a difference between 0.1s and 0.0001s, so I don't think that sort of difference really matters, it's when it gets over 1s that it becomes an issue.

Running something like Nailgun in the background may be a good solution, I don't have any experience with it. But if it requires starting a daemon in the background, that could get in the way of using the tool in a normal way.

I don't really know what the best solution to this problem is. I'm not sure Clojure is the best tool for the job.


I can certainly see your point about using Drake in an automated environment where this delay would still matter, but running a daemon is not practical. I think you have a lot of good arguments against JVM. There were some moments when I thought it might not have been the best choice as well - for example, Java world is notoriously poor with dealing with child processes.

So, I agree, but there are several arguments that it's not that bad after all:

- Drake is fundamentally an interactive tool. If you run it as a part of an automated process, all its flexibility is not quite needed. You could have Drake print a list of all shell commands it would execute, and save it to get your automated script.

- Most data workflows Drake is good for are quite expensive. Minutes, sometimes hours. Definitely much more than 5 seconds. The reason is simple - if your workflow takes so little time, you're really not gaining much by using a complicated tool like Drake, instead of just putting it all in a linear shell script, and simply re-running everything every time you need it.

- Maybe we'll find a good solution like Nailgun and Drip.

- Maybe someone will make a Java-code compiler that would create a stand-alone executable out of a JAR.

- Maybe Sun will eliminate JVM startup overhead. Or somebody will release a 3rd party JVM without it.

- Maybe we'll have a compiled version of Clojure one day.

- Other maybes. :)

We certainly would support any effort to port Drake into Lisp, C++, Ruby, Python or any language you desire. Porting it into Common Lisp might not be that much easier than to Ruby. We might not consider it ourselves, since the effort will be quite substantial.

Does it sound reasonable to you?


I would say if a startup overhead time of < 10 second bothers you, you're not working with "data". Of course sed and grep have less overhead, but I wouldn't even thin of trying out a new tool for files/datasets larger than, say a Gigabyte. (Rough guess, I know you can use grep and sed in under 10 seconds for larger files, the point is about perspective and complexity.)

Clojure is sadly a really bad choice for fire-and-forget cli scripts, but "large scale data processing" doesn't fit this criterion for me.


I'm mostly going to use this for parsing XML into some other formats and getting it into SQLite databases I think. The reason I would like to use Drake over 'raw' Python scripts is because it supports a lot of the mundane stuff that goes around the actual processing of the data, and I want to automate the processes.

I typically deal with sub-100MB XML documents, so processing them takes very little time, but having the quick iteration of changing the format and re-outputting is a key part of the development cycle for me, and I think very useful when you are experimenting with new data and seeing how it could be used. Doing quick transforms is awesome.


Drip now works with Drake! Yes, it's still less than ideal if you're calling Drake hundreds of times from an automated script which you need to run quickly, but for interactive development, it should work just fine:

https://github.com/Factual/drake/wiki/Faster-startup:-Drake-...


It's a good point, and I agree it might not be the top priority, but I also understand the frustration. I, too, find 5s start up file rather irritating especially when I make errors in the workflow file, or didn't specify targets correctly. So, we are in search of ideas on how to fix it.


Did you consider ClojureScript and V8? Are there downsides to ClojureScript that would lead you to use Clojure instead?


To be honest with you, no, we didn't seriously consider it. Maybe we should have. I do not know if ClojureScript would be able to work with all the dependencies we have (for example, Hadoop client library to talk to HDFS). But it's a good point nevertheless. I'll mention it in https://github.com/Factual/drake/issues/1.

Thanks!


I didn't realise originally that Drake integrated with HDFS. Thats a really awesome feature, and I can see why the JVM made sense in development because of existing HDFS libraries.


Thanks for the response! I ask because I have an idea for a CLI program, and I want to write it in Clojure, but I'm worried about the startup time of the JVM. As I understand it, this issue is mitigated in Drake by the fact that a typical job will crunch lots of data and therefore take lots of time. That's not the case for my program, it needs to be quick.


Yes, startup times are a pain. As of this morning, Drake now works with Drip, which is a nifty tool to bring down start up times. It spins "backup" JVMs, so next time you run the command, JVM is ready. It works great for interactive environments where at least several seconds pass between runs, but won't do much if you need to run Drake several times per second from an automated script.

Another option is Nailgun, but it has its limitations, too.

None if this is ideal. If you want to write a very simple CLI program, keep this in mind. You may want to stay away from JVM.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: