Just replying to the top bit, since we're exchanging multiple comment chains. Ha...

mmt · on July 11, 2018

> You can do job scheduling on a single machine too, of course, but you get it for free with Hadoop/YARN. You don't get it for free with chained bash scripts or a custom single-node solution.

I'd venture to say that "for free" is a bit of a stretch, other than under the "already paid for" definition of "free". Being built on top of a job scheduler means one always has to exert the effort/overhead of dealing with a scheduler, even for the simplest case.

On a single machine, you can decide if you want to add the complexity of something like LSF or not. (I would call a custom single-node solution a strawman in the face of pre-Hadoop schedulers. Batch processing isn't exactly new).

> And without sophisticated job scheduling, you can't really come too close to fully allocating your machine (when you're not watching carefully), because you'll accidentally exhaust resources and break random jobs.

I'm pretty sure I'm still missing something. Since my goal is to gain a better understanding, I'll put forward where I believe the disconnect is (but definitely, anyone, correct me if my assumptions are off):

You're coming from a distributed model, where the assumption is that "resources" are divisible and allocatable in such a way that, for any given job, there's an effective maximum number of nodes it can use (e.g. network-i/o-bound job that would waste any additional CPUs). Any leftover nodes need other jobs allocated to them to reach full utilization.

I, however, am considering that the single-server model assumes that there won't be any obvious bottlenecks or limiting resources, instead prefering to run a single job at one time, as fast as possible. This would completely obviate the need for any complex scheduling and avoids accidental resource exhaustion (which, I suppose, is really only disk or memory).

potatoyogurt · on July 13, 2018

hah, you're right, I definitely am looking at it with distributed system goggles on. With just one machine, yes, it will probably make sense to fully process one job before moving onto another (although there are some cases where this may not be true, such as if you're limited by having to wait for external APIs. But this isn't a strong argument because calling external APIs probably isn't something you want to be doing in Hadoop anyway). If your workload scales, though, you will eventually have to spread out onto multiple machines anyway. At that point, even if you process each job on a single machine, you either need a system to coordinate job allocation across machines, or you have to accept some slack if you only schedule jobs on individual machines. In the former case, we're taking rapid steps towards YARN anyways. I'm sure non-Hadoop, pre-packaged systems exist that can handle this, although I'm not personally familiar with them, but my point is just that the more steps we take towards the complexity of a distributed computing model, the more it starts to seem reasonable to just say "okay, let's just use hadoop/spark/etc. and get the extra benefits they offer as well (such as smooth scaling without having to think about hardware), even if it's more expensive than a setup with individual nodes optimized for our purpose." It's now really easy to spin up a cluster, and with small scale, costs are not that big a deal. With large scale, your system is distributed in some way anyway.

I don't think it's an obvious decision, and you're absolutely right that people are usually too quick to jump to distributed systems when they really don't need them. But I think there are a lot of arguments for using something like Hadoop even before it's strictly necessary. I think part of the disconnect is that we have different backgrounds, so we both look at different things and think "oh, that's easy" vs "oh, thats like a pain."

mmt · on July 13, 2018

Sorry.. I forgot to respond to the last part of your message:

> I don't think it's an obvious decision

Certainly, which is why I even bother with discussion like this, in the hopes of making the decision clearer (of not obvious) to me in the future.

A response to my footnoted (in my other comment) comment pointed out how oversimplified my understanding of distributed databases was. Well, I knew it was an oversimplification, but not in which way.

There's plenty of computer science research from the 70s and 80s covering these topics, but they're both tough to translate to practical considerations, and they're woefully out of date (e.g. don't account for SSDs or cheap commodity hardware).

> But I think there are a lot of arguments for using something like Hadoop even before it's strictly necessary.

Well, philisophically, I would disagree with such an assertion on the grounds of premature optimization, absent the "strictly".

I would advocate for switching from scaling "up" (aka "vertically", larger single machines) to scaling "out" (aka "horizontally" or distributed) around the point of cost parity, not at the point it is no longer possible to scale up a single machine (unless that point can reasonably be expected to occur first, I suppose).

> I think part of the disconnect is that we have different backgrounds, so we both look at different things and think "oh, that's easy" vs "oh, thats like a pain."

That would account for any overestimation of how difficult it is to work with hardware or how complex Hadoop is set up, administer, or use. Those are just initial conditions and may well unduly influence decision making that has far longer-lasting consequences.

However, I'd like to think I'm not often guilty of the latter overestimation when discussing solutions (and even advocating single-server), as I tend to assume that it can it least become easy enough for anyone out there, so long as the technology is popular enough (like Hadoop) or traditional/mature enough (like the tools in the original comment, or PBS) that plenty of documentation and/or experts exist.

My background also includes having seen, first-hand, over decades, various attempts at distributed processing and databases in practice, with varying degrees of success. This has included early "universal" filesystems like AFS, "sharding" MySQL to give it "web scale" performance [1], Glustre and its ilk, some NoSQLs, and of course Hadoop.

If anything, I'd say that, with most popular, new technologies, especially ones predicated on performance or scale, "it's a pain" is not the knee-jerk skeptical reaction my experience has ingrained in me. Rather, it's more like "sure, it's easy now, but you'll pay." TANSTAAFL.

[1] This worked well enough but did have a high up-front engineering cost and a high on-going operating cost for the large number of medium-small servers plus larger than otherwise needed app servers to do DB operations that could no longer be done inside the database because each one had incomplete data. Due to effort overestimation, it was unthinkable to move from a VPS to a colo so as to get a medium-large single DB server with enough attached storage to break the "web scale" I/O bottleneck for years to come.

mmt · on July 13, 2018

> calling external APIs probably isn't something you want to be doing in Hadoop anyway

And yet I've seen it done. At least they weren't truly external, just external to Hadoop.

> In the former case, we're taking rapid steps towards YARN anyways. I'm sure non-Hadoop, pre-packaged systems exist that can handle this, although I'm not personally familiar with them

There's those goggles again :) Batch schedulers have existed for decades (e.g. PBS started in '91).

> (such as smooth scaling without having to think about hardware),

This neatly embodies what I believe is the primary fallacy in most of the decision making, including the fact that it's often parenthetical (i.e. a throway assumption).

Does anyone really value the "smoothness" of scaling? I'd expect the important feature to be, instead, that the slope of the curve doesn't decrease too fast.

The notion that Hadoop someone frees one from having to think about hardware flies in the face of hardware sizing advice from, Cloudera, Hortonworks, and others that discuss sizing node resources based on workload (mostly i/o and ram) expectations and heterogenous clusters. It does, however, explain my observation, in the wild of clusters built out of nodes that seem undersized in all respects for any workload.

>It's now really easy to spin up a cluster,

It's really easy if it's already there? That borders on the tautological. Or are you talking about an additional cluster in an organization where there already is one?

>and with small scale, costs are not that big a deal.

That's just too broad a generalization, just as "cost is a big deal" would be. Cost is always a factor, just not always the biggest one.

Small scale is often (though not always) associated with limited funds. Doubling, or even tacking on 50% to, the cost could be catastrophic to a seed startup or a scientific researcher.

>With large scale, your system is distributed in some way anyway.

This strikes me as little more than a version of the slippery slope fallacy. Even some web or app servers behind a load balancers could be considered distributed [1], but that doesn't make them a good candidate for anything that's actually considered a distribute framework.

It also hand-waves away the problem that, even if costs weren't a big deal at small scale, they don't somehow magically become less of an issue at large scale. Paying a 50% "tax" on $40k is one thing. At $400k, you could have hired someone to think about hardware for a year, instead.

[1] I just recently pointed out, slightly tongue-in-cheeck, an architectural similarity, one layer down, between app server and database https://news.ycombinator.com/item?id=17521817