Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Consul 1.0 Released (hashicorp.com)
206 points by schmichael on Oct 16, 2017 | hide | past | favorite | 62 comments


I use Consul at work, and I do like it, but I have to ask a kind of obvious question; why would I use Consul over something that's more industry-standard like Zookeeper?

I'm not asking this passive aggressively, I mostly want to know what features it adds over a vanilla Zookeeper setup.


Zookeeper and Consul have very different scopes and therefore architectures. Consul servers are very much like Zookeeper: an odd number of nodes running a consensus protocol to provide a consistent API to a key/value store.

However, Consul builds tons more on top of the servers. Consul has a first class notion of services and health checks. It has agents that run on every node to register services, perform health checks, and provide service discovery to local apps via HTTP or DNS.

For every Consul feature there is probably a similar library or tool to do the same thing with Zookeeper. Consul just chose to focus on the service discovery problem and address it as a first class feature of the project.

Full disclosure: I work for HashiCorp, but not on Consul.


I don't know much about Zookeeper, having not used it personally, but I do manage a large Consul installation as part of my Ops role for for a large cloud-based SaaS provider. Not knowing anything about differences in functionality, one thing that stands out to me is platform differences. If you're in a Java shop and have all the tools and expertise for managing Java applications, then Zookeeper will fit right in. If you _aren't_ a Java shop, then that's an additional layer of complexity you will be forced to manage. So for the most part, we aren't a Java shop at my job, so Consul's native binary is very appealing because it's simple and straightforward.

Consider that the Consul binary distribution is a zip file containing a single file. Compare that to Zookeeper's tarball which contains 1538 files. That maybe isn't a fair point on which to make a comparison, but I think it's an important signal about their respective approaches. Various people might have different reactions to those numbers, so I'm not trying to pass judgment here. Just giving an idea of one way to think about differences.


What additonal layer of complexity are you talking about?

Installing a jvm?


I found this article [0] a couple weeks ago and found it informative. Basically Consul has service discovery built in, while Zookeeper et al. expect you to build service discovery on top of the primitives they provide (assuming you need it).

[0]: https://www.consul.io/intro/vs/zookeeper.html


This article is biased because it was written by Consul author.

I'm big fan of Zookeeper, and agree with parent that is not only industry standard, but it is very robust and battle tested.

Having said that, this is not what Zookeeper was meant to. ZK is coordination service, if you have a distributed service that you need to coordinate and you don't want any single point of failure ZK is the choice.

ZK can be used for service discovery, but that's not the proper use, you could equally set up a single database node, or MongoDB cluster to accomplish essentially the same thing.

So specialized service is better, but I also don't like how Consul solves service discovery. If you think about it, SD doesn't require strong consistency, it is not a big deal if all nodes won't learn about each other at the same time, arguably that's actually preferred.

By removing strong consistency (the only reason why ZK is even compared to Consul) we could have a better SD that's also more scalable.


If you think about it, SD doesn't require strong consistency, it is not a big deal if all nodes won't learn about each other at the same time, arguably that's actually preferred.

Can you go into more detail about this? Each computer has its own Consul agent that monitors the health of services running on it. You also get configuration from the agent on the individual computer. All of the agents communicate peer to peer and with the server don’t they? Isn’t that eventually consistent? I’m just starting to use Consul for configuration and haven’t started working on the service discovery part yet.


Yes, but that's only to detect whether node is up or down, the health checks are still strongly consistent and go through the servers.

Because of this design, the don't get benefits of any of them.

Side note, KV is actually strongly consistent, which is what is desired, the thing is that since KV comes with consul it encourages people to use it.

One just need to keep in mind, that Raft and Paxos expect all nodes to be updated on every change (e.g. if you have 5 server node cluster all 5 nodes need to "sign on it" before it is accepted). This is counter-intuitive since adding nodes to cluster makes it actually slower (the goal is correctness and resilience not performance). Basically you just need to keep in mind to not use KV for some write heavy operations. Serving configuration files generally is fine.


In both paxos and raft you only need a quorum to sign on it. In the case of raft/consul it's the majority. Which means, you need 3 of 5 nodes to sign on it.

Also, key value store reads are usually done locally, from the agent, which goes over gossip and is eventually consistent.

Health checks are also eventually consistent. They are done by the local agent, and through gossip are sent to the masters group.


I'm aware that Paxos and Raft require odd number of nodes. My point was that the more nodes you have the slower the cluster is, because increasing number of the nodes doesn't increase performance but resilience. With 3 nodes you can afford to lose at most 1 node, with 5 -> 2, with 7 -> 3 etc.

> Also, key value store reads are usually done locally, from the agent, which goes over gossip and is eventually consistent.

That's not what the documentation says. The KV is using raft, you can relax some guarantees for reading by having clients use consistency=stale, but you're still communicating with the server, except in stale you only communicate with one instead all of them.

> Health checks are also eventually consistent. They are done by the local agent, and through gossip are sent to the masters group.

Also that's not what the documentation says. Gossip is used for adding/removing nodes and the overall health of the node (whether host is reachable) is also done through gossip, but the actual health checks (whether a service is running) is done through consensus protocol.

I still want to emphasize that using consensus protocol for SD is not a good idea. Let say you're running in AWS and follow the best practices which is spreading your servers over multiple AZ. Let say you use us-east-1 which has 6 AZ.

An AWS starts having network issues (like few months ago) and 2 AZs have network issues and can't communicate with the rest. The machines are still running, just can't communicate. It just happens that on 2 of those AZs you run 2 out of your 3 consul servers. Even though all the machines are healthy and running your Consul cluster is down, even for the remaining 4 AZs that don't have any problems.


You should probably have a Consul server in each availability zone if you want to protect against this type of failure.

Consul uses one raft cluster per "datacenter", so you don't have to worry about write performance seriously degrading as you scale, it will only be up to the max of the number of availability zones you'd like to operate in per region. Plus you shouldn't be doing write heavy KV operations with consul anyway.

Just because you are using strong consistency doesn't mean you can't get reasonably high availability in practice.


That's not what the documentation says. The KV is using raft, you can relax some guarantees for reading by having clients use consistency=stale, but you're still communicating with the server, except in stale you only communicate with one instead all of them.

How do you come to the conclusion that you're communicating with the server? When I'm doing any type of communications with consul via the API I'm always communicating with 127.0.0.1/....


Local agents forward to your central consul servers.



Although its not part of Zookeeper core, there is an Apache Curator module[1] for doing service discovery on Zookeeper

[1] https://curator.apache.org/curator-x-discovery/


From the standpoint of CAP theorem, Consul is AP and ZK is CP. AP and CP distributed systems have their use cases, so it's important to think them through up-front. For example, I personally consider service discovery to favor AP.

I arrived at that personal assertion by building a SD setup using our conventions + Curator + ZK. In disaster scenarios we forced into the system when testing, I was able to get ZK to basically lock up because, really, that's what it's supposed to do, guarantee consistency--if it can't guarantee consistency, then it tells you about it. Consul for us fit the bill better.

Note: This was some time ago and I do not claim to be a ZK expert; it's quite possible the root issue was me. YMMV, etc etc.


Through most of my careers, ZK cause me so much grief.

So I welcome a competition in this space.


If ZK caused a lot of grief then you must be doing something horribly wrong with me.

From my experience ZK is one of the tools that's most ops friendly (i.e. you set it and forget it), right next to Postgres and Solr.


> right next to Postgres and Solr.

Never used Solr, but even as someone who loves Pg over every other RDBMS out there I'd hardly call it "ops friendly", with the Bring-Your-Own-Tooling mentality it turns a lot of operations people off when they realize features like replication, log archiving, etc. require bolt-on tools to actually do anything useful.

ZK is set and forget though, outside of occasional tweaking of JVM parameters based on growth.


I'm not sure what you mean (I suspect that perhaps that was the way before Postgres 9?), my current company exists since times when MySQL was ruling the IT world, so it is heavily MySQL based. Since about a year ago it was accepted to also introduce PostgreSQL.

We currently have 9.6 with barman for streaming backups and repmgr to control replication. Compared to any of our MySQL setups is a breeze to manage, my team is owning both and the bad thing is that there was only one issue so far[1] on call (bad, because if system doesn't break often no one will learn about it).

PostgreSQL + barman + repmgr is much nicer than anything else we have for MySQL.

Few things: barman offers streaming backup, meaning it is not just once a day, but it is done continuously. On top of that contains a single command that can verify it operates correctly (it has enough backups, streaming operation works etc).

I also like that everything is robust. In MySQL if replication breaks, I need to manually restart it, if it gets too far behind (too many changes, or it is not fixed fast enough) the wal logs will be erased and I would need to set it up from the scratch again[2].

Postgres is smart enough to keep track of all wal logs and only deletes them if all standby nodes received them. If connection is broken (for example a network partition) it will reconnect and resume replication. It self heals whenever it cans. It is a breeze to manage.

Another thing MySQL lost a lot of points for me was at my previous workplace. An MySQL host suddenly started throwing errors to clients. Turns out that few tables got corrupted, the fixing was easy, just shutting down and running repair tool. What was puzzling is that there was no obvious reason (the disk had plenty of free space, no corruption) the machine was not abruptly terminates (~100 days uptime) I never found why data got corrupted. Also, I never experienced things like that with Postgres.

[1] the issue was that barman host had a bad disk, so that triggered an alert and after disk replacement things continued to work.

[2] there is also one crazy system that MySQL allowed to happen. Essentially in master-standby replication someone had genius idea to do some processing on the data in master and write it in standby. Over time that data became also important, so actually the standby node needed to be backed up instead of the master. This is "great" because if replication breaks, whomever is on call will have fun time doing some acrobatics to restore the whole thing. Postgres correctly doesn't allow writes to standby during replication.


> PostgreSQL + barman + repmgr is much nicer than anything else we have for MySQL.

I use them all, as someone who is very familiar with PostgreSQL and Linux systems in general I have no complaints - but unless you know what you're doing it's painful to get all of these set up, they're not intuitive and not integrated in.

This is what I mean by "Bring-Your-Own-Tooling", the core support for replication, wal archiving, etc. is all built-in to PostgreSQL but there's no included tooling to manage it. It's up to you to pick between barman, wal-e, whatever to handle backups. It's up to you to choose between repmgr or pglookout. It's up to you to choose between pgBouncer or pgpool. Batteries not included, and nothing is opinionated, have fun.

I'm not comparing PostgreSQL against MySQL either, I'm comparing it against Oracle or SQL Server - this functionality is baked in and as a result DBA's are a dime a dozen (for the most part), because their job mostly revolves around managing performance and general maintenance instead of wiring a bunch of extra tools together to get it into a production-ready state.

I love PostgreSQL as a developer, and as someone who knows what he's doing I don't mind it from an operations role either - but I don't look forward to the day when we finally fill our open DevOps position and I have to train them on monitoring and managing it.


Just curious about the MySQL problems you hit.

MySQL replicas self-recover via MASTER_CONNECT_RETRY and MASTER_RETRY_COUNT. e.g. set to 1, 8640000 respectively to make a replica retry every one second for up to 100 days before finally giving up.

At least this is true for network errors. If replication breaks due to data inconsistency, it should stay broken and a human should look into it, IMHO.

Corrupted tables and repairs suspiciously sound like MyISAM, and a realize this ancient engine still lurks at some places. I haven't done a "repair" for over 14 years -- the last time I used MyISAM in production. Using MyISAM today is analogous to using the ext2 file system today.

In any case this just clarifies a couple comments on MySQL and has nothing to say on MySQL vs. PostgreSQL.


It's a lot easier to take care of than zookeeper for one thing. Also Consul is getting pretty close to industry standard these days for newer businesses.


I can confirm this. Consul is much easier to run in production without issues. Zookeeper is rather picky around disk responsiveness etc.

And the Consul API is rather nice.


For me it's the ease of use. One simple go binary, all the features you need (yet manages to feel lightweight), well documented and rock solid.


Consul is like ZK but specifically tailored to maintaining the state of infrastructure and application services. It removes a lot of the work required to do this on top of ZK.


I use it because of the built in service discovery and the built in integration with Hashicorp’s other tools that I use - Nomad and Vault.


Zookeeper is industry standard? I guess that explains why it's such a mess.


Lack of gossip in zookeeper makes changes in cluster topology much more painful.


ZK 3.5 brings this.


Big thanks to the Consul team! We have been using it for a year with basically 0 maintenance.

* Decent REST API

* K/V store

* Service discovery

* Single Go Binary

* Geographical failover

* Load balancing

A real Swiss knife.

Haven't tried different things (e.g: ZooKeeper) but don't feel very compelled to since everything just works.


HCL - why inventing another config/markup language? There TOML, yaml and dozen others.


It's really quite a nice config language -- I much prefer HCL to any of the alternatives. Has comments (unlike JSON), is properly nestable (unlike TOML), is extremely human readable and not too verbose, has a built-in pretty printer, etc.

I wasn't ever really a huge YAML fan. It's way too ambiguous and sensitive to whitespace. One erroneous space is way more likely to mess up your YAML file than your HCL file.

It's not perfect, but I actually would like to see MORE systems using HCL, not fewer.


HCL is not used in the rest API though. So we have to integrate our apps using both HCL and JSON. It's also schemaless, which means we have to guess the API using various "samples".


Which REST API?



How is TOML not properly nestable?


Because in TOML you just specify additional keys after the first one, leading to redundancy and representing a structure that's actually nested as flat.

In TOML you can indent but it's not required:

    [[fruit]]
      name = "apple"
    
    [[fruit.physical]]
      color = "red"
      shape = "round"
In HCL it's all just one (properly nested) object and you can `fmt` it:

    fruit {
        name = "apple"
        physical {
            color = "red"
            shape = "round"
        }
    }
It seems arbitrary but really matters once you're nesting a few levels deep, which tends to happen in anything complex such as an orchestration system.


Such a good question there's a section of the README dedicated to it! https://github.com/hashicorp/hcl#why


FWIW, many OpenBSD tools use a configuration language that's not too far from HCL. I'm much more confused by the industry using the difficult to parse (for both humans and computers) YAML for everything.


I think HCL strikes a balance between being too generic (YAML, JSON) and too flexible (Chef, Puppet code) and seems to be better suited for infrastructure automation.


Yes, or a full-featured programming language. Even though I don't really know or like Ruby, I still like that Vagrant uses it for config since it is at least possible to do complex things if you need to.


It makes it impossible for the system to reason about the outputs though (due to the halting problem and a lot of other practical issues). Therefore a Vagrant setup just "does its best" to execute the given prescription to set a machine up, whereas something like Terraform will actually intelligently figure out if work needs to be done.


I don't understand why you have been downvoted. Many of these tools are very imperative and make assumptions about the starting state before they replay their updates. For a fully declarative solution, NixOS is the best I have seen.


This is a strange assertion--you run the DSL to generate, at runtime, a data model that can then be checked. Hell, I wrote a Terraform competitor that was certainly no worse at planning than Terraform is. (Still not great, because the problem is a hard one, but.)

And doing so gives you such niceties as if-statements and loops (I mean, have you seen the straight nonsense people have to pull to make a simple 'if' in HCL?) as well as a seamless way to introduce external data sources. This is why Chef remains the only part of my workflow that has not changed over the last five years (and why first Puppet and, eventually, Ansible will fall by the wayside): because, as a programming language and set of libraries, I can bring it with me by extending it at the user level without having to extend the runtime to do something. Patching the runtime means I need CM for my CM; doing it at the user level means it goes with me and plays with the base tools.


> Hell, I wrote a Terraform competitor that was certainly no worse at planning than Terraform is.

So why is Terraform so popular instead of your tool? :)

Both approaches have tradeoffs, and what you've mentioned are clearly advantages of the full-PL config style. The problem with a full-blown programming language, though, is that the flexibility makes it more difficult to reason about effects without actually running the code. If you're just using it to generate a declarative data model that then gets applied, you're still doing fully declarative management, just with extra steps.

It's not _necessarily_ a good thing to have something like if conditionals in a provisioning system either - the goal with these systems is usually to reduce uncertainty about outcomes, not multiply the # of branches you need to reason about. Ultimately I agree Terraform has issues with restrictions of HCL, but I don't think I'd say choosing a programming language instead of a purely declarative style is a no-brainer.


In favor of Ansible, I'd argue that extending it is trivial.

I tend to write simple Jinja2 filters when I need something done quickly that's not included, and it's as simple as making a specifically named directory (filter_plugins) and a python class (FilterModule). You can write your own modules pretty simply as well but there's a bit more boilerplate.


Considering that Vagrant is their most famous tool so far, they should have had a good reason to move to HCL then right? (And they do, as linked in a sibling comment.)


It's mostly JSON. We use this and Terraform and yes, I'd prefer go scripts and a literal library of go code.


How many people are actively working on it and maintaining it? From my experiences with Vagrant and Packer, I get the feeling that there is not much development effort (by Hashicorp employees) allocated to the free Hashicorp products so I hesitate to even consider using Consul.

Vagrant and Packer feel like early alphas and while there is little competition, I do not feel comfortable using them. Because of this bad experience in the past, I have not even tried out Consul and Vault and am unlikely to do so in the future.


I don't know the answer to your question but while Vagrant and Packer both have been pretty bad experiences for me, Consul, Nomad, Vault and Terraform are very different products and all pretty actively worked on. I am using them all at the moment and am very happy.

Consul seems like a very mature product and development has maybe slowed down, but I think this is mostly because it is mature and looks like HashiCorp does not feel the need for feature bloat.


I'm not sure if exact employee numbers are public, but I can assure you a lot of hiring has taken place over the last year and all of these products have dedicated engineering teams as well as supporting staff (customer success, sales, management, etc).

(I'm a HashiCorp employee on the Nomad team.)


Consul seems to get some love when looking at the releases, but I am really getting this impression when looking at Nomad. Seems like its been 0.6 forever.


Consul is a dream for most Operations Engineers. At work we have used it in production for almost 3 years with basically no maintenance except dropping a new binary for the old one when we update the versions. We use it for service discovery, feature flags, consul lock for deployments and use the KV as a config store. It is also our backend for Vault and stores all the encrypted data. Thank you HashiCorp and the Consul team for an amazing achievement.


We used it heavily, but are switching away since it's not that stable inside kubernetes. Also some features are unnecessary if you have k8s already. Like you have a k/v store and you have service discovery already.

I wasn't part of that discussion though. Thus I'm curious. Anybody got a different point of view on Consul in k8s?


Did you try 0.9.3 under k8s? We did some work specifically targeting k8s by enabling pods with servers to get restarted with new IPs, even all of the servers at once. That required Consul to be configured with -raft-protocol=3 to benefit in 0.9.3. That's the default now in 1.0, so both of those versions should be much more stable under k8s.

I'd be curious if you ran into other kinds of issues on k8s and if you tried 0.9.3 with the required raft protocol version set before you gave up.



This is only tangentially related but is anyone using Spring Cloud + Consul? Is the integration of the two in a good spot now?


We are a Java shop and use Consul (as a docker container) for our service discovery. Spring Cloud Consul plays well with Consul. The REST APIs are what makes Consul stand out. We have an in-house monitoring service querying consul apis for service removals & healthy service counts and send an alert if there are no more healthy services, so that our docker can spin up one just in time.

Another feature that we use is the KV configuration dynamic reloading. So all your yml configurations are in Consul and can be updated and propagated to all of your services on the fly.

Now the problem we face is Spring Cloud Consul project is maintained by a very few people, so if there is an issue, we have to workaround than waiting for a fix.


For Service Discovery, why should I use Consul over Eureka?


Consul's documentation has a comparison at https://www.consul.io/intro/vs/eureka.html.


Congrats.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: