How do people in production handle the possibility that your service might miss a webhook notification? If you miss a notification you'll end up with stale data and you won't know it.
Slack has a retry policy for a while but will then just give up. Another webhook provider I've looked at says nothing at all about this sort of thing. How do folks deal with this in production systems?
Seems to me like the best way to address this issue is to use the webhook as a hint that you need to run some other process that guarantees you've got all updates.
When I was at IFTTT (a few years ago, so it's definitely changed since then) we tried not to rely on the content of the webhooks and just used them as a hint as you describe to fetch new data. Not every API made this easy though.
If receiving a webhook is critical, you should make your receiver do as little as possible to place the event into a resilient queueing system and then process them separately. That won't save you from bad DNS, TLS, etc. configs but it should help reduce the possibility that you DoS yourself with a flood of webhook events.
I would prefer to implement the sending of webhooks in bulk - if the consumer falls behind, they receive up to 100-1000 webhooks per request (depending on the size and complexity of each individual webhook - ids only is 1000, complex documents 100). This drastically cuts down on the number of concurrent requests to a single client when load is high, or the consumer broke down for a period of time.
Unfortunately, developers writing code to receive batch requests are often... inadequate, to say the least. They'll write basic looping code without any error/exception handling; so if the 3rd item in a bulk request of 100 items causes a server-side error for them, they throw a 500 Internal Server Error or similar and fail to continue processing items 4 through 100. You simply cannot batch webhooks as a producer, unless you detect a single failure from the client to process a batch as a cue to drop to performing "batches" of size 1 until you receive an error for a single request, at which point you return to bulk. Rinse and repeat.
Honestly, being the producer sending webhooks to consumers which are written by random developers is a nightmare. You have to understand that your customers will not write proper code to accept your webhook requests, even if each request is for a single webhook. You also must understand that your customers will not look to blame themselves for shitty code. You can retry 1,000 times over a 48 hour period, and if their code still fails to process the webhook, it will be YOUR fault, not theirs.
Truthfully, it is horrible to be on the sending end of webhooks to random developers/customers.
I don't understand, if it's such a nightmare why don't you (the producer) create the code/libraries to use those webhooks? At least in the 2 most common platforms (e.g. PHP, Java)
You can set up something where it will alert you if there are too many failures in a certain time period. That isn't offered by Stripe but you can build it.
If you mean in the case of "catastrophic failure", there is none.
If there is a "catastrophic failure" (machine gets shut off for a week, data center blown up, whatever), there are probably bigger issues or we probably would already know.
Stripe has an "events" API that can be polled to receive the same content that you would have received via Webhook [1].
(Disclaimer: I work there.)
If you missed some Webhooks due to an application failure, it's possible to page through it and look for omissions. I've spoken to at least one person integrating who had this sort of setup running as a regular process to protect against the possibility of dropped Webhooks. This usually works pretty well, but does start to break down at very large scale where events are being created faster than you can page back.
The possibility of dropped events is a major disavantage of Webhooks in my mind -- if you consider other alternatives for streaming APIs like a Kafka/Kinesis-like stream (over HTTP) that's simply iterated through periodically with a cursor, you avoid this sort of degenerate case completely, and also get nice things like a vastly reduced number of total HTTP requests, and guaranteed event ordering.
(But to be clear, Webhooks are overall pretty good.)
I never even thought of using it that way. I just use events to check that it is a valid Stripe event (Probably easier / better to set up the ELB to only listen to certain addresses)
We[1] had a similar problem with clients reporting to us about lost callbacks[2] (our term for webhook). To solve it, we have built two options.
- Get a notification email everytime the callback fails. The email contains the same information the callback was supposed to deliver
- Retries. We retry for the next 24 hrs (max) with an interval of 5 mins or until the callback call succeeds (within those 24hrs). We created a sub-resource called `calls` (/callbacks/[id]/calls) that keep the status of the call we made. If it succeeds, the status changes to "SUCCESS", if it fails, it remains in "FAILED". If even after 24hrs the receiver system being down, and the call does not succeed, the developer can make a call to GET /callbacks/[id]/calls?status=FAILURE and receive all the failed calls. They can process the content and do a PUT /callbacks/[id]/calls?id=ID1&id=ID2&id=ID3... with body as `{ "status": "SUCCESS" }` to mark them as "SUCCESS".
The calls are saved for upto 7 days, so that the dev has enough time to fix their server issues, and get back all the lost callback calls. This solved much of the client issues.
* An added benefit of this came to the devs who could not get an inbound POST from us into their network due to firewall restrictions. The firewall restriction defeated the purpose of live callbacks, but with the `status` option, they only checked for new (`FAILED`) notifications once every 2 hrs or so , and mark the one processed with `SUCCESS`. This way, they only look for `FAILED` and process when they have one. Else, nothing to do.
Previous devs were doing expensive things whenever we received webhooks. This meant we DoS'd ourselves every time a sizable amount of webhooks came our way.
Set up a tiny server on Heroku that received the webhooks and put them on a queue. A worker with a configurable concurrency level later forwards the events on the queue .
Dropped from four digit 502s and 504s weekly to virtually none.
The good APIs do, but it's still at a loss to both sides.
a) The producer of the events has to store them in semi-permanent storage. I've been there and done that - failed webhooks result in a table of tens of millions of rows, even if the memory on each event is only 48 hours. It's astounding how many events fail to process. And I've been through extensive verification that there is truly no problem on our side - it's always the client who is wrong. Emails back and forth for weeks with the client screaming "it's your fault!" - only to finally receive an "oops, we found the problem on our end... sorry".
b) Frankly, if the consumer of the events fails on a single webhook more than 5 times in a 24 hour period, that event is a permanent loss. The reason it fails consistently is because that specific event is a permanent failure to process on the consumer's side. It is probably throwing a 500 Internal Server Error or similar - every single time. 0.001% of webhook consumers actually have emergency alerts when webhooks fail on their end, so the job will continue to throw a silent/unlogged/unnoticed/ignored error no matter how many times you retry. These are the same type of developers who will never poll your "failure queue", because they don't even understand that their consumer endpoint throws 500 Internal Server Errors on 10% of your requests. You're trying to provide a service to developers that live in a fantasy world where errors and exceptions never happen on their end.
It's a simple fact that developers who consume webhook requests are a disgrace. Chances are that if a request fails two times, it will never succeed. And yet the best APIs will try hundreds/thousands of times over a 24 hour period - simply to prove to that client that it is their fault that they are not processing webhooks properly. There is only so much a webhook producer can do. There is no magic we can do if the consumer is copy/pasting PHP snippets from Google or Stackoverflow.
Story time. The most memorable situation I can remember is a client who was experiencing 100% webhook consumer failure for more than three weeks. The emails from their team - and subsequent phone calls from their CTO - were absolutely stunning; it got to the point that we were hounding our own business people to drop them as a client, the verbal abuse was that bad. Turns out they had a bunch of PHP developers who were for the first time writing their consumer webhook endpoint in C for some reason. They were trying to parse the custom "id" field that they sent us as a string in a JSON field, as an integer. It was all because they sent us a string, and choked on trying to re-interpret it as an integer. It hurts to even think about that case.
tldr; Fuck webhook consumers. Incompetent developers who don't know how to handle errors that are 100% their fault.
Funny aside: the most amusing cases come from PHP and .NET developers who expose their internal server errors in production. When you can copy/paste the response they gave you on a webhook because they are calling an undefined function or method... pure bliss.
You could also help customers who apparently have trouble properly connecting to your APIs by giving better error returns (got type A, expected type B), providing client libraries or giving more extensive support (for a price).
Blaming the customer is easy, providing a way for even those "incompetent developers" to interface with you in a way that is easy to understand and debug for all parties is hard.
The truly great developers find a better way than only retrying webhooks and prepare a client library that the customer can just plug in to their code :-)
I like what Shopify does here - because your app is tied to a partner account, they can email you saying "this payload has failed 20 times in succession". If it fails too many times then the webhook is uninstalled.
Not to be snarky - but it's a distributed system. There's no way to guarantee you've got all updates! At a certain combination of latency and volume polling becomes impossible so webhooks (or something analogous) are all you've got :)
At a certain combination of latency and volume polling becomes impossible so webhooks (or something analogous) are all you've got :)
Isn't it the opposite? At a certain volume, when each polling request aways returns results, polling becomes more efficient than "interrupts". It's only at low volumes that webhooks are more efficient, since polling would have to issue a lot of requests with no response if a low latency is required.
Assuming here you mean something like a classic REST-alike "/events" endpoint which returns a bunch of stuff that's changed since the last time you requested it.
In that case, as the number of events grows, the HTTP transaction overhead goes to zero with polling, yeah.
But now you have a bunch of extra things which will impact your latency:
- The third-party service will do more work preparing the payload, meaning that the earliest event on the list no longer hits the wire right away
- related: someone might be holding a lock on event 63 of 100. Now other events have to wait for it before they can hit the wire
- In your application code, you may have to read the entire request before you can validate it or do anything with it (at least, this goes for APIs which speak JSON)
- You probably have to commit your transaction for the previous page of events before you can start your next request. Otherwise, whichever side of the network is keeping tabs on your current pointer in the list, that pointer may end up in the wrong place. Oops!
- If more events happen during the time it takes you to request a page than will fit on a page, then you're really stuck.
- An error anywhere in the super-http-transaction (network, user code...) now means that an entire page of updates has been delayed rather than just one.
It's possible to remove the sequential-ness constraint from our hypothetical "/events" but not without introducing other fun new problems.
Yeah, I feel the best way is just for providers to give a RSS feed as the primary way of listing events and then notify with PubSubHubbub directly. Big advantage: everything already exists and is standard.
I think the "securing webhooks" section is missing some critical tips that we've learned in production.
1) Resolve the DNS of the webhook URL, and compare all returned addresses from that resolution against an IP blacklist, which includes all RFC1918 addresses, EC2 instance metadata, and any other concerning addresses.
2) Even though it seems like you'd want to, do NOT blindly return an unexpected response to the person configuring the webhook. Say there was an error, what the code was, etc, but returning the response body means you basically just gave someone curl with a starting point on your network (see 1 as well)
3) Find ways to perform other validations of those webhooks. Are the URLs garbage? Are they against someone else's system? Create validation workflows that require initial pushes to the URL with a validation token to be entered back into your system, like validating an email address by clicking a link.
We wrestled with #1 (and therefore #2) for a long time. Amazing how carful you have to be. EC2 meta data is a place where a lot of services have their pants down unknowingly.
Our eventual solution? AWS Lambda. We built a simple function that receives a payload with the HTTP request to be made and the Lambda function makes the request. It serves as a sandboxed micro-proxy for all of our untrusted external HTTP calls. We give that Lambda function permission to do nothing within the AWS account. We even went to far as to place the Lambda in a dedicated AWS account to further isolate it, which prevents an admin accidentally placing the Lambda within a sensitive VPC, for example.
We still examine endpoint URLs to insure they don't belong to the internal network, but I sleep much better knowing that if something slips through the Lambda function is isolated from our internal resources and there's not too much interesting to see / probe / find.
Point 1, can be a little more tricky than it seems. At first you'll think, I'll just use a regex to match known local addresses to protect again evil callback urls like http://127.0.0.1/status.
You'll realize though you have to actually resolve hostnames, because users can just create an A record of foo.bar.com that points to 127.0.0.1.
Point 3 is spot on. I think it would be a good strategy to avoid expire dates on subscriptions. Producers could take decisions on whether or not keep sending data by monitoring responses on the consumer's target URL.
Another thing that it should be worth mentioning is that some services batch notifications (e.g Facebook Messenger) so that they can send more data in a single POST request.
Ehh. I disagree with both Default Permit and Enumerating Badness--I think they have their place. If I run a club do I background check and whitelist every customer? Or to a blacklist the troublemakers? The problems cited in the article were reasonable decisions at the time, but years later grew into headaches when the use-cases changed.
Does their no Default Permit policy apply to network egress? Do I have to approve each and every application that wants to connect to the Internet? I think the leaving port 80 open because it was whitelisted is why so many things tunnel through port 80 instead of using other protocols and ports. Now how do you filter and whitelist traffic?
His example of antivirus products using Enumerating Badness is a market failing more than anything else. I'm not sure I see the alternative for a naive user. Call a specialist to investigate their use-cases and "open the system" to accommodate? Any time you want to update your tool or workflow or try something new have that specialist come out and reevaluate your system?
I understand what you're saying here. But the baseline sanity set is pretty fixed. Localhost, RFC1918, IPv6 link local, etc. I'm not advocating folks blacklist every bad actor on the internet - that obviously cannot work - but there's some simple things you can do to prevent a malicious user from configuring webhooks that attack your internal services.
There are cases where IP blacklists are pretty much the only option you have. For example, in the case of webhooks, what would you whitelist? You cannot whitelist anything that user provides without manual approval (which can be huge overhead).
Pretty much the only alternative I can think of is to query whois databases of RIRs, but you would need blacklisting there as well since they do include private IP spaces as well (ex. you would need to blacklist netname IETF-RESERVED-ADDRESS-BLOCK).
Similar problem exists with route advertisements from transit providers. They are not going to provide you a list of routes they advertise to you (since they don't get those from their customers usually), so your only option is to blacklist bogons yourself (unless you want to manually approve every single prefix out there as needed).
Yes, whitelisting makes much more sense. Github has an API that you can ask about which IPs are in their network - compare the webhook sender against that list and you're dandy. This should become a standard in webhook APIs.
Yes but that's why you build webhook for, to let people consume your content. I think DX is critical today and big companies can afford to do the heavy-lifting. As I mentioned in the Subscription Expiration paragraph I totally get Microsoft's reason to put a 72hrs expire date on subscriptions but it adds some friction on the consumer side.
I routinely have to integrate with random 3rd party systems, some with no or broken webhooks, some with no API at all.
It turns out for my customers (this may not be always the case) eventual consistency is more important than timelyness.
What I do now every time I need to sync data from a third party is I always implement some sort of pull first with idempotent logic on my side. It's easier, and it allows me to just re-run things if something fails (e.g. network error, unexpected data in production, etc).
Only when that works reliably and only if required by the customer I implement a webhook, but I usually throw away most of the message and just wake up my polling worker that is otherwise polling relatively slowly.
Long polling works brilliantly (where your API call blocks until there are some results or until timeout occurs - then you loop and call again).
Long polling gives you the best of both worlds - easy programming model with instant alerting rather than the delay of normal polling.
The only downside really is the need for a more or less permanently open connection per client. As long as the server does not use a naive "thread per connection" model this can scale up to many hundreds of thousands of clients or more.
The good thing about long polling is that if the connection breaks, the keep-alive will time out and you'll know you're not getting updates. Assuming there's some keep-alive feature.
Thats what I did, too. Poll, fetch and remember and retry errors, and if possible implement a sliding window as a poor man's cursor using dateFrom / dateTo if available.
Sort of disagree with the send-everything-in-the-payload approach. It opens your system up to all sorts of weird edge case bugs like receiving hooks out of order which could mean stale data is considered fresh. It also means you have to care a lot more about verifying the authenticity of the request.
Agree. Its better to use webhooks as a pure signal that something has changed, and then in the case of update or insert, have the client pull whatever they want using normal API.
Otherwise, you end up in a descending vortex of madness trying to specify some protocol whereby the client can specify in advance which properties they care about.
Webhook payloads need to be logically monotonic[0]; this probably means either:
- having a lamport-clock timestamp for each payload so you can entirely discard older payloads in favour of new ones
- a well defined / consistent "merge" function over the subset of the payload you care about (e.g. maybe you know a customer's state can never go back from "registered" to "guest")
Sometimes you want to ignore webhooks based on the payload (or put them into a different queue). It's faster to do that if you get the payload up front.
Also, in your documentation, please show what the webhook events will look like since developers actually want to write code and not guess at what we will get.
The implication was meant to be that the information under `data/object` is simply a full representation of another API resource of the type on which the event occurred, and that you can look elsewhere in the documentation to see exactly what each type will look like (you can see a subscription embedded in the sample response for example).
Fair enough that we could rewrite this to be more explicit about that though! We'll see what we can do to make that section more clear.
You don't have to guess anything. Stripe lets you send test webhooks to the endpoint you specify[1]. You can set up something like ngrok[2] on your localhost to examine the headers and bodies, then write your code to parse them accordingly.
I also learned about services that will set up test webhooks without having to go about setting up a server, etc.
I think I might use that, but I still think that docs should at least explain what will get sent. Maybe that is a bit too verbose in Stripe's case though.
This is going to sound bizarre, but why do webhooks and not just an AMQP queue? I get that receiving HTTP POSTs is easier, but it just seems better to setup a publisher/subscriber relationship. That way, if a subscriber goes down, they can always catch up. And publishers can allow messages to sit in the queue with a TTL and max_size. It seems like a win-win for everyone.
It's not AMQP (sadly) but something I've done previously is to have the actual webhook endpoint be as dumb as possible, doing nothing but accepting the payload (maybe with some very high level validation that the request was expected) and pushing it into a real queueing system.
This means you can handle all sorts of failure modes, not just the backend going down, but also bugs in the consumer that would otherwise result in losing the request. I've not tried it, but I imagine this is a pretty good usecase for AWS Lambda as it's a small bit of glue code.
I've used the same basic concept for accepting payment-received notifications (from a http redirect based payment gateway):
Read the transaction ID from the request body, and store it (with a date/time) in a table. A periodic process later checks them, and uses the payment service's API to validate the payment is valid and take appropriate action.
One difference is in "dead connection detection". How do you know that your AMQP connection is down? At some level you're polling, whether that be TCP keepalive, application keepalive or something else.
If you're doing polling, you're actually back at the same pre-webhook place - polling their server on some timescale which is a compromise between latency and load.
Yes, a TCP keepalive is generally cheaper than an HTTP long poll request, but only by a constant factor.
It's a whole lot simpler to do secure cross-organization HTTP requests than it is to figure out how to have multiple AMQP subscribers from untrusted companies.
Though this is an approach worth considering for a bunch of local services, at least a subset of them, I don't see it working with 3rd party APIs. Consider something like Stripe — it is probably far more straightforward to invoke some HTTP endpoints, than to set up a huge infrastructure with millions persistent client connections.
Agreed. I believe that AMQP connections are best suited for wide and articulated local environments. Moreover REST HTTP endpoints are today's esperanto :)
Because anyone can throw another route onto port 443 if you already host a website. A non-HTTP protocol running on a dedicated port, despite often being a superior solution, requires extra effort to set up, if the hosting environment even provides that option.
Webhook only makes sense if you don't care a single bit about missing updates. If not, it's deeply flawed.
A pull model (polling, long-polling, SSE, etc) is strictly superior for synchronisation. You just can't "miss" updates, can restart from the beginning again and reinterpret past events in a different light, the client goes at its own pace, etc.
BitBucket is a very good example of web hook integrations done right. Relatable, logged, and well documented. I learned from their UI when I implemented my own version.
True. Also having a sample request at the very beginning is useful so you don't have to find a way to trigger the event by clicking around on the product UI.
I'd like to follow up on the statement that the OpenAPI tools do not support webhooks. This is slated to change in an upcoming version of the OpenAPI-specification. Check out https://github.com/OAI/OpenAPI-Specification/pull/763 to see details. As soon as this is released, it only be a matter of time before Swagger and the rest support webhooks.
To be fully client based (serverless) you need a middle-man for web-hooks. Websockets are a better alternative for stand-alone web clients. There are also "push notifications" via web workers but they are vendor dependent.
For the receiver: Everything that can run a dynamic website can run a webhook receiver, opening an arbitrary socket connection isn't possible in all environments (e.g. software running on shared hosting or PaaS). You'd also need to define and implement a protocol on top of said socket, whereas more or less every web developer knows what to do with HTTP POST with a JSON payload.
And for the sender, keeping many concurrent connections open can be quite a challenge. Sending Webhooks also takes resources, but at least you can easily distribute it over many machines/processes if necessary.
Slack offers both. Their realtime API is especially nice because you connect to it rather than needing to deploy public facing web services to receive events from them.
On the other hand, if you are not connected, messages could be lost unless you build in syncing capability, whereas with web hooks, Slack will handle the retries for you.
Not quite. "Push technology" is sort of an all-encompassing term for server-to-client updates whereas webhooks pertain specifically to HTTP callbacks. An example of a webhook would be GitHub making a POST request to some URL (set by the user) whenever new commits are made to a repo. Push technology might take the form of webhooks, long polling, WebSockets etc.
Webhooks are traditional HTTP requests so I don't believe HTTP/2 changes anything. The ability to differentiate notifications depends on the service / API you're integrating with.
In the past when I used to support webhooks what I did was very simple:
* Receive the HTTP POST submission to my hook end-point.
* Save this data in a queue.
* Return to the hook-caller "200 OK - $ID".
This was better than trying to initiate a long-running job as a result of the hook, and meant that I could trigger "fake webhooks" just by adding data to the queue manually.
I'm sure there are other approaches, but this is a flexible one that also gave the benefit of being simple. (For the queue I just used Redis.)
Slack has a retry policy for a while but will then just give up. Another webhook provider I've looked at says nothing at all about this sort of thing. How do folks deal with this in production systems?
Seems to me like the best way to address this issue is to use the webhook as a hint that you need to run some other process that guarantees you've got all updates.