Tell Me More, Nginx (2017)

latch · on May 19, 2019

The company that I recently joined uses honeycomb.io, versus ELK at my previous job. Maybe I haven't played with it enough, or maybe we have it badly configured, but I find it a huge step backwards. As far as I know, there's no full text search and the types of reports and aggregation you can do are extremely primitive. The UI is also barely usable..huge horizontal scrolling, can't click on a specific value to apply that sub-filter. The list goes on and on. I wish devops team would focus on these fundamentals (proper centralized logging) before playing with all the "cool" and trendy toys.

As for logging, Our `log_format` directive looks like:

  log_format stats '$time_iso8601\t$host\t$request_method\t$request_uri\t$request_length\t$status\t$upstream_cache_status\t$bytes_sent\t$request_time\t$upstream_response_time\t$upstream_http_x_route\t$client_id';

This captures most of the generic data you'd associate with a request (time, host, method, uri, length) as well as the generic response data you'd care about (status, length, time to reply). Note that `$request_time` measures the full time to first byte until time to last byte that nginx spends serving the request, vs $upstream_response_time which is more specific to the upstream's response time.

If you're using nginx as a cache, $upstream_cache_status tells you the cache hit/miss status.

A list of all variables is available at http://nginx.org/en/docs/varindex.html

All of our services can set an "x-route" header, which helps canonicalize URIs into something more meaningful. It's up the the service to decide what to do..but /v1/users/ID could be called "users:show". A more complex route might use a different name based on arguments. For example, depending on the parameters we expect some hits to /v1/reports to be very fast, and some to be very slow, so we'll set the name to "reports:list:fast" or "reports:list:slow" so that we can get more accurate statistics.

Finally, we use OpenRestry to do some initial global authentication (along with some other stuff). This is where $client_id comes from. All requests go through something like:

location / { set $client_id '' set $upstream '' access_by_lua_block { require('execute')() } proxy_pass http://$upstream; }

A piece of that execute code will possibly set $client_id to something which will then get logged.

mherdeg · on May 19, 2019

> As far as I know, there's no full text search and the types of reports and aggregation you can do are extremely primitive.

My understanding is that it's helpful to think of honeycomb.io as a replacement for Scuba -- which, to my understanding, is Facebook's tool for doing fast math and aggregations on metric data even if those metrics come from things that look like logs.

There are some "visualize" things you can do in the Kibana UI which really stretch ELK's capabilities (graphing p99 latency per handler for every one-second bucket of a sample minute in a high-throughput service that writes all its hits to log files, for example). Where Kibana and its backends start to stutter, or when certain kinds of graphs are just impossible, you might begin to be well-served by Scuba-style tools.

Not having worked at Facebook, I have only heard about this secondhand. After a certain scale -- when you have many many requests per second and when handling a request involves a forest of other services doing some work in series or parallel -- you start to feel like you can no longer rely on raw logs and start to lean on tools who do things like distributed tracing or aggregations; if your tool work well, eventually you may even feel a disdain for those raw logs and feel like you're _better_ off using the more advanced tools.

I feel a bit left behind sometimes because I live pragmatically in both worlds (I'm excited to get results rapidly from newer visibility stacks, but honestly am also still happy to log into a system and read its logs -- just in case we ever need to do that).

cyen · on May 19, 2019

> I find it a huge step backwards. As far as I know, there's no full text search and the types of reports and aggregation you can do are extremely primitive.

Oof. Honeycomb is for fast, realtime analytics: starting with a high-level question in your mind ("why did our throughput drop by 50%?") and rapidly iterating on a hypothesis (examples in [0]). ELK can... be used for that, but is optimized for another (as you said, full-text search and generating static reports).

Being able to flip from a funny-looking graph directly into "raw data" mode is intended to be a bonus in Honeycomb, not the primary way you interact with your data.

While we believe that fulltext search has its place, beyond a certain point (most production systems, these days), sifting through log lines is a brute-force method of answering questions about your systems — especially if you're not sure what the proverbial needle you're searching for looks like. [1]

(But mherdeg's answer is great, go back and read theirs while you're here :))

0: https://www.honeycomb.io/blog/troubleshooting-in-honeycomb-c... 1: https://www.honeycomb.io/blog/the-true-cost-of-search-first-...

wpietri · on May 19, 2019

Huh. I don't think of Honeycomb and ELK as a "versus" relationship. Don't they solve different problems?

perfmode · on May 19, 2019

elk is for a very different problem domain.

ludwigvan · on May 19, 2019

Tacking a custom header to the response is a nice trick!

Could be particularly useful for GraphQL apps where every response is a big opaque answer from `/api/graphql`: Add the GraphQL operation name to the response header.

This will also help with Chrome Devtools where you can add custom headers to the Network tab, otherwise you just see tons of POST reqs to `/api/graphql`: https://umaar.com/dev-tips/115-manage-network-response-heade...

mooreds · on May 19, 2019

As the last piece of most architectures to see every write request (caches will see every request, including reads), web servers are the logical place to capture system wide data. (Caches, if configurable, may be a slightly better option.) I haven't done a lot of nginx config, but appreciated the pointers in this post. On the other hand, it was a bit short and no pointer to the next installment.

By the way, if you haven't read the blog of Honeycomb founder, it's full of interesting provocative opinions:

https://charity.wtf/

amelius · on May 19, 2019

Ok but if privacy sensitive data enters your logs, you better have a way to clean them.

minaguib · on May 19, 2019

FWIW, past a certain scale, it makes a lot more sense to export dedicated telemetry out of your app into a TSDB.

The reason is, it's thousands of times cheaper to +1 a counter/record a gauge and flush it every once in a while, than to serialize per-request metrics into HTTP headers, log them, and re-parse them for analytics later.

cyen · on May 19, 2019

Totally, if all you care about is just that counter/gauge -- but more and more often, you need that counter/gauge captured for a particular segment of your traffic (e.g. some app-level identifier), and TSDBs tend to struggle[0] as the number of possible segments explodes.

If all you care about is overall latency, awesome! Use a TSDB. Once you care about latency per endpoint/user agent/customer ID/client platform (or combination thereof), you need the flexibility associated with structured log data, stored in something meant for fast analytical querying.

0: https://www.honeycomb.io/blog/the-problem-with-pre-aggregate...

ryanworl · on May 19, 2019

You can do both!

Use a basic sampling approach (recommended by Honeycomb and others) to sample e.g. one percent or less of your “boring” transactions, 10% of your suspect transactions, and all of your errors. By suspect I mean too slow by some cutoff, generated more work than expected, etc. Those numbers are arbitrary and could change depending on the environment. For CDN logs of static assets, 0.1% or less might be sufficient because there is hopefully not much going on there, but you’d like to know if there were.

You then (importantly) store the sampling rate along with the event itself to reconstruct a usable comparison between the different types of events. For example, one event for each of the previous sampling rates I gave would be three total events stored but represent 111 transactions.

vbsteven · on May 19, 2019

Nginx is an amazing workhorse and can help out with logging/monitoring/metrics.

Some tools I like to use to improve the experience:

* Fluentular [1] - A tool for matching Fluentd log parsing to your customized nginx log format

* Nginx-lua-prometheus [2] - Library for exposing request data to prometheus

[1] https://fluentular.herokuapp.com/

[2] https://github.com/knyar/nginx-lua-prometheus

tyingq · on May 19, 2019

Related, Apache has %D to put request times in logs, and mod_log_config if you want to log header values. Logging header values has some security/privacy implications though.

combatentropy · on May 19, 2019

I'm surprised more people don't use the logs in web servers. Out of the box they are already on and running with little overhead. They usually tell me everything I want to know about usage.

But I don't do things like custom headers:

> a header value generalizing the shape of the endpoint path > (e.g. /products/foobarbaz → /products/:prod)

Instead I analyze the logs with Bash (grep, cut, awk, sort, uniq). In fact I think Apache's logs are one of the main places where I learned these tools.

jrumbut · on May 19, 2019

Something I worked on but never released was a series of posts/small book that was about actually using Apache (one could write the same for Nginx).

There are (metaphorical) $20 bills lying all over the ground in small, medium, and ever large applications' server settings.

Or at least that's my experience, that usually someone puts in the conf file from a tutorial somewhere and they're off to the races. Maybe a few things get tweaked if there's a problem. Fortunately a lot of distros have somewhat better defaults now but it's still often easy wins on performance, logging, or security to be found.

umanwizard · on May 19, 2019

Sorry for the nitpick, but it’s a pet peeve of mine that people think the entire suite of Unix command line tools is part of “bash”.

combatentropy · on May 19, 2019

Thank you, I will try to avoid it. In fact I started calling them all "bash" only recently, seeking a short name instead of "Unix command-line tools." But you're right, it's inaccurate.

CamJN · on May 20, 2019

The tools you listed can be referred to as coreutils: https://en.wikipedia.org/wiki/GNU_Core_Utilities

perfmode · on May 19, 2019

envoy proxy knocks this out of the park. one can even emit detailed log data as proto over grpc.

bovermyer · on May 19, 2019

This is useful to know. I wish I'd known it five years sooner.

pighive · on May 19, 2019

I am curious to know, why?

bovermyer · on May 21, 2019

It would have been useful in some monitoring efforts I was part of back then.