Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hi Paul!

I have a question about the new tagging feature in InfluxDB 0.9 - hopefully you can clear up my confusion.

My understanding is that tags are great, as they are indexed and hence quick to search.

However, they are not suitable for situations where you have very high cardinality (> 100,000). Assuming this isn't the case, where else would you use fields over tags?

For example - apart from the indexing and cardinality issue - what are the pros/cons of

   "tags": {
       "host": "example.com",
       "direction": "bytesin"
    }
   "fields": {
       "value": 50
   }
and

   "tags": {
       "host": "example.com",
       "direction": "bytesout"
    }
   "fields": {
       "value": 60
   }
versus

   "tags": {
       "host": "example.com",
       "direction": "bytesin"
    }
   "fields": {
       "bytesin": 50
       "bytesout": 60
   }
Or say you have a logline, and you're parsing various attributes out of it (meaning all the values are quite tightly associated with each other) - would you split them into separate series, each with their own set of tags, or would you store them in a single series with multiple fields?


In your example I'd probably have the direction in the measurement name. e.g.

{ "name": "network_bytesin", "tags": {"host" "example.com"}, "fields": {"value": 50} }

For fields, you generally only want to put two pieces of data together in a single point (thus in two fields) if you're always going to be querying them together. Either by pulling the values out, or filtering on a WHERE clause.

Unrelated to this, we've created a line protocol to write data in and it's a much more compact way to show a point:

network_bytesin,host=example.com value=50

Details here: https://github.com/influxdb/influxdb/pull/2696


Aha, fair enough - so if you only sometimes query the values together, you should separate them out into discrete series =).

For a logline, the sort of metrics we'd get would be things like query duration, database, client ID etc. You've often query them together, but you'd also query them separately, so I guess that also makes sense as multiple separate series.

Interesting, that line protocol looks cool and seems like it'd be more efficient on the wire.

Will the drivers (e.g. Python, Go) be updated to take advantage of this new endpoint?

Finally, I see there's also a ticket open for binary protocols:

https://github.com/influxdb/influxdb/issues/139

Do you see that as being still potentially useful?


The Go client has already been updated. We'll be updating the Ruby and front end JS client. Hopefully the community will jump on updating the other clients once we release 0.9.0. It's a super simple protocol so I don't imagine it'll take much work.

The binary protocol is probably much less useful now. HTTP + GZip of the line protocol will already saturate what our storage engine can do at this point. In fact, I'm going to close that out right now...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: