-
Notifications
You must be signed in to change notification settings - Fork 73
Description
Due to a rather significant spike in API hits, our analytics database has suddenly grown by quite a bit for this month. This is negatively impacting the performance of the analytics queries and graphs in the admin tool. I've increased our hardware sizes to get things generally working, but things can be quite poky the first time an admin performs queries until the caches spin up.
So one issue is that we perhaps just need to look into tuning our ElasticSearch queries or the ElasticSearch database itself. In particular, the "Filter Logs" view could possibly be optimized, since it's really performing several different requests in parallel (one date histogram for the chart over time, more aggregations for the "top X" IPs/users, and another for the most recent raw logs for the table). So that's probably worth a look, but I'm not too sure there's much different we can do with those queries if we want all that information on one page.
The more general issue is that I think we need to take a longer look at how we handle analytics data, since with our current approach this problem will only get worse with more usage, and I'd like to have more of a plan other than throwing more hardware at it when things break (or if more hardware is the plan, that's fine, but I'd at least like to have that planned a bit better in terms of when we need it, costs, etc). So a few random notes on that front:
- We currently log every request into ElasticSearch and then perform aggregations on the fly depending on the queries. We're pretty much doing what the ELK (ElasticSearch/Kibana/Logstash) stack does, but slightly customized for our API use-case. We also store logs indefinitely (so for many years, rather than I think a more common ELK use-cases where it's just a couple weeks or a month of raw log files before you prune). While we've had a spike in volume, it still doesn't seem like anything too unreasonable, so it would be interesting to know what hardware requirements other people typically use for ELK installations of our size to see if we're just doing anything silly.
- We could dramatically speed things up by pre-aggreating and binning our analytics, but my struggle with that approach has always been that we don't always know how we want to slice and dice the requests in the future, so storing all the raw logs has come in quite useful. And even in some cases where we have a good sense of how to pre-aggregate the requests (like total hits per hour/day), that's complicated by the fact that agency admins should only be able to see hits for their APIs. Right now, it's simpler to query all that on the fly, since it makes the permissions easy and fluid, but if needed, we could probably devise a system to pre-aggreate common things.
- This latest spike is interestingly mainly caused by one user making a ton of over-rate-limit requests. On the one hand, our rate limits are being effective, which is good, but on the other hand, it means we're logging a ton of over-limit error messages. This is an interesting problem because I definitely found our logging of all these over rate limit errors helpful at first (since it helped pinpoint where all this additional traffic was coming from), but at this point logging all of those continued errors seems like a lot of noise.
- As an alternative, I've been intrigued by InfluxDB and whether it might be better suited to the task over ElasticSearch (since this use-case seems exactly what they're targeting), but it's still quite new. I'd also be interested in how Postgres or Cassandra performs in comparison with similar volume (there's also some interesting stuff happening in Postgres with columnar store plugins that might be better suited for this, but most of those seem like commercial options at this point).
- As another alternative, there's always the option of external providers for analytics, like Google Analytics or Keen.io. If someone else wants to figure out these more difficult problems, I'm all for that, we'd just need to determine if it would meet our requirements, the cost implications, and if we're okay with offloading a portion of our stack like that.
- I thought we had a GitHub issue discussing Google Analytics already floating out there, but I couldn't find it. In any case, now that Universal Analytics is out there, I think storing API metrics out there is more feasible, but there are still things to consider (rate limits on their end, pricing, and the fact that we couldn't store api key/user information).
- Something like Keen.io might be a bit more flexible, but that would require more research, budget, etc. We'd be above their "Enterprise" pricing plan this month and into "Custom" plan territory, but I did notice a little blurb on their pricing page about helping out with open source and open data, so that might be an interesting conversation to have with them if we wanted to pursue this.