Categories
Uncategorised

The mystery of the 502 Kibana errors

We upgraded kibana from version 6.8 to 7.9, and started having issues with access via our apache proxy. Our proxy terminates SSL sessions and uses x509 client certificates for authentication, authorisation and accounting, so it’s a right pain.

Before the upgrade background traffic was about 720 requests an hour — one permament monitoring machine with a page looking at comrex state. Ye old zgrep ” 200 ” ssl.pensieve-kibaccess.log.3.gz|sed -e ‘s/[^:]://’ -e ‘s/:.. .*//’|cut -b -3|uniq -c shows up plenty of HTTP 200s in the log. grep for ” 50. ” and no errors at all. That was before the upgrade.

720 200s an hour, that 1 every 5 seconds. We don’t have many users.

After the upgrade though we started seeing issues – it would load, sometimes, but then we’d get errors. These could be seen in the apache proxy access and error logs, but not in the Kibana log

The numbers were high, while there were still 1200-1800 200 codes an hour, 50x errors were in the 400-900 an hour level

So why was it breaking? Looking at the traffic in wireshark pointed a big red stick at the problem — the Kibana server was resetting TCP

502 actually, but this is cute

Kibana uses nodejs to host itself, and nodejs is resetting the connection. It really shouldn’t to that, there’s a reason TCP RST shows up red in wireshark by default.

I could see these RSTs in real time on the kibana server with sudo tcpdump -i any port 5601 |grep Flags..R

Some DDGing eventually returned the following blog which sounded very familiar https://adamcrowder.net/posts/node-express-api-and-aws-alb-502/.

While running an Express server using (NodeJS 8+) behind an ALB under moderate load (at least ~15 requests per minute, although more apparent under higher loads), you can start to observe occasional 500-range errors on the ALB

Yup, I was sending about 20 requests a minute, I’m using a proxy which isn’t getting anything back from the server other than resets, and it’s intermittent (about 25% of the time). Adam had done similar debugging, and seen the TCP resets

The reason this can occur is because the NodeJS http server is destroying its TCP socket instead of gracefully closing it (waiting for an acknowledgement of the connection closure while still accepting data from the socket).

So to test this hypothesis, disabling keep alives from apache seemed a reasonable step. The following line in the virtualhost does the job, and a quick apache2 reload later, put some dots in the tcpdump window, and see if the errors stop. And they did, instantly.

SetEnv proxy-nokeepalive 1

So there we have it, moving to a recent Kibana/NodeJS broke keepalives, disabling them means our users are happy they can use kibana to see what our comrexes are upto, and we can move onto other more important things, like an OBE with massive jitter in its RTP emissions.

All fixed

OK, that’s working, and everyone’s happy, but a new TCP connection for each request feels a bit 1990s ish. Now the 90s were clearly the best decade in history – Spice Girls, Independence Day, and the Golden Age of Star Trek, however as our proxy has a connection tracking firewall between it and the kibana server, it does feel like a needless waste of resources. Don’t want to break the service too much though, so I create a new virtual host on the proxy to examine without causing everyone else to grumble, and check the error still occurs.

Oh yes, Kibana really doesn’t like the Bee Gees.

In Kibana, installed via Ubuntu package management, grep -r keepAliveTimeout /usr/share/kibana shows up the most likely location. /usr/share/kibana/src/core/server/http/http_tools.js configures server.listener.keepAliveTimeout – and sets it (via some layers) to the value set in server.keepaliveTimeout in kibana.yml. That’s undefined in my install, but looks like it should have a default value of 2 minutes.

Due to the potential cause being this regression bug mentioned in Adam Crowder’s blog with headersTimeout, adding this line might solve everything.

I can do more hotfixing in prod on my laptop, sitting in my pajamas, before my first cup of Earl Grey, than you can do in a year in the field. What Q really meant.

Restart the kibana service and check it still works. Hammering refresh on the “-keepalive” proxy did cause the problem reliably before, even on a nice static resource like /ui/favicons/apple-touch-icon.png, and doing this post edit no longer does, so that’s a good sign. Flip back the main proxy to use keepalives and monitor for RSTs and good to go.

The issue seems to be in kibana at https://github.com/elastic/kibana/issues/44062, maybe it only affects apache, and all the cools kids use nginx and AWS ELB at webscale levels.