A few weeks ago, I was at work and started to notice that occasionally I’d open a new tab or website, and it wouldn’t load (or would load broken), but my browser normally auto-retried or I F5’d, and the problem went away. It didn’t happen too often, so I ignored it for a couple days, figuring it might be a transient network error or something wrong with my own machine.
After a few days, other users started to report it and I realized it wasn’t just me. It also started to affect our build pipelines, causing intermittent failures with network related errors. As any developer can probably tell you, intermittent CI failures are very frustrating. We reported the issue to our IT team, who could see no issues on our switches or firewalls, so they escalated to our ISP. A couple days later our ISP responded that it wasn’t them, and they couldn’t help us.
At this point, the intermittent build pipelines had gotten more frustrating, and random processes failing throughout the day were causing headaches. With a lot of back and forth, we contemplated moving our build server off-premises and to the cloud, but that wouldn’t solve issues being experienced by devs on their workstations. We contemplated working from home and hoping the issue sorted itself out. Unhappy with any of the proposed options, and frustrating that nobody seemed to be owning the problem, I decided to do some debugging.
Just like with development, my first step was to try and reliably reproduce the issue. If you can’t reliably reproduce it, it’s hard to test any changes or eliminate possible faults. I used cURL in a while loop on the command line, and tried cURL’ing various URLs. Some sites I got a 50% failure rate, some sites I had a 0% failure rate, some sites I had a very low (<10%) failure rate. Strange. I tried different page sizes to see if it might be related to packet sizes or certain byte strings getting caught by a firewall. I tried HTTP and HTTPS (and other protocols, but only HTTP/HTTPS were affected). I used sites I had control of to experiment. Eventually I determined that the issue primarily affected sites behind a CDN (CloudFlare/Fastly reliably had a 50% failure rate).
[email protected] ~ ↪ curl "http://capacitor.ionicframework.com" [email protected] ~ ↪ curl "http://capacitor.ionicframework.com" curl: (52) Empty reply from server [email protected] ~ ↪ curl "http://capacitor.ionicframework.com" curl: (52) Empty reply from server [email protected] ~ ↪ curl "http://capacitor.ionicframework.com" curl: (52) Empty reply from server [email protected] ~ ↪ curl "http://capacitor.ionicframework.com" curl: (52) Empty reply from server [email protected] ~ ↪ curl "http://capacitor.ionicframework.com"
At this point, I was satisfied that I could reproduce the issue. Now I wanted to further understand the problem, so I broke out Wireshark. Wireshark is a really great networking diagnostic tool, that has a rep for being hard to use and scary. For simple use cases, it’s not that difficult though. I installed it on my Windows workstation, and started listening on my Ethernet adapter. I ran a few cURL requests so I had one failure and one success to compare, then I stopped the capture. I filtered by protocol of HTTP, and I dumped the protocol stream for both requests to compare. For the failed request, we issued a GET request, and received no response, just a graceful TCP close connection packet. The successful request was identical, except we got back a proper response. Strange. At this point, my bet was on firewalls somewhere, because switches would not consistently drop the same packets every time.
I wanted to eliminate or confirm it was our IT infrastructure, so I worked with our IT team to connect my laptop in front of the firewall. That way, only one of our switches was between me and the modem. I re-ran the request, and I got new information. The issue still occurred, confirming it was an issue upstream of us, but our firewall had been masking important information. The server wasn’t gracefully closing the connection, it was sending a TCP RST packet to terminate the connection. This almost certainly indicated a firewall issue.
If you refer to the stream index column, you’ll see these are from two different TCP streams (aka requests). Stream index 0 is a failed request, where we send the Client Hello message to the server, and then immediately receive an RST packet in response. In the second request (stream index 1), we send the “Client Hello” message and get a “Server Hello” message back. This is expected for a normal SSL connection. These requests are identical, performed 4 seconds apart, to the same server/website. One worked, one doesn’t. At this point I felt it was unlikely to be related to the upstream servers, as we experienced the problem across two very large CDNs (Fastly and Cloudflare), who have separate infrastructure (as far as I’m aware). We also couldn’t reproduce the issue using another local ISP, even if we pinned the IP address to try and hit the same edge node. It seemed certain that it was an issue with our ISP, but they kept denying it.
One thing that made the issue particularly difficult to debug was that we couldn’t reproduce it without a CDN, which meant we couldn’t see raw packets arriving at our servers to determine if they were malformed or if some packets didn’t arrive, etc.
At this point, I reached out to both Fastly and Cloudflare to see if either of them could shed light on the issue. I included as much info as possible, including packet captures, IP addresses (source and destination), destination URL, etc. They asked me to test again with a couple different URLs, and were able to very quickly identify the issue to me. Packets in the same stream were arriving from different IP addresses, and were being blocked by their firewall. They suggested it was probably a routing issue with our ISP. I sent their info along to our ISP, and the issue was wrapped up the next day.
Overall, it took two weeks and a lot of wireshark captures, but eventually I was able to get our ISP to realize it was their fault, and to fix it.