Point of failure
Started: Wednesday, August 4, 2004 17:03
Finished: Wednesday, August 4, 2004 17:21
This is Bitscape, reporting live from the bizarre scene. By the time anyone reads this, the bizarreness will likely have been solved. Otherwise, getting it out for the world to read could be tricky.
It all began a little less than an hour ago. I was happily working away, actually being semi-productive for a change, when suddenly, all my ssh consoles stopped responding. I tried to ping Hydrogen. No response. A dead server, apparently. After a few seconds of this, I immediately picked up the phone and called Scott. He was having the same problem.
We tried pinging other servers in the cabinet. They too were not responding. Firewall issue?
Scott called the guy who runs our router, and he suggested that it might be a recently discovered Cisco bug, that rebooting the firewall might temporarily fix until a patch could be applied. (Is it just me, or does Cisco seem a lot like the Microsoft of the firewall router world?)
Reportedly, the firewall was being rebooted, and should be working again in a few minutes.
I was already a little suspicious, and when nothing came up after 5 minutes, I started poking around again. A traceroute from here showed something odd. My packets appeared to be making several hops, but weren't getting anywhere near our server, or cabinet, or data center. Could it be an ISP problem on this end? I was still able to access all my usual websites.
Finally, I decided that this was all just a bit too suspicious, and tried to ssh over to Ziyal, and ping Hydrogen from there. It came back with a response, no problem. So it had to be a Comcast problem!
Well, not so fast.
I noticed that one of the IP addresses that appeared in a traceroute from Ziyal was identical to one that appeared from here. The only difference was that when the traceroute was performed from Argo, it was the last IP on the list. Ziyal's traceroute continued a few more hops to our dear server cabinet.
I called Scott again and informed him of the problem. He suggested that I use dnsstuff.com to see who owns the IP. The answer came back: "AT&T Worldnet Services." Lovely.
Whose tech support does one call in such an instance? Clearly, the packets were making it out of the Comcast network, but they were stopping short before arriving at our favorite data center. Scott decided to call the data center. (Their tech support would likely be much more knowledgable and responsive than Comcast, and the problem did appear to be a bit nearer to their end of the network.)
In any case, the problem has now been fixed. I can ping and ssh to hydrogen just fine again. But that was damn spooky. Every now and then, wierd things happen. Today was just such a day.