It’s a day that the average system administrator doesn’t look forward to… server problems. Today was one of those for me. Started out as a typical Monday morning when suddenly my cell phone vibrated and my email alerts on my screen were going off… it was our monitoring system letting me know about a sudden increase in Oracle connections on our application servers. Taking a look it had gone up from 1200 connections per machine to 2400 in a matter of minutes. I alert some people and the proper course was to restart Apache. All the while people that are supposed to be watching for this had no idea. I restarted Apache on the application servers and at this point no DB connections were being made. I made contact with the db admins in iowa and they suddenly received pages. Apparently the Oracle listener had failed. A quick restart and we were online… or so we thought.
Fast forward 2 hours and bam same problem, Oracle connections on the rise (this time almost hitting 3000 on each machine). I immediately alerted the db admins (who again didn’t receive pages from their monitoring system) and this time it was found that one of the CPUs had died on the machine and it was spitting out memory errors. Ten minutes of downtime and we were up… for good.
Lesson learned… a good monitoring system can prevent a lot of stress. I was able to alert all the necessary people before it exploded in my face. Its embarrasing to hear that your website is down from either a customer or even someone in the office. It feels nice to say “Already know about it and working on it” and also letting customer support know about the issue so they don’t look like dumbos. At a previous employment this was never the case, i felt like the db admins did today… as in it was never known until a customer reported in that something broke.
Below… a nice overview of what happened. I highly click on it and you can see for yourself.
