On Sunday 12th November @ 22:30 one of the 4 proxy servers that ResNet users use to surf web sites crashed because its log file became too full. The initial question should be "how did we let the log file get so big?". Well, the log file is able to get to 2GB in size before it fails, which under normal circumstances is more than enough, especially as we rotate the logs daily. However, one user's machine was making about 100 requests per second (over 8 million per day) to this server which caused it to crash. The user in question has been disconnected from ResNet until we can find out what software was causing the problem.
Most of you probably did not even notice the server fail as all its traffic was automatically moved onto one of the other three proxy servers with the failed server back up by 9am on Monday. One good thing to come out of this is that we have now changed the way we load balance our proxy servers in the event of an error. Instead of one proxy taking the load of the failed one, doubling its load, the traffic is spread evenly between all remaining servers so only increasing load to each by one third.
Another result of the 8 million connection attempts in one day was that the log analysis server crashed a day later because it ran out of disk space due to the larger logs that were copied to it
It never rains but it pours!! This is being fixed by adding a larger disk. Well the original disk was only 18GB, it will soon be a whopping 36GB!