Apologies for the serious downtime thats been experienced today, and yesterday. It seems that a bug in the kernel, due to our compiler and perhaps the options supplied to it, gradually consumed more and more memory until the machine ran out and fell over. As a result things would have gradually got slower and slower, processes taking longer to finish until the box collapsed because as it got busier, so it got slower, which meant it had more waiting to do…
This is why a number of mails to us bounced because the server began refusing connections, sometime yesterday.
There have been a number of changes to help fix this problem for the future. First, the number of allowed web processed has been quartered, so that in the event of slowdowns, rather than queuing up much much more than the server can handle, it will refuse new ones until the backlog is cleared. Also, the kernel is upgraded now to a newer version and one that was compiled directly by RedHat, so our compiler issue is not an issue.
A number of crucial pieces of software have been ‘stripped’ (Which for those not in the know basically means made a lot smaller by removing redundant information) so the server should run more processes before getting upset.
But as a result of these emergency changes, the service has been somewhat unpredictable today, for which we apologise and hope we’ve got the gremlins now!
Apologies to Psyche for the worry that the mail bouncing will have caused her!