Just to let you all know, todays outage was unrelated to Sunday’s downtime, although, as a result of the fulltext indexes it meant that todays failure was excerbated by the extreme slowness of rebuilding the indexes after a failure.
So, what do we learn from this experience to endeavour to ensure it doesn’t happen again?
The server monitoring software has been upgraded to SHUTDOWN the database subsystem if the diskspace starts to get low. This is important because the reason for todays failure was that the machine ran out of disk space due to excessive logging of the web server, and as a result, the database could not properly update itself, and resulted in corruption of the indices.
So, IF the machine gets too low on disk space, the database will die nicely, rather than horribly as it did this morning.
2nd, the error logging for the web server has been changed to reduce the likelihood of it being this that causes future outages.
3rd, a hot backup site is being configured in order that, should an extended database outage occur on this server, a second server can be brought online within a few minutes, not hours.
Apologies again for the failure and to all those ‘addicts’ who have needlessly suffered as a result of this failure.