PDA

View Full Version : Unplanned outage



administrator
31-08-2011, 06:43 AM
Just an update on the website being down yesterday Hostaway had a problem
thankfully Ashleigh allways explains why Thanks Ash
Please note their timEzone is Perth WA

UNPLANNED OUTAGE
Dear Dean Bartorelli,


As you are probably aware, this afternoon HostAway experienced a prolonged downtime of Linux website hosting.




THE PROBLEM


A file system corruption on our storage server occurred at 12:10pm.
A file system check produced multiple errors, and unfortunately the block of data which held our main directory structure was corrupted, meaning not all website files could be found or related to their original website.
Databases were not affected, only website files.
The corrupted file system was replicated onto our secondary server, rendering this method of failover ineffective.




HOW WE DEALT WITH IT


Our monitoring system immediately notified us of the file corruption, at which point the file system check was activated.
We posted a note on our Twitter feed to notify customers of the issues. The magnitude of the corruption was not evident so downtime was expected to be minimal and handled using standard procedures.
Once the extent of the damage was realised we began restoring the data using custom scripts designed to sift through the lost data to rebuild the file tree.
A temporary outage notice was placed on all affected sites, advising visitors that the website was unavailable and to try again soon.
We delayed using backup files as a last resort to avoid the possibility of customers losing new data. In some cases backup data was used as a temporary fix.
A new storage server (that was already being built) was fast-tracked and setup with a backup file system as a precaution in case the file system rebuild was unsuccessful.
By 3pm the majority of websites were restored. A limited number of sites required more extensive reconstruction using backups, and we continued to restore these on a site-to-site basis.
We disabled FTP so that customers could not upload new files that could possibly be overwritten while we restored data.
The servers remained under a heavy load until 4:05pm at which point we were able to stabilise them.
All staff remained on site until well into the evening answering customer requests.




HOW WE WILL AVOID THIS IN THE FUTURE


New storage servers with updated hardware and software will replace the current machines. This was already scheduled for next month, but has now been fast-tracked.
We have isolated several software issues that contributed to this issue, and determined they have been corrected in the versions scheduled for use on the new servers.
A new notification system will be put in place, so that customers can be notified by email or SMS when a service is affected.

If you would like more information please contact us at support@hostaway.net.au or give me a call on (08) 9249 3646. Please check your website is functioning correctly and notify HostAway if you experience any issues.

I sincerely apologise for any inconvenience this may have caused you or your clients.


Kind regards,

Ashley Hadassin
Director

NLALM
31-08-2011, 06:32 PM
Thank god for that I thought there was something wrong with my computer