Follow Up: Post Incident Report for Sydney DC
We have received additional information from our vendor regarding the incident on June 6, 2012.
The root cause of the power failure has been investigated by electrical contractors and has now been identified as the result of a small rodent creating a dead short across 2 of the 3 incoming phase conductors for the A2 UPS unit.
The damage caused by the rodent severed cables from the input filter module and caused a secondary short on the output terminals. This created a fault current in excess of 34,000Amps to the entire UPS group which registered an output overload fault. When the UPS group went in to bypass, it simultaneously tripped the protection circuit breakers, shutting down the entire UPS group even though UPS A1 and A3 were not affected.
The vendor has advised that the equipment reacted as expected for this type of fault.
With the UPS group in this state, the on-site generators were unable to supply power thus causing an interruption to power.
Mains power was returned as quickly as possible once electrical contractors verified the safety of the staff, building and equipment, as well as isolating the A2 UPS unit from the group before returning power. This was able to be done relatively quickly as electrical contractors were already on-site performing scheduled UPS maintenance, which was scheduled for 7:00PM through to 1:00AM the following morning.
What happens next?
The faulty A2 UPS unit will undergo extensive investigation and testing before repairs are carried out to ensure that there are no tertiary issues before scheduling a planned maintenance window to repair the unit. Once repairs have been completed, the A2 UPS unit will be soak-tested in isolation before being re-introduced to the UPS group. A site load test will then be conducted to ensure the UPS group and generator systems are all back to specification.
Customers with affected websites today
Customers formerly hosted on SYDS11-N2S will be eligible for an SLA credit, which will be paid at the full 100% of your monthly plan fee and irrespective as to who was at fault. Our CEO will be writing an email to all affected customers over the weekend once migrations to the new node have been completed.
We extend our most sincere apologies to customers on this node who have experienced extended down time as a result of this emergency migration, however, the fact that it is taking so long to complete this migration gives a clear indication as to the importance of moving customers to a new server as quickly as possible to avoid any potential data loss.
More than 85% off the services on this node have been successfully migrated to new hardware and will already be back online. We ask all customers on this node to check their data as soon as possible, and advise our Customer Care team if they notice any data missing so that they can get it moved across for you.
We suspect that this server has been damaged by a power surge that also damaged other components in other nodes which were replaced without further incident.
We will also be contacting customers on our Business cPanel hosting nodes to extend an SLA credit to them as a result of this incident.
Some people are saying that you should have had dual power supplies?
It’s always easier to cast an opinion, however educated it may or may not be, when you’re on the other side of the fence (or hiding behind a keyboard as the case may be). For many years we have purchased quality server hardware that best fits the needs of our intended customer base. Our services are marketed towards the low cost segment of the web hosting market, and in order to maintain our price points we need to keep both our OpEx and CapEx as low as possible, which in turn allows us to keep a healthy profit margin and continue to expand the company.
We’ve achieved this with our domain names by becoming an accredited registrar of all domain names we offer, with SSL’s by forging a strong premier partnership with Trustwave, and with SMS by connecting directly to the Telstra Wholesale network.
It is important to note that our core routers, switches and our new Dell Enterprise Cloud hardware (used by some of our Business customers) all use dual redundant power supplies.
If we were to service our all of our web hosting customers from hardware that had dual redundant power supplies, an additional $1,200 per server would be added to our CapEx cost base (approximately $900 more to upgrade the chassis to one which supports dual power supplies and approximately $300 for the second power supply), while our data centre costs would rise by approximately 75% which would be added to our OpEx.
As you can imagine, this would equate to a substantial cost increase that would require us to raise our prices by approximately 150% in order to cover the additional expenditure, or attempt to cram more customers on to one server and risk degrading the quality of service over a prolonged period of time. In offering this level of redundancy, we then find ourselves in another market segment which we believe is already serviced well by other web hosting companies, such as Micron21 and Anchor, who invest their money in purpose-built, high availability cloud systems… but they also charge the appropriate price for their services.
We have always maintained that we are not a one-stop-shop for all web hosting customers. We understand and appreciate that some customers absolutely require dual redundant everything and 100% guaranteed uptime, but at the same time some commentators need to understand that there is a large segment of customers who don’t want nor require this level of service, and for those customers we believe our services are an ideal choice.
And as we have always offered, customers are free to trial our services for 30 days and if they are unsatisfied with us they can request a full refund of their hosting fees – no questions asked.
VentraIP has always shown its appreciation and respect to its loyal customer base, and will continue to provide the services that have made it a “Top 10” domain name and hosting company.