Data Center Outages
The Ponemon Institute study in September 2013 revealed that unplanned data center outages are a significant threat to business and revenue. 83% of those surveyed knew the cause of the outage. The most frequent outages were caused by UPS battery failures, human error and exceeding UPS capacity. (UPS equipment failure itself was low.) Even cyber attacks were an issue for some as 34% of power failures. It is possible to prevent most of these outages.
As complex as data centers are, power equipment is simply the most important hardware in daily operations of any data center. Without properly functioning power equipment, the data center and every dedicated server inside becomes completely inoperable. Therefore, it is essential that all power equipment receive proper maintenance and routine checkups. In addition, backup power also needs to be adequate and in 100% working order. They need to take over should any catastrophic power failures occur with the main power equipment.
Data center outages are a major revenue drain.
Data center technicians should be on the lookout and able to spot any potential power issues with any equipment, from either a dedicated server on a rack, to a switch, or even battery powered backups, they should have the know-how to address the issue and/or forward their concern to more specialized personnel, even if it is from an outside third-party service. Most onsite techs should be familiar with any equipment they are used to dealing with since they have the most hands-on experience.
Preventing a problem from happening in the future can go a long way in preventing something more serious from happening. This way, techs can schedule maintenance windows for clients at their convenience. This avoids being forced into resolving issues with long downtimes, and in some cases even longer if the part is not in stock. On-site techs can spot issues in several ways; visually, audibly and even by smell.
Scheduled Maintenance is Vital
Many times, manufacturers will simply have regular service schedules and observation guidelines to prolong the life of the equipment and prevent a premature power failure. This can be on a daily, weekly or even monthly basis. This will depend on the hardware use and amount of strain it may be under. Some data centers will even perform infrared scans to detect heat issues. Inspecting all dedicated server fans, (PSU’s!), switches and routers etc. for dust and even dirt should be carried out at least monthly. This also includes cabling, racks, floors, and walls and ensuring there is still plenty of space for proper airflow. A log should record what was done, when and what if anything, needs further inspection or review. Adhering to a maintenance schedule is a must-do.
Noisy fans are at least one good indicator that things are not running smoothly the way they should be. High airflow temperature readings from within a server chasis are possible serious warning signs.
Equipment that is always running at 100% 24/7 will definitely put a lot more stress on power equipment and shorten the life of that hardware. Being familiar with equipment load values will help substantially in prolonging the life of the equipment and in many cases any servers that may be plugged into that same “power chain”. Incorrectly using power equipment, to begin with, can cause problems. The instruction manual is provided for a reason. Last, failing equipment should in most cases be replaced if it cannot be completely fixed. Otherwise, it will just be prolonging the inevitable.