Critical ICT loads within data centres rely on UPSs to protect them from power issues and ensure a continuous uninterrupted power supply at all times. To live up to this expectation, UPSs must not only be reliable and resilient, but also backed by a rigorous and well executed planned maintenance programme that also includes on-site or fast access to critical spares inventory and a rapid and high-quality response service to cover emergencies.
However, optimising a maintenance programme for a particular data centre calls for consideration of many issues; the failure scenarios that must be allowed for, and the capacity of on-site staff to complement external technicians’ maintenance efforts, or alternatively to create maintenance problems through operator error. The UPSs’ topology is also a significant factor.
The maintenance programme’s ability to handle these issues depends on how well resourced it is, in terms of technician skills and availability, ready availability of suitable and correct-revision components, and documentation and management.
Routine maintenance
Data centres are continuously populated with facilities staff who oversee the mechanical and electrical services, so there are opportunities for those staff to carry out some, but probably not all, of the routine planned maintenance tasks if (and only if) the power system topology incorporates sufficient ‘concurrent maintenance’ capability. Those tasks include downloading the event-logs, set-point monitoring, filter changes, visual inspection of all connections, general cleaning and battery cleaning/torque setting. If the concurrent maintenance capability is dependent upon manual switching of complete systems (including the UPS for example) then the correct levels of system training and familiarity exercises must be regularly carried out.
Non-routine maintenance, and failure response
UPSs also require further maintenance tasks, but not so regularly or routinely. These include:
- Battery load bank and cell impedance measurements (annual, rising to bi-annual)
- DC capacitor changes
- AC capacitor changes (Five to 10-year intervals depending on the manufacturer. For KUP, it’s five years)
However, these non-routine tasks – and all UPS failure interventions – are very infrequent.
Emergency interventions (actual failure of one or more UPS functions) are so infrequent as to be virtually impossible to cover properly, especially by using on-site 24/7 multi-shift staff.
The level of documentation and constant training required do not make commercial sense, especially considering natural wastage and turnover in onsite personnel. By contrast, a UPS OEM’s maintenance technicians are highly experienced in fault-finding and repairing UPS failures, as they are engaged in this daily, within a large installed customer UPS base. Their services need to be available under a contracted service level agreement.
Many published reports attribute as many as 70% of all data centre load-loss incidents to human error; the UPS share is thus reduced to the remaining 30%. In practice this could be as low as 5% of the total failures for the power system.
Working with the UPS OEM’s support team can bring further benefits instrumental to securing UPS safety and reliability. These include management of factory upgrades, remote system monitoring, and spare parts provisioning.
Data centre operators are recommended to set up a service agreement to handle situations from preventative maintenance to time-guaranteed emergency responses. The agreement can bring together all the elements and resources needed, and package them to suit the facility’s particular needs and priorities.
An effective service plan is summarised here. It should comprise annual scheduled preventative maintenance (PM) visits for both the UPS and its batteries, as well as facilities for emergency call-outs on demand. Trained engineers and technicians should be available 24/7, and based close enough to ensure arrival on site within contractually-agreed response times. These personnel should be backed with immediate access to a comprehensive local spare parts inventory, and more in-depth technical support if required.
Interested parties should be able to pre-empt UPS problems as far as possible through remote battery monitoring and impedance testing, generator monitoring, and UPS monitoring with monthly trend reporting and 24/7 alarm notifications.
The service plans must be robust, well-managed, both to ensure their efficacy, and to maintain accurate budgetary control. Tasks include maintaining accurate monthly service records, and replacement planning with time and budget considerations. Fulfilling recommended part replacement cycles, once agreed, is important.
The ideal number of scheduled preventative maintenance visits per year depends on the power system topology; single-phase installations can be safely supported with a single annual PM visit, while three-phase systems warrant two annual visits.
Conclusion
Maintaining a UPS at a high level of availability, and responding fast on the rare occasions when a problem does arise, calls for readily available, highly skilled technicians, backed by appropriate spare parts inventory. This level of backup can best be achieved by leveraging the resources of the original UPS supplier, complementing these with support from the facility’s on-site operational staff where it makes sense, and managing the overall strategy with a well-tailored maintenance contract.