Measuring Reliability For Uninterruptible Power Supplies and Power Protection Plans

The criticality for which uninterruptible power supplies (UPS) were created means that their reliability requires some form of measure to give customers a means of comparing different manufacturers and UPS. The purpose being to shield the loads the UPS is protecting from vulnerability, therefore, reliability should not be guessed at.

Mean Time Between Failure

MTBF or Mean Time Between Failure is one such measure – an indicator of the reliability of an uninterruptible power supply. It is the average operational time between powering up and system shutdown due to failure (not power failure in this sense but failure of the UPS system itself). It is represented by a measurement of hours.

Average failure rate is another measure of reliability. This is the total number of failures in a given time period. The failure rate over the lifetime of any UPS system, therefore, is inversely proportionate to its MTBF.

Uninterruptible power supplies are no different to any other electronic equipment in that the rate at which they fail is not constant. There are three distinct periods associated with UPS failure (which are often represented by a bathtub curve diagram showing a) infant mortality failures, b) random failures and c) wear out failures).

Infant Mortality UPS Failures

Infant mortality failures correspond to failures early on the life of the uninterruptible power supply. IT-sized uninterruptible power supplies can suffer what is termed ‘dead-on-arrival’. This could be due to a component manufacturing defect or transportation damage. A sudden shock or jolt in transportation may weaken a soldered joint, for example. Whilst UPS manufacturers strive to reduce these incidents as much as possible through stringent quality checks and testing processes, they do happen. Various processes can be applied to minimise the chances of it happening. UPS from 10kVA, for example, can be run for short burn-in periods (up to 48 hours) at high ambient temperature to reduce the potential for such failures.

Random UPS Failures

Random failures happen less often. During the normal working life of a UPS, the rate of these is low and fairly constant.

Wear Out Failures

Wear out failures at the end of an uninterruptible power supply’s working life are more common (and this is where the curves is steeper). Here, battery problems account for 98 percent of UPS wear out failures. Particularly where uninterruptible power supply has been subjected to high ambient temperatures over long periods, internal cabling insulation becomes brittle and breaks down. There are other consumable items that should be part of a regular monitoring regime, such as fans and capacitors, which will also eventually wear out with use.

Just because a manufacturer shows you some favourable MTBF stats does not necessarily mean that their products are the most reliable. Like most things, these can be massaged into looking more relaxed than they actually are. The important question to ask is: what was the basis for their calculation? There are two primary approaches:

1) A record of the total number of failures for a particular UPS size over a given time period.

Commonly adopted by UPS manufacturers, this is a valuable approach if the field population is large and the time period long enough (more than the typical life expectancy of a UPS, which is five to ten years).

2) A system value calculated from the known MTBF values of components and assemblies.

Obviously, this approach is more complex and relies on following standardised calculation formats.

Mean Time to Repair

Mean Time to Repair (Mean Time to Restore) is the time taken to return an uninterruptible power supply to normal operation from shutdown.

Online UPS are designed to fail safely to mains; therefore, the MTBF calculation of the mains power supply is also an important consideration along with mean time to repair (or average repair time).

As it is highly unlikely for a service engineer to be onsite at the very moment a UPS fails, MTTR needs also to include a travel time element. This also assumes the service engineer is carrying the required parts needed to fix the problem in a single visit, which is sometimes not the case. Uninterruptible power supply manufacturers may only provide a figure based on the actual repair time. Although this may be a satisfactory comparison tool, it is not a true representation of reliability. A degree of scepticism is sometimes necessary when comparing marketing data from some manufacturers.

Testing Business Continuity Plans – Do You Do Enough?

Many more companies have come to realize that the development and implementation of a Business Continuity Program is now a good business practice. The existence of this program gives the executives, the staff, the Board of Directors and shareholders a feeling of confidence in the effective and quick recovery of the business operations in the event of a disaster.

Every year the plan gets that Auditor’s tick mark and people point to the report and says that they are covered should a disaster occur. And every year the plan gets put back into the binder and put back on the shelf only to be dusted off next year.

So could you really recover using your plan documentation?

Do you know what is in your plan? Has your plan been updated with all of the technology and business changes that have occurred this year? Has it been tested? An untested plan is not any better than not having a plan. If the plan has not been kept up-to-date then it is best left in the binder during a disaster because it will only hinder your recovery not help it take place.

Testing can be passive for business plans and crisis management plans and active for technology recovery plans. You need to implement a comprehensive testing program for all of your company’s recovery plans. Each test should have objectives and the results should be documented and any discrepancies between expectations and actual results addressed.

No finger pointing is allowed. Take an objective look at why the plan did not work as expected. Was a critical update missed? Were the objectives set too high for the level of experience of your test team and testing program? There is no sense in trying to recover your complete company during the first or second exercise. You need to take small gradual steps to develop your team’s confidence in their capability to properly execute the instructions in the plan. Too fast and they will become disheartened, to slow and they will become bored.

Once you have implemented your testing program and your team has gained confidence in their capability then you can start to set harder to reach goals.

Along with the testing program, you also need to implement a proper maintenance program for the plan documents. Once you have put these programs in place only then can you and those people who rely on your company be sure that the company could be recovered quickly and fully after a disaster event.