Measuring Availability
The need for availability is governed by the business objectives and its measurement’s primary goal is:- To provide an availability baseline (maintain it);
- To help identify where to improve the systems;
- To monitor and control improvement projects;
It is important to recognize that numbers like these can be difficult to achieve, since time is needed to recover from outages. The length of recovery time correlates with the following factors:
- Complexity of the system: The more complicated the system, the longer it takes to restart it. Hence, outages that require system shutdown and restart can dramatically affect your ability to meet a challenging availability target. For example, applications running on a large server can take up to an hour just to restart when the system has been shut down normally, longer still, if the system was terminated abnormally and data files must be recovered.
- Severity of the problem: Usually, the greater the severity of the problem, the more time is needed to fully resolve the problem, including restoring lost data or work done.
- Availability of support personnel: Let's say that the outage occurs after office hours. A support person who is called in after hours could easily take an hour or two simply to arrive to diagnose the problem. You must allow for this possibility.
- Other factors: Many other factors can prevent the immediate resolution of an outage. Sometimes an application may have an extended outage simply because the system can't be put offline while applications are running. Other cases may involve the lack of replacement hardware by the system supplier, or even lack of support staff.
Availability Metrics
- Mean Time to Repair (MTTR)
- Impacted User Minutes (IUM)
- Defects per Million (DPM)
- MTBF (Mean Time Between Failure)
- Performance (e.g. latency, drops)
Read more »
No comments:
Post a Comment