Reliability analysis of imperfect coverage systems with a shared warm standby

Redundancy technique is commonly applied to satisfy the reliability requirements of fault-tolerant systems. Warm standby, a compromise between hot standby and cold standby in term of power consumption and recovery time, has attracted wide attention over the past several decades. However, the existing reliability analysis methods for warm standby system with imperfect coverage are difficult to deal with some cases, such as non-exponential time-tofailure distributions for the system components and the systems with shared standbys. In this paper, a new approach based on step function and impulse function is proposed to overcome the limitations of the existing approaches. The reliability of a system including shared standbys is deduced considering two kinds of imperfect fault coverage models, which contain Element Level Coverage (ELC) and Fault Level Coverage (FLC). The proposed approach can applicable to any type of time-to-failure distributions for the system components subject to imperfect fault coverage. A case study is presented to illustrate the applications and advantages.


Introduction
Standby Redundancy is an important concept in improving the reliability of systems [1].It is universally accepted that fault tolerant and safety critical systems cannot achieve intended reliability without employing standby [2].Typically, the standby techniques can be classified into hot standby sparing, cold standby sparing and warm standby sparing [3].Cold standby sparing is used when the energy consumption is of concern, while hot standby is applied when the recovery time is of vital importance.Warm standby is a compromise between cold standby and hot standby in terms of power consumption and recovery time [4][5][6].Warm spares have different failure rates or failure distributions before and after they are used to replace a faulty component, which may fail randomly.Therefore, the failure behavior of the warm standby component is time-dependent [7].Due to the complex failure behavior of the warm standby system, it has attracted extensive attention [8][9][10].Existing approaches for modeling and analyzing the reliability of warm-standby systems are mostly Markov-based methods, simulation-based methods, and combinatorial methods.The Markov-based methods suffer from the well-known state space explosion problem [11] and are typically limited to exponential timeto-failure distributions.The simulation-based methods, for instance, Monte-Carlo simulations, are usually involve long computational time, especially when high accuracy results are required [12].A combinatorial approach was proposed by Liu et al. [13], which enumerates all minimal cut sets or sequences and then applies the inclusion-exclusion formula to assess the reliability of warm standby system.However, the inclusion-exclusion expansion makes the complexity of the method exponential.Another combinatorial approach based on sequential binary decision diagram (SBDD) has recently been proposed to evaluate the standby systems [14][15].
In redundancy standby systems, automatic recovery mechanisms are generally designed to make the system continue to function correctly even in the presence of faults or errors.If a component fails without being detected or isolated due to the failure of system recovery mechanism, it may produce incorrect result which can lead to an overall system failure [16].This behavior is known as IFC [17][18][19].System reliability can't improve unlimitedly with the increase of standby components when IFC is considered [20].There are two conceptual models for system reliability subject to IFC: fault coverage can be modeled as a function of the number of faults that the system has experienced, which is Element Level Coverage (ELC), or it can be assumed that a particular coverage value can be associated with each redundant element in the system, which is Fault Level Coverage (FLC) [21].The simple and efficient algorithm (SEA) is a well-known approach used to incorporate the IPC into combinatorial models for the system reliability analysis [22][23].However, it is not applicable to dynamic systems with time or sequential dependence.
It is a difficult and challenging task to reliability analysis of WSP systems subject to IFC due to the sequence-dependent failure component behavior.Some researchers have studied the reliability of the WSP system subject to FLC [24][25][26][27].However, their work is restricted to cases where the failure time of each system component obeys an exponential distribution.When several components have a shared warm standby, the process to system reliability analysis will be more complex.
In this work, a new approach is proposed for the reliability analysis of systems with a shared warm standby subject to IPC.The reliability of a system including a shared standby is assessed considering ELC and FLC.The time-to-failure distributions of system components are not limited in the paper.
The rest of this paper is organized as follows.Section 2 presents the fundamentals of warm standby system and IPC.Section 3 presents the suggested approach, which introduces Heaviside step function and Impulse function to descript the dynamic failure behaviors of warm standby system subject to IPC.An algorithm is proposed to deduce the failure probability density function and evaluate the reliability of system with a shared warm spare.Case studies are provided in Section 4 to illustrate the approach.Section 5 summarizes the paper.

Warm standby system
Warm standby sparing is a fault-tolerant design technique widely used to improve system reliability.Warm spares have different failure rates or failure distributions before and after they are used to replace a faulty component.In practice, some important components usually equip a shared warm standby due to expensive cost.For example, a series system is composed of component A and B. In order to improve the system reliability, a warm standby S is equipped for A and B to share considering the cost, the fault tree of which is shown as Figure 1.Once A or B fails, S is used to replace the faulty component.However, sometimes the warm standby fails to turn into working state due to the failure of monitoring, locating and switching devices.This refers to fault coverage problem, which will be detailed below.

Imperfect fault coverage (IPC)
Fault coverage problem has attracted extensive attention in fault tolerant system.In general⊛fault coverage is divided into perfect fault coverage(PFC) and imperfect fault coverage (IPC).PFC means that whether the system can continue to work or not depends on the remaining structure of the system, not the faulty component, while the IPC behavior is caused by the imperfect coverage mechanisms [28].The general IPC model is shown in Figure 2 where t andτ are nonnegative real numbers.The discontinuity of the Heaviside step function occurs at t τ = .In the warm spare structure, the Heaviside step function ( ) u t τ − can be used to denote that a component fails at time t after the time instantτ .
(2) The impulse function The impulse function is used to express a variable taken on a specific value and defined as [29] ( Because all the variables in warm standby system belong to the set of nonnegative real numbers, the integral in (2) can be transformed as Besides, an important property of the impulse function will be used extensively in this paper, shown as follows.

The conditional PDF of warm standby
As shown in Figure 1, component A and B respectively possess a time-to-failure probability density function (PDF), ( ) λ respectively denote the failure rates of warm standby S in spare state and working state.It is often assumed that ( )
Therefore, the conditional failure rate , ( , ) The formula ( 5 Using the formula (6), the conditional PDF , ( , ) We assume that the components A, B and S don't fail at the same time.Utilizing the property of Heaviside step function and formula (7), the conditional PDF of S can be simplified as follows where ( ) S R t means the reliability of warm standby S in time t .

Reliability evaluate subject to IPC
Where the first term denotes that the state of X is equal to the state of A in three cases: S fails to replace the faulty A; S succeeds to replace the faulty B, then A fails; and S fails first, then A fails.The second term states that the state of X is equal to the state of B, which also includes three cases similar to the cases of first term.While the last term indicates that when S succeeds to replace the faulty A or B, and then S fails, the state of X is equal to the state of S.
It is noticed that , , ( , , ) xabs is a proper distribution function, which can be proved by ( 1) and (10), shown as follows.
, , 0 ( , , ) 1 Therefore, the joint failure probability density function of X is , , Here the failures of A and B are considered to be independent on each other.Using formula ( 9) and ( 10), we can get The formula (13) can be simplified as To integrate ( , , , ) ABSX f ab s x with respect to a, b, and s from 0 to +∞ , the marginal failure PDF of X can be gotten 0 0 0 ( ) ( , , , ) The ( ) X f x of formula (15) can be the sum of 8 terms The first term is Using the properties of the Heaviside step function and Impulse function, The second term of ( ) Using a similar approach made to compute the first term and second term, the other terms of ( ) Thus, ( ) ( ) ( 1) Therefore, the reliability of system is

ELC model
In ELC model, every element has a particular coverage value, which is the probability of being covered when it fails.Here, we assume that the coverage probability of A is 1 c while B is 2 c .In this case, the conditional PDF of output X is Using a similar approach to the IFC model, the failure PDF of X can be obtained ) ( ) ( ) In this case, the reliability of system is 4 Case study Taking a system including a shared warm standby element for example, the dynamic fault tree of the system is shown in Figure 3.The system consists of elements A, B, S, C, D, E, and the of dormancy coefficient α of warm standby element S is 0.5.Considering the time-to-failures of the system components obey exponential distributions and Weibull distributions, the system reliability is evaluated by the proposed method.

Exponential distributions
When the time-to-failures of system components follow exponential distributions, the parameters of components are presented at Table 1.
Table 1.parameter values of system components.Seeming the shared warm standby structure as a new component X, the system reliability ( ) R t can be obtained Considering the fault coverage of warm standby structure, the PFC, FLC, ELC models are discussed in this paper.The PFC model equals to the coverage parameter being 1.We suppose that the system coverage parameter equals to 0.85 in FLC model while the coverage parameter of element A equals to 0.95 and element B equals to 0.9 in ELC model.
Utilizing the MATLAB software, the system reliability curve changing over time is shown in Figure 4.In the mission time 800 t h = , the system reliability in PFC, FLC, ELC models is 0.800, 0.756 and 0.777, respectively.For the mission time 800 t h = , partial derivatives of system reliability function are taken with respect to failure rate of each element, the results are shown as Table 2.

Table 2. Partial derivatives with respect to failure rate of each element
From the Table 2, the decrease of failure rates of element A, B and C are more useful for improving the system reliability.The importance of the elements impacting on the system reliability is related to the fault coverage models.

Weibull distributions
When the time-to-failures of system components follow Weibull distributions, the parameters of components are shown at Table 3.Using a similar approach to the case of exponential distributions, the system reliability with elements obeying Weibull distributions can be obtained, shown in Figure 5.In the mission time 800 t h = , the system reliability changes over the coverage parameters in FLC model and ELC model are shown in Figure 6 and Figure 7. Seeming the system reliability as a function with respect to the scale parameters of elements, the mission reliability changing over the scale parameter of each component is presented in Figure 8.
From Figure 8, the change of the system reliability over the scale parameters of A, B and S is more pronounced.Therefore, the key to enhancing the reliability of the system is the developments of components A, B and S. Besides, the increasing of coverage parameter has an important significance on security operation of warm standby system.

Conclusion
This paper proposes a new approach to the system reliability analysis with shared warm standby element.The step function and impulse function are introduced to describe the failure characteristic of warm standbys.Two kinds of imperfect fault coverage models: Element Level Coverage (ELC) and Fault Level Coverage (FLC) are considered in the system reliability evaluation.The proposed approach has no limitation on the type of time-to-failure distributions for the system components.In the mission time, the reliability of warm standby system changing over the time-to-failure parameters of each element and the coverage parameters in different fault coverage models are obtained.

Figure 2 .
Figure 2. General structure of imperfect coverage model.When a fault occurs, there are three possible outcomes corresponding to transient restoration R, permanent coverage C and single-point failure S. Exist R represents the failure is transient and causes no change to the state, while Exit C denotes the failure is permanent and causes only outage of offending component.And Exit S means the occurrence of the fault that causes the total system to fail.The probabilities 0 r , 0 c and 0 s , respectively represents the occurrence probability of R exit, C exit and S exit, and 0 0 0 1 r c s + + = .In this paper, the probability 0 r is assumed to be zero, which means that the transient restoration is not considered as a fault. 0

f x a b s c u b
a u s a cu a b u s a u a s u b a x a c u a b u s b cu b a u s b u b s u a b x b c u s a u b s c u s b u a s x s δ δ δ a u s a c u a b u s a u a s u b a x a c u a b u s b c u b a u s b u b s u a b x b c u s a u b s c u s b u a a cua bus a ua sub a x a c u a b u s b c u b a u s b u b s u a b x b c u s a u b s c u s b u a s x s δ δ δ

Figure 3 .
Figure 3. Dynamic fault tree of the example system.

Figure 4 .
Figure 4. System reliability curve changes over time.

Figure 5 .
Figure 5. System reliability curve changes over time in the case of Weibull distributions.

Figure 6 .
Figure 6.System reliability changes over the coverage parameter in FLC model.

Figure 7 .Figure 8 .
Figure 7. System reliability changes over the coverage parameters in ELC model.
In FLC model, a coverage parameter c is used to represent the probability of success in detecting and isolating the faulty component.In this case, the conditional PDF of output X is Reliability evaluate is important for reliability analysis.Considering the IFC of warm standby systems, FLC model and ELC model are used to analyze system reliability., , ( , , ) [(1 ) (

Table 3 .
parameter values of system components.