What is System Hang and How to Handle it Yian Zhu,Yue Li2,Jingling Xue2,Tian Tan3,Jialong Shil,Yang Shen3,Chunyan Ma3 School of Computer Science,Northwestern Polytechnical University.Xi'an.P.R.China 2School of Computer Science and Engineering.University of New South Wales,Sydney.Australia 3School of Software and Microelectronics.Northwestern Polytechnical University.Xi'an,PR.China {zhuya,machunyan}@nwpu.edu.cn [yueli,jingling@cse.unsw.edu.au {silverbullettt,jialong.tea,yangfields@gmail.com Abstract by operating system (OS)(except for some severe cases detected only partially by watchdog mechanisms Almost every computer user has encountered an un- provided by some modern OSes).This leaves the user responsive system failure or system hang,which leaves no choice but to power the system off.As a result, the user no choice but to power off the computer:In the OS fails to provide continuous services,causing this paper.the causes of such failures are analyzed in the user to lose some valuable data.Worse still,if the detail and one empirical hypothesis for detecting sys- computer system is deployed in some mission-critical tem hang is proposed.This hypothesis exploits a small applications,e.g.,nuclear reactors,system hang may set of system performance metrics provided by the OS lead to devastating consequences. itself.thereby avoiding modifying the OS kernel and By observing existing studies dealing with system introducing additional cost (e.g.,hardware modules). hang,we draw two conclusions.First,most studies, Under this hypothesis,we propose SHFH.a self although being effective in certain cases,could only healing framework to handle system hang,which can address certain system hang scenarios [1]-[5].One be deployed on OS dynamically.One unique feature main explanation to this is that it is difficult to analyze of SHFH is that its "light-heavy"detection strategy the causes of system hang,and accordingly,each study is designed to make intelligent tradeoffs between the focuses on its own assumptions about the causes of performance overhead and the false positive rate system hang.As a result,it is necessary to study the induced by system hang detection.Another feature causes of system hang more comprehensively. is that its diagnosis-based recovery strategy offers Second,most methodologies for detecting system a better granularity to recover from system hang. hang need additional assistance,provided by either Our experimental results show that SHFH can cover new hardware modules [7],modified OS kernels [1], 95.34%of system hang scenarios,with a false positive [5],or monitor breakpoints inserted dynamically for rate of 0.58%and 0.6%performance overhead,val- interested code regions [4].Can we rely on the exist- idating the effectiveness of our empirical hypothesis ing services provided by the OS to detect system hang effectively?An attempt made in[2]does this by just Keywords-System Hang.Operating System,Self- monitoring I/O throughput,but it fails if a hang occurs Healing Framework,Fault Detection and Recovery within some OS code not related to I/O.The work of [8]is developed on the assumption that statistical 1.Introduction models of processes,for such metrics as CPU and memory utilization,may reveal the slowness of the Almost every computer user has encountered such a system (similar to system hang).However,since the scenario in which all windows displayed on a com- causal relationship between the statistical models for puter monitor become static and the whole computer processes and the slowness for the system has not system ceases to respond to user input.Sometimes been validated,the effectiveness of this assumption even the mouse cursor does not move either."Unre- remains unclear.As a result,whether or not existing sponsiveness'”,“freeze''and"hang”have been used OS services can be utilized to detect system hang to describe such a phenomenon,with "hang"being becomes an attractive argument,since an affirmative the most popular [1]-[4],[6],[7],[9],[12].Note answer implies that no additional cost will be incurred that a single program unresponsive failure (i.e.,one The main contributions of this paper are as follows. application failing to respond to user input)is regarded We give a new characterization of system hang based as application hang,which is not the focus in this on the two popular views about it (as described in paper.Unlike the other failures (e.g.,invalid opcode Section 2.1).Besides,the causes of system hang and general protection fault)whose causes can be de- are analyzed in detail from two aspects:indefinite tected directly by hardware [13],system hang cannot wait for system resources (resources not released or usually be detected by hardware or even perceived released slowly)and infinite loop under interrupt and
What is System Hang and How to Handle it Yian Zhu1 , Yue Li2 , Jingling Xue2 , Tian Tan3 , Jialong Shi1 , Yang Shen3 , Chunyan Ma3 1 School of Computer Science, Northwestern Polytechnical University, Xi’an, P.R.China 2 School of Computer Science and Engineering, University of New South Wales, Sydney, Australia 3 School of Software and Microelectronics, Northwestern Polytechnical University, Xi’an, P.R.China {zhuya,machunyan}@nwpu.edu.cn {yueli,jingling}@cse.unsw.edu.au {silverbullettt,jialong.tea,yangfields}@gmail.com Abstract Almost every computer user has encountered an unresponsive system failure or system hang, which leaves the user no choice but to power off the computer. In this paper, the causes of such failures are analyzed in detail and one empirical hypothesis for detecting system hang is proposed. This hypothesis exploits a small set of system performance metrics provided by the OS itself, thereby avoiding modifying the OS kernel and introducing additional cost (e.g., hardware modules). Under this hypothesis, we propose SHFH, a selfhealing framework to handle system hang, which can be deployed on OS dynamically. One unique feature of SHFH is that its “light-heavy” detection strategy is designed to make intelligent tradeoffs between the performance overhead and the false positive rate induced by system hang detection. Another feature is that its diagnosis-based recovery strategy offers a better granularity to recover from system hang. Our experimental results show that SHFH can cover 95.34% of system hang scenarios, with a false positive rate of 0.58% and 0.6% performance overhead, validating the effectiveness of our empirical hypothesis. Keywords-System Hang, Operating System, SelfHealing Framework, Fault Detection and Recovery 1. Introduction Almost every computer user has encountered such a scenario in which all windows displayed on a computer monitor become static and the whole computer system ceases to respond to user input. Sometimes even the mouse cursor does not move either. “Unresponsiveness”, “freeze” and “hang” have been used to describe such a phenomenon, with “hang” being the most popular [1]–[4], [6], [7], [9], [12]. Note that a single program unresponsive failure (i.e., one application failing to respond to user input) is regarded as application hang, which is not the focus in this paper. Unlike the other failures (e.g., invalid opcode and general protection fault) whose causes can be detected directly by hardware [13], system hang cannot usually be detected by hardware or even perceived by operating system (OS) (except for some severe cases detected only partially by watchdog mechanisms provided by some modern OSes). This leaves the user no choice but to power the system off. As a result, the OS fails to provide continuous services, causing the user to lose some valuable data. Worse still, if the computer system is deployed in some mission-critical applications, e.g., nuclear reactors, system hang may lead to devastating consequences. By observing existing studies dealing with system hang, we draw two conclusions. First, most studies, although being effective in certain cases, could only address certain system hang scenarios [1]–[5]. One main explanation to this is that it is difficult to analyze the causes of system hang, and accordingly, each study focuses on its own assumptions about the causes of system hang. As a result, it is necessary to study the causes of system hang more comprehensively. Second, most methodologies for detecting system hang need additional assistance, provided by either new hardware modules [7], modified OS kernels [1], [5], or monitor breakpoints inserted dynamically for interested code regions [4]. Can we rely on the existing services provided by the OS to detect system hang effectively? An attempt made in [2] does this by just monitoring I/O throughput, but it fails if a hang occurs within some OS code not related to I/O. The work of [8] is developed on the assumption that statistical models of processes, for such metrics as CPU and memory utilization, may reveal the slowness of the system (similar to system hang). However, since the causal relationship between the statistical models for processes and the slowness for the system has not been validated, the effectiveness of this assumption remains unclear. As a result, whether or not existing OS services can be utilized to detect system hang becomes an attractive argument, since an affirmative answer implies that no additional cost will be incurred. The main contributions of this paper are as follows. We give a new characterization of system hang based on the two popular views about it (as described in Section 2.1). Besides, the causes of system hang are analyzed in detail from two aspects: indefinite wait for system resources (resources not released or released slowly) and infinite loop under interrupt and
preemption constraints.Accordingly,we present six a new characterization of system hang is given below. types of faults responsible for system hang. System hang is a fuzzy concept which depends on We propose a self-healing framework to handle the criteria of the observer-the system gets partially system hang automatically and refer to it as SHFH. or completely stalled,and most services become un- which can be deployed on OS(currently implemented responsive,or respond to user inputs with an obvious on Linux)dynamically.One unique feature is that a latency (an unacceptable length of time according to "light-heavy"detection strategy is adopted to make in- the observer). telligent tradeoffs between the performance overhead and the false positive rate induced by system hang 2.2.Causes of System Hang detection.Another feature lies in its diagnosis-based Tasks need to run effectively to provide services. recovery strategy,which is designed to provide a better In other words,if tasks cannot run,or run without granularity for system hang recovery. doing useful work,users would be aware of the un- We have selected UnixBench [22]as our benchmark available services (unresponsive).Accordingly,what suite,and injected six types of faults into UnixBench causes tasks to be unavailable to run (i.e.,tasks to to cause system hang among 9 bench workloads wait for resources that will never be released)or to representing at least 95%of kernel usage [26].By do useless work (i.e.,tasks to fall into an infinite loop) analyzing a total of 68 performance metrics (e.g., contributes to system hang.It should be noticed that context switches per second and number of runnable although a task falls into an infinite loop,it can be tasks)which are provided by the OS itself from 1080 interrupted or preempted by other tasks.Besides,some experiments under normal and anomalous workloads. system hangs can be automatically recovered after a and after further experimental validation by using both period of time since the resources which are held by UnixBench and LTP(Linux Test Project)[21],we find other tasks are released slowly.In this situation.if that 9 common performance metrics are sufficient as users have no patience to wait for a long time (until the basis to detect most system hang problems without resources are released),system hang is considered requiring any additional assistance(e.g.,new hardware happening. modules or kernel modification). Consequently,we analyze the causes of system The rest of this paper is organized as follows. hang from two aspects:infinite loop under interrupt Section 2 describes what system hang is and what and preemption constraints and indefinite wait for causes it.Section 3 discusses whether empirical sys- system resources (resources not released or released tem performance metrics can be utilized to detect slowly).Accordingly,six types of faults are distin- system hang.According to the hypothesis presented in guished as shown in Figure 1. Section 3,SHFH is proposed and described in detail in Section 4.Section 5 evaluates our SHFH and validates 2.2.1.Infinite Loop accordingly the effectiveness of the hypothesis made When interrupts are disabled (F1),even a clock inter- in Section 3.Section 6 discusses the related work and rupt cannot be responded.As a result,if the running Section 7 concludes the paper. task does not relinquish the CPU on its own,i.e.,falls into an infinite loop,other tasks would have no chance 2.System Hang and Causes to be executed.In the case with interrupts enabled but preemption disabled(F3),CPU can respond to inter- There is no standard definition of system hang.In rupts;however,even tasks with higher priority cannot Section 2.1,we give a new characterization of system be executed,thus making some services provided by hang as our analysis foundation according to the two the ready tasks unavailable.Although both interrupts existing views about it.The causes of system hang are and preemption are enabled,when a task falls into an analyzed in detail in Section 2.2. infinite loop in kernel(F2)(certain OSes,e.g.,Linux 2.1.What is System Hang after 2.6 version,support kernel preemption mecha- nism),it still cannot be preempted unless all the locks There are two popular views.Studies [1],[3],[5],[7] held by the task are released or the task is blocked describe system hang as that OS does not relinquish or explicitly calls schedule function;however,falling the processor,and does not schedule any process, into an infinite loop in kernel offers little chances i.e.,the system is in a totally hang state which does to satisfy the above conditions,thus providing OS not allow other tasks to execute and respond to any little opportunities to schedule other tasks.Generally. user input.On the other side,studies [2],[4],[8], infinite loops can be explained in two scenarios:(1) [9],[11]consider that when OS gets partially or an interrupt(preemption)enabled operation cannot be completely stalled,and does not respond to user-space executed due to an infinite loop formed earlier and(2) applications,the system enters a state of hang. an interrupt (preemption)disabled/enabled pair falls We prefer the second view about system hang inside an infinite loop.Faults related to spinlocks,e.g.. because it includes a broader scope of hang scenar- double spinlocks,are also categorized into F1(the first ios which is in accordance with our daily human- scenario)due to its mechanism of busy waiting for computer interaction experience,and based on which. locks after interrupts are disabled.Even in a multi-
preemption constraints. Accordingly, we present six types of faults responsible for system hang. We propose a self-healing framework to handle system hang automatically and refer to it as SHFH, which can be deployed on OS (currently implemented on Linux) dynamically. One unique feature is that a “light-heavy” detection strategy is adopted to make intelligent tradeoffs between the performance overhead and the false positive rate induced by system hang detection. Another feature lies in its diagnosis-based recovery strategy, which is designed to provide a better granularity for system hang recovery. We have selected UnixBench [22] as our benchmark suite, and injected six types of faults into UnixBench to cause system hang among 9 bench workloads representing at least 95% of kernel usage [26]. By analyzing a total of 68 performance metrics (e.g., context switches per second and number of runnable tasks) which are provided by the OS itself from 1080 experiments under normal and anomalous workloads, and after further experimental validation by using both UnixBench and LTP (Linux Test Project) [21], we find that 9 common performance metrics are sufficient as the basis to detect most system hang problems without requiring any additional assistance (e.g., new hardware modules or kernel modification). The rest of this paper is organized as follows. Section 2 describes what system hang is and what causes it. Section 3 discusses whether empirical system performance metrics can be utilized to detect system hang. According to the hypothesis presented in Section 3, SHFH is proposed and described in detail in Section 4. Section 5 evaluates our SHFH and validates accordingly the effectiveness of the hypothesis made in Section 3. Section 6 discusses the related work and Section 7 concludes the paper. 2. System Hang and Causes There is no standard definition of system hang. In Section 2.1, we give a new characterization of system hang as our analysis foundation according to the two existing views about it. The causes of system hang are analyzed in detail in Section 2.2. 2.1. What is System Hang There are two popular views. Studies [1], [3], [5], [7] describe system hang as that OS does not relinquish the processor, and does not schedule any process, i.e., the system is in a totally hang state which does not allow other tasks to execute and respond to any user input. On the other side, studies [2], [4], [8], [9], [11] consider that when OS gets partially or completely stalled, and does not respond to user-space applications, the system enters a state of hang. We prefer the second view about system hang because it includes a broader scope of hang scenarios which is in accordance with our daily humancomputer interaction experience, and based on which, a new characterization of system hang is given below. System hang is a fuzzy concept which depends on the criteria of the observer - the system gets partially or completely stalled, and most services become unresponsive, or respond to user inputs with an obvious latency (an unacceptable length of time according to the observer). 2.2. Causes of System Hang Tasks need to run effectively to provide services. In other words, if tasks cannot run, or run without doing useful work, users would be aware of the unavailable services (unresponsive). Accordingly, what causes tasks to be unavailable to run (i.e., tasks to wait for resources that will never be released) or to do useless work (i.e., tasks to fall into an infinite loop) contributes to system hang. It should be noticed that although a task falls into an infinite loop, it can be interrupted or preempted by other tasks. Besides, some system hangs can be automatically recovered after a period of time since the resources which are held by other tasks are released slowly. In this situation, if users have no patience to wait for a long time (until resources are released), system hang is considered happening. Consequently, we analyze the causes of system hang from two aspects: infinite loop under interrupt and preemption constraints and indefinite wait for system resources (resources not released or released slowly). Accordingly, six types of faults are distinguished as shown in Figure 1. 2.2.1. Infinite Loop When interrupts are disabled (F1), even a clock interrupt cannot be responded. As a result, if the running task does not relinquish the CPU on its own, i.e., falls into an infinite loop, other tasks would have no chance to be executed. In the case with interrupts enabled but preemption disabled (F3), CPU can respond to interrupts; however, even tasks with higher priority cannot be executed, thus making some services provided by the ready tasks unavailable. Although both interrupts and preemption are enabled, when a task falls into an infinite loop in kernel (F2) (certain OSes, e.g., Linux after 2.6 version, support kernel preemption mechanism), it still cannot be preempted unless all the locks held by the task are released or the task is blocked or explicitly calls schedule function; however, falling into an infinite loop in kernel offers little chances to satisfy the above conditions, thus providing OS little opportunities to schedule other tasks. Generally, infinite loops can be explained in two scenarios: (1) an interrupt (preemption) enabled operation cannot be executed due to an infinite loop formed earlier and (2) an interrupt (preemption) disabled/enabled pair falls inside an infinite loop. Faults related to spinlocks, e.g., double spinlocks, are also categorized into F1 (the first scenario) due to its mechanism of busy waiting for locks after interrupts are disabled. Even in a multi-
Interrupt disabled F1 Infinite loop Preemption enabled(loop in kernel)F2 Interrupt enabled System Hang Preemption disabled F3 Resources not released Deadlock(except spinlock)F4 Indefinite wait Sleeping while holding locks F5 Resources released slowly Abnormal resource consumption F6 Holding resources too long during correct operations Figure 1.Categories of system hang causes(F:Fault in abbreviation core computer,the stall of only one core can cause through additional assistances (e.g.,hardware mod- the freeze of the whole system for certain reasons, ules or kernel modification),this section investigates e.g.,the synchronization mechanism between different whether exploiting the services provided by the OS cores.Indeed,this phenomenon does occur frequently itself can help detect system hang.In Section 3.1,we in our experiments. first introduce a hypothesis about empirical metrics 2.2.2.Indefinite Wait used for system hang detection.According to this hypothesis,the research questions about detection Awaiting resources (e.g.,signals,semaphores,I/O, metrics are proposed in Section 3.2.In Section 3.3, interrupts or memory spaces)indefinitely can be ex- we conduct experiments to determine which metrics plained as waiting for the resources requested either should be selected to detect system hang.Finally,we infinitely or for a long time (depends on the patience discuss how to use the selected performance metrics of users).The deadlock described in F4 does not to detect system hang. include the circumstance triggered by spinlocks even if double spinlocks(it belongs to F1)is also a kind 3.1.Hypothesis of Detection Metrics of deadlock.If tasks or a piece of kernel codes, which have several interactions with other tasks,are We choose system performance metrics(e.g.,context trapped by deadlock,system hang may occur due to switches per second and number of runnable tasks) the sudden loss of the key internal services.In general, as the targets of detection because they are usually sudden disappearance of resources (e.g,peripheral provided by most OSes and implicate the overall per- devices,pipe)also belongs to F4.OS provides no formance information when the system slows down. mechanism to ensure that a task holding spinlock Our detection metrics are hypothesized as follows: would not fall into a sleep state.As a result,F5 may Hypothesis:Combined with a theoretical analysis. cause system hang because tasks that wait for the partial system performance metrics can be regarded as spinlocks to be released have to run on CPU in a busy a sufficient basis to determine whether system hang waiting way,thus providing no chance to schedule occurs. other tasks.F6 is usually relevant to anomalous mem- 3.2.Research Questions ory consumption,since there are not enough memory space immediately provided to the new forked tasks Since system performance metrics are uncontrollable, or the ones swapped in again.The classical malicious it is impossible to build a mapping from performance program "fork bomb"(fork infinitely)also belongs to metrics to a hang state.As a result,the other way, F6.Holding resources for a long time during correct i.e.,observing the values of performance metrics when operations,e.g.,copying many files simultaneously to the system enters a hang state,can be attempted to peripheral devices,may cause temporal system hang. help understand which metrics may implicate system However,this situation is not considered as a cause of hang.It should be noticed that,in this situation,the system hang,since it is a correct operation and varies influenced performance metrics are necessary rather with different system configurations.It should be than sufficient to detect system hang. noticed that although F5 and F6 may release resources As a result,whether the selected metrics are also after a while (e.g.,the task holding spinlock is waked sufficient or not needs to be validated(empirically in up and executes an unlock operation),F5 and F6 are Section 5).According to the hypothesis (Section 3.1) considered as the causes of system hang because they and the analysis above,we seek to answer the follow- occur due to inappropriate operations. ing research questions: 3.Empirical Detection Metrics RO/Among hundreds of system performance metrics provided by OS,which ones should The difficulty in handling system hang lies in how be selected? to detect it,since OS offers no mechanisms to make RO2 How to determine system hang with the itself informed when it enters a hang state.Most system performance metrics? studies (as described in Section 1)detect system hang Sections 3.3 and 3.4 answer the two research
System Hang Infinite loop Interrupt disabled F1 Interrupt enabled Preemption enabled(loop in kernel) F2 Preemption disabled F3 Indefinite wait Resources not released Deadlock(except spinlock) F4 Resources released slowly Sleeping while holding locks F5 Abnormal resource consumption F6 Holding resources too long during correct operations Figure 1. Categories of system hang causes ( F:Fault in abbreviation ) core computer, the stall of only one core can cause the freeze of the whole system for certain reasons, e.g., the synchronization mechanism between different cores. Indeed, this phenomenon does occur frequently in our experiments. 2.2.2. Indefinite Wait Awaiting resources (e.g., signals, semaphores, I/O, interrupts or memory spaces) indefinitely can be explained as waiting for the resources requested either infinitely or for a long time (depends on the patience of users). The deadlock described in F4 does not include the circumstance triggered by spinlocks even if double spinlocks (it belongs to F1) is also a kind of deadlock. If tasks or a piece of kernel codes, which have several interactions with other tasks, are trapped by deadlock, system hang may occur due to the sudden loss of the key internal services. In general, sudden disappearance of resources (e.g., peripheral devices, pipe) also belongs to F4. OS provides no mechanism to ensure that a task holding spinlock would not fall into a sleep state. As a result, F5 may cause system hang because tasks that wait for the spinlocks to be released have to run on CPU in a busy waiting way, thus providing no chance to schedule other tasks. F6 is usually relevant to anomalous memory consumption, since there are not enough memory space immediately provided to the new forked tasks or the ones swapped in again. The classical malicious program “fork bomb” (fork infinitely) also belongs to F6. Holding resources for a long time during correct operations, e.g., copying many files simultaneously to peripheral devices, may cause temporal system hang. However, this situation is not considered as a cause of system hang, since it is a correct operation and varies with different system configurations. It should be noticed that although F5 and F6 may release resources after a while (e.g., the task holding spinlock is waked up and executes an unlock operation), F5 and F6 are considered as the causes of system hang because they occur due to inappropriate operations. 3. Empirical Detection Metrics The difficulty in handling system hang lies in how to detect it, since OS offers no mechanisms to make itself informed when it enters a hang state. Most studies (as described in Section 1) detect system hang through additional assistances (e.g., hardware modules or kernel modification), this section investigates whether exploiting the services provided by the OS itself can help detect system hang. In Section 3.1, we first introduce a hypothesis about empirical metrics used for system hang detection. According to this hypothesis, the research questions about detection metrics are proposed in Section 3.2. In Section 3.3, we conduct experiments to determine which metrics should be selected to detect system hang. Finally, we discuss how to use the selected performance metrics to detect system hang. 3.1. Hypothesis of Detection Metrics We choose system performance metrics (e.g., context switches per second and number of runnable tasks) as the targets of detection because they are usually provided by most OSes and implicate the overall performance information when the system slows down. Our detection metrics are hypothesized as follows: Hypothesis: Combined with a theoretical analysis, partial system performance metrics can be regarded as a sufficient basis to determine whether system hang occurs. 3.2. Research Questions Since system performance metrics are uncontrollable, it is impossible to build a mapping from performance metrics to a hang state. As a result, the other way, i.e., observing the values of performance metrics when the system enters a hang state, can be attempted to help understand which metrics may implicate system hang. It should be noticed that, in this situation, the influenced performance metrics are necessary rather than sufficient to detect system hang. As a result, whether the selected metrics are also sufficient or not needs to be validated (empirically in Section 5). According to the hypothesis (Section 3.1) and the analysis above, we seek to answer the following research questions: RQ1 Among hundreds of system performance metrics provided by OS, which ones should be selected? RQ2 How to determine system hang with the system performance metrics? Sections 3.3 and 3.4 answer the two research
questions respectively time spent by application)is still zero after injecting the respective kernel module.Finally,CPUO cannot 3.3.Which Performance Metrics to Select execute the user program any more when the system In this section.we investigate experimentally which enters a hang state (see Figure 2-(d)).In addition,after metrics to select to detect system hang by observing the 59th second,the number of context switches per if a metric changes abnormally under hang scenario. second (cs)(as shown in Figure 2-(e))is small since First,we describe our experimental setup.Then,we the other three CPUs are occupied by the injected use an example to show how these experiments work. kernel codes.Although some metrics vary obviously Finally,the system performance metrics which have after the injection of the faults,e.g.,the number of potential to detect system hang are selected according runnable tasks under the pipe workload(Figure 2-(f)). to our experimental results. they may not be selected as detection metrics,since the value of influenced metrics may be normal in other 3.3.1.Experiment Setup workloads (e.g.,the number of runnable tasks for the The six types of faults (see Section 2)that cause sys- shell8 workload as shown in Figure 2-(f)). tem hang are considered as the injected faults,which After injecting F5 into the pipe workload 10 times are implemented as errant kernel modules and loaded and finishing the experiments of F5 in other 8 work- dynamically under different workloads.Accordingly, loads of UnixBench,the general detection metrics the activation rate of injected faults to cause system selected for F5 are usr,sys per CPU,and cs hang is 100%.We select 68 system performance met- 3.3.3.Experimental Conclusion rics(e.g.,number of tasks currently blocked and per- Similar to the methodology adopted by the above centage of time spent by soft interrupt requests)as the observation targets.To observe the general variations example,other experiments are implemented,and the of performance metrics under sufficient workloads,9 experimental results are given in Table 1.Metric programs (context1,dhry,fstime,hanoi,shell8,pipe, iowait represents the percentage of time spent by I/O wait.rin means the number of tasks in the running spawn,syscall,and execl)in the benchmark suite state and blk records the number of tasks currently (UnixBench 5.1.2)are selected,which could represent at least 95%of kernel usage [26].Experiments are blocked.Metric pswpout means the number of pages performed on two computers.One with Intel Core swapped out per second and memfree records the i5 650.3.20GHz CPU(seen as 4 CPUs by OS)and unused space of memory.util means the percentage 4GB RAM,and the other one with Intel Pentium 4. of CPU time during which I/O requests were issued 3.20GHz CPU (seen as 2 CPUs by OS)and 512MB to the device.The 9 system performance metrics in Table 1 are considered as the metrics to detect system RAM.We consider a Linux kernel (version 2.6.32) as our experimental operating system.To guarantee hang.F1,F2 and F3 have the same detection metrics the generality of the experimental results,each type since they all consume CPU inappropriately.F4 makes of injected faults is loaded and executed under each the tasks sleep to wait for the services provided by selected UnixBench workload 10 times in each com- the tasks which are trapped in deadlock,thus it has no influence on the CPU metrics.Because F5 makes puter.Consequently,the total number of experiments the tasks run on CPUs in a way of busy waiting,its conducted is6×9×10×2=1080. metrics are similar to the ones related to infinite loops 3.3.2.An Example As for F6,since it has relevance to consumption of We choose F5 and inject it in the pipe workload of large resources,its detection metrics should be related UnixBench running on the computer with Intel Core to memory and I/O. i5 650.3.20GHz CPU and 4GB RAM. Table 1.Performance metrics used to detect system hang Although experienced programmers avoid using semaphores after a spinlock to make an unlock oper- Metrics CPU Process Memory disk 1/O ation executed quickly,they may ignore whether the Fault FI called functions after a spinlock have operations on F2 semaphores or sleep.As a result,tasks which wait F3 for the spinlock to be released(the task holding the F5 spinlock falls asleep due to the downo operation on F6 semaphore or explicitly sleep operation,F5)have to run on CPU in a busy waiting way,leaving no chance 3.4.How to Determine System Hang for other tasks to run.We inject the sleeping kernel module with a spinlock A at the 23rd second,and The values of several monitored metrics of system inject the kernel modules which acquire A at the 39th, under the normal execution are quite different from 51st and 59th seconds consecutively.As shown in those of a hang system.During normal execution, Figure 2-(a),2-(b)and 2-(c),metric sys(percentage of each value of a monitored metric has an acceptable time spent by system call and exception)reaches and range.The system is considered healthy when each holds 100%,and the value of metric usr(percentage of monitored metric is among its acceptable range.By
questions respectively. 3.3. Which Performance Metrics to Select In this section, we investigate experimentally which metrics to select to detect system hang by observing if a metric changes abnormally under hang scenario. First, we describe our experimental setup. Then, we use an example to show how these experiments work. Finally, the system performance metrics which have potential to detect system hang are selected according to our experimental results. 3.3.1. Experiment Setup The six types of faults (see Section 2) that cause system hang are considered as the injected faults, which are implemented as errant kernel modules and loaded dynamically under different workloads. Accordingly, the activation rate of injected faults to cause system hang is 100%. We select 68 system performance metrics (e.g., number of tasks currently blocked and percentage of time spent by soft interrupt requests) as the observation targets. To observe the general variations of performance metrics under sufficient workloads, 9 programs (context1, dhry, fstime, hanoi, shell8, pipe, spawn, syscall, and execl) in the benchmark suite (UnixBench 5.1.2) are selected, which could represent at least 95% of kernel usage [26]. Experiments are performed on two computers. One with Intel Core i5 650, 3.20GHz CPU (seen as 4 CPUs by OS) and 4GB RAM, and the other one with Intel Pentium 4, 3.20GHz CPU (seen as 2 CPUs by OS) and 512MB RAM. We consider a Linux kernel (version 2.6.32) as our experimental operating system. To guarantee the generality of the experimental results, each type of injected faults is loaded and executed under each selected UnixBench workload 10 times in each computer. Consequently, the total number of experiments conducted is 6 × 9 × 10 × 2 = 1080. 3.3.2. An Example We choose F5 and inject it in the pipe workload of UnixBench running on the computer with Intel Core i5 650, 3.20GHz CPU and 4GB RAM. Although experienced programmers avoid using semaphores after a spinlock to make an unlock operation executed quickly, they may ignore whether the called functions after a spinlock have operations on semaphores or sleep. As a result, tasks which wait for the spinlock to be released (the task holding the spinlock falls asleep due to the down() operation on semaphore or explicitly sleep operation, F5) have to run on CPU in a busy waiting way, leaving no chance for other tasks to run. We inject the sleeping kernel module with a spinlock A at the 23rd second, and inject the kernel modules which acquire A at the 39th, 51st and 59th seconds consecutively. As shown in Figure 2-(a), 2-(b) and 2-(c), metric sys (percentage of time spent by system call and exception) reaches and holds 100%, and the value of metric usr (percentage of time spent by application) is still zero after injecting the respective kernel module. Finally, CPU0 cannot execute the user program any more when the system enters a hang state (see Figure 2-(d)). In addition, after the 59th second, the number of context switches per second (cs) (as shown in Figure 2-(e)) is small since the other three CPUs are occupied by the injected kernel codes. Although some metrics vary obviously after the injection of the faults, e.g., the number of runnable tasks under the pipe workload (Figure 2-(f)), they may not be selected as detection metrics, since the value of influenced metrics may be normal in other workloads (e.g., the number of runnable tasks for the shell8 workload as shown in Figure 2-(f)). After injecting F5 into the pipe workload 10 times and finishing the experiments of F5 in other 8 workloads of UnixBench, the general detection metrics selected for F5 are usr, sys per CPU, and cs. 3.3.3. Experimental Conclusion Similar to the methodology adopted by the above example, other experiments are implemented, and the experimental results are given in Table 1. Metric iowait represents the percentage of time spent by I/O wait. run means the number of tasks in the running state and blk records the number of tasks currently blocked. Metric pswpout means the number of pages swapped out per second and memfree records the unused space of memory. util means the percentage of CPU time during which I/O requests were issued to the device. The 9 system performance metrics in Table 1 are considered as the metrics to detect system hang. F1, F2 and F3 have the same detection metrics since they all consume CPU inappropriately. F4 makes the tasks sleep to wait for the services provided by the tasks which are trapped in deadlock, thus it has no influence on the CPU metrics. Because F5 makes the tasks run on CPUs in a way of busy waiting, its metrics are similar to the ones related to infinite loops. As for F6, since it has relevance to consumption of large resources, its detection metrics should be related to memory and I/O. Table 1. Performance metrics used to detect system hang P Fault PPPPP Metrics CPU Process Memory disk I/O sys usr iowait run blk cs pswpout memfree util F1 √ √ √ F2 √ √ F3 √ √ √ F4 √ √ F5 √ √ √ F6 √ √ √ √ √ √ 3.4. How to Determine System Hang The values of several monitored metrics of system under the normal execution are quite different from those of a hang system. During normal execution, each value of a monitored metric has an acceptable range. The system is considered healthy when each monitored metric is among its acceptable range. By
CPU2(pipe) CPU3(pipe) 100 100 sys 80 70 60 30 20 2 A 70 0 time/s time/s (a)CPU2 (b)CPU3 CPU1(pipe】 CPUO(pipe) % 8 0 sys 40 20 20 30 0 70 <0 40 50 10 time/s time/s (c)CPUI (d)CPUO cantext switches(cs)(pipe) runnable tasks number 450d +cs/s 40 30c 150 100 run task num(pipe) run task numfshell8 40 20 40 70 time/s time/s (e)Number of context switches per second (f)Number of current runnable tasks Figure 2.Performance metrics records with F5 in pipe workload of UnixBench
(a) CPU2 (b) CPU3 (c) CPU1 (d) CPU0 (e) Number of context switches per second (f) Number of current runnable tasks Figure 2. Performance metrics records with F5 in pipe workload of UnixBench