第8卷第1期 智能系统学报 Vol.8 No.1 2013年2月 CAAI Transactions on Intelligent Systems Feh.2013 D0I:10.3969/j.issn.1673-4785.201209059 Network Publishing Address:http://www.cnki.net/kcms/detail/23.1538.TP.20130205.1834.001.html Immune based computer virus detection approaches TAN Ying2,ZHANG Pengtao .2 (1.Department of Machine Intelligence,School of Electronics Engineering and Computer Science,Peking University,Bei- jing 100871,China;2.Key Laboratory of Machine Perception,Ministry of Education,Peking University,Beijing 100871, China) Abstract:The computer virus is considered one of the most horrifying threats to the security of computer systems worldwide.The rapid development of evasion techniques used in virus causes the signature based computer virus detection techniques to be ineffective.Many novel computer virus detection approaches have been proposed in the past to cope with the ineffectiveness,mainly classified into three categories: static,dynamic and heuristics techniques.As the natural similarities between the biological immune sys- tem(BIS),computer security system (CSS),and the artificial immune system (AIS)were all developed as a new prototype in the community of anti-virus research.The immune mechanisms in the BIS provide the opportunities to construct computer virus detection models that are robust and adaptive with the ability to detect unseen viruses.In this paper,a variety of classic computer virus detection approaches were intro- duced and reviewed based on the background knowledge of the computer virus history.Next,a variety of immune based computer virus detection approaches were also discussed in detail.Promising experimental results suggest that the immune based computer virus detection approaches were able to detect new variants and unseen viruses at lower false positive rates,which have paved a new way for the anti-virus research. Keywords:computer virus detection;artificial immune system;immune algorithms;hierarchical model; negative selection algorithm with penalty factor CLC Number:TP309.5 Document Code:A Article ID:1673-4785(2013)01-0080-15 Due to the rapid development of computer Currently,there are several companies that pro- technology and the Internet,the computer has become duce various anti-virus products,most of which are a part of daily life in the 21st century.Meanwhile,the based on signatures.These products are usually able to computer security systems are getting more and more detect known viruses effectively with lower false posi- notice.The computer viruses,new variants and unseen tive rates and overheads.Unfortunately,these same viruses in particular,have been one of the most dread- products fail to detect new variants and unseen viruses. ful threats to the computers worldwide.Today viruses Based on the metamorphic and polymorphous tech- are becoming more complex with faster propagation niques,even a layman can develop new variants of speed and stronger ability for latency,destruction and known viruses easily using virus automatons.For ex- infection.At present a virus is able to spread all over ample,the Agobot has observed more than 580 variants the world in a matter of minutes and results in huge e- from its initial release,which makes use of polymor- conomic losses.The mission of how to protect comput- phism to evade detection and disassembly Thus, ers from these various types of viruses has become pri- traditional signature based computer virus detection ap- ority number one. proaches are no longer suitable for the new environ- ments;dynamic and heuristics techniques as well have Received Date:2012-09-27.Network Publishing Date:2013-02-05. Foundation Item:National Natural Science Foundation of China No. started to emerge. 61170057,60875080). Dynamic techniques,such as virtual machine, Corresponding Author:TAN Ying.E-mail:ytan@pku.edu.cn
第1期 TAN Ying,et al:Immune based computer virus detection approaches ·81 keep watch over the execution of every program during with penalty factor (NSAPF)was proposed to over- run-time and stop the program once it tries to harm the come the drawback of the traditional NSA in defining system.Most of these techniques monitor the behaviors the harmfulness of“self'and“nonself'”.It focuses on of a program by analyzing the application programming the danger of the code and greatly improves the effec- interface (API)call sequences generated at runtime. tiveness of the NSAPF based virus detection model. As the huge overheads of monitoring API calls,it is The rest of this paper was organized as follows:In practically impossible to deploy the dynamic techniques Section 1,the background knowledge of computer viru- on personal computers at this time. ses is introduced.Section 2 presents a variety of clas- Data mining approaches,one of the most popular sic computer virus detection approaches.In Section 3, heuristics,try to mine frequent patterns or association the artificial immune system and immune based com- rules to detect viruses by using classic classifiers. puter virus detection approaches are briefly described. These approaches have led to some success.However, Our previous works and conclusions are proposed in data mining approaches lose the semantic information detail in Sections 4 and 5,respectively. of the code and cannot easily recognize unseen viruses in the long term. 1 Computer virus The computer virus is named after the biological 1.1 Definition and features virus,due to the similarity between them,such as par- In a narrow sense,a computer virus is a program asitism,propagation and infection.The biological im- that can infect other programs by modifying them to in- mune system(BIS)protects organisms from antigens clude a possibly evolved copy of it.In a broad for a long time,resolving the problem to detect unseen sense,a computer virus indicates all the malicious antigens successfully.Inspired from the BIS,apply- code that is a program designed to harm or secretly ac- ing immune mechanisms to detect computer viruses has cess a computer system without the owners'informed developed into a new anti-virus field in the past few consent;such as viruses in the narrow sense,worms, years,attracting many researchers.Forrest et al.ap- backdoor and Trojans.Through the development of plied the immune theory to computer anomaly detection the computer virus,the lines have become blurred be- for the first time in 19943.Since then,many re- tween the different types of viruses and are not clear. searchers have proposed various kinds of virus detec-In this paper,all the programs that are not authorized tion approaches and achieved some success.Some of by users and perform harmful operations in the back- them are mainly derived from ARTIS(46). ground are referred to as viruses. As time goes on,more and more immune mecha- The features of the computer viruses are listed be- nisms become clear which lay a good foundation for the low. development of the AIS.On this basis,many immune 1)Infectivity:Infectivity is the fundamental and based computer virus detection approaches have been essential feature of the computer virus in the narrow proposed,in which more and more immune mecha- sense,which is the foundation to detect a virus.When nisms are involved.The simulations of the AIS to the a virus intrudes into a computer system,it starts to BIS keep going on and the immune based computer vi- scan the programs and computers on the Internet that rus detection approaches have paved a new way for the can be infected.Next,through self-duplicating,it anti-virus research. spreads to the other programs and computers. The researchers of this paper have done some re- 2)Destruction:According to the extent of de- lated works in the anti-virus field and achieved some struction,the virus is divided into "benign"virus and success7.They have tried to make full use of the malignant virus."Benign"viruses merely occupy sys- relativity among different features in a virus sample by tem resources,such as GENP,W-BOOT,while malig- constructing an immune based hierarchical model7. nant viruses usually have clear purposes.They can de- On the basis of the traditional negative selection algo- stroy data;delete files,even format diskettes. rithms (NSA),a novel negative selection algorithm 3)Concealment:Computer viruses often attach
·82 智能系统学报 第8卷 themselves to benign programs and start up with the and their punching bags are data files,mainly Mi- host programs.They perform harmful operations in the crosoft Office files. background hiding from users. 5)Virus techniques merging with hacker tech- 4)Latency:After intruding into a computer sys- niques:Nowadays merging of virus techniques and tem,the viruses usually hide themselves from users in- hacker techniques has been a tendency.It makes the stead of attacking the system immediately.This feature viruses have much stronger concealment,latency and makes the viruses have longer lives.They spread them- much faster propagation speed than ever before. selves and infect other programs in this period. System power on 5)Trigger:Most viruses have one or more trigger conditions.When these conditions are satisfied,the vi- Enter ROM-BIOS ruses begin to destroy the system.Other features of the Read boot sector to 0:7C00H viruses include illegality,expressiveness,and unpre- dictability. System reset 1.2 Development phases of the viruses Read in COMMAND.COM The viruses are evolved with the computer tech- nology all the time.The development of the viruses ap- Complete the disk bootstrap proximately goes through several phases which are de- scribed below. Fig.1 Normal boot procedure of DOS 1)DOS boot phase:Fig.1 and Fig.2 illustrate System power on Read the boot sector virus the boot procedures of DOS without and with boot sec- tor virus,respectively.Before the computer system ob- Enter ROM-BIOS Run the virus tains the control right,the virus starts up,modifies in- Read boot sector to terrupt vector and copies it to infect the diskette.These 0:7C00H Modify interupt vector are the original infection procedures of the viruses. System reset Copy the virus and What is more,the similar infection procedures can be infect the disk found in the viruses nowadays. Read in COMMAND.COM 2)DOS executable phase:The viruses exist in a computer system in the form of executable files in this Complete the disk bootstrap phase.They would control the system when the users Fig.2 Boot procedure of DOS with boot sector virus run applications infected by the viruses.Most viruses now are executable files. 2 Classic virus detection approaches 3)Virus generator phase:Virus generators,also The computer virus has become a major threat to called virus automatons,can generate new variants of known viruses with different signatures.Metamorphic the security of computers and the Internet worldwide. A wide range of host-based anti-virus solutions have techniques are used here to obfuscate virus scanners been proposed by many researchers and compa- which are based on virus signatures,including instruc- tion reordering,code expansion,code shrinking and nies These anti-virus techniques could be broad- garbage code insertion ly classified into three categories:static techniques, 4)Macro virus phase:Before the emerging of dynamic techniques and heuristics. macro viruses,all the viruses merely infect executable The fight between the viruses and the anti-virus files as it is almost the only way for the viruses to ob- techniques is more violent now than ever before.The tain the right of execution.When users run the host of viruses disguise themselves by using various kinds of e- a virus,the virus starts up and controls the system.In- vasion techniques,such as metamorphic and polymor- fecting data files cannot help the virus to run itself. phous techniques,packer and encryption techniques. The emerging of macro viruses changed this situation Coping with the new situations,the anti-virus tech- niques unpack the suspicious programs,decrypt them
第1期 TAN Ying,et al:Immune based computer virus detection approaches ·83· and try to be robust to those evasion techniques.Nev- niques are vulnerable to unseen viruses and the evasion ertheless,the viruses evolve to anti-unpack anti-de- techniques of viruses.As a result,a variety of dynamic crypt and develop into obfuscating the anti-virus tech- and heuristic anti-virus approaches are developed to niques again.The fight will never stop and the virus cope with these situations. techniques will always be ahead of the anti-virus tech-2.2 Dynamic techniques niques.What can we do is to increase the difficulty of Computer viruses often show some special behav- intrusion,decrease the losses caused by the viruses iors when they harm the computer systems.For exam- and react to them as soon as possible. ple,writing operation to executable files,dangerous 2.1 Static techniques operations (e.g.,formatting a diskette),and switc- Static techniques usually work on program bit hing between a virus and its host.These behaviors give strings,assembly codes,and application programming us an opportunity to recognize the viruses.Based on interface (API)calls of a program without running the the above idea,the dynamic techniques keep watch o- program.One of the most famous static techniques is ver the execution of every program during run-time and the signature based virus detection technique. observe the behaviors of the program.They would stop The signature based virus detection technique is the program once it tries to harm the computer system. the mainstream anti-virus approach and most of the The dynamic techniques usually utilize the operating commercial anti-virus products are based on this tech- system's API sequences,system calls and other kinds nique.A signature usually is a bit string which is di- of behavior characteristics to identify the purpose of a vided from a virus sample and it is able to identify a vi- program14] rus uniquely.The signature based anti-virus products There are two main types of dynamic techniques: are referred to as scanners in this paper. the behavior monitoring approach and the virtual ma- In order to extract a signature from a virus,the chine approach. anti-virus experts first disassemble the virus to assem- Based on the assumption that the viruses have bly codes.Then they analyze it in the semantic level to some special behaviors that can identify themselves and figure out the mechanisms and workflow of the virus. would never emerge in benign programs,the behavior Finally,a signature is extracted to characterize the vi- monitors keep watch over every behavior of a virus and rus uniquely. wish to prevent destruction from the dangerous opera- This technique is able to detect known virus very tions effectively. quickly with lower false positive and high true positive This approach is considered to be able to detect rates.It is one of the simplest approaches with minimal known viruses,new variants and unseen viruses, overheads.Nevertheless,since a signature of a new vi- whereas it is very dangerous to run viruses in a real rus can be only extracted after the break out of the vi- computer by using this approach.If a behavior monitor rus by experts,it would take a long time to detect the fails to kill a virus,the virus would take control of the new virus effectively.The losses caused by the virus computer.Moreover,the overheads brought in by a already cannot be recovered.Furthermore,with the behavior monitor are too huge to personal computers. development of virus techniques,there are many eva-The false positive rate of this approach is high inevita- sion techniques which are used to help the virus evade bly and the approach cannot recognize the type and from the signature based scanners,such as metamor-name of a virus,thus it cannot eliminate the virus from phic and polymorphous techniques,packer,and en- a computer.Furthermore,it is very hard to implement cryption techniques.The signature based anti-virus a relative perfect behavior monitor. techniques are easily defeated by these techniques.For In order to separate the running program from the example,simple program entry point modifications real computer,the virtual machine approach creates a consisting of two extra jump instructions effectively de- virtual machine (VM)and runs the programs in the feat most signature based scanners. VM.The execution environment of a program here is To conclude,the signature based anti-virus tech- the VM which is software,instead of the physical ma-
·84 智能系统学报 第8卷 chine.Hence the computer is safe,even when the VM and data mining to detect virus is feasible.N-Gram is is crashed by a virus.It is very easy to collect all the a concept from text categorization,which means N con- information while a program is running in a VM.If the tinuous words or phrases.In the anti-virus field,an N- VM captures any dangerous operation,it would give Gram is usually defined as a binary string of length N the users a tip.When it confirms that the running pro- bytes.The experimental results revealed that the boos- gram is a virus,it will kill the virus. ted decision trees outperformed other classifiers with an The virtual machine approach is very safe and can area under the receiver operating characteristic curve recognize almost all the viruses,including encrypted (AUC),0.996.Later they extended this technique to and packed viruses.Now the VM approach has become classify virus according to the functions of their pay- one of the most amazing virus detection approaches. loads126] However,the VM brings comparable overheads to the A new feature selection criterion,class-wise docu- host computers.How to implement a relative perfect ment frequency (CDF),was proposed by Reddy et al. VM is a new research study.In addition,the VM only and applied to the procedure of N-Gram selection(27 simulates a part of the computer's functions which pro- Their experimental results suggested that the CDF out- vides opportunities for anti-VM techniques to evade performed the IG in the feature selection process.They from the VM approach. guessed the reason might be most of the relevant N- Anti-VM techniques have been used in many viru- Grams selected by using the IG came from benign pro- ses recently.For example,inserting some special in- grams.What is more,since the CDF tries to select the structions into a virus may cause the crash of a VM. features with the highest frequencies in a specific The entry point obscuring is also involved by the viru- class,it has a bias to the information of the class.As a ses to evade from the VM approach. result,it could not select the discriminating features for Ref.[15-20]proposed some new dynamic tech- the class effectively. niques based virus detection models.Although these Stolfo et al.made use of N-Grams to identify file models have shown promising results,they can produce types and later to detect stealthy virus Their high false positive rates,an issue which has yet to be experimental results showed that the method was able resolved[2 to detect embed virus.However,this method was not a 2.3 Heuristics general virus detection method. Schultz et al.,who are pioneers to apply the tech- Sulaiman et al.proposed a static analysis frame- niques of machine learning and data mining to the anti- work for detecting variants of viruses which was called virus field,proposed a data mining framework to detect disassembled code analyzer for virus (DCAM)3 unseen virus effectively and automatically2.Three Different from the traditional static code analysis which approaches are taken to the feature extraction proce- usually works on the binary string of a program,the dure.The first one makes use of Bin-Utils!231 of GNU authors extracted virus features from disassembled to extract resource information of a program.String se- codes.The programs which got through three steps of quences are extracted by using GNU strings program in matching were considered as benign programs;other- the second approach.The third approach is called hex wise the DCAM classified the programs as viruses.The dump)which transforms binary files into byte se- experimental results suggested that the DCAM worked quences.However,DLL and function names are too very well and could prevent breakouts of previous iden- unstable to detect virus.This work lays a good founda- tified viruses. tion for the application of the techniques of machine Henchiri and Japkowicz adopted a data mining ap- learning and data mining in the anti-virus field. proach to extract the frequent patterns (FPs)to detect Kolter et al.proposed a technique to detect virus virus(3).They filtered FPs twice and tried to obtain in the field based on the relevant N-Grams selected by general FPs based on the intra-family support and in- using the information gain(IG)2s).They clearly iden-ter-family support.Several classifiers were involved in tified that using the techniques from machine learning this work,such as the J48 decision tree and naive