2019 26th Asia-Pacific Software Engineering Conference (APSEC) Adaptive Random Testing for XSS Vulnerability Chengcheng Lv',Long Zhang2.3.Fanping Zeng',and Jian Zhang2.3 ISchool of Computer Science and Technology,University of Science and Technology of China.Hefei.China 2State Key Laboratory of Computer Science,Institute of Software.Chinese Academy of Sciences,Beijing.China 3 University of Chinese Academy of Sciences,Beijing.China Email:lvcc@mail.ustc.edu.cn,zlong@ios.ac.cn.billzeng @ustc.edu.cn,zj@ios.ac.cn Abstract-XSS is one of the common vulnerabilities in web Stored XSS attack:The stored XSS vulnerability is a applications.Many black-box testing tools may collect a large variant of the cross-site scripting flaw,which is also number of payloads and traverse them to find a payload that can be successfully injected,but they are not very efficient. known as Type-II or persistent XSS vulnerability,and Previous research has paid less attention to how to improve the attackers can exploit such vulnerability to attack web efficiency of black-box testing to detect XSS vulnerability.To applications.An attacker can embed a malicious code improve the efficiency of testing,we develop an XSS testing tool. into a vulnerable server through an application such as It collects 6128 payloads and uses a headless browser to detect a forum and store it permanently.When a victim visits XSS vulnerability.The tool can discover XSS vulnerability quickly with adaptive random testing method.We conduct an such an infected site,the malicious code is provided to experiment using 3 extensively adopted open source vulnerable the victim as part of the web page benchmarks and 2 actual websites to evaluate the adaptive DOM based XSS attack:DOM based XSS attack is a random testing method.The experimental results indicate that new sub-class of reflected XSS attacks,which is also the adaptive random testing method can effectively improve the known as Type-0 XSS attack.In DOM based XSS fuzzing method by more than 27.1%in reducing the number attacks,malicious data does not touch web servers of attempts before accomplishing a successful injection. Keywords-XSS Vulnerability,Adaptive Random Testing, Instead,it is completely reflected by JavaScript code Fuzzing on the client side. There are many black-box testing tools for detecting I.INTRODUCTION XSS vulnerability.They do not know the internals of the Cross-site scripting attack (also known as XSS)is a web application and use fuzzing techniques over the web well-known security vulnerability in web applications.In HTTP requests [3].The approaches that can detect XSS an XSS attack,attackers usually manipulate malicious con- vulnerability are mainly divided into dynamic approach tent (malicious script)to disguise benign text,which can and static approach [4].The static approach detects XSS deceive a vulnerable web application.When executing a vulnerability by analyzing the response data.The detection web application,the victim usually treats the malicious text speed is fast but the false alarm rate is high and the alarms as the legitimate code of the application,and the victim's need to be confirmed manually.Therefore,the dynamic browser inadvertently executes the malicious content [1].In approach may be a better choice.It determines whether the report released by the Open Web Application Security user input is being parsed as code based on the behavior of Project (OWASP)[2]in 2017,XSS is listed as one of the the program at runtime.The dynamic approach could detect top 10 web vulnerabilities. XSS vulnerability more accurately,but consume more time For example,“echo“<b>".SuserName.“<b>":”isa and resources.At the same time,a website may have many piece of PHP code whose function is to display the different urls with risk of XSS vulnerabilities.Therefore,it user's name on the page.But when the user's name is is difficult for tools to perform a large number of test cases. "<script>alert('This is an XSS')</script>",the browser In this paper,we propose a dynamic detection tool with will execute the user's name as the page code and display the method of adaptive random testing (ART)[5]to detect "This is an XSS"in the window.Here,"<script>alert('This XSS vulnerability in web applications.We found that the is an XSS')</script>"is called XSS payload. reason why invalid payloads fail to be injected is that some XSS vulnerabilities can be divided into the following three keywords in payloads were filtered or converted,or the types [1]: payloads do not satisfy the context so the browser could Reflected XSS attack:Reflected XSS attack is currently not execute malicious code.We have observed that effective the most basic type of web vulnerability attack,which payloads tend to cluster together.Moreover,there usually are is also known as Type-I XSS attack or non-persistent some identical keywords in invalid payloads and effective XSS attack.When the victim clicks on a link containing payloads,some mutation in invalid payloads may result in malicious text(most commonly in HTTP query pa- successful injection.Therefore,after a payload fails to be rameters),the server script parses malicious text into injected,we can measure the distance between the failed malicious code (i.e.,reflected back),and the victim's payload and other payloads,then select the next payload that browser executes it. is most likely to be injected successfully to find vulnerability 2640-0715/19/$31.00©20191EEE 63 Do110.1109/APSEC48747.2019.00018
Adaptive Random Testing for XSS Vulnerability Chengcheng Lv1, Long Zhang2,3, Fanping Zeng1, and Jian Zhang2,3 1School of Computer Science and Technology, University of Science and Technology of China, Hefei, China 2 State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China 3 University of Chinese Academy of Sciences, Beijing, China Email: lvcc@mail.ustc.edu.cn, zlong@ios.ac.cn, billzeng@ustc.edu.cn, zj@ios.ac.cn Abstract—XSS is one of the common vulnerabilities in web applications. Many black-box testing tools may collect a large number of payloads and traverse them to find a payload that can be successfully injected, but they are not very efficient. Previous research has paid less attention to how to improve the efficiency of black-box testing to detect XSS vulnerability. To improve the efficiency of testing, we develop an XSS testing tool. It collects 6128 payloads and uses a headless browser to detect XSS vulnerability. The tool can discover XSS vulnerability quickly with adaptive random testing method. We conduct an experiment using 3 extensively adopted open source vulnerable benchmarks and 2 actual websites to evaluate the adaptive random testing method. The experimental results indicate that the adaptive random testing method can effectively improve the fuzzing method by more than 27.1% in reducing the number of attempts before accomplishing a successful injection. Keywords—XSS Vulnerability, Adaptive Random Testing, Fuzzing I. INTRODUCTION Cross-site scripting attack (also known as XSS) is a well-known security vulnerability in web applications. In an XSS attack, attackers usually manipulate malicious content (malicious script) to disguise benign text, which can deceive a vulnerable web application. When executing a web application, the victim usually treats the malicious text as the legitimate code of the application, and the victim’s browser inadvertently executes the malicious content [1]. In the report released by the Open Web Application Security Project (OWASP) [2] in 2017, XSS is listed as one of the top 10 web vulnerabilities. For example, “echo “<b>”.$userName.“</b>”;” is a piece of PHP code whose function is to display the user’s name on the page. But when the user’s name is “<script>alert(‘This is an XSS’)</script>”, the browser will execute the user’s name as the page code and display “This is an XSS” in the window. Here, “<script>alert(‘This is an XSS’)</script>” is called XSS payload. XSS vulnerabilities can be divided into the following three types [1]: • Reflected XSS attack: Reflected XSS attack is currently the most basic type of web vulnerability attack, which is also known as Type-I XSS attack or non-persistent XSS attack. When the victim clicks on a link containing malicious text (most commonly in HTTP query parameters), the server script parses malicious text into malicious code (i.e., reflected back), and the victim’s browser executes it. • Stored XSS attack: The stored XSS vulnerability is a variant of the cross-site scripting flaw, which is also known as Type-II or persistent XSS vulnerability, and attackers can exploit such vulnerability to attack web applications. An attacker can embed a malicious code into a vulnerable server through an application such as a forum and store it permanently. When a victim visits such an infected site, the malicious code is provided to the victim as part of the web page. • DOM based XSS attack: DOM based XSS attack is a new sub-class of reflected XSS attacks, which is also known as Type-0 XSS attack. In DOM based XSS attacks, malicious data does not touch web servers. Instead, it is completely reflected by JavaScript code on the client side. There are many black-box testing tools for detecting XSS vulnerability. They do not know the internals of the web application and use fuzzing techniques over the web HTTP requests [3]. The approaches that can detect XSS vulnerability are mainly divided into dynamic approach and static approach [4]. The static approach detects XSS vulnerability by analyzing the response data. The detection speed is fast but the false alarm rate is high and the alarms need to be confirmed manually. Therefore, the dynamic approach may be a better choice. It determines whether user input is being parsed as code based on the behavior of the program at runtime. The dynamic approach could detect XSS vulnerability more accurately, but consume more time and resources. At the same time, a website may have many different urls with risk of XSS vulnerabilities. Therefore, it is difficult for tools to perform a large number of test cases. In this paper, we propose a dynamic detection tool with the method of adaptive random testing (ART) [5] to detect XSS vulnerability in web applications. We found that the reason why invalid payloads fail to be injected is that some keywords in payloads were filtered or converted, or the payloads do not satisfy the context so the browser could not execute malicious code. We have observed that effective payloads tend to cluster together. Moreover, there usually are some identical keywords in invalid payloads and effective payloads, some mutation in invalid payloads may result in successful injection. Therefore, after a payload fails to be injected, we can measure the distance between the failed payload and other payloads, then select the next payload that is most likely to be injected successfully to find vulnerability 63 2019 26th Asia-Pacific Software Engineering Conference (APSEC) 2640-0715/19/$31.00 ©2019 IEEE DOI 10.1109/APSEC48747.2019.00018
more quickly. 1)Word tokenizing:In order to measure the distance The main contributions of this paper are as follows. between the payloads more accurately,we need to tokenize We convert each payload into a collection of words the XSS payloads for identifying sensitive strings and tags of based on the defined rules and calculate the distance HTML or JavaScript language [10].So we define the rules between two payloads. as follows and use the Natural Language ToolKit [11]for We find that in XSS testing.the distribution of effective processing the XSS payloads. payloads is uneven.Based on this observation,we The contents of single and double quotes,such as 'XSS' develop an XSS testing tool.It collects 6128 payloads <>tag,such as '<script' and use a headless browser to detect XSS vulnerability. parameter name,such as 'href=' The tool can discover XSS vulnerability quickly with Function body,such as 'alert( the ART method. 。Http/https link We conduct an experiment using 3 extensively adopted Common words composed of alphanumeric characters. open source vulnerable benchmarks and 2 actual web- such as 'javascript' sites to evaluate the ART method.The experimental Special encoding format,such as \u003c' results indicate that there is a 27.1%improvement over Special keywords,such as' Fuzzing method. 2)Distance calculation:We have defined the rules for The rest of this paper is organized as follows.Section II XSS payloads and eventually convert each payload into describes our approach.Section IlI describes the implemen- a collection of words.We use the ratio of the keywords tation of the testing tool.Section IV evaluates our approach shared by the two payloads to measure the similarity through experiments.Section V introduces related work. between the two payloads,that is,the distance between Section VI summarizes the paper. two payloads.In this section,we use the Jaccard distance [12],[13]to calculate the distance between two payloads. IⅡ.APPROACH Suppose we are given two payloads P and Pi.which have word collection W;and W;.The Jaccard distance measures Black-box testing is a common way to mitigate the threat the similarity between two samples using the proportion of of XSS vulnerability in web applications.Fuzzing is a the different elements in the two sets.The formula is: popular effective black-box testing method to detect such vulnerabilities [6].A security expert or an attacker often Distance(P,P3)=Jaccard(Wi,W3)=wuwl-Iwnwjl has a prepared collection of attack payloads in hand,and traverses these payloads to find an effective payload,which B.Payloads selection can successfully inject the web application.These tasks are The distribution of effective payloads in the state space very simple and easy to be automated,but the existing tools is often uneven and tend to cluster together.So we propose are inefficient [7. To improve the efficiency of testing in the sense,adaptive an XSS payloads selection algorithm XSSART,which can improve efficiency with the method of adaptive random random testing (ART)[8].[9]has been proposed.Based on the observation,we find that failure-causing inputs tend to testing.We select a payload randomly as the first test case. When the payload cannot be successfully injected,increase be clustered together,ART tries to evenly spread the ran- the priority of the payloads whose distance from this invalid domly generated test cases for improving the fault-detection capability [8],[9]. payload is in the interval [dl,dr](The specific values of dl, We observe that the distribution of effective payloads dr are specifically determined in the next section).Then the in the state space is often uneven and tend to cluster highest priority payload would be selected as the next test case.The specific process of XSSART is shown in algorithm together.So we can try to improve efficiency with the 1. method of adaptive random testing.We found that there The input of the algorithm is the payloads collection Pe usually are some identical keywords in invalid payloads and valid payloads,and some mutation in invalid payloads may and the appropriate distance interval [d,dr].And the output of the algorithm is True or False,indicating whether an result in successful injection.So when a payload cannot XSS vulnerability exists in the system.The Rank value of be successfully injected,we can select a payload with an appropriate distance from the invalid payload.It is equivalent a payload indicates the selected priority.First,the Rank value of the payloads are all set to 0,and the Candidate is to making a mutation on the invalid payload.The main problem is the distance measure between the payloads and set to all the payloads Pc(1.2 lines of algorithm 1).Then how to select the next payload.We will explain separately a payload Peleeted is randomly selected from Candidate below. and Pselected is removed from Pc (4,5 lines of algorithm 1).If Pseleeted can be injected successfully.the algorithm will return True (6,7 lines of algorithm 1).If not,the A.Distance Measure algorithm will set Candidate to 0 and set Max Rank to Distance measure is mainly divided into word tokenizing 0(10 line of the algorithm).For each payload Pimp in the and distance calculation. Pc,if the distance between Pimp and Pselected is within the 64
more quickly. The main contributions of this paper are as follows. • We convert each payload into a collection of words based on the defined rules and calculate the distance between two payloads. • We find that in XSS testing, the distribution of effective payloads is uneven. Based on this observation, we develop an XSS testing tool. It collects 6128 payloads and use a headless browser to detect XSS vulnerability. The tool can discover XSS vulnerability quickly with the ART method. • We conduct an experiment using 3 extensively adopted open source vulnerable benchmarks and 2 actual websites to evaluate the ART method. The experimental results indicate that there is a 27.1% improvement over Fuzzing method. The rest of this paper is organized as follows. Section II describes our approach. Section III describes the implementation of the testing tool. Section IV evaluates our approach through experiments. Section V introduces related work. Section VI summarizes the paper. II. APPROACH Black-box testing is a common way to mitigate the threat of XSS vulnerability in web applications. Fuzzing is a popular effective black-box testing method to detect such vulnerabilities [6]. A security expert or an attacker often has a prepared collection of attack payloads in hand, and traverses these payloads to find an effective payload, which can successfully inject the web application. These tasks are very simple and easy to be automated, but the existing tools are inefficient [7]. To improve the efficiency of testing in the sense, adaptive random testing (ART) [8], [9] has been proposed. Based on the observation, we find that failure-causing inputs tend to be clustered together, ART tries to evenly spread the randomly generated test cases for improving the fault-detection capability [8], [9]. We observe that the distribution of effective payloads in the state space is often uneven and tend to cluster together. So we can try to improve efficiency with the method of adaptive random testing. We found that there usually are some identical keywords in invalid payloads and valid payloads, and some mutation in invalid payloads may result in successful injection. So when a payload cannot be successfully injected, we can select a payload with an appropriate distance from the invalid payload. It is equivalent to making a mutation on the invalid payload. The main problem is the distance measure between the payloads and how to select the next payload. We will explain separately below. A. Distance Measure Distance measure is mainly divided into word tokenizing and distance calculation. 1) Word tokenizing: In order to measure the distance between the payloads more accurately, we need to tokenize the XSS payloads for identifying sensitive strings and tags of HTML or JavaScript language [10]. So we define the rules as follows and use the Natural Language ToolKit [11] for processing the XSS payloads. • The contents of single and double quotes, such as ‘XSS’ • <>tag, such as ‘<script’ • parameter name, such as ‘href=’ • Function body, such as ‘alert(’ • Http/https link • Common words composed of alphanumeric characters, such as ‘javascript’ • Special encoding format, such as ‘\u003c’ • Special keywords, such as ‘\\’ 2) Distance calculation: We have defined the rules for XSS payloads and eventually convert each payload into a collection of words. We use the ratio of the keywords shared by the two payloads to measure the similarity between the two payloads, that is, the distance between two payloads. In this section, we use the Jaccard distance [12], [13] to calculate the distance between two payloads. Suppose we are given two payloads Pi and Pj , which have word collection Wi and Wj . The Jaccard distance measures the similarity between two samples using the proportion of the different elements in the two sets. The formula is: Distance(Pi, Pj ) = Jaccard(Wi, Wj ) = |Wi∪Wj |−|Wi∩Wj | |Wi∪Wj | B. Payloads selection The distribution of effective payloads in the state space is often uneven and tend to cluster together. So we propose an XSS payloads selection algorithm XSSART, which can improve efficiency with the method of adaptive random testing. We select a payload randomly as the first test case. When the payload cannot be successfully injected, increase the priority of the payloads whose distance from this invalid payload is in the interval [dl, dr] (The specific values of dl, dr are specifically determined in the next section). Then the highest priority payload would be selected as the next test case. The specific process of XSSART is shown in algorithm 1. The input of the algorithm is the payloads collection P c and the appropriate distance interval [dl, dr]. And the output of the algorithm is True or False, indicating whether an XSS vulnerability exists in the system. The Rank value of a payload indicates the selected priority. First, the Rank value of the payloads are all set to 0, and the Candidate is set to all the payloads P c (1, 2 lines of algorithm 1). Then a payload Pselected is randomly selected from Candidate and Pselected is removed from P c (4, 5 lines of algorithm 1). If Pselected can be injected successfully, the algorithm will return T rue (6, 7 lines of algorithm 1). If not, the algorithm will set Candidate to ∅ and set M ax Rank to 0 (10 line of the algorithm). For each payload Ptmp in the P c, if the distance between Ptmp and Pselected is within the 64
interval [dl,dr].the Rank value of Pmp will be increased to construct a http request and sends it to the target server by 1 (11,12 lines of algorithm 1).If the Rank value of After receiving the response,the tool runs the response code the Pimp is greater than Mar_Rank,the algorithm will to determine if the XSS injection was successful based on the modify the value of Max Rank and set Candidate to 0 behavior of the browser.If it is not,select the next payload (14,15,16 lines of algorithm 1).If the Rank value of to execute,otherwise,output the effective payload and stop the Pimp is equal to Mar_Rank.Pimp will be added to running. Candidate (18,19 lines of algorithm 1).After processing all the payloads in Pc.if Candidate is not equal to the XSS Detect empty set,the algorithm will go back to the 4th line of the algorithm through the while loop and execute the next ART payload,otherwise,the algorithm will return False. Payloads algorithm 1 Payloads selection Input Input:Payloads collection:Pc URL Input:Distance interval:dl,dr Cookie PantomJS Data Output:True/False User-agent 上Rank[Po,B,.,Pn={o} 2:Candidate Pc 3:while Candidate0 do Payload randomly select Pselected from Pc Pc←Pc-{Pselected} Fig.1.Overview of the tool 6 if Pselected can be injected successfully then At the same time,we collect high quality payloads from return True the following open source tools and websites. else 9 Candidate+0.Max Rank =0 Xenotix!:an open source tool with 1630 payloads. 10: for all Pimp∈Pcdo XSSfork2:an open source tool with 71 payloads. 11: ifdl≤Distance(Pselected,Ptmp)≤dh Burp_suit3:a tool with 96 payloads then Fuzzdb4:a collection with 243 payloads 12: Rank[Pimp]+=1 foospidys:a collection with various payloads 13: end if We keep only one item which appears multiple times and 14 if Rank(Pimp]>Mar_Rank then collect a total of 6,128 payloads. 15: Candidate←-0 IV.EXPERIMENTAL EVALUATION 16: Mar_Rank Rank[Pimp] 17: end if We test the efficiency of the ART method on open source 伊 if Rank(Pimp]Mar_Rank then vulnerable benchmarks and actual website applications.We 19: CandidateCandidate U{Pimp} select open source benchmarks Web for Pentester5,Damn 20: end if Vulnerable Web Application(DVWA)?and WAVSEPS.They 21: end for are all well-known penetration test walkthrough environ- 22 end if ments.Web for Pentester has 7 refected websites,which 23:end while are recorded as WFP_1,WFP_2,WFP 3,WFP 4,WFP 5. 24:return False WFP_6,and WFP_7.DVWA has a reflected injection point and a stored injection point with four different levels of security.The level 'Impossible'can't inject any payloads. III.IMPLEMENTATION so we have 6 websites that are recorded as DVWA_R1, We develop a prototype tool that implements the approach DVWA_R_2,DVWA_R_3,DVWA_S_I,DVWA_S_2 and mentioned in Section II in the python 3.5.4 environment. DVWA S 3.WAVSEP has 73 different websites.Consider- We use the PhantomJS [14]based browser to determine ing the number of websites in the first two systems,we select if the injection was successful based on the behavior of the first 7 websites,recorded as WAVSEP_1,WAVSEP 2 the program.If the content in the payloads is executed as WAVSEP_3,WAVSEP_4,WAVSEP_5, WAVSEP 6 and page code,we believe that there is an XSS vulnerability at Ihttps://github.com/ajinabraham/OWASP-Xenotix-XSS-Exploit- this site.The tool supports the detection of Reflected XSS Framework vulnerability and Stored XSS vulnerability. 2https://github.com/bsmali4/XSSfork The overview of the tool is shown in Figure 1.The 3http://portswigger.net/burp user needs to provide url,cookies and other information 4https://github.com/fuzzdb-project/fuzzdb for testing.Each time the tool selects a payload with the 5https://codeload.github.com/foospidy/payloads/zip/master 6https://pentesterlab.com/exercises/web_for_pentester method of ART,and the headless browser PhantomJS uses 7http://www.dvwa.co.uk/ the selected payload and the information provided by users 8https://github.com/sectooladdict/wavsep 65
interval [dl, dr], the Rank value of Ptmp will be increased by 1 (11, 12 lines of algorithm 1). If the Rank value of the Ptmp is greater than M ax Rank , the algorithm will modify the value of M ax Rank and set Candidate to ∅ (14, 15, 16 lines of algorithm 1). If the Rank value of the Ptmp is equal to M ax Rank, Ptmp will be added to Candidate (18, 19 lines of algorithm 1). After processing all the payloads in P c, if Candidate is not equal to the empty set, the algorithm will go back to the 4th line of the algorithm through the while loop and execute the next payload, otherwise, the algorithm will return False. algorithm 1 Payloads selection Input: Payloads collection : Pc Input: Distance interval:dl , dr Output: T rue/F alse 1: Rank[P0, P1,...,Pn] = {0} 2: Candidate = P c 3: while Candidate = ∅ do 4: randomly select Pselected from Pc 5: P c ← P c − {Pselected} 6: if Pselected can be injected successfully then 7: return T rue 8: else 9: Candidate ← ∅, M ax Rank = 0 10: for all Ptmp ∈ P c do 11: if dl ≤ Distance(Pselected, Ptmp) ≤ dr then 12: Rank[Ptmp]+ = 1 13: end if 14: if Rank[Ptmp] > M ax Rank then 15: Candidate ← ∅ 16: M ax Rank = Rank[Ptmp] 17: end if 18: if Rank[Ptmp] = M ax Rank then 19: Candidate ← Candidate ∪ {Ptmp} 20: end if 21: end for 22: end if 23: end while 24: return F alse III. IMPLEMENTATION We develop a prototype tool that implements the approach mentioned in Section II in the python 3.5.4 environment. We use the PhantomJS [14] based browser to determine if the injection was successful based on the behavior of the program. If the content in the payloads is executed as page code, we believe that there is an XSS vulnerability at this site. The tool supports the detection of Reflected XSS vulnerability and Stored XSS vulnerability. The overview of the tool is shown in Figure 1. The user needs to provide url, cookies and other information for testing. Each time the tool selects a payload with the method of ART, and the headless browser PhantomJS uses the selected payload and the information provided by users to construct a http request and sends it to the target server. After receiving the response, the tool runs the response code to determine if the XSS injection was successful based on the behavior of the browser. If it is not, select the next payload to execute, otherwise, output the effective payload and stop running. Fig. 1. Overview of the tool At the same time, we collect high quality payloads from the following open source tools and websites. • Xenotix1 : an open source tool with 1630 payloads. • XSSfork2 : an open source tool with 71 payloads. • Burp suit3 : a tool with 96 payloads. • Fuzzdb4 : a collection with 243 payloads • foospidy5 : a collection with various payloads We keep only one item which appears multiple times and collect a total of 6,128 payloads. IV. EXPERIMENTAL EVALUATION We test the efficiency of the ART method on open source vulnerable benchmarks and actual website applications. We select open source benchmarks Web for Pentester6, Damn Vulnerable Web Application (DVWA)7 and WAVSEP8. They are all well-known penetration test walkthrough environments. Web for Pentester has 7 reflected websites, which are recorded as WFP 1, WFP 2, WFP 3, WFP 4, WFP 5, WFP 6, and WFP 7. DVWA has a reflected injection point and a stored injection point with four different levels of security. The level ‘Impossible’ can’t inject any payloads, so we have 6 websites that are recorded as DVWA R 1, DVWA R 2, DVWA R 3, DVWA S 1, DVWA S 2 and DVWA S 3. WAVSEP has 73 different websites. Considering the number of websites in the first two systems, we select the first 7 websites, recorded as WAVSEP 1, WAVSEP 2, WAVSEP 3, WAVSEP 4, WAVSEP 5, WAVSEP 6 and 1https://github.com/ajinabraham/OWASP-Xenotix-XSS-ExploitFramework 2https://github.com/bsmali4/XSSfork 3http://portswigger.net/burp 4https://github.com/fuzzdb-project/fuzzdb 5https://codeload.github.com/foospidy/payloads/zip/master 6https://pentesterlab.com/exercises/web for pentester 7http://www.dvwa.co.uk/ 8https://github.com/sectooladdict/wavsep 65
WAVSEP 7.We chose a course selection system as the Distribution of effective payloads actual website for the test.The site has two different XSS injection points.If exploited by an attacker,it can be Effective detrimental to the teachers and students who use the site.We record these two injection points as School_I and School_2. We tested these sites with the payloads.The results are shown in Table I. We can see that the number of valid payloads is different 02 in different websites.Some are dense (such as WFP_1 and DVWA_R_1)and some are sparse (such as WFP_7 and Average WAVSEP 2).How the effective payloads are distributed on these different sites and how efficient the ART method is will be discussed below. 0.d 02 04 06 0.8 10 TABLE I Fig.2.Distribution of effective payloads. PAYLOADS INJECTION RESULTS B.XSSART Vs Fuzzing Site Total Payloads Valid Payloads Ratio School_I 6128 0.096 From the previous section we know that for invalid pay- School 2 6128 0.042 loads,as the distance increases,the proportion of payloads WFP I 6128 1256 0.205 injected successfully increases first and then decreases.So WFP_2 6128 908 0.148 WFP 3 6128 814 0.133 we can select the next one based on the average distance WFP_4 6128 450 0.073 between invalid payloads and effective payloads.We use WFP_5 6128 359 0.059 the 7 websites of Web for Pentester as a "training set"to WFP 6 6128 131 0.021 WFP_7 6128 63 find a appropriate distance,to test the efficiency of the ART 0.00 DVWA R I 6128 1444 0236 method on other different benchmarks and actual websites. DVWA_R_2 6128 1082 0.177 We continue to use the relative distance,and the average DVWA R3 6128 531 0.087 DVWA S_I 6128 I505 0.246 distance between invalid payloads and effective payloads in DVWA_S_2 6128 1055 0.172 the 7 websites of Web for Pentester is 0.391.We increase DVWA_S_3 WAVSEP I 6128 0087 the priority of the 1/4 payloads each time,so we can set the 6128 0.242 distance interval to [0.265,0.5151. WAVSEP 2 6128 16 0.003 WAVSEP_3 6128 583 0.095 WAVSEP 4 6128 65 0.011 TABLE II WAVSEP 5 6128 174 0.028 XSSART Vs FUZZING WAVSEP_6 6128 272 0.044 WAVSEP 7 6128 261 0.043 Site Fuzzing XSSART Ratio Average 6128 628 0.02 School 1 10.24 6.7 34.60 School 2 23.83 11.88 50.5% WFP_1 4.86 4.27 12.1% WFP 2 6.76 5.68 16.0% WFP 3 7.4 5.94 9.7% WFP4 1372 9.03 34.2 A.Distribution of effective payloads WFP 5 17.09 15.35 10.2% WFP 6 46.48 34.5 25.8% We count the proportion of payloads injected successfully WFP_7 99.79 90.54 9.3% at different distances between invalid payloads and effective DVWA RI 425 372 12.5% payloads.The result is shown in Figure 2.Here,we use DVWA_R_2 5.63 4.71 16.3% DVWA R 3 11.4 8.4 26.3% the relative distance.For a target payload,we sort the other DVWA S I 408 3.67 10.0 payloads according to the Jaccard coefficient with the target DVWA S_2 5.69 5.01 12.0% payload,and divide the order by the value of the total DVWA S 3 11.61 8.45 27.2% WAVSEP_I 4.1 3.69 10.0% number of payloads as the distance value,so that the distance WAVSEP_2 358.84 173.56 51.60 from the target payload is evenly distributed. WAVSEP 3 10.69 6.88 35.6% WAVSEP 4 92.73 45.73 50.7% The line in Figure 2 represents the average proportion of WAVSEP S 34R2 20.67 40.6% effective payloads.We can see that,for effective payloads, WAVSEP 6 22.2 12.1i 45.5% the closer the payloads are,the more successfully they can be WAVSEP 7 23.48 12.65 46.1% injected.But for invalid payloads,as the distance increases, Average 37.26 22.41 27.1% the proportion of payloads injected successfully increases first and then decreases.Therefore,we can say that effective In the XSS detection,it is significant to find the first payloads cluster together and selecting a payload with an effective payload.We stop testing once that we find a appropriate distance from the invalid payload can increase payload which can be injected successfully,and record the the probability of successful injection. 66
WAVSEP 7. We chose a course selection system as the actual website for the test. The site has two different XSS injection points. If exploited by an attacker, it can be detrimental to the teachers and students who use the site. We record these two injection points as School 1 and School 2. We tested these sites with the payloads. The results are shown in Table I. We can see that the number of valid payloads is different in different websites. Some are dense (such as WFP 1 and DVWA R 1) and some are sparse (such as WFP 7 and WAVSEP 2). How the effective payloads are distributed on these different sites and how efficient the ART method is will be discussed below. TABLE I PAYLOADS INJECTION RESULTS Site Total Payloads Valid Payloads Ratio School 1 6128 587 0.096 School 2 6128 256 0.042 WFP 1 6128 1256 0.205 WFP 2 6128 908 0.148 WFP 3 6128 814 0.133 WFP 4 6128 450 0.073 WFP 5 6128 359 0.059 WFP 6 6128 131 0.021 WFP 7 6128 63 0.010 DVWA R 1 6128 1444 0.236 DVWA R 2 6128 1082 0.177 DVWA R 3 6128 531 0.087 DVWA S 1 6128 1505 0.246 DVWA S 2 6128 1055 0.172 DVWA S 3 6128 535 0.087 WAVSEP 1 6128 1484 0.242 WAVSEP 2 6128 16 0.003 WAVSEP 3 6128 583 0.095 WAVSEP 4 6128 65 0.011 WAVSEP 5 6128 174 0.028 WAVSEP 6 6128 272 0.044 WAVSEP 7 6128 261 0.043 Average 6128 628 0.102 A. Distribution of effective payloads We count the proportion of payloads injected successfully at different distances between invalid payloads and effective payloads. The result is shown in Figure 2. Here, we use the relative distance. For a target payload, we sort the other payloads according to the Jaccard coefficient with the target payload, and divide the order by the value of the total number of payloads as the distance value, so that the distance from the target payload is evenly distributed. The line in Figure 2 represents the average proportion of effective payloads. We can see that, for effective payloads, the closer the payloads are, the more successfully they can be injected. But for invalid payloads, as the distance increases, the proportion of payloads injected successfully increases first and then decreases. Therefore, we can say that effective payloads cluster together and selecting a payload with an appropriate distance from the invalid payload can increase the probability of successful injection. Fig. 2. Distribution of effective payloads. B. XSSART Vs Fuzzing From the previous section we know that for invalid payloads, as the distance increases, the proportion of payloads injected successfully increases first and then decreases. So we can select the next one based on the average distance between invalid payloads and effective payloads. We use the 7 websites of Web for Pentester as a “training set” to find a appropriate distance, to test the efficiency of the ART method on other different benchmarks and actual websites. We continue to use the relative distance, and the average distance between invalid payloads and effective payloads in the 7 websites of Web for Pentester is 0.391. We increase the priority of the 1/4 payloads each time, so we can set the distance interval to [0.265,0.515]. TABLE II XSSART VS FUZZING Site Fuzzing XSSART Ratio School 1 10.24 6.7 34.6% School 2 23.83 11.88 50.5% WFP 1 4.86 4.27 12.1% WFP 2 6.76 5.68 16.0% WFP 3 7.4 5.94 19.7% WFP 4 13.72 9.03 34.2% WFP 5 17.09 15.35 10.2% WFP 6 46.48 34.5 25.8% WFP 7 99.79 90.54 9.3% DVWA R 1 4.25 3.72 12.5% DVWA R 2 5.63 4.71 16.3% DVWA R 3 11.4 8.4 26.3% DVWA S 1 4.08 3.67 10.0% DVWA S 2 5.69 5.01 12.0% DVWA S 3 11.61 8.45 27.2% WAVSEP 1 4.1 3.69 10.0% WAVSEP 2 358.84 173.56 51.6% WAVSEP 3 10.69 6.88 35.6% WAVSEP 4 92.73 45.73 50.7% WAVSEP 5 34.82 20.67 40.6% WAVSEP 6 22.2 12.11 45.5% WAVSEP 7 23.48 12.65 46.1% Average 37.26 22.41 27.1% In the XSS detection, it is significant to find the first effective payload. We stop testing once that we find a payload which can be injected successfully, and record the 66
XSSART vs Fuzzing results are shown in Table II XSSART vs Fuzzing 部 微 Fig.3.ART Vs Fuzzing F-measure XSSART vs Fuzzing Fig.5.ART Vs Fuzzing As we can see in the table,the average efficiency of the ART method is superior to the Fuzzing method in the test of all 22 websites.The average increase is 27.1%,and the highest increase is 51.6%.Moreover,the more sparse the effective payloads are,the more efficient of ART is. In order to better compare ART and Fuzzing,we use the n置 box-whisker plots to represent the data of those websites, as shown in Figure 3.Figure 4 and Figure 5.We divide the F-measure of each website by the maximum of the w two methods,and normalize the F-measure to between 0 oni and 1.There are median,maximum,and minimum values of F-measure.It can be seen that among the 22 XSS vulnerabilities,the minimum value of the F-measure of XSSART and Fuzzing is not much different,which means n that the best case (the least number of payloads you need to try)in two methods is very close.But for the maximum value of the F-measure,XSSART is significantly smaller Fig.4.ART Vs Fuzzing than Fuzzing.Among the 21 vulnerabilities,the maximum number of executions of payloads to evaluate the ART value of XSSART is less than Fuzzing,and the maximum value of the two methods in the remaining vulnerability method.This evaluation is called F-measure,a commonly (DVWA S 1)is close.It indicates that the worst case of used metric,which is defined as the expected number of test cases to detect the first failure [8].[9].We use F-measure to XSSART(the most payloads you need to try)is much better compare XSSART with the Fuzzing method.Here Fuzzing than Fuzzing. Therefore,we can say that the method of ART can detect method for XSS detection means that each time select an XSS vulnerabilities more effectively than the method of unexecuted payload for testing until the vulnerability is discovered.To avoid sample bias,we tested each website Fuzzing. 1000 times and count the average as the last result.Finally, V.RELATED WORK we calculate the ratio((Fuzzing-ART)/Fuzzing*100%) Academia and industry researchers have proposed many to evaluates how much efficiency XSSART improves.The approaches to detect XSS attacks,we summarize the main work in the field related to this paper. 67
Fig. 3. ART Vs Fuzzing Fig. 4. ART Vs Fuzzing number of executions of payloads to evaluate the ART method. This evaluation is called F-measure, a commonly used metric, which is defined as the expected number of test cases to detect the first failure [8], [9]. We use F-measure to compare XSSART with the Fuzzing method. Here Fuzzing method for XSS detection means that each time select an unexecuted payload for testing until the vulnerability is discovered. To avoid sample bias, we tested each website 1000 times and count the average as the last result. Finally, we calculate the ratio((F uzzing−ART)/F uzzing ∗100%) to evaluates how much efficiency XSSART improves. The results are shown in Table II. Fig. 5. ART Vs Fuzzing As we can see in the table, the average efficiency of the ART method is superior to the Fuzzing method in the test of all 22 websites. The average increase is 27.1%, and the highest increase is 51.6%. Moreover, the more sparse the effective payloads are, the more efficient of ART is. In order to better compare ART and Fuzzing, we use the box-whisker plots to represent the data of those websites, as shown in Figure 3, Figure 4 and Figure 5. We divide the F-measure of each website by the maximum of the two methods, and normalize the F-measure to between 0 and 1. There are median, maximum, and minimum values of F-measure. It can be seen that among the 22 XSS vulnerabilities, the minimum value of the F-measure of XSSART and Fuzzing is not much different, which means that the best case (the least number of payloads you need to try) in two methods is very close. But for the maximum value of the F-measure, XSSART is significantly smaller than Fuzzing. Among the 21 vulnerabilities, the maximum value of XSSART is less than Fuzzing, and the maximum value of the two methods in the remaining vulnerability (DVWA S 1) is close. It indicates that the worst case of XSSART (the most payloads you need to try) is much better than Fuzzing. Therefore, we can say that the method of ART can detect XSS vulnerabilities more effectively than the method of Fuzzing. V. RELATED WORK Academia and industry researchers have proposed many approaches to detect XSS attacks, we summarize the main work in the field related to this paper. 67