Taint Inference for Cross-Site Scripting in Context of URL Rewriting and HTML Sanitization
Taint Inference for Cross-Site Scripting in Context of URL Rewriting and HTML Sanitization
ETRI Journal. 2016. Apr, 38(2): 376-386
Copyright © 2016, Electronics and Telecommunications Research Institute (ETRI)
  • Received : June 24, 2015
  • Accepted : December 09, 2015
  • Published : April 01, 2016
Export by style
Cited by
About the Authors
Jinkun Pan
Xiaoguang Mao
Weishi Li

Currently, web applications are gaining in prevalence. In a web application, an input may not be appropriately validated, making the web application susceptible to cross-site scripting (XSS), which poses serious security problems for Internet users and websites to whom such trusted web pages belong. A taint inference is a type of information flow analysis technique that is useful in detecting XSS on the client side. However, in existing techniques, two current practical issues have yet to be handled properly. One is URL rewriting, which transforms a standard URL into a clearer and more manageable form. Another is HTML sanitization, which filters an input against blacklists or whitelists of HTML tags or attributes. In this paper, we make an analogy between the taint inference problem and the molecule sequence alignment problem in bioinformatics, and transfer two techniques related to the latter over to the former to solve the aforementioned yet-to-be-handled-properly practical issues. In particular, in our method, URL rewriting is addressed using local sequence alignment and HTML sanitization is modeled by introducing a removal gap penalty. Empirical results demonstrate the effectiveness and efficiency of our method.
I. Introduction
Nowadays, accessing web applications has already become a daily routine for many people, such as the checking of emails, conducting bank transactions, and visiting social networking websites. All kinds of information systems for governments, businesses, and individuals are now built as web applications. Unfortunately, many web applications are exposed to various security vulnerabilities. Among them, cross-site scripting (XSS) has emerged as one of the most serious threats on the web. XSS is listed second in the top 10 security risks from OWASP [1] , and fourth in the top 25 most dangerous software errors from CWE/SANS [2] . The security problems caused by XSS are of great severity. Through injecting malicious scripts into trusted web contents, an attacker can gain access to a user’s browser; steal a user’s cookies; hijack a user’s sessions; transfer confidential data; cause denial of service; and forge web requests and responses, as well as perform many other types of malicious activities.
Although a single XSS vulnerability is easy to fix, fixing all XSS vulnerabilities in a large web application is a really challenging task, which many application programmers cannot fully accomplish. Instead of fixing them all, detecting and preventing them when they occur is a more feasible way to deal with them. To prevent XSS, we should first detect whether an attacker is able to exert control over a piece of web content, and if so, we should then further detect precisely which parts of this content can the attacker inject into. Taint inference is proposed to solve this problem. Such a technique is practically useful on the client side because it works in a manner similar to that of a black box (that is, it compares the input of the user with the response of the server); thus, it does not need source code and is irrelevant to the underlying server technology. However, some practical issues still remain to be solved. For example, more and more websites are using URL rewriting to overcome the shortcomings of the standard URL — the exposing of the underlying technology of the website, the fact that it is neither descriptive nor friendly to users and search engines alike. URL rewriting impedes existing taint inference techniques in locating and extracting a user input from a URL, thus affecting the precision of any resulting inference. Moreover, many server applications adopt HTML sanitizers to filter potential dangerous tags and attributes in an effort to protect against XSS attacks. This also causes problems when it comes to trying to match the input of a user with the response of the server. To overcome these two problems, we propose a new taint inference technique inspired by molecule sequence alignment in bioinformatics. Through local sequence alignment, a tainted input can be located and inferred automatically within the context of URL rewriting, and the imprecision of taint inference caused by HTML sanitization can be mitigated by introducing a removal gap penalty. We evaluate our technique using 18 vulnerabilities in five open-source projects, each with 108 malicious vectors. Experimental results show that both the inference rate and the inference precision are improved evidently and that the running overhead is negligible.
In the reminder of this paper, we first introduce background techniques in Section II. Then, we describe the motivation for our study in Section III and propose our approach in Section IV. Experimental evaluations are reported in Section V. Finally, we discuss the related work in Section VI and conclude in Section VII.
II. Background
- 1. Cross-Site Scripting
XSS denotes a kind of code injection attack on a web application. Because HTML documents have a flat, serial structure comprising a mixture of control statements, formatting, and actual content, attackers can inject malicious scripts into the content to be responded to by the vulnerable application, due to a lack of proper input validation and sanitization. As such injected content is delivered from a trusted server, the relevant malicious scripts can act under permissions that are granted to the vulnerable application.
XSS can be classified into three different types: reflected, stored, and DOM-based XSS. A reflected XSS vulnerability is the most common type. These vulnerabilities show up when data provided by a web client is used immediately by server-side scripts to parse and display a page of results. A stored XSS vulnerability occurs when malicious data provided by an attacker is injected into a vulnerable application’s storage. This results in every user that accesses the poisoned web page receiving the injected script without the need for any further action on behalf of the attacker. A DOM-based XSS is a special variant of the reflected XSS, where logic errors in legitimate JavaScript and careless usage of client-side data result in XSS conditions. In a DOM-based XSS, malicious data need not touch a web server; rather, it can be reflected by the JavaScript code, fully on the client side.
Regardless of the different types of XSS, the corresponding taint inference algorithms are similar. The differences lie in the contexts and contents to be inferred. To simplify illustration, we only consider reflected XSS in this paper.
- 2. Taint Inference
XSS attacks occur under the following two conditions:
  • ▪ Data from an untrusted source is injected into dynamic content that is to be sent to a web user.
  • ▪ The injected content is able to perform malicious activities — the likes of which is not anticipated by either the developer or the administrator.
To detect the occurrence of an XSS attack, it is necessary to check whether these two conditions have been met. In our research, we focus on the first condition and try to solve the problems of whether data from an untrusted source has been injected and which parts of the response delivered by the server are derived from the injected data.
Fine-grained taint tracking [3] [7] has been proposed as an effective technique for tackling such problems. However, it suffers from several drawbacks, such as heavy instruments, high overheads, language dependency, and requirement of source codes; thus, these drawbacks make it difficult to adopt such a technique in production systems.
To overcome these drawbacks, a new taint inference technique is proposed, which infers taints using a black-box method by observing and comparing user input requests and server output responses. Generally speaking, requests to web applications use the HTTP protocol, with standardized ways of encoding parameters. Web applications receive the request-related parameter values, apply simple sanitization or normalization operations, and then use the values to retrieve some data, or even generate contents containing these values and respond to the user. As a result, data flows might be identified by comparing input parameter values against all possible substrings of outgoing responses. Because client-side defenses do not (and need not) access the source code, taint inference is preferred rather than taint tracking. An example is shown in Fig. 1 . The server code is vulnerable due to a lack of proper input processing. The solid arrows represent a taint flow from the URL request to the HTML response through the server code. However, the server code is not available on the client side. Taint inference helps us infer the taint flow between the URL request and the HTML response, shown by the dashed arrow, which discloses the XSS vulnerability concealed in the server code.
PPT Slide
Lager Image
Example of taint inference in detecting XSS vulnerability.
In existing techniques, Internet Explorer (IE) [8] uses regular expressions to infer taints. From inputs, regular expressions from possibly malicious injections are created using heuristics. These expressions are then compiled and matched against the HTML output. The taint inference algorithm of XSSAuditor [9] uses the idea of straight string matching between inputs and outputs, considering magic quotes and normalization of unicode characters. NoXSS [10] adopts a longest common subsequence algorithm, which allows parts of a substring to be present in an input parameter while missing in a response. XSSFilt [11] relies on an approximate, rather than exact, string match to be able to identify taint in the presence of simple sanitization or normalization operations used by a web application. These techniques have been proven to be useful in inferring taints that may cause XSS. Nevertheless, there are still some practical issues that need to be further investigated, which will be discussed in the next section.
III. Motivation
- 1. URL Rewriting
URL rewriting aims to improve the appearance of a given URL. It adds a layer of abstraction between the files used to generate a web page and the URL that is presented to the outside world. Most web servers and web frameworks support URL rewriting, either directly or through extension modules.
Normally, a standard URL looks something like the following:
They are prevalent in dynamically generated web pages. However, there are many problems with a URL of this kind:
  • ▪ It exposes the underlying technology, which gives potential hackers clues as to what they should send along with the query string to perform a front-door attack on the site.
  • ▪ If the language that the website is based on is changed (to PHP, for instance), all old URLs will stop working.
  • ▪ The URL is littered with awkward punctuation, such as the question mark and ampersand.
  • ▪ Many search engines will not index a site in depth if it contains links to such dynamic pages.
Luckily, using rewriting, we can clean up this URL to something far more manageable, such as the following:
This URL is more logical, readable, and memorable and will be picked up by search engines. The faux directories are short and descriptive. In addition, it looks more permanent.
Nevertheless, there can be drawbacks as well. A URL is the most prevalent input source of XSS. Existing taint inference techniques rely on standard parameter encoding of URLs to locate and extract user inputs. In such encodings, parameters are located after a question mark and separated by an ampersand (in a URL). Each parameter has a name and a value that are connected with the equals sign. It is easy to parse a standard URL to extract parameter values as the user input to be inferred. However, in the context of URL rewriting, it is hard to extract parameters on the client side, since we do not know the rewriting rules of the server; the only thing we can determine is that parameters may exist in the URL beyond the domain part. Without precise information of the input parameters, the effectiveness of existing taint inference techniques will reduce dramatically. This motivates us to propose a practical method to infer taints without relying on exact URL parameter locations.
- 2. HTML Sanitation
HTML sanitization is the process of examining an HTML document and producing a new HTML document that preserves only those tags that are deemed to be safe. HTML sanitization can be used to protect against XSS attacks by sanitizing any HTML code submitted by a user. Basic tags for changing fonts are often allowed, such as , , , , and , while more advanced tags such as