I’m currently working on an application that takes content from various web resources, munges the content, stores it in a database, and on demand generates interactive web pages, which includes the ability to annotate content in a web editor. Things were humming along great for weeks until we got a stream of data which made the browser burp with a JavaScript syntax error.
Problem was, when I examined the automatically generated JavaScript, it looked perfectly good to my eyes.
So, I reduced the problem down to a very trivial case.
What would you suppose the following code block does in a browser?
<BODY>
start
<SCRIPT>
alert( "</SCRIPT>" );
</SCRIPT>
finish
</BODY>
</HTML>
To my eyes, this should produce an alert box with the simple text </SCRIPT> inside it. Nothing special.
However, in all browsers (IE 7, Firefox, Opera, and Safari) on all platforms (XP/Vista/OS X) it didn’t. The close tag inside the quoted literal terminated the scripting block, printing the closing punctuation.
Change </SCRIPT> to just <SCRIPT>, and you get the alert box as expected.
So, I did more reading and more testing. I looked at the hex dump of the file to see if perhaps there was something strange going on. Nope, plain ASCII.
I looked at the JavaScript documentation online, and the other thing they suggest escaping are the single and double quotes, as well as the backslash which does the escaping. (Note we’re using forward slashes, which require no escapes in a JavaScript string.)
I even got the 5th Edition of JavaScript: The Definitive Guide from O’Reilly, and on page 27, which lists the comprehensive escape sequences, there is nothing magical about the forward slash, nor this magic string.
In fact, if you start playing with other strings, you get these results:
<SCRIPT> …works
<A/B> …works
</STRONG> …works
<\/SCRIPT> …displays </SCRIPT>, and while I suppose you can escape a forward slash, there should be no need to. Ever. See prior example.
</SCRIPT> …breaks
</SCRIPTX> …works (note the extra character, an X)
With JavaScript, what’s in quotes is supposed to be flat, literal, uninterpreted, meaningless test.
It was after this I turned to ask for help from several security and web experts.
Security Concerns
Why security experts?
The primary concern is obviously cross site scripting. We’re taking untrusted sites and displaying portions of the data stream. Should an attacker be able to insert </SCRIPT> into the stream, a few comment characters, and shortly reopen a new <SCRIPT> block, he’d be able to mess with cookies, twiddle the DOM, dink with AJAX, and do things that compromise the trust of the server.
The Explanation
The explanation came from Phil Wherry.
As he puts it, the <SCRIPT> tag is content-agnostic. Which means the HTML Parser doesn’t know we’re in the middle of a JavaScript string.
What the HTML parser saw was this:
<BODY>
start
<SCRIPT>alert( "</SCRIPT>
" );
</SCRIPT>
finish
</BODY>
</HTML>
And there you have it, not only is the syntax error obvious now, but the HTML is malformed.
The processing of JavaScript doesn’t happen until after the browser has understood which parts are JavaScript. Until it sees that close </SCRIPT> tag, it doesn’t care what’s inside – quoted or not.
Turns out, we all have seen this problem in traditional programming languages before. Ever run across hard-to-read code where the indentation conveys a block that doesn’t logically exist? Same thing. In this case instead of curly braces or begin/end pairs, it was the start and end tags of the JavaScript.
Upstream Processing
Remember, this wasn’t hand-rolled JavaScript. It was produced by an upstream piece of code that generated the actual JavaScript block, which is much more complex than the example shown.
It is getting an untrusted string. Which, to shove inside of a JavaScript string not only has to be sanitized, but also escaped in such a way that the HTML parser cannot accidentally treat the string’s contents as a legal (or illegal!) tag.
To do this we need to build a helper function to scrub data that will directly be emitted as a raw JavaScript string.
- Escape all backslashes, replacing \ with \\, since backslash is the JavaScript escape character. This has to be done first as not to escape other escapes we’re about to add.
- Escape all quotes, replacing ' with \', and " with \" — this stops the string from getting terminated.
- Escape all angle brackets, replacing < with \<, and > with \> — this stops the tags from getting recognized.
str = str.replace(“\\”,”\\\\”); // escape single backslashes
str = str.replace(“'”,”\\'”); // escape single quotes
str = str.replace(“\””,”\\\””); // escape double quotes
str = str.replace(“<“,”\\<“); // escape open angle bracket
str = str.replace(“>”,”\\>”); // escape close angle bracket
return str;
}
At this point we should have generated a JavaScript string which never has anything that looks like a tag in it, but is perfectly safe to an XML parser. All that’s needed next is to emit the JavaScript surrounded by a <![CDATA[ … ]]> block, so the HTML parser doesn’t get confused over embedded angle brackets.
From a security perspective, I think this also goes to show that lone JavaScript fragment validation isn’t enough; one has to take it in the full context of the containing HTML parser. Pragmatically speaking, the JavaScript alone was valid, but once inside HTML, became problematic.