I’m currently working on an application that takes content from various web resources, munges the content, stores it in a database, and on demand generates interactive web pages, which includes the ability to annotate content in a web editor. Things were humming along great for weeks until we got a stream of data which made the browser burp with a JavaScript syntax error.
Problem was, when I examined the automatically generated JavaScript, it looked perfectly good to my eyes.
So, I reduced the problem down to a very trivial case.
What would you suppose the following code block does in a browser?
<BODY>
start
<SCRIPT>
alert( "</SCRIPT>" );
</SCRIPT>
finish
</BODY>
</HTML>
To my eyes, this should produce an alert box with the simple text </SCRIPT> inside it. Nothing special.
However, in all browsers (IE 7, Firefox, Opera, and Safari) on all platforms (XP/Vista/OS X) it didn’t. The close tag inside the quoted literal terminated the scripting block, printing the closing punctuation.
Change </SCRIPT> to just <SCRIPT>, and you get the alert box as expected.
So, I did more reading and more testing. I looked at the hex dump of the file to see if perhaps there was something strange going on. Nope, plain ASCII.
I looked at the JavaScript documentation online, and the other thing they suggest escaping are the single and double quotes, as well as the backslash which does the escaping. (Note we’re using forward slashes, which require no escapes in a JavaScript string.)
I even got the 5th Edition of JavaScript: The Definitive Guide from O’Reilly, and on page 27, which lists the comprehensive escape sequences, there is nothing magical about the forward slash, nor this magic string.
In fact, if you start playing with other strings, you get these results:
<SCRIPT> …works
<A/B> …works
</STRONG> …works
<\/SCRIPT> …displays </SCRIPT>, and while I suppose you can escape a forward slash, there should be no need to. Ever. See prior example.
</SCRIPT> …breaks
</SCRIPTX> …works (note the extra character, an X)
With JavaScript, what’s in quotes is supposed to be flat, literal, uninterpreted, meaningless test.
It was after this I turned to ask for help from several security and web experts.
Security Concerns
Why security experts?
The primary concern is obviously cross site scripting. We’re taking untrusted sites and displaying portions of the data stream. Should an attacker be able to insert </SCRIPT> into the stream, a few comment characters, and shortly reopen a new <SCRIPT> block, he’d be able to mess with cookies, twiddle the DOM, dink with AJAX, and do things that compromise the trust of the server.
The Explanation
The explanation came from Phil Wherry.
As he puts it, the <SCRIPT> tag is content-agnostic. Which means the HTML Parser doesn’t know we’re in the middle of a JavaScript string.
What the HTML parser saw was this:
<BODY>
start
<SCRIPT>alert( "</SCRIPT>
" );
</SCRIPT>
finish
</BODY>
</HTML>
And there you have it, not only is the syntax error obvious now, but the HTML is malformed.
The processing of JavaScript doesn’t happen until after the browser has understood which parts are JavaScript. Until it sees that close </SCRIPT> tag, it doesn’t care what’s inside – quoted or not.
Turns out, we all have seen this problem in traditional programming languages before. Ever run across hard-to-read code where the indentation conveys a block that doesn’t logically exist? Same thing. In this case instead of curly braces or begin/end pairs, it was the start and end tags of the JavaScript.
Upstream Processing
Remember, this wasn’t hand-rolled JavaScript. It was produced by an upstream piece of code that generated the actual JavaScript block, which is much more complex than the example shown.
It is getting an untrusted string. Which, to shove inside of a JavaScript string not only has to be sanitized, but also escaped in such a way that the HTML parser cannot accidentally treat the string’s contents as a legal (or illegal!) tag.
To do this we need to build a helper function to scrub data that will directly be emitted as a raw JavaScript string.
- Escape all backslashes, replacing \ with \\, since backslash is the JavaScript escape character. This has to be done first as not to escape other escapes we’re about to add.
- Escape all quotes, replacing ' with \', and " with \" — this stops the string from getting terminated.
- Escape all angle brackets, replacing < with \<, and > with \> — this stops the tags from getting recognized.
str = str.replace(“\\”,”\\\\”); // escape single backslashes
str = str.replace(“'”,”\\'”); // escape single quotes
str = str.replace(“\””,”\\\””); // escape double quotes
str = str.replace(“<“,”\\<“); // escape open angle bracket
str = str.replace(“>”,”\\>”); // escape close angle bracket
return str;
}
At this point we should have generated a JavaScript string which never has anything that looks like a tag in it, but is perfectly safe to an XML parser. All that’s needed next is to emit the JavaScript surrounded by a <![CDATA[ … ]]> block, so the HTML parser doesn’t get confused over embedded angle brackets.
From a security perspective, I think this also goes to show that lone JavaScript fragment validation isn’t enough; one has to take it in the full context of the containing HTML parser. Pragmatically speaking, the JavaScript alone was valid, but once inside HTML, became problematic.
This does not work either, in both Firefox and IE:
<script type="text/javascript">
//<![CDATA[
c = ‘</script>’;
//]]>
</script>
Shouldn’t the HTML parser recognize the CDATA section and leave it alone?
May take on what’s happening, Michiel, is that we humans tend to read things in a nested manner. You see the CDATA, realize that the angle brackets aren’t to screw things up. Then you see the script tags, and recognize what’s inside is code. Then you read the code and see it’s a string to print.
This is not how the XML parser sees things. It reads sequentially.
First is sees an open script, then it sees the CDATA which simply says escaping is not necessary. It then sees the close script tag, which by what appears to be coincidence is not needed to be escaped, and it closes the script block there, since one is open.
The XML parser has no knowledge that what’s inside is JavaScript, and in fact, think back to the days before JavaScript capable browsers. From it’s perspective, it sees some odd characters as the value of that script element.
Only until moments later does it switch back into escaping mode and sees a second, which is clearly invalid, closing tag.
I think the real error is when an invalid tag is seen inside what would be the JavaScript block that quirk-mode browsers allow the syntax in the first place.
Think of it in terms of a language that you yourself don’t understand:
OPEN FEE FIE FOE CLOSE FOO CLOSE
All you understand is OPEN and CLOSE. You’re a simple parser. And you see invalid syntax.
You don’t know that FEE is the console object, FIE is the print directive, FOE is the start of literal text, and FOO is the end of literal text. You don’t care, either. You parse documents, you’re not a language. Therefore, you know nothing about when you should or shouldn’t ignore what looks like tags to you.
Thanks for the informative post!
The usage of “” (back slash) to escape “/” (forward slash) is deprecated by Mozilla. When writing a JavaScript with the “” closing tag embedded in a string literal, it may be best to avoid that construction. A better solution may be to replace the angle brackets with escaped character codes, “x3C” and “3E” respectively.
Any better solutions?
This solved my problem too! Thanks!
I have one small addition, on James reply I think it should say
x3C and x3E
not 3E
Thanks again.
Walt, what you said about the HTML parser not understanding JavaScript string literals makes perfect sense, but your response to Michiel didn’t make as much sense to me.
In Michiel’s case, the XML/HTML parser wouldn’t need to know anything about JavaScript – when the CDATA block is opened, it should read forward until the CDATA block is closed with ]]> – and if the HTML is malformed and the CDATA block is never closed, well then the entire rest of the document is enclosed in the CDATA inside the element and would not be parsed as HTML. Isn’t that the way it should be? The CDATA block isn’t a separate language (like JavaScript) that the HTML parser doesn’t understand, the CDATA block is part of the XML/XHTML specification that the parser SHOULD understand and follow. Just as JavaScript understand that anything enclosed in quotes is a literal, the HTML parser should understand that anything enclosed in is a literal.
I encountered a similar problem and to solve the problem on my end I HTML-encoded the JS string and before I used it it was decoded back.
A simple solution is to replace “</script>” with “</scr”+”ipt>”.
@rlively: an XML parser, sure, but web browsers use HTML parsers to parse HTML. Remember that whole thing about XHTML? It was a bit of a joke, if you scroll down to the bottom of the XHTML 1.0 spec, you’ll see a compatabilty section, explaining what you have to do to get a loose HTML parser to accept you XHTML. That forward slash in your BR tags? Just a silent error on every page you serve up as text/html.
Same story with CDATA tags – to an HTML parser, they mean nothing and do nothing.
@cheeming: that’s the perfect solution. When you embed one language in another, escape according to the host language.
rlively: I agree with you about this weird discrepancy. Perhaps it has to do with the content-type the document is served with (and thus the parser being used)? That is, if the document is being treated as HTML, I don’t think the <![CDATA[ means anything.
I’m a bit surprised that you left out what occurs when you link to a script such as that.
Linking to a script file works as you would expect it to, as the HTML parser does not get in the way.
My guess is that you’ve got a program generating the script and that makes things slightly less trivial to switch to a linked script instead, but may be worth your while.
An even simpler solution is to use external script files, instead of relying on inline script blocks.
Alternatively, always escape forward slashes. Much faster than your safeJavaScriptStringLiteral function.