Using In A JavaScript Literal

Today I got bit by a very interesting bug involving the tag. If you’re writing code that generates code, you want to know about this.

I’m currently working on an application that takes content from various web resources, munges the content, stores it in a database, and on demand generates interactive web pages, which includes the ability to annotate content in a web editor. Things were humming along great for weeks until we got a stream of data which made the browser burp with a JavaScript syntax error.

Problem was, when I examined the automatically generated JavaScript, it looked perfectly good to my eyes.

So, I reduced the problem down to a very trivial case.

What would you suppose the following code block does in a browser?

<HTML>

<BODY>

  start

  <SCRIPT>

    alert( "</SCRIPT>" );

  </SCRIPT>

  finish

</BODY>

</HTML>

Try it and see.

To my eyes, this should produce an alert box with the simple text </SCRIPT> inside it. Nothing special.

However, in all browsers (IE 7, Firefox, Opera, and Safari) on all platforms (XP/Vista/OS X) it didn’t. The close tag inside the quoted literal terminated the scripting block, printing the closing punctuation.

Change </SCRIPT> to just <SCRIPT>, and you get the alert box as expected.

So, I did more reading and more testing. I looked at the hex dump of the file to see if perhaps there was something strange going on. Nope, plain ASCII.

I looked at the JavaScript documentation online, and the other thing they suggest escaping are the single and double quotes, as well as the backslash which does the escaping. (Note we’re using forward slashes, which require no escapes in a JavaScript string.)

I even got the 5th Edition of JavaScript: The Definitive Guide from O’Reilly, and on page 27, which lists the comprehensive escape sequences, there is nothing magical about the forward slash, nor this magic string.

In fact, if you start playing with other strings, you get these results:
  <SCRIPT> …works
  <A/B> …works
  </STRONG> …works
  <\/SCRIPT> …displays </SCRIPT>, and while I suppose you can escape a forward slash, there should be no need to. Ever. See prior example.
  </SCRIPT> …breaks
  </SCRIPTX> …works (note the extra character, an X)

With JavaScript, what’s in quotes is supposed to be flat, literal, uninterpreted, meaningless test.

It was after this I turned to ask for help from several security and web experts.

Security Concerns

Why security experts?

The primary concern is obviously cross site scripting. We’re taking untrusted sites and displaying portions of the data stream. Should an attacker be able to insert </SCRIPT> into the stream, a few comment characters, and shortly reopen a new <SCRIPT> block, he’d be able to mess with cookies, twiddle the DOM, dink with AJAX, and do things that compromise the trust of the server.

The Explanation

The explanation came from Phil Wherry.

As he puts it, the <SCRIPT> tag is content-agnostic. Which means the HTML Parser doesn’t know we’re in the middle of a JavaScript string.

What the HTML parser saw was this:

<HTML>

<BODY>

  start

  <SCRIPT>alert( "</SCRIPT>

  " );

  </SCRIPT>

  finish

</BODY>

</HTML>

And there you have it, not only is the syntax error obvious now, but the HTML is malformed.

The processing of JavaScript doesn’t happen until after the browser has understood which parts are JavaScript. Until it sees that close </SCRIPT> tag, it doesn’t care what’s inside – quoted or not.

Turns out, we all have seen this problem in traditional programming languages before. Ever run across hard-to-read code where the indentation conveys a block that doesn’t logically exist? Same thing. In this case instead of curly braces or begin/end pairs, it was the start and end tags of the JavaScript.

Upstream Processing

Remember, this wasn’t hand-rolled JavaScript. It was produced by an upstream piece of code that generated the actual JavaScript block, which is much more complex than the example shown.

It is getting an untrusted string. Which, to shove inside of a JavaScript string not only has to be sanitized, but also escaped in such a way that the HTML parser cannot accidentally treat the string’s contents as a legal (or illegal!) tag.

To do this we need to build a helper function to scrub data that will directly be emitted as a raw JavaScript string.

Escape all backslashes, replacing \ with \\, since backslash is the JavaScript escape character. This has to be done first as not to escape other escapes we’re about to add.
Escape all quotes, replacing ' with \', and " with \" — this stops the string from getting terminated.
Escape all angle brackets, replacing < with \<, and > with \> — this stops the tags from getting recognized.

private String safeJavaScriptStringLiteral(String str) {
  str = str.replace(“\\”,”\\\\”); // escape single backslashes

  str = str.replace(“'”,”\\'”); // escape single quotes

  str = str.replace(“\””,”\\\””); // escape double quotes

  str = str.replace(“<“,”\\<“); // escape open angle bracket

  str = str.replace(“>”,”\\>”); // escape close angle bracket

  return str;

}

At this point we should have generated a JavaScript string which never has anything that looks like a tag in it, but is perfectly safe to an XML parser. All that’s needed next is to emit the JavaScript surrounded by a <![CDATA[ … ]]> block, so the HTML parser doesn’t get confused over embedded angle brackets.

From a security perspective, I think this also goes to show that lone JavaScript fragment validation isn’t enough; one has to take it in the full context of the containing HTML parser. Pragmatically speaking, the JavaScript alone was valid, but once inside HTML, became problematic.

12 thoughts on “Using In A JavaScript Literal”

Michiel says:

June 21, 2007 at 11:28 am

This does not work either, in both Firefox and IE:

<script type="text/javascript">
//<![CDATA[
c = ‘</script>’;
//]]>
</script>

Shouldn’t the HTML parser recognize the CDATA section and leave it alone?

Reply
Walt Stoneburner says:

June 21, 2007 at 12:24 pm

May take on what’s happening, Michiel, is that we humans tend to read things in a nested manner. You see the CDATA, realize that the angle brackets aren’t to screw things up. Then you see the script tags, and recognize what’s inside is code. Then you read the code and see it’s a string to print.

This is not how the XML parser sees things. It reads sequentially.

First is sees an open script, then it sees the CDATA which simply says escaping is not necessary. It then sees the close script tag, which by what appears to be coincidence is not needed to be escaped, and it closes the script block there, since one is open.

The XML parser has no knowledge that what’s inside is JavaScript, and in fact, think back to the days before JavaScript capable browsers. From it’s perspective, it sees some odd characters as the value of that script element.

Only until moments later does it switch back into escaping mode and sees a second, which is clearly invalid, closing tag.

I think the real error is when an invalid tag is seen inside what would be the JavaScript block that quirk-mode browsers allow the syntax in the first place.

Think of it in terms of a language that you yourself don’t understand:

OPEN FEE FIE FOE CLOSE FOO CLOSE

All you understand is OPEN and CLOSE. You’re a simple parser. And you see invalid syntax.

You don’t know that FEE is the console object, FIE is the print directive, FOE is the start of literal text, and FOO is the end of literal text. You don’t care, either. You parse documents, you’re not a language. Therefore, you know nothing about when you should or shouldn’t ignore what looks like tags to you.

Reply
James says:

November 10, 2007 at 2:41 pm

Thanks for the informative post!

The usage of “” (back slash) to escape “/” (forward slash) is deprecated by Mozilla. When writing a JavaScript with the “” closing tag embedded in a string literal, it may be best to avoid that construction. A better solution may be to replace the angle brackets with escaped character codes, “x3C” and “3E” respectively.

Any better solutions?

Reply
Hamilton says:

March 24, 2008 at 11:02 am

This solved my problem too! Thanks!
I have one small addition, on James reply I think it should say

x3C and x3E

not 3E

Thanks again.

Reply
rlively says:

June 2, 2008 at 1:23 pm

Walt, what you said about the HTML parser not understanding JavaScript string literals makes perfect sense, but your response to Michiel didn’t make as much sense to me.

In Michiel’s case, the XML/HTML parser wouldn’t need to know anything about JavaScript – when the CDATA block is opened, it should read forward until the CDATA block is closed with ]]> – and if the HTML is malformed and the CDATA block is never closed, well then the entire rest of the document is enclosed in the CDATA inside the element and would not be parsed as HTML. Isn’t that the way it should be? The CDATA block isn’t a separate language (like JavaScript) that the HTML parser doesn’t understand, the CDATA block is part of the XML/XHTML specification that the parser SHOULD understand and follow. Just as JavaScript understand that anything enclosed in quotes is a literal, the HTML parser should understand that anything enclosed in is a literal.

Reply
cheeming says:

January 30, 2009 at 12:09 pm

I encountered a similar problem and to solve the problem on my end I HTML-encoded the JS string and before I used it it was decoded back.

Reply
Patrick Fisher says:

April 7, 2009 at 6:56 pm

A simple solution is to replace “&lt/script&gt” with “&lt/scr”+”ipt&gt”.

Reply
Douglas says:

March 21, 2010 at 8:37 am

@rlively: an XML parser, sure, but web browsers use HTML parsers to parse HTML. Remember that whole thing about XHTML? It was a bit of a joke, if you scroll down to the bottom of the XHTML 1.0 spec, you’ll see a compatabilty section, explaining what you have to do to get a loose HTML parser to accept you XHTML. That forward slash in your BR tags? Just a silent error on every page you serve up as text/html.

Same story with CDATA tags – to an HTML parser, they mean nothing and do nothing.

@cheeming: that’s the perfect solution. When you embed one language in another, escape according to the host language.

Reply
Jesse says:

March 21, 2010 at 8:43 am

rlively: I agree with you about this weird discrepancy. Perhaps it has to do with the content-type the document is served with (and thus the parser being used)? That is, if the document is being treated as HTML, I don’t think the <![CDATA[ means anything.

Reply
Josh Peters says:

March 21, 2010 at 11:16 am

I’m a bit surprised that you left out what occurs when you link to a script such as that.

Linking to a script file works as you would expect it to, as the HTML parser does not get in the way.

My guess is that you’ve got a program generating the script and that makes things slightly less trivial to switch to a linked script instead, but may be worth your while.

Reply
Jordan says:

March 23, 2010 at 4:05 pm

An even simpler solution is to use external script files, instead of relying on inline script blocks.

Alternatively, always escape forward slashes. Much faster than your safeJavaScriptStringLiteral function.

Reply
Pingback: Loading non-RJS javascript via ajax in Rails

Security Concerns

The Explanation

Upstream Processing

12 thoughts on “Using In A JavaScript Literal”

Leave a Reply Cancel reply