XML – Walt-O-Matic

I’m currently working on an application that takes content from various web resources, munges the content, stores it in a database, and on demand generates interactive web pages, which includes the ability to annotate content in a web editor. Things were humming along great for weeks until we got a stream of data which made the browser burp with a JavaScript syntax error.

Problem was, when I examined the automatically generated JavaScript, it looked perfectly good to my eyes.

So, I reduced the problem down to a very trivial case.

What would you suppose the following code block does in a browser?

<HTML>

<BODY>

  start

  <SCRIPT>

    alert( "</SCRIPT>" );

  </SCRIPT>

  finish

</BODY>

</HTML>

Try it and see.

To my eyes, this should produce an alert box with the simple text </SCRIPT> inside it. Nothing special.

However, in all browsers (IE 7, Firefox, Opera, and Safari) on all platforms (XP/Vista/OS X) it didn’t. The close tag inside the quoted literal terminated the scripting block, printing the closing punctuation.

Change </SCRIPT> to just <SCRIPT>, and you get the alert box as expected.

So, I did more reading and more testing. I looked at the hex dump of the file to see if perhaps there was something strange going on. Nope, plain ASCII.

I looked at the JavaScript documentation online, and the other thing they suggest escaping are the single and double quotes, as well as the backslash which does the escaping. (Note we’re using forward slashes, which require no escapes in a JavaScript string.)

I even got the 5th Edition of JavaScript: The Definitive Guide from O’Reilly, and on page 27, which lists the comprehensive escape sequences, there is nothing magical about the forward slash, nor this magic string.

In fact, if you start playing with other strings, you get these results:
  <SCRIPT> …works
  <A/B> …works
  </STRONG> …works
  <\/SCRIPT> …displays </SCRIPT>, and while I suppose you can escape a forward slash, there should be no need to. Ever. See prior example.
  </SCRIPT> …breaks
  </SCRIPTX> …works (note the extra character, an X)

With JavaScript, what’s in quotes is supposed to be flat, literal, uninterpreted, meaningless test.

It was after this I turned to ask for help from several security and web experts.

Security Concerns

Why security experts?

The primary concern is obviously cross site scripting. We’re taking untrusted sites and displaying portions of the data stream. Should an attacker be able to insert </SCRIPT> into the stream, a few comment characters, and shortly reopen a new <SCRIPT> block, he’d be able to mess with cookies, twiddle the DOM, dink with AJAX, and do things that compromise the trust of the server.

The Explanation

The explanation came from Phil Wherry.

As he puts it, the <SCRIPT> tag is content-agnostic. Which means the HTML Parser doesn’t know we’re in the middle of a JavaScript string.

What the HTML parser saw was this:

<HTML>

<BODY>

  start

  <SCRIPT>alert( "</SCRIPT>

  " );

  </SCRIPT>

  finish

</BODY>

</HTML>

And there you have it, not only is the syntax error obvious now, but the HTML is malformed.

The processing of JavaScript doesn’t happen until after the browser has understood which parts are JavaScript. Until it sees that close </SCRIPT> tag, it doesn’t care what’s inside – quoted or not.

Turns out, we all have seen this problem in traditional programming languages before. Ever run across hard-to-read code where the indentation conveys a block that doesn’t logically exist? Same thing. In this case instead of curly braces or begin/end pairs, it was the start and end tags of the JavaScript.

Upstream Processing

Remember, this wasn’t hand-rolled JavaScript. It was produced by an upstream piece of code that generated the actual JavaScript block, which is much more complex than the example shown.

It is getting an untrusted string. Which, to shove inside of a JavaScript string not only has to be sanitized, but also escaped in such a way that the HTML parser cannot accidentally treat the string’s contents as a legal (or illegal!) tag.

To do this we need to build a helper function to scrub data that will directly be emitted as a raw JavaScript string.

Escape all backslashes, replacing \ with \\, since backslash is the JavaScript escape character. This has to be done first as not to escape other escapes we’re about to add.
Escape all quotes, replacing ' with \', and " with \" — this stops the string from getting terminated.
Escape all angle brackets, replacing < with \<, and > with \> — this stops the tags from getting recognized.

private String safeJavaScriptStringLiteral(String str) {
  str = str.replace(“\\”,”\\\\”); // escape single backslashes

  str = str.replace(“'”,”\\'”); // escape single quotes

  str = str.replace(“\””,”\\\””); // escape double quotes

  str = str.replace(“<“,”\\<“); // escape open angle bracket

  str = str.replace(“>”,”\\>”); // escape close angle bracket

  return str;

}

At this point we should have generated a JavaScript string which never has anything that looks like a tag in it, but is perfectly safe to an XML parser. All that’s needed next is to emit the JavaScript surrounded by a <![CDATA[ … ]]> block, so the HTML parser doesn’t get confused over embedded angle brackets.

From a security perspective, I think this also goes to show that lone JavaScript fragment validation isn’t enough; one has to take it in the full context of the containing HTML parser. Pragmatically speaking, the JavaScript alone was valid, but once inside HTML, became problematic.

While working on some XML and XSLT stuff, I ran into some strange problems where transformed XML content was making Firefox spin its wheels forever and Safari was having problems rendering XSL variables.

I wasn’t engaged in a browser war shoot out, I just wanted to know that the XSLT was correctly transforming the XML into the desired output. As various tools were slowly slipping from my fingertips, I figured I might just have to go back to the command line.

But then I discovered XSLPalette. It’s a “free, native, XSLT 2.0, XPath 2.0, and XQuery 1.0 debugging palette” for OS X (and it’s a Universal Binary).

All I have to say is that, as a developer, I’m impressed with the ease this tool provides for trying different XSLT engines. I does basically one thing, and that one thing very, very well. I like that in developer tools.

You give the palette an XML file, and XSLT file, select the engine, and it does the transformation, showing you messages along the way, in addition to the transformed output, a collapsible view, and a browser-like rendered view.

Walt gives XSLPalette a thumbs up!

Category: XML

Using In A JavaScript Literal

Security Concerns

The Explanation

Upstream Processing

Great XSLT Tool for OS X