Unicode and the Copyright Symbol

Here’s how I solved the copyright character being changed to an invalid Unicode character block on source control commits.

I ran into an interesting problem with the copyright symbol recently: ©

The symbol appears in the comments of a number of source code files that I work on. And I was noticing that using Subversion‘s difference tool was complaining that my editor was producing an invalid character. The © symbol was changing to that familiar rectangle that says the character is invalid.

I’m using two editors, Notepad++ and IntelliJ Ultimate. I trust both to do the right thing.

Using Cygwin, I was able to dump the file and see what was going on.
$ od -ctx1 filename

In the original file, the © symbol was rendered using 0xA9 (ascii 169). This byte was in that magical 128-255 block, and under certain conditions, with the right code page and font loaded, it would appear as ©, though in many others it would just appear as an invalid character block.

However, if I removed the offensive byte and replaced it with the copyright character, a dump showed something more UTF-8 looking, a byte pair for the character: 0xC2 0xA9

Subversion’s tool then rendered the character properly, as well the compilers and editors working well.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.