Using Non-ASCII / Unicode URLs on Your Web Site

by Oliver 24. August 2012 16:22

We’re still working on Marinas.info and were wondering if we should change any of the behavior that we use in Camping.info. There we allow for all kinds of unicode characters, from those in Eastern European languages such as Polish to the Cyrillic letters of the Russian alphabet – but we encode them properly using HttpServerUtility.UrlPathEncode(). This is the behavior that RFC 3986 on URIs defines in section 2.1. It also means that all links that are rendered on pages on Camping.info are correctly encoded and will work in all browsers, even text based ones.

Problems with Internet Explorer

The drawback of encoded URLs is that Internet Explorer will not decode them in the address the way all other major browsers do (Firefox, Chrome, Opera, haven’t checked Safari), so all those Internet Explorer users out there will see something like http://pl.camping.info/polska/%C5%9Bl%C4%85skie/camping-pod-d%C4%99bowcem-20149/po%C5%82o%C5%BCenie – which is, to put it mildly, unreadable. Go ahead and copy that URL into any of the other browsers and they will reformat it to http://pl.camping.info/polska/śląskie/camping-pod-dębowcem-20149/położenie but behind the scene will still use the encoded URL.

Another problem occurs when you enter a URL that contains non-ASCII characters directly into IE’s address bar. The initial page load will succeed as IE properly encodes the URL. But once you want to POST to the same page IE changes its behavior to something in my eyes inconsistent generating an error.

A deeper look behind the IE scenes

This is what the error looks like in Fiddler:

image

As you can see, IE replaced the Polish special characters from the first three URL segments and encoded only the last part which is also what’s inside the form’s action attribute:

image

Who would have thought! As you can see in the first screenshot, for the URL www.camping.info/österreich/niederösterreich something similar happens with the difference that IE replaces the ö by the byte with value F6 in hex or 246 in decimal. To make the following screenshots I saved the whole request (as Fiddler intercepted it) to a file and looked at it using the hex editor HxD:

image

According to the table http://www.utf8-chartable.de/, F6 is the Unicode code point for the small letter ö, whose UTF-8 representation is C3 B6 which we can find e.g. in the referrer of the same request:

image

So it turns out IE uses 3(!) different encodings to transmit the same letter ö: its Unicode code point, the URL encoded version proposed by RFC 3986, and the UTF-8 encoded version. Wow! Unfortunately, IIS and our application don’t play well with that.

Conclusion - Support Encoded URLs anyway

We decided anyway to support those encoded URLs for our new portals including Marinas.info to be able to SEO our pages according to their content even through their URL. Maybe IE 10 will decode those URLs in the address bar and get a grip on handling URL and form action uniformly – for their users’ sake!

Happy encoding!

Comments are closed

About Oliver

shades-of-orange.com code blog logo I build web applications using ASP.NET and have a passion for javascript. Enjoy MVC 4 and Orchard CMS, and I do TDD whenever I can. I like clean code. Love to spend time with my wife and our children. My profile on Stack Exchange, a network of free, community-driven Q&A sites

About Anton

shades-of-orange.com code blog logo I'm a software developer at teamaton. I code in C# and work with MVC, Orchard, SpecFlow, Coypu and NHibernate. I enjoy beach volleyball, board games and Coke.