Tuesday, January 02, 2007

URL encoding and XML/HTML encoding

Recently in one of our applications we were passing XML data as a POST parameter to the server. For e.g. xml={xml string here}

We were using XML encoding to replace special characters in the XML. For e.g. '&' was replaced with '&amp;' '<' was replaced with & lt; etc.

But on the server side whenever we were trying to get the XML string as a request parameter, we were getting a truncated string. After putting in a sniffer, I found out the reason why this was happening - I was getting confused between URL encoding and XML encoding.

In URL encoding as we know, we need to encode special characters such as '&'. So we replace them with '&amp'. But what if '&' was part of the data; i.e. an address contained &. In this case the URL encoding function should just replace '&' with '%26'. Otherwise the server would treat it as a parameter separator and break the string on the first occurance of '&'.

Now our XML string (passed as a POST parameter) contained the & character, because all XML special characters were encoded with a '&' at the beginning. So the trick in this case was to perform URL encoding of the element values after the XML encoding. Confused??

Ok...consider the following XML element:
Now the above element data contains an & and hence would result in a XML parsing error. Hence I XML encode the data. So it would not look like:

But when this string is passed as a HTTP GET/POST param, then it would be truncated.
Hence we have to URL encode it - replace & with %26

So when the string reaches the server, the request.getParameter would perform the URL decoding and then we passed the XML string to Castor or any other XML parser.

Another good design to avoid all this confusion is to pass the XML in the request body stream and get the XML on the server side using request.getInputStream().

Even ASP.NET has 2 methods: HtmlEncode and UrlEncode.
HtmlEncode converts the angle brackets, quotes, ampersands, etc. to the entity values to prevent the HTML parser from confusing it with markup.
UrlEncode converts spaces to "+" and non-alphanumeric to their hex-encoded values. Again this is to prevent the the URL parser from misinterpreting an embedded ?, & or other values.