International Character Set in HTML Forms

International Character Set in HTML Forms

The default character set (or codepage) for HTML documents and Forms is ISO-8859-1 (or Latin 1). This can be changed to another default ISO character set in the Server Administration Program (or in the server.ini file). The requirement is that the server and the clients (browsers) agree to use the same charset.

However, if the HTML page with the Form should be used internationally, there might be a need for solution that handles all kinds of characters. This solution is called UTF-8 and is a byte encoded version of the Unicode charset. Characters specified in UTF-8 can easily be converted into Unicode and then converted back again.

The complete solution is to set two attributes in the form tag, as seen in bold below:

The default content type for forms is application/x-www-form-urlencoded. When a user presses a Submit button, the browser should send back form names and values to the server. These field names and values are placed in the header field data and are separated by the "&" sign.

multipart/form-data is a content type created to be able to send binary data in forms, such as binary files. The content type is described in RFC 2046. There is no support for files in NetPhantom, but this format makes it possible for the browser to send any kind of character data to the server.

The accept-charset attribute specifies to the browser, what character sets the server can accept. The support for this attribute is a bit limited in the browsers, but it can still be used. We want to use the utf-8 character set (UTF-8), which means that all possible characters from any character set can be entered by the user and identified in the server.

All HTML files in the NetPhantom server are parsed and NetPhantom searches for the form tags with enctype="multipart/form-data". If this kind of form tag is found, two new hidden fields are dynamically added to the form before it is sent to the client. The reason for adding these fields is to make NetPhantom understand in which character set (charset) the field data is sent back from the browser. These fields are handled by NetPhantom and are not included in the HeaderField hash table accessed by the CGI.

The first hidden field is definition_charset which contains the same information found in the accept-charset attribute in the form tag, this should in this case be utf-8. The second hidden field is definition_charset_teststring which is used as a work-around to verify that the browser is actually using the UTF-8 character set.

Click on the links below to test the international character set support in forms: