Definition of XML Encoding
XML Encoding is defined as the process of converting Unicode characters into binary format and in XML when the processor reads the document it mandatorily encodes the statement to the declared type of encodings, the character encodings are specified through the attribute ‘encoding’. Encoding plays a role in XML as the user needs to provide a correct encoding while transferring XML Documents on different platforms. With respective to XML 1.0 specification, the two Unicode UTF -8 and 16 must be supported in the processor automatically.XML parser encodes the document properly and translate them into standard Unicode internally.
Syntax of XML Encoding
This Unicode character set has a universal character that covers a major part of the world languages. To lead a better interaction with methods of encoding characters this Unicode gives us the specification. The encoding part is declared in the section of the XML document LINE1. The general Syntax of Unicode is given below:
<?xml version="1.0" encoding="encoding-name”?>
UTF-8 Syntax
<?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?>
- It’s a pure ASCII character.
UTF-16 Syntax
If suppose a document includes a Unicode like (0XX…) they are considered to be UTF-16 encodings with 16bits.
<?xml version = "1.0" encoding = "UTF-16" standalone = "no" ?>
The encoding attribute names are not case-sensitive as they proceed ISO and IANA standards.
For Western European Character set the declaration is as follows as they use non-English characters (Latin-1).
<xml version="1.0" encoding="ISO-8859-1" >
Xml also recognizes different encodings like US-ASCII, ISO-8859-1 to 10 and windows version. The general annotation of XML declaration with valid encodings name are given below:
<?xml version='1.0' encoding='US-ASCII' standalone='yes’?>
<?xml version='1.0' encoding='ISO-10646-UCS-2’?>
<?xml version='1.0' encoding='ISO-8859-1’?>
<?xml version='1.0' encoding='Shift-JIS’?>
By default (with no encoding specified) UTF-8 is allowed to assume in the header of the XML file and this is used by the XML Parser.
How does Encoding Work in XML?
To avoid errors while working with XML it is necessary to specify the type of encoding or the XML file should be saved as Unicode. Different types of character encodings are provided while specifying any foreign languages which fall beyond the standard encoding scope. In some cases, the XML processor ignores encoding attributes in the XML Declaration when it is passed through the other network protocols as HTTP has specific headers for the encoding provided actual encoding should be the same as a specific encoder or else it shows the error. The Encoding given in the XML declaration could be overridden by HTTP Protocols during data transfer. The function XMLGetEncoding() helps to do the encoding process.
Format: XMLGetEncoding(generation, I/O entry)
- generation is the task generation, 0 for the current task, 1 for the parent, and so on.
- I/O entry defines the number of input/output file that has the XML document.
- It gives a text box which is the value of the “encoding” attribute on the XML document.
Types of Encoding in XML with Example
XML classifies encoding into two different types they are:
1. UTF-8
For specific Document types, certain detections rules are given one such rule is for XML, DTD If no character encoding is specified then UTF-8 is used and java, SQL, XQuery uses this encoding as they have compression format. For numeric character reference in XML, this UTF-8 is been assigned with variable-length encoding. The BYTE ORDER MASKS for UTF-8 is EF BB BF. It is said that for languages like Chinese scripts the good choice is to use UTF-16 as there is a trouble with UTF-8 is as they make larger files yet not a universal solution. The significant bit of UTF-8 is defined as 7, 11,16,21 as they are encoded as one to four bytes.
Example
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<?xml-stylesheet href="clock.css" type="text/css"?>
<Clocks timezone="GMT">
<timehour>11</timehour>
<timeminute>50</timeminute>
<timesecond>40</timesecond>
<timemeridian>p.m.</timemeridian>
</Clocks>
Output:
2. UTF-16
This type takes two bytes for each character and should be smaller also incompatible with ASCII. UTF-16 doesn’t follow uniform width which may use 2 or 4 bytes. It is again having classification to LE and BE (little Indian and big Indian) and the byte order is done by byte order mask. It faces some issues while processing in older programming languages like C version as they process zero-harder machine address. Here the significant bit is represented as 16, 20. But UTF-16 supports only for selected specification by xml parser. For national data items (COBOL) parsed in XML documents, it is suggested to prefer UTF-16. They are used mostly in java and windows.
Example
<?xml version="1.0" encoding="UTF-16"?>
<college>
<Professor>
<fullname>Evangeline MAC</fullname>
<Dept>Science-1</Dept>
</Professor>
<!--
<Professor>
<fullname>Antony Jay</fullname>
<Dept>Mathematics</Dept>
</Professor>
-->
</college>
When a file is read the bytes here changes encoding to UTF- 16. Note that the file should be changed to UTF-16 in the text while saving the file.
Output:
Let’s take another example
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<Name>Mópezr Pchödinger</Name>
The above encoding changes the special international characters to special symbols.
Output:
Now let’s see next sample example with ASCII encoding. here the code is.
<?xml version="1.0" encoding="ASCII" standalone="yes"?>
<Name>Mópezr Pchödinger</Name>
In ASCII format the first “ó “symbol is supposed to encode as C3 B3(Specific two bytes). And the second “ö “symbol as C6. The ASCII encoding entirely overlaps with UTF-8.
Output:
Here comes an example of encoding in XML with C#. Here we use UTF-16 encoding mechanism.
using System;
using System.IO;
using System.Xml;
public class main {
public static void Main() {
XmlDocument d = new XmlDocument();
string xmlSt = "<tv><tvname>Samsung</tvname></tv>";
d.Load(new StringReader(xmlSt));
XmlDeclaration dec;
dec = d.CreateXmlDeclaration("1.0",null,null);
dec.Encoding="UTF-16";
dec.Standalone="yes";
XmlElement root = d.DocumentElement;
d.InsertBefore(dec, root);
Console.WriteLine(d.OuterXml);
}
}
Output:
Conclusion
So that’s all about the encoding. We have gone through Unicode and encodes in the XML and also the implementation of XML encoding through C#. In this emerging software world, the characters sets are not made so feasible therefore there comes a character encoding schemes to be done with the XML and other programming languages. Therefore it is said that it is best to use UTF-8 everywhere where it doesn’t need any conversions encoding.
Recommended Articles
This is a guide to XML Encoding Here we also discuss the Introduction and how does it in xml along with types and examples. You may also have a look at the following articles to learn more –