XML Documents

Contact Us or call 1-877-932-8228
XML Documents

XML Documents

An XML document is made up of the following parts.

  • An optional prolog.
  • A document element, usually containing nested elements.
  • Optional comments or processing instructions.

Note: we will review an XML document in the next presentation.

The Prolog

The prolog of an XML document can contain the following items.

  • An XML declaration
  • Processing instructions
  • Comments
  • A Document Type Declaration

The XML Declaration

The XML declaration, if it appears at all, must appear on the very first line of the document with no preceding white space. It looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

This declares that the document is an XML document. The version attribute is required, but the encoding and standalone attributes are not. If the XML document uses any markup declarations that set defaults for attributes or declare entities then standalone must be set to no.

Processing Instructions

Processing instructions are used to pass parameters to an application. These parameters tell the application how to process the XML document. For example, the following processing instruction tells the application that it should transform the XML document using the XSL stylesheet beatles.xsl.

<?xml-stylesheet href="beatles.xsl" type="text/xsl"?>

As shown above, processing instructions begin with and <? end with ?>.


Comments can appear throughout an XML document. Like in HTML, they begin with <!-- and end with -->.

<!--This is a comment-->

A Document Type Declaration

The Document Type Declaration (or DOCTYPE Declaration) has three roles.

  1. It specifies the name of the document element.
  2. It may point to an external Document Type Definition (DTD).
  3. It may contain an internal DTD.

The DOCTYPE Declaration shown below simply states that the document element of the XML document is beatles.

<!DOCTYPE beatles>

If a DOCTYPE Declaration points to an external DTD, it must either specify that the DTD is on the same system as the XML document itself or that it is in some public location. To do so, it uses the keywords SYSTEM and PUBLIC. It then points to the location of the DTD using a relative Uniform Resource Indicator (URI) or an absolute URI. Here are a couple of examples.


<!--DTD is on the same system as the XML document-->
<!DOCTYPE beatles SYSTEM "dtds/beatles.dtd">


<!--DTD is publicly available-->
<!DOCTYPE beatles PUBLIC "-//Webucator//DTD Beatles 1.0//EN"

As shown in the second declaration above, public identifiers are divided into three parts:

  1. An organization (e.g, Webucator)
  2. A name for the DTD (e.g, Beatles 1.0)
  3. A language (e.g, EN for English)


Every XML document must have at least one element, called the document element. The document element usually contains other elements, which contain other elements, and so on. Elements are denoted with tags. Let's look again at the Paul.xml.

Code Sample:

<?xml version="1.0"?>

The document element is person. It contains three elements: name, job and gender. Further, the name element contains two elements of its own: firstname and lastname. As you can see, XML elements are denoted with tags, just as in HTML. Elements that are nested within another element are said to be children of that element.

Empty Elements

Not all elements contain other elements or text. For example, in XHTML, there is an img element that is used to display an image. It does not contain any text or elements within it, so it is called an empty element. In XML, empty elements must be closed, but they do not require a separate close tag. Instead, they can be closed with a forward slash at the end of the open tag as shown below.

<img src="images/paul.jpg"/>

The above code is identical in function to the code below.

<img src="images/paul.jpg"></img>


XML elements can be further defined with attributes, which appear inside of the element's open tag as shown below.


<name title="Sir">


Sometimes it is necessary to include sections in an XML document that should not be parsed by the XML parser. These sections might contain content that will confuse the XML parser, perhaps because it contains content that appears to be XML, but is not meant to be interpreted as XML. Such content must be nested in CDATA sections. The syntax for CDATA sections is shown below.


	This section will not get parsed
	by the XML parser.

White Space

In XML data, there are only four white space characters.

  1. Tab
  2. Line-feed
  3. Carriage-return
  4. Single space

There are several important rules to remember with regards to white space in XML.

  1. White space within the content of an element is significant; that is, the XML processor will pass these characters to the application or user agent.
  2. White space in attributes is normalized; that is, neighboring white spaces are condensed to a single space.
  3. White space between elements is ignored.

xml:space Attribute

The xml:space attribute is a special attribute in XML. It can only take one of two values: default and preserve. This attribute instructs the application how to treat white space within the content of the element. Note that the application is not required to respect this instruction.

XML Syntax Rules

XML has relatively straightforward, but very strict, syntax rules. A document that follows these syntax rules is said to be well-formed.

  1. There must be one and only one document element.
  2. Every open tag must be closed.
  3. If an element is empty, it still must be closed.
    • Poorly-formed: <tag>
    • Well-formed: <tag></tag>
    • Also well-formed: <tag />
  4. Elements must be properly nested.
    • Poorly-formed: <a><b></a></b>
    • Well-formed: <a><b></b></a>
  5. Tag and attribute names are case sensitive.
  6. Attribute values must be enclosed in single or double quotes.

Special Characters

There are five special characters that cannot be included in XML documents. These characters are replaced with predefined entity references as shown in the table below.

Special Characters
Character Entity Reference
< &lt;
> &gt;
& &amp;
" &quot;
' &apos;