TREX - Tree Regular Expressions for XML
Tutorial

Author:
    James Clark (Thai Open Source Software Center) <jjc@thaiopensource.com>
Date:
    2001-01-20

Copyright © 2001 Thai Open Source Software Center Ltd

Abstract

A TREX pattern specifies a pattern for the structure and content of an XML document. A TREX pattern thus identifies a class of XML documents consisting of those documents that match the pattern. A TREX pattern is itself an XML document.

Table of contents

1 Getting started
2 Choice
3 Attributes
4 Named patterns
5 Interleaving
6 Strings
7 Modularity
  7.1 Including patterns
  7.2 Merging grammars
8 Namespaces
  8.1 Using the ns attribute
  8.2 Qualified names
  8.3 Pattern element namespace
9 Name classes
10 Datatyping
  10.1 Named datatypes
  10.2 Anonymous datatypes
11 Advanced features
  11.1 Nested grammars
  11.2 Concur
12 Non-restrictions
13 Non-features

1 Getting started

Consider a simple XML representation of an email address book:

<addressBook>
  <card>
    <name>John Smith</name>
    <email>js@example.com</email>
  </card>
  <card>
    <name>Fred Bloggs</name>
    <email>fb@example.net</email>
  </card>
</addressBook>

The DTD would be as follows:

<!DOCTYPE addressBook [
<!ELEMENT addressBook (card*)>
<!ELEMENT card (name, email)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT email (#PCDATA)>
]>

A TREX pattern for this could be written as follows:

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <element name="name">
        <anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</element>

If the addressBook is required to be non-empty, then we can use oneOrMore instead of zeroOrMore:

<element name="addressBook">
  <oneOrMore>
    <element name="card">
      <element name="name">
        <anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
    </element>
  </oneOrMore>
</element>

Now let's change it to allow each card to have an optional note element.

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <element name="name">
        <anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
      <optional>
	<element name="note">
	  <anyString/>
	</element>
      </optional>
    </element>
  </zeroOrMore>
</element>

Note that the anyString pattern matches any string, including the empty string. Note also that whitespace separating tags is ignored when matching against a pattern.

2 Choice

Now suppose we want to allow the name to be broken down into a givenName and a familyName, allowing an addressBook like this:

<addressBook>
  <card>
    <givenName>John</givenName>
    <familyName>Smith</familyName>
    <email>js@example.com</name>
  </card>
  <card>
    <name>Fred Bloggs</name>
    <email>fb@example.net</email>
  </card>
</addressBook>

We can use the following pattern:

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <choice>
        <element name="name">
          <anyString/>
        </element>
        <group>
          <element name="givenName">
            <anyString/>
          </element>
          <element name="familyName">
            <anyString/>
          </element>
        </group>
      </choice>
      <element name="email">
        <anyString/>
      </element>
      <optional>
	<element name="note">
	  <anyString/>
	</element>
      </optional>
    </element>
  </zeroOrMore>
</element>

This corresponds to the following DTD:

<!DOCTYPE addressBook [
<!ELEMENT addressBook (card*)>
<!ELEMENT card ((name | (givenName, familyName)), email, note?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT givenName (#PCDATA)>
<!ELEMENT familyName (#PCDATA)>
<!ELEMENT note (#PCDATA)>
]>

3 Attributes

Suppose we want the card element to have attributes rather than child elements. The DTD might look like this

<!DOCTYPE addressBook [
<!ELEMENT addressBook (card*)>
<!ELEMENT card EMPTY>
<!ATTLIST card
  name CDATA #REQUIRED
  email CDATA #REQUIRED>
]>

Just change each element pattern to an attribute pattern:

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <attribute name="name">
        <anyString/>
      </attribute>
      <attribute name="email">
        <anyString/>
      </attribute>
    </element>
  </zeroOrMore>
</element>

In XML, the order of attributes is traditionally not significant. TREX follows this tradition. The above pattern would match both

<card name="John Smith" email="js@example.com"/>

and

<card email="js@example.com" name="John Smith"/>

In contrast, the order of elements is significant. The pattern

<element name="card">
  <element name="name">
    <anyString/>
  </element>
  <element name="email">
    <anyString/>
  </element>
</element>

would not match:

<card><email>js@example.com</email><name>John Smith</name></card>

Note that an attribute element by itself indicates a required attribute, just as an element element by itself indicates a required element. To specify an optional attribute, use optional just as with element:

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <attribute name="name">
        <anyString/>
      </attribute>
      <attribute name="email">
        <anyString/>
      </attribute>
      <optional>
        <attribute name="note">
          <anyString/>
        </attribute>
      </optional>
    </element>
  </zeroOrMore>
</element>

The group and choice patterns can be applied to attribute elements in the same way they are applied to element patterns. For example, if we wanted to allow either a name attribute or both a givenName and a familyName attribute, we can specify this in the same way that we would if we were using elements:

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <choice>
        <attribute name="name">
          <anyString/>
        </attribute>
        <group>
          <attribute name="givenName">
            <anyString/>
          </attribute>
          <attribute name="familyName">
            <anyString/>
          </attribute>
        </group>
      </choice>
      <attribute name="email">
        <anyString/>
      </attribute>
    </element>
  </zeroOrMore>
</element>

There are no restrictions on how element elements and attribute elements can be combined. For example, the following pattern would allow a choice of elements and attributes independently for both the name and the email part of a card:

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <choice>
	<element name="name">
	  <anyString/>
	</element>
	<attribute name="name">
	  <anyString/>
	</attribute>
      </choice>
      <choice>
	<element name="email">
	  <anyString/>
	</element>
	<attribute name="email">
	  <anyString/>
	</attribute>
      </choice>
    </element>
  </zeroOrMore>
</element>

As usual, the relative order of elements is significant, but the relative order of attributes is not. Thus the above would match any of:

<card name="John Smith" email="js@example.com"/>
<card email="js@example.com" name="John Smith"/>
<card email="js@example.com"><name>John Smith</name></card>
<card name="John Smith"><email>js@example.com</email></card>
<card><name>John Smith</name><email>js@example.com</email></card>

However, it would not match

<card><email>js@example.com</email><name>John Smith</name></card>

because the pattern for card requires any email child element to follow any name child element.

There is one difference between attribute and element patterns: <anyString/> is the default for the content of an attribute pattern, whereas an element pattern is not allowed to be empty. For example,

<attribute name="email"/>

is short for

<attribute name="email">
  <anyString/>
</attribute>

It might seem natural that

<element name="x"/>

matched an x element with no attributes and no content. However, this would make the meaning of empty content inconsistent between the element pattern and the attribute pattern, so TREX does not allow the element pattern to be empty. A pattern that matches an element with no attributes and no children must use <empty/> explicitly:

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <element name="name">
        <anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
      <optional>
        <element name="prefersHTML">
          <empty/>
        </element>
      </optional>
    </element>
  </zeroOrMore>
</element>

4 Named patterns

For a non-trivial TREX pattern, it is often convenient to be able to give names to parts of the pattern. Instead of

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <element name="name">
	<anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</element>

we can write

<grammar>

  <start>
    <element name="addressBook">
      <zeroOrMore>
	<element name="card">
	  <ref name="cardContent"/>
	</element>
      </zeroOrMore>
    </element>
  </start>

  <define name="cardContent">
    <element name="name">
      <anyString/>
    </element>
    <element name="email">
      <anyString/>
    </element>
  </define>

</grammar>

A grammar element has a single start child element, and zero or more define child elements. The start and define elements contain patterns. These patterns can contain ref elements that refer to patterns defined by any of the define elements in that grammar element. A grammar pattern is matched by matching the pattern contained in the start element.

We can use the grammar element to write patterns in a style similar to DTDs:

<grammar>

  <start>
    <ref name="AddressBook"/>
  </start>

  <define name="AddressBook">
    <element name="addressBook">
      <zeroOrMore>
        <ref name="Card"/>
      </zeroOrMore>
    </element>
  </define>

  <define name="Card">
    <element name="card">
      <ref name="Name"/>
      <ref name="Email"/>
    </element>
  </define>

  <define name="Name">
    <element name="name">
      <anyString/>
    </element>
  </define>

  <define name="Email">
    <element name="email">
      <anyString/>
    </element>
  </define>

</grammar>

Recursive references are allowed. For example

<define name="inline">
  <zeroOrMore>
    <choice>
      <anyString/>
      <element name="bold">
        <ref name="inline"/>
      </element>
      <element name="italic">
        <ref name="inline"/>
      </element>
      <element name="span">
        <optional>
          <attribute name="style"/>
        </optional>
        <ref name="inline"/>
      </element>
    </choice>
  </zeroOrMore>
</define>

However, recursive references must be within an element. Thus, the following is not allowed:

<define name="inline">
  <choice>
    <anyString/>
    <element name="bold">
      <ref name="inline"/>
    </element>
    <element name="italic">
      <ref name="inline"/>
    </element>
    <element name="span">
      <optional>
	<attribute name="style"/>
      </optional>
      <ref name="inline"/>
    </element>
  </choice>
  <optional>
    <ref name="inline"/>
  </optional>
</define>

A start element may also have a name attribute. This is a shorthand for a define with that name together with a start element referencing that definition. For example

<grammar>
  <start name="inline">
    <zeroOrMore>
      <choice>
	<anyString/>
	<element name="bold">
	  <ref name="inline"/>
	</element>
      </choice>
    </zeroOrMore>
  </start>
</grammar>

is short for

<grammar>
  <start>
    <ref name="inline"/>
  </start>
  <define name="inline">
    <zeroOrMore>
      <choice>
	<anyString/>
	<element name="bold">
	  <ref name="inline"/>
	</element>
      </choice>
    </zeroOrMore>
  </define>
</grammar>

If there's a combine attribute, but there's no earlier definition, then the combine attribute is ignored.

5 Interleaving

The interleave pattern allows child elements to occur in any order. For example, the following would allow the card element to contain the name and email elements in any order:

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <interleave>
	<element name="name">
	  <anyString/>
	</element>
	<element name="email">
	  <anyString/>
	</element>
      </interleave>
    </element>
  </zeroOrMore>
</element>

The pattern is called interleave because of how it works with patterns that match more than one element. Suppose we want to write a pattern for the HTML head element which requires exactly one title element, at most one base element and zero or more style, script, link and meta elements and suppose we are writing a grammar pattern that has one definition for each element. Then we could define the pattern for head as follows:

<define name="head">
  <element name="head">
    <interleave>
      <ref name="title"/>
      <optional>
        <ref name="base"/>
      </optional>
      <zeroOrMore>
        <ref name="style"/>
      </zeroOrMore>
      <zeroOrMore>
        <ref name="script"/>
      </zeroOrMore>
      <zeroOrMore>
        <ref name="link"/>
      </zeroOrMore>
      <zeroOrMore>
        <ref name="meta"/>
      </zeroOrMore>
    </interleave>
  </element>
</define>

Suppose we had a head element that contained a meta element, followed by a title element, followed by a meta element. This would match the pattern because it is an interleaving of a sequence of two meta elements, which match the child pattern

      <zeroOrMore>
        <ref name="meta"/>
      </zeroOrMore>

and a sequence of one title element, which matches the child pattern

      <ref name="title"/>

The semantics of the interleave pattern are that a sequence of elements matches an interleave pattern if it is an interleaving of sequences that match the child patterns of the interleave pattern. Note that this is different from the & connector in SGML: A* & B matches the sequence of elements A A B or the sequence of elements B A A but not the sequence of elements A B A.

One special case of interleave is very common: interleaving <anyString/> with a pattern p represents a pattern that matches what p matches but also allows characters to occur as children. The mixed element is a shorthand for this.

<mixed> p </mixed>

is short for

<interleave> <anyString/> p </interleave>

6 Strings

Whereas the anyString pattern matches any string, the string pattern matches a specific string. This is useful mainly for specifying the value of attributes. For example,

<element name="card">
  <attribute name="name"/>
  <attribute name="email"/>
  <attribute name="prefersHTML">
    <choice>
      <string>true</string>
      <string>false</string>
    </choice>
  </attribute>
</element>

This corresponds to the DTD

<!DOCTYPE card [
<!ELEMENT card EMPTY>
<!ATTLIST card
  name CDATA #REQUIRED
  email CDATA #REQUIRED
  prefersHTML (true|false) #REQUIRED>
]>

Normally, the string pattern will normalize the white-space in both the pattern string and the string being matched by stripping leading and trailing white-space characters, and collapsing sequences of one or more white-space characters to a single space character. This corresponds to the behaviour of an XML parser for an attribute that is declared as other than CDATA. Thus the above pattern will match any of

<card name="John Smith" email="js@example.com" prefersHTML="true"/>
<card name="John Smith" email="js@example.com" prefersHTML=" true "/>

To prevent the string pattern from normalizing white-space, specify a whiteSpace="preserve" attribute on the string pattern.

<element name="card">
  <attribute name="name"/>
  <attribute name="email"/>
  <attribute name="prefersHTML">
    <choice>
      <string whiteSpace="preserve">true</string>
      <string whiteSpace="preserve">false</string>
    </choice>
  </attribute>
</element>

will not match

<card name="John Smith" email="js@example.com" prefersHTML=" true "/>

The string pattern is not restricted to attribute values. For example, the following is allowed:

<element name="card">
  <element name="name">
    <anyString/>
  </element>
  <element name="email">
    <anyString/>
  </element>
  <element name="prefersHTML">
    <choice>
      <string>true</string>
      <string>false</string>
    </choice>
  </element>
</element>

If the children of an element or an attribute match a string pattern, then complete content of the element or attribute must match that string pattern. It is not permitted to have a pattern which allows part of the content to match a string pattern, and another part to match another pattern. For example, the following pattern is not allowed:

<element name="bad">
  <choice>
    <string>true</string>
    <string>false</string>
  </choice>
  <element name="note">
    <anyString/>
  </element>
</element>

However, this would be fine:

<element name="ok">
  <choice>
    <string>true</string>
    <string>false</string>
  </choice>
  <attribute name="note">
    <anyString/>
  </attribute>
</element>

Note that this restriction does not apply to the anyString pattern.

7 Modularity

The include element can be used to allow a pattern to be divided amongst multiple files. The include element has a required href attribute that specifies the URL of a file to be included in place of the include element.

7.1 Including patterns

The include element can be used as a pattern. In this case, it will match if the pattern contained in the specified URL matches. Suppose for example, you have a TREX pattern that matches HTML inline content stored in inline.trex:

<grammar>
  <start name="inline">
    <zeroOrMore>
      <choice>
        <anyString/>
        <element name="code">
          <ref name="inline"/>
        </element>
        <element name="em">
          <ref name="inline"/>
        </element>
        <!-- etc -->
      </choice>
    </zeroOrMore>
  </start>
</grammar>

Then we could allow the note element to contain inline HTML markup by using include as follows:

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <element name="name">
        <anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
      <optional>
	<element name="note">
	  <include href="inline.trex"/>
	</element>
      </optional>
    </element>
  </zeroOrMore>
</element>

For another example, suppose you have two TREX patterns stored in files pattern1.trex and pattern2.trex. Then the following is a pattern that which match anything matched by one of those patterns:

<choice>
  <include href="pattern1.trex"/>
  <include href="pattern2.trex"/>
</choice>

7.2 Merging grammars

The include element is also allowed as a child of a grammar pattern. In this case the specified URL must contain a grammar pattern, and the included grammar will be merged with the including grammar. Normally a duplicate definition is an error, however if the two definitions are from different files, then the later definition can be combined with the earlier one. The combine attribute specifies how it should be combined. If there is no combine attribute, it is an error. The simplest value for combine is replace, which says to replace the earlier definition with the later one.

Suppose the file addressBook.trex contains the following grammar pattern:

<grammar>

  <start>
    <element name="addressBook">
      <zeroOrMore>
	<element name="card">
	  <element name="name">
	    <anyString/>
	  </element>
	  <element name="email">
	    <anyString/>
	  </element>
          <ref name="card.local"/>
	</element>
      </zeroOrMore>
    </element>
  </start>

  <define name="card.local">
    <empty/>
  </define>

</grammar>

Another pattern could customize addressBook.trex as follows:

<grammar>

  <include href="addressBook.trex"/>

  <define name="card.local" combine="replace">
    <optional>
      <element name="note">
	<anyString/>
      </element>
    </optional>
  </define>

</grammar>

This would be equivalent to:

<grammar>

  <start>
    <element name="addressBook">
      <zeroOrMore>
	<element name="card">
	  <element name="name">
	    <anyString/>
	  </element>
	  <element name="email">
	    <anyString/>
	  </element>
          <ref name="card.local"/>
	</element>
      </zeroOrMore>
    </element>
  </start>

  <define name="card.local">
    <optional>
      <element name="note">
	<anyString/>
      </element>
    </optional>
  </define>

</grammar>

which is equivalent to

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <element name="name">
	<anyString/>
      </element>
      <element name="email">
	<anyString/>
      </element>
      <optional>
	<element name="note">
	  <anyString/>
	</element>
      </optional>
    </element>
  </zeroOrMore>
</element>

The combine attribute can also specify the name of an element to use to combine the earlier pattern and the later pattern. For example we could have written our customization as:

<grammar>

  <include href="addressBook.trex"/>

  <define name="card.local" combine="group">
    <optional>
      <element name="note">
	<anyString/>
      </element>
    </optional>
  </define>

</grammar>

This would be equivalent to:

<grammar>

  <start>
    <element name="addressBook">
      <zeroOrMore>
	<element name="card">
	  <element name="name">
	    <anyString/>
	  </element>
	  <element name="email">
	    <anyString/>
	  </element>
          <ref name="card.local"/>
	</element>
      </zeroOrMore>
    </element>
  </start>

  <define name="card.local">
    <group>
      <empty/>
      <optional>
        <element name="note">
	  <anyString/>
        </element>
      </optional>
    </group>
  </define>

</grammar>

This has the same meaning as before, since adding an empty pattern to the content of a group pattern does not make any difference to what the group pattern matches. We could also have used combine="choice" here:

<grammar>

  <include href="addressBook.trex"/>

  <define name="card.local" combine="choice">
    <!-- no optional element needed this time -->
    <element name="note">
      <anyString/>
    </element>
  </define>

</grammar>

This would be equivalent to:

<grammar>

  <start>
    <element name="addressBook">
      <zeroOrMore>
	<element name="card">
	  <element name="name">
	    <anyString/>
	  </element>
	  <element name="email">
	    <anyString/>
	  </element>
          <ref name="card.local"/>
	</element>
      </zeroOrMore>
    </element>
  </start>

  <define name="card.local">
    <choice>
      <empty/>
      <element name="note">
        <anyString/>
      </element>
    </choice>
  </define>

</grammar>

This has the same meaning as before, since an optional pattern is equivalent to a choice between the pattern and empty.

The notAllowed pattern never matches anything. Just as adding empty to a group makes no difference, so adding notAllowed to a choice makes no difference. It is typically used in a definition that is referenced in a choice element to allow an including pattern to specify additional choices. For example, suppose a TREX pattern inline.trex provides a pattern for inline content, which allows bold and italic elements arbitrarily nested:

<grammar>

  <start name="inline">
    <zeroOrMore>
      <choice>
	<anyString/>
	<element name="bold">
	  <ref name="inline"/>
	</element>
	<element name="italic">
	  <ref name="inline"/>
	</element>
        <ref name="local.inline"/>
      </choice>
    </zeroOrMore>
  </start>

  <define name="local.inline">
    <notAllowed/>
  </define>

</grammar>

Another TREX pattern could use inline.trex and add code and em to the set of inline elements as follows:

<grammar>

  <include href="inline.trex"/>

  <start>
    <element name="doc">
      <zeroOrMore>
	<element name="p">
	  <ref name="inline"/>
	</element>
      </zeroOrMore>
    </element>
  </start>

  <define name="local.inline" combine="replace">
    <choice>
      <element name="code">
	<ref name="inline">
      </element>
      <element name="em">
	<ref name="inline">
      </element>
    </choice>
  </define>
  
</grammar>

We could instead have used combine="choice". In this case, inline.trex would need to separate out the choices as a separate definition:

<grammar>

  <start name="inline">
    <zeroOrMore>
      <ref name="inline.class"/>
    </zeroOrMore>
  </start>

  <define name="inline.class">
    <choice>
      <anyString/>
      <element name="bold">
	<ref name="inline"/>
      </element>
      <element name="italic">
	<ref name="inline"/>
      </element>
    </choice>
  </define>

</grammar>

and the customization would add to those choices:

<grammar>

  <include href="inline.trex"/>

  <start>
    <element name="doc">
      <zeroOrMore>
	<element name="p">
	  <ref name="inline"/>
	</element>
      </zeroOrMore>
    </element>
  </start>

  <define name="inline.class" combine="choice">
    <choice>
      <element name="code">
	<ref name="inline">
      </element>
      <element name="em">
	<ref name="inline">
      </element>
    </choice>
  </define>
  
</grammar>

8 Namespaces

TREX is namespace-aware. Thus, it considers an element or attribute to have both a local name and a namespace URI which together constitute the name of that element or attribute.

8.1 Using the ns attribute

The element pattern uses an ns attribute to specify the namespace URI of the elements that it matches. For example

<element name="foo" ns="http://www.example.com">
  <empty/>
</element>

would match any of

<foo xmlns="http://www.example.com"/>
<e:foo xmlns:e="http://www.example.com"/>
<example:foo xmlns:example="http://www.example.com"/>

but not any of

<foo/>
<e:foo xmlns:e="http://WWW.EXAMPLE.COM"/>
<example:foo xmlns:example="http://www.example.net"/>

A value of an empty string for the ns attribute indicates a null or absent namespace URI (just as with the xmlns attribute). Thus, the pattern

<element name="foo" ns="">
  <empty/>
</element>

matches any of

<foo xmlns=""/>
<foo/>

but not any of

<foo xmlns="http://www.example.com"/>
<e:foo xmlns:e="http://www.example.com"/>

It is tedious and error-prone to specify the ns attribute on every element, so TREX allows it to be defaulted. If an element pattern does not specify an ns attribute, then it defaults to the value of the ns attribute of the nearest ancestor that has an ns attribute, or the empty string if there is no such ancestor. Thus

<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <element name="name">
        <anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</element>

is equivalent to

<element name="addressBook" ns="">
  <zeroOrMore>
    <element name="card" ns="">
      <element name="name" ns="">
        <anyString/>
      </element>
      <element name="email" ns="">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</element>

and

<element name="addressBook" ns="http://www.example.com">
  <zeroOrMore>
    <element name="card">
      <element name="name">
        <anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</element>

is equivalent to

<element name="addressBook" ns="http://www.example.com">
  <zeroOrMore>
    <element name="card" ns="http://www.example.com">
      <element name="name" ns="http://www.example.com">
        <anyString/>
      </element>
      <element name="email" ns="http://www.example.com">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</element>

The attribute pattern also takes an ns attribute. However, there is a difference in how it defaults. This is because of the fact that the XML Namespaces Recommendation does not apply the default namespace to attributes. If an ns attribute is not specified on the attribute pattern, then it defaults to the empty string. Thus

<element name="addressBook" ns="http://www.example.com">
  <zeroOrMore>
    <element name="card">
      <attribute name="name"/>
      <attribute name="email"/>
    </element>
  </zeroOrMore>
</element>

is equivalent to

<element name="addressBook" ns="http://www.example.com">
  <zeroOrMore>
    <element name="card" ns="http://www.example.com">
      <attribute name="name" ns=""/>
      <attribute name="email" ns=""/>
    </element>
  </zeroOrMore>
</element>

and so will match

<addressBook xmlns="http://www.example.com">
  <card name="John Smith" email="js@example.com"/>
</addressBook>

or

<example:addressBook xmlns:example="http://www.example.com">
  <example:card name="John Smith" email="js@example.com"/>
</example:addressBook>

but not

<example:addressBook xmlns:example="http://www.example.com">
  <example:card example:name="John Smith" example:email="js@example.com"/>
</example:addressBook>

To match this last example, the attribute patterns must specify global="true":

<element name="addressBook" ns="http://www.example.com">
  <zeroOrMore>
    <element name="card">
      <attribute name="name" global="true"/>
      <attribute name="email" global="true"/>
    </element>
  </zeroOrMore>
</element>

This is equivalent to:

<element name="addressBook" ns="http://www.example.com">
  <zeroOrMore>
    <element name="card" ns="http://www.example.com">
      <attribute name="name" ns="http://www.example.com"/>
      <attribute name="email" ns="http://www.example.com"/>
    </element>
  </zeroOrMore>
</element>

Thus, specifying global="true" on an attribute pattern makes the ns attribute default in the same way that it does on an element pattern.

The ns attribute is allowed on any element in a TREX pattern. The global attribute is allowed only on an attribute pattern.

8.2 Qualified names

When a pattern matches elements and attributes from multiple namespaces, using the ns attribute would require repeating namespace URIs in different places in the pattern. This is error-prone and hard to maintain, so TREX also allows the element and attribute patterns to use a prefix in the value of the name attribute to specify the namespace URI. In this case, the prefix specifies the namespace URI to which that prefix is bound by the namespace declarations in scope on the element or attribute pattern. Thus

<element name="e:addressBook" xmlns:e="http://www.example.com">
  <zeroOrMore>
    <element name="e:card">
      <element name="e:name">
        <anyString/>
      </element>
      <element name="e:email">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</element>

is equivalent to

<element name="addressBook" ns="http://www.example.com">
  <zeroOrMore>
    <element name="card" ns="http://www.example.com">
      <element name="name" ns="http://www.example.com">
        <anyString/>
      </element>
      <element name="email" ns="http://www.example.com">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</element>

If a prefix is specified in the value of the name attribute of an element or attribute pattern, then that prefix determines the namespace URI of the elements or attributes that it will be matched by that pattern, regardless of the value of any ns attribute.

Note that the XML default namespace (as specified by the xmlns attribute) is not used in determining the namespace URI of elements and attributes that element and attribute patterns match.

8.3 Pattern element namespace

A TREX pattern can use the namespace URI http://www.thaiopensource.com/trex for the pattern elements. If it uses this namespace URI for the root element, it must use it for all descendant elements. If it does not use a namespace URI for the pattern element, it must not use one for any descendant elements. Thus, any of

<element name="addressBook" xmlns="http://www.thaiopensource.com/trex">
  <zeroOrMore>
    <element name="card">
      <element name="name">
        <anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</element>
<trex:element name="addressBook" xmlns:trex="http://www.thaiopensource.com/trex">
  <trex:zeroOrMore>
    <trex:element name="card">
      <trex:element name="name">
        <trex:anyString/>
      </trex:element>
      <trex:element name="email">
        <trex:anyString/>
      </trex:element>
    </trex:element>
  </trex:zeroOrMore>
</trex:element>
<element name="addressBook" xmlns="">
  <zeroOrMore>
    <element name="card">
      <element name="name">
        <anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</element>
<element name="addressBook">
  <zeroOrMore>
    <element name="card">
      <element name="name">
        <anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</element>

is allowed. But

<trex:element name="addressBook" xmlns:trex="http://www.thaiopensource.com/trex">
  <zeroOrMore>
    <element name="card">
      <element name="name">
        <anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</trex:element>

is not allowed.

If a TREX element has an attribute or child element with a namespace URI other than the TREX namespace, then that attribute or element is ignored. Thus, you can add annotations to TREX patterns simply by using an attribute or element in a separate namespace:

<element name="addressBook" xmlns="http://www.thaiopensource.com/trex" xmlns:a="http://www.example.com/annotation">
  <zeroOrMore>
    <element name="card">
      <a:documentation>Information about a single email address.</a:documentation>
      <element name="name">
        <anyString/>
      </element>
      <element name="email">
        <anyString/>
      </element>
    </element>
  </zeroOrMore>
</element>

9 Name classes

Normally, the name of the element to be matched by an element element is specified by a name attribute. An element element can instead start with an element specifying a name-class. In this case, the element pattern will only match an element if the name of the element is a member of the name-class. The simplest name-class is anyName, which any name at all is a member of, regardless of its local name and its namespace URI. For example, the following pattern matches any well-formed XML document:

<grammar>

  <start name="anyElement">
    <element>
      <anyName/>
      <zeroOrMore>
	<choice>
	  <attribute>
	    <anyName/>
	  </attribute>
	  <anyString/>
	  <ref name="anyElement"/>
	</choice>
      </zeroOrMore>
    </element>
  </start>

</grammar>

The nsName name-class contains any name with the namespace URI specified by the ns attribute, which defaults in the same way as the ns attribute on the element pattern.

The choice name-class matches any name that is a member of any of its child name-classes.

The not name-classes contains any name that is not a member of the child name-class.

For example

<element name="card" ns="http://www.example.com">
  <zeroOrMore>
    <attribute>
      <not>
        <choice>
          <nsName/>
          <nsName ns=""/>
        </choice>
      </not>
    </attribute>
  </zeroOrMore>
  <anyString/>
</element>

would allow the card element to have any number of namespace-qualified attributes provided that they were qualified with namespace other than that of the card element.

Note that an attribute pattern matches a single attribute even if it has a name-class that contains multiple names. To match zero or more attributes, the zeroOrMore element must be used.

The difference name-class contains any name that is a member of the first child name-class, but not a member of any of the following name-classes. The not name-class is, in fact, a shorthand for difference:

<not> name-class </not>

is short for

<difference> <anyName/> name-class </difference>

The name name-class contains a single name. The content of the name element specifies the name in the same way as the name attribute of the element pattern. The ns attribute specifies the namespace URI in the same way as the element pattern.

Some schema languages have a concept of lax validation, where an element or attribute is validated against a definition only if there is one. We can implement this concept in TREX with name classes that uses difference and name. Suppose, for example, we wanted to allow an element to have any attribute with a qualified name, but we still wanted to ensure that if there was an xml:space attribute, it had the value default or preserve. It wouldn't work to use:

<element name="example">
  <zeroOrMore>
    <attribute>
      <anyName/>
    </attribute>
  </zeroOrMore>
  <optional>
    <attribute name="xml:space">
      <choice>
        <string>default</string>
        <string>preserve</string>
      </choice>
    </attribute>
  </optional>
</element>

because an xml:space attribute with a value other than default or preserve would match

    <attribute>
      <anyName/>
    </attribute>

even though it did not match

    <attribute name="xml:space">
      <choice>
        <string>default</string>
        <string>preserve</string>
      </choice>
    </attribute>

The solution is to use name together with difference:

<element name="example">
  <zeroOrMore>
    <attribute>
      <difference>
        <anyName/>
        <name>xml:space</name>
      </difference>
    </attribute>
  </zeroOrMore>
  <optional>
    <attribute name="xml:space">
      <choice>
        <string>default</string>
        <string>preserve</string>
      </choice>
    </attribute>
  </optional>
</element>

Note that the define element cannot contain a name-class; it can only contain a pattern.

10 Datatyping

TREX does not have any system of datatypes built in. Rather it expects to partner with a datatyping vocabulary, such as Part 2 of the W3C's XML Schema language. TREX implementations may differ in the datatyping vocabularies they support. You must pick a datatyping vocabulary that is supported by the implementation you plan to use.

10.1 Named datatypes

The data pattern matches a string that is a lexical representation of a value of a named datatype. The type attribute of data contains the qualified name of the datatype. For example, if a TREX implementation supported the built-in datatypes of the W3C's XML Schema Language, you could use:

<element name="number" xmlns:xsd="http://www.w3.org/2000/10/XMLSchema">
  <data type="xsd:integer"/>
</element>

As with element and attribute, data can use the ns attribute to explicitly specify the namespace of the datatype, instead of using a prefix within the value of the type attribute.

The prohibition against a string pattern's matching only part of the content of an element also applies to data patterns.

10.2 Anonymous datatypes

Sometimes it is desirable to use a datatype that does not have a name, for example a datatype derived by restricting the allowed values of some other named datatype. TREX supports this by taking advantage of the extensibility provided by XML with namespaces. The anonymous datatype must be represented by an XML element in a different namespace. In addition, the element must have a trex:role="datatype" attribute to signal to TREX that this element places the role of specifying a datatype. (In the absence of such an attribute, an element in a different namespace would be treated as an annotation which TREX can ignore.) For example, if a TREX implementation supported the xsd:restriction element defined by Part 2 of the W3C's XML Schema Language, you could use:

<trex:element name="age"
    xmlns:xsd="http://www.w3.org/2000/10/XMLSchema"
    xmlns:trex="http://www.thaiopensource.com/trex">
  <xsd:restriction base="xsd:nonNegativeInteger" trex:role="datatype">
    <xsd:maxInclusive value="150"/>
  </xsd:restriction>
</trex:element>

The trex:role attribute need not use the trex prefix, but it must have a prefix and the prefix must be bound by the in-scope namespace declarations to the namespace URI http://www.thaiopensource.com/trex. If you are using the default namespace for TREX pattern elements, you cannot use a bare role attribute; you must also declare a namespace prefix bound to the same namespace URI as the default namespace. For example

<element name="age"
    xmlns="http://www.thaiopensource.com/trex"
    xmlns:xsd="http://www.w3.org/2000/10/XMLSchema"
    xmlns:trex="http://www.thaiopensource.com/trex">
  <xsd:restriction base="xsd:nonNegativeInteger" trex:role="datatype">
    <xsd:maxInclusive value="150"/>
  </xsd:restriction>
</element>

The prohibition against a string pattern's matching only part of the content of an element also applies to anonymous datatype patterns.

11 Advanced features

11.1 Nested grammars

There is no prohibition against nesting grammar patterns. A ref pattern refers to the definition from nearest grammar ancestor. However, by putting a parent="true" attribute on ref, it is possible to escape out of the current grammar and reference its parent grammar.

Imagine the problem of writing a pattern for tables. The pattern for tables only cares about the structure of tables; it doesn't care about what goes inside a table cell. First, we create a TREX pattern table.trex as follows:

<grammar>

<define name="cell.content">
  <notAllowed/>
</define>

<start>
  <element name="table">
    <oneOrMore>
      <element name="tr">
        <oneOrMore>
	  <element name="td">
	    <ref name="cell.content"/>
	  </element>
        </oneOrMore>
      </element>
    </oneOrMore>
  </element>
</start>

</grammar>

Patterns that include table.trex must redefine cell.content. By using a nested grammar pattern containing a ref pattern with parent="true", the including pattern can redefine cell.content to be a pattern defined in the including pattern's grammar, thus effectively importing a pattern from the parent grammar into the child grammar:

<grammar>

<start>
  <element name="doc">
    <zeroOrMore>
      <choice>
	<element name="p">
	  <ref name="inline"/>
	</element>
	<grammar>
	  <include href="table.trex"/>
	  <define name="cell.content">
	    <ref name="inline" parent="true"/>
	  </define>
	</grammar>
      </choice>
    </zeroOrMore>
  </element>
</start>

<define name="inline">
  <zeroOrMore>
    <choice>
      <anyString/>
      <element name="em">
        <ref name="inline"/>
      </element>
    </choice>
  </zeroOrMore>
</define>

</grammar>

Of course, in a trivial case like this, there is no advantage in nesting the grammars: we could simply have have included table.trex within the outer grammar element. However, when the included grammar has many definitions, nesting it avoids the possibility of name conflicts between the including grammar and the included grammar.

11.2 Concur

The concur pattern matches if all of its subpatterns simultaneously match. Suppose you have two versions of a TREX pattern stored in old.trex and new.trex, and suppose you want to check that a document matches both the old and new patterns. The following pattern will do this:

<concur>
  <include href="old.trex"/>
  <include href="new.trex"/>
</concur>

The concur pattern can be used to implement exclusions. Suppose you have an td element and the content should be anything that matches the Flow pattern, except that a table element is not allowed as a descendant of td. First we write a pattern that matches any content that does not include a table at any depth:

<define name="not-table">
  <zeroOrMore>
    <choice>
      <anyString/>
      <element>
        <not>
          <name>table</name>
        </not>
	<zeroOrMore>
	  <attribute>
	    <anyName/>
	  </attribute>
	</zeroOrMore>
	<ref name="not-table"/>
      </element>
    </choice>
  </zeroOrMore>
</define>

Then we can write the pattern for td as follows:

<element name="td">
  <concur>
    <ref name="Flow"/>
    <ref name="not-table"/>
  </concur>
</element>

The concur pattern is also useful for validation of documents using multiple namespaces. For example, suppose the file example.trex contains a pattern that describes a language for the namespace URI http://www.example.com, and suppose this language allows all elements to have any attribute with a namespace qualified name:

<grammar ns="http://www.example.com">
  <start>
    <element name="example">
      <ref name="other.atts"/>
    </element>
  </start>

  <define name="other.atts">
    <zeroOrMore>
      <attribute>
        <not>
          <nsName ns=""/>
        </not>
      </attribute>
    </zeroOrMore>
  </define>
</grammar>

Suppose we also have a file xml.trex that checks that a document uses the xml namespace correctly:

<grammar ns="http://www.w3.org/XML/1998/namespace">
  <start name="any">
    <element>
      <not>
        <nsName/>
      </not>
      <optional>
        <attribute name="space" global="true">
          <choice>
            <string>default</string>
            <string>preserve</string>
          </choice>
        </attribute>
      </optional>
      <optional>
        <attribute name="lang" global="true">
          <data type="language" ns="http://www.w3.org/2000/10/XMLSchema"/>
        </attribute>
      </optional>
      <zeroOrMore>
        <attribute>
	  <not>
	    <nsName/>
	  </not>
        </attribute>
      </zeroOrMore>
      <zeroOrMore>
        <choice>
          <anyString/>
          <ref name="any"/>
        </choice>
      </zeroOrMore>
    </element>
  </start>
</grammar>

Then the following pattern would match a document only if it satisfied the requirements of both namespaces:

<concur>
  <include href="example.trex"/>
  <include href="xml.trex"/>
</concur>

12 Non-restrictions

TREX does not require patterns to be "deterministic" or "unambiguous".

Suppose we wanted to write the email address book in HTML, but use class attributes to specify the structure.

<element name="html">
  <element name="head">
    <element name="title">
      <anyString/>
    </element>
  </element>
  <element name="body">
    <element name="table">
      <attribute name="class">
        <string>addressBook</string>
      </attribute>
      <oneOrMore>
        <element name="tr">
	  <attribute name="class">
	    <string>card</string>
	  </attribute>
          <element name="td">
	    <attribute name="class">
	      <string>name</string>
	    </attribute>
            <interleave>
              <anyString/>
              <optional>
                <element name="span">
                  <attribute name="class">
                    <string>givenName</string>
                  </attribute>
                  <anyString/>
                </element>
              </optional>
              <optional>
                <element name="span">
                  <attribute name="class">
                    <string>familyName</string>
                  </attribute>
                  <anyString/>
                </element>
              </optional>
            </interleave>
          </element>
          <element name="td">
	    <attribute name="class">
	      <string>email</string>
	    </attribute>
            <anyString/>
          </element>
        </element>
      </oneOrMore>
    </element>
  </element>
</element>

This would match a XML document such as:

<html>
  <head>
    <title>Example Address Book</title>
  </head>
  <body>
    <table class="addressBook">
      <tr class="card">
        <td class="name">
          <span class="givenName">John</span>
          <span class="familyName">Smith</span>
        </td>
        <td class="email">js@example.com</td>
      </tr>
    </table>
  </body>
</html>

but not

<html>
  <head>
    <title>Example Address Book</title>
  </head>
  <body>
    <table class="addressBook">
      <tr class="card">
        <td class="name">
          <span class="givenName">John</span>
          <!-- Note the incorrect class attribute -->
          <span class="givenName">Smith</span>
        </td>
        <td class="email">js@example.com</td>
      </tr>
    </table>
  </body>
</html>

13 Non-features

The role of TREX is simply to specify a class of documents, not to assist in interpretation of the documents belonging to the class. It does not change the infoset of the document. In particular, TREX

Also TREX does not define a way for an XML document to associate itself with a TREX pattern.