The Linux Page

#PCDATA and DTDs

Foggy error reports when dealing with #PCDATA in a DTD.

If you are like me and write DTDs to check your XML files to make sure you don't have too many mistakes in them, then you probably have had this problem before.

The #PCDATA has a very special behavior and it is really restrained as follow:

  • #PCDATA must appear at the start
  • #PCDATA must be repeated from zero to infinity, so only * work with it
  • #PCDATA cannot be used with sub-groups (things between parenthesis)

Something like this:

<!ELEMENT Z (P | (#PCDATA | A | B | C)* | Q)+>

does not work because you use + and #PCDATA is within a sub-group.

What you need to do is have this instead:

<!ELEMENT Z (#PCDATA | A | B | C)*>

And if P and Q are still necessary, add them there too:

<!ELEMENT Z (#PCDATA | A | B | C | P | Q)*>

If you have an entity and have a need for #PCDATA, it is very likely that you'll have to extract the #PCDATA from the entity and move it to the main tag:

<!ENTITY % E "(#PCDATA | A | B | C)*">
<!ELEMENT Z %E;> <!-- this one works fine -->
<!ELEMENT Z (%E; | P | Q)*> <!-- this one fails! -->

Notice that it fails because #PCDATA finds itself within a sub-group. Since the entity and the use of it are both followed by an asterisk, it is not required, but somehow it is still failing.

In most cases what you will have to do is move the #PCDATA to the element and remove all parenthesis from your entities:

<!ENTITY % E "A | B | C">
<!ELEMENT Z (#PCDATA | %E; | P | Q)*> <!-- that works -->

Sample errors that we get from xmllint when #PCDATA is missused (totally confusing if you ask me!):

Entity: line 1: parser error : xmlParseElementMixedContentDecl : Name expected
 %html-data;
            ^
Entity: line 1:
#PCDATA|a|b|br|div|em|i|img|p|small|strong|u
^
Entity: line 1: parser error : expected '>'
 %html-data;
            ^
Entity: line 1:
#PCDATA|a|b|br|div|em|i|img|p|small|strong|u
^
Entity: line 1: parser error : Content error in the external subset
 %html-data;
            ^
Entity: line 1:
#PCDATA|a|b|br|div|em|i|img|p|small|strong|u
^
Entity: line 1: parser error : ContentDecl : Name or '(' expected
 %html-data;
            ^
Entity: line 1:
(#PCDATA|a|b|br|div|em|i|img|p|small|strong|u)*
 ^
Entity: line 1: parser error : ContentDecl : ',' '|' or ')' expected
 %html-data;
            ^

Source: https://stackoverflow.com/.../why-is-this-not-a-valid-xml-dtd-parameter-entity-and-pcdata