XML/TA library for glib2 and pcre

xmlg 0.7.18
unfinished documents:
NODEMODEL
XPATHDEF
CHECKING
HYPERCSS
TEXTMINING
 XPath and shorthand

The XPath specifications created by the W3's XML directory is
actually a retrievial language. It defines a set of check
functions that must match to point to make a node selected.
In most cases it will be an element-name (i.e. select-from)
with logical operatoins on the (where-attrib==value).

The XPath documents however show an abbreviated syntax which
is a lot easier to implement. Some implementations may also
call these "basic XPath"s. Here is a little introduction to
this scheme. First of all, no functions exist in themselves
only the abbreviations exist - a later processor may stack
on top of the basic syntax by handling functions being
put into [...] parts. The xmlg's basic xpaths do never
contain [...] other than numeric ones, i.e. [number].

 "A" ... matches all nodes being direct subnodes of the current
     node with a name of "A".
 "*" ... matches all children of the current node.
 A/B ... matches all nodes with name "B" being a child of nodes
     with name "A" of the current node.
 */B ... the nodes named "B" at secondary level under the current.
 //B ... matches all nodes named "B" at any depth under the
     current node.
 //A//B ... any nodes "B" at any depth inside "A" nodes which may
     be found anywhere under the current node (and its tree).

 B[2] .. if there are multiple "B" nodes then only the second
     one will match. Omitting a [number] specification on a
     single-node lookup wil return the [1]st node of course.
 B@href .. from multiple nodes "B" only those are selected which
     have an attribute "href".
 //*@href .. any nodes in the tree at any depth that contians an
    attribute "href". Note that the uppermost node contiang such
    an attribute will be found even if lower nodes have such a
    href too. All basic xpaths are non-greedy.

 strstr matching

The basic xpath model is expanded in xmlg by matching the names
of nodes with an expression syntax. The most basic xmlglib has
a simple extension of searching with a "strstr()" function 
when the first character of a namespec was a single "*". That
is a select of      
   //*A//*U     will also match   <BLAFF><HOO><BLUBB>... nodes.

Of course this fits well within the xpath abbreviations where
a "*"-star matches any. This strstr-matching with an empty
strstr-spec will likewise match all nodes since every node
name has a subportion being zero-length.

The same style of strstr-matching is used throughout the
library - on the xpath syntax it also refers to strstr-style
matching of attribute-names. The exact-matching style (i.e.
strcmp-style) is the default and strstr-style selected with
a leading "*" in the match-expression part.

 xpath comparison

The most common usage of xpath expression takes the abbeviated
syntax that is largely similar to a filepath specification. In
xpath each of the selects can be followed by an assertion 
check. None of these assertion checks are implemented for the
selection processing here. However, we can compare the formats
that are provided.

For example, the form "//A" is equivalent to an xpath selection
expressed as "//*[name()='A']" - in wording it says to select
all nodes with a name equal to 'A'. Likewise we have to note
that the attribute-assertion above has actually to be put into
such assertion checks, i.e. "//*@id" is equivalent to an xpath
selection of "//*[@id]".

The strstr matching above has an equivalent in xpath selection
expressions as well - a query of "//*AA" is equivalent to
"//*[contains(name(),'AA')]". OTOH, I am currently not aware of 
any xpath selection being equivalent to "//*@*AA" but you got
the point.

 xpath extensions

It would be nice to implement a number of functions for the
xpath postfix bracket-syntax, something like "//H1[not(@id)]"
would be nice to select all nodes of name H1 that do not yet 
have any "id" attribute. Others could be useful as well, as
the count() or atleast logical operators to follow up some
functions, i.e. "//*[count() < 5]".

OTOH, I do not intend to implement the prefix axis specifications
at any point - the "/" is said to be equivalent to an axis
specification of "/child::" and "//" is said to be equivalent
to "/descendant-or-self::". The others like "following-sibling"
can not be expressed, hopefully most of it can be expressed
in terms of forward-assertions on a postfix bracket syntax.

 pcre matching

About all xpath routines have a cousin in the xmlpcre part of
this project - there all perl-style regular expressions are
applied - and including a handy name1|name2 syntax. Per default
the pcre must match the complete name (i.e. the default is in
fact "^(?:name1|name2)$") which can again be reduced to match
any subportion by giving a leading "*".

The style of name-matching is defined more in-depth in a later
document. Of course, the PCRE machine will not see a leading
"*" on such a pcre-name match, it just gets different modifier
flags. The `make check` shows a lot of example how xpath's
may look like in xmlglib and its xmlpcre part.

a warning herein: the "|"-syntax of PCRE does somewhat 
interfere with the xpath selection syntax - there it 
separates entire paths, i.e. "//A/B|//A/C" in xpath .vs.
"//A/B|C" in pcre matching. In the xmlpcre matching part
we always make a regular expression match on the name part, 
not the path as a whole.

 not yet implemented

It would be nice to add the "=" syntax all about the xpath
selectors, i.e. name=text matches nodes with "name" having
a text-content of "text", and more importantly all the
*@attrib=value would match with any node having an attribute
of the given name that furthermore has the given value. It
is of course possible then to use PCRE expressions to do the
selection-match, not only strcmp-style matching.