XML/TA library for glib2 and pcre

xmlg 0.7.18
unfinished documents:
NODEMODEL
XPATHDEF
CHECKING
HYPERCSS
TEXTMINING
 the XM nodes struct

The XML Nodes in XML/TA are a subclass of glib's GNodes and therefore
their basic header file is called "xml/gnode.h" and the methods are
called firstly "xml_g_node_" reflecting the original naming scheme
of "g_node_" - in fact, most are just "inline"rs pointing at them.

As a GNode, it has four basic pointers ".next" ".prev" ".parent" and
".children". The glib functions don't quite like a root node with
siblings, but  "prev || ! .parent || .parent.children == .this"
should be kept always as an assertion, i.e. children points to
the first node in the sibling chain.

The firstmember of "GNode" is a "gpointer data" which has been
renamed into "gchar* name". It does hold the XML element name
and should be g_free'd. After the basic GNode layout, four
extra members exist - that's why you should not try to use the
g_node_new / g_node_free on xml_GNodes since that will not be
correct due to glib internal pool node scheme.

The three members "GString .text" and "gsize off" and "gsize end"
form the basis of the TA model. The TA = TextArray is actually
a GString, and the two size-offsets simply tell the start and end
of the text associated with this node. To duplicate a node's text
content you would use g_strndup (with length param!) as follows:
g_strndup (node->text->str + node->off, node->end - node->off);

To use two gsize parameters has the benefit of easy navigation
inside the tree - every TA position can be easily looked up to
find the containing node by simply asking for off <= pos < end.
Node that a PCRE expression will generally return the spans as
a 2-tupel of offsets into the string it was given - those can
be directly fed to the navigation/insertion rules of the xmlg lib.

Note that a node might not have a "name" in which case it should
utterly be ignored (invisible nodes), and when "text" has been
set to "null" then this nodes has no position in the XML tree
as well, not even a starting position. Most subroutines should
be ready to check these "problems" - the basic loader routines 
will always inject the reference offsets into the text array.

The only other member is ".attributes". This used to be of type
"GHashTable<str,str>" but it has since been moved into its own
header type. It still exports method functions looking very
much alike a str_hash but internally it has been turned into a
GList<struct<str,str>>. It did turn out that most nodes have
a very low number of attributes if any, and it is much easier
to walk an attributes collection when it is a list, plus that
extracting/inserting is easier - the speedup of lookups in a
str_hash is not needed. As to note you, remember that the
".attributes" is in alphebetic order by name - you should only
use the str_hash-derived methods from nodeattr.h to handle 
this type. 

And btw, the <struct<str,str>> is internal. You may cast and
dereference the first member being the name of the attribute
but the others should be kept illegal to access directly
since there is the possibility of turning the attrib-value
into a comples thing to hold not only a string-value but
also some references to other object-types. My favourite would
be a node-reference or secondary textarray-reference, however
that is not yet implemented.

 searching the textarray

The text array model allows to reference large portions of 
the text content of an xml file. One could get the node for
a "//filetext" and search the whole filetext at once even
when the //filetext node has subnodes. It is therefore easy
to slurp non-xml sources as a single textblock and then
successivly examine them remembering information in XML
markups.

This style of remembering information is fundamentally
different from a perl-style approach where data is
generally memorized in dedicated internal hashes/lists
(which needs you to defined specialized dump routines
to get these into an output format), or by modifying
the text-array itself (which somewhat breaks the current
run of a search-operation).

In xmlg's xmlpcre part it is quite common to get a text
array portion and put it subject to a PCRE pattern. This
pattern will return an ovector with offsets that did 
match it. See the implementation of xml_pcre_match_add9
in pcrenext.c for visual example.

 const char* errmsg; int erridx; int ovector[33];
 pcre* regex = pcre_compile (txtRE, 0, &errmsg, &erridx, 0);

 if (0< pcre_exec (regex, 0, node->text->str, 
                  node->end, node->off, 0, ovector, 33 ))
 {
       xml_tree_add9 (tree, ovector, names);
 }
 pcre_free (regex);

The xml_tree_add9 is convenient capitalization on top of
its basic cousing xml_tree_add2 (tree, offs1, offs2, name)
which will lookup the deepest node containing both offsets
offs1 and offs2 and it will add a new-node in there with
this text-span and the given element name. The new node
is returned so that you can further modify it, probably
by injecting attributes. Note however that an add2 can
fail if the text-span does not match the well-formed'ness
of an xml document.

 grouping nodes

The add2 function may have a text-span that already
includes markups - and that's the reason that it can
fail when one side of offs1 or offs2 is inside such a
sub-node, well, that would break well-formdness of an
xml tree which is strictly hiararchic. (Theoretically
shadow XM-tree can be built atop the same TextArray but
that has not been throughly explored yet - it is possible
even now but not nativily supported in the library. Those
shadow trees are independent hierachies, and two XM-trees
do not necessarily need to be mergable xml' well-formed).

Apart from grouping by offs-positions in the TextArray
it is often needed to group nodes directly - by simply
giving the left-most and right-most node and expecting
that the new node will take its left-"off"s from the
leftmost-node and the right-"end"s from the rightmost
node. Visually, you want a text as of

  <head>..</head> <val>..</val> <val>..</val>
into
  <group><head>..</head> <val>..</val> <last>..</last></group>

Of course you may simulate the behaviour by simply 
taking head.off and last.end and calling add2 with
it, but that operation is somewhat slower and not it
does not visually reflect the programmer's intention.
Instead the same nodeadds.c file cotains a few helpers:

xml_node_group_outer (left, right, node)
xml_node_group_inner (left, right, node)
xml_node_group_cut (node) 
xml_node_group_outer_new (left, right, name)
xml_node_group_inner_new (left, right, name)
xml_node_group_outer_alias (node, name)
xml_node_group_inner_alias (node, name)

The last routines document the difference between
outer and inner - a group may degrade to contain
only a single sub-node which does reference the
exact same textarray portion. In this case it is
possible to exchange the two markups within the
hierachy without breaking well-formdness, that is
the following two are "exchange-identical":

  <A><Z>hello</Z></A>
  <Z><A>hello</A></Z>

However, a structural XML text (possibly with a
DTD) expects markups to only nest in a specific
manner - so "symbol" must be lower than "ident".
Note that also the "add2" routine has a cousin
called "adds2". The "add"-routines will hang
themselves into the lowest-possible node that
contains both pos-offsets, while "adds" will
look that node is "exchange-identical". If it
is then the new node will be higher than all
its "exchange-identical" sub-nodes.

 academic note

The presented nodemodel is sometimes known as
"proximal nodes". It separates the data from
metadate and each metadata node refers to a 
span of corresponding text data. From my own
research, the simple form (without extra use
of separators) must ensure that the text
content can in fact be stripped from the 
metadata enclosed in an XML file.