unfinished documents:
pfe sources pfe manpages pfe docbook |
the XM nodes struct The XML Nodes in XML/TA are a subclass of glib's GNodes and therefore their basic header file is called "xml/gnode.h" and the methods are called firstly "xml_g_node_" reflecting the original naming scheme of "g_node_" - in fact, most are just "inline"rs pointing at them. As a GNode, it has four basic pointers ".next" ".prev" ".parent" and ".children". The glib functions don't quite like a root node with siblings, but "prev || ! .parent || .parent.children == .this" should be kept always as an assertion, i.e. children points to the first node in the sibling chain. The firstmember of "GNode" is a "gpointer data" which has been renamed into "gchar* name". It does hold the XML element name and should be g_free'd. After the basic GNode layout, four extra members exist - that's why you should not try to use the g_node_new / g_node_free on xml_GNodes since that will not be correct due to glib internal pool node scheme. The three members "GString .text" and "gsize off" and "gsize end" form the basis of the TA model. The TA = TextArray is actually a GString, and the two size-offsets simply tell the start and end of the text associated with this node. To duplicate a node's text content you would use g_strndup (with length param!) as follows: g_strndup (node->text->str + node->off, node->end - node->off); To use two gsize parameters has the benefit of easy navigation inside the tree - every TA position can be easily looked up to find the containing node by simply asking for off <= pos < end. Node that a PCRE expression will generally return the spans as a 2-tupel of offsets into the string it was given - those can be directly fed to the navigation/insertion rules of the xmlg lib. Note that a node might not have a "name" in which case it should utterly be ignored (invisible nodes), and when "text" has been set to "null" then this nodes has no position in the XML tree as well, not even a starting position. Most subroutines should be ready to check these "problems" - the basic loader routines will always inject the reference offsets into the text array. The only other member is ".attributes". This used to be of type "GHashTable<str,str>" but it has since been moved into its own header type. It still exports method functions looking very much alike a str_hash but internally it has been turned into a GList<struct<str,str>>. It did turn out that most nodes have a very low number of attributes if any, and it is much easier to walk an attributes collection when it is a list, plus that extracting/inserting is easier - the speedup of lookups in a str_hash is not needed. As to note you, remember that the ".attributes" is in alphebetic order by name - you should only use the str_hash-derived methods from nodeattr.h to handle this type. And btw, the <struct<str,str>> is internal. You may cast and dereference the first member being the name of the attribute but the others should be kept illegal to access directly since there is the possibility of turning the attrib-value into a comples thing to hold not only a string-value but also some references to other object-types. My favourite would be a node-reference or secondary textarray-reference, however that is not yet implemented. searching the textarray The text array model allows to reference large portions of the text content of an xml file. One could get the node for a "//filetext" and search the whole filetext at once even when the //filetext node has subnodes. It is therefore easy to slurp non-xml sources as a single textblock and then successivly examine them remembering information in XML markups. This style of remembering information is fundamentally different from a perl-style approach where data is generally memorized in dedicated internal hashes/lists (which needs you to defined specialized dump routines to get these into an output format), or by modifying the text-array itself (which somewhat breaks the current run of a search-operation). In xmlg's xmlpcre part it is quite common to get a text array portion and put it subject to a PCRE pattern. This pattern will return an ovector with offsets that did match it. See the implementation of xml_pcre_match_add9 in pcrenext.c for visual example. const char* errmsg; int erridx; int ovector[33]; pcre* regex = pcre_compile (txtRE, 0, &errmsg, &erridx, 0); if (0< pcre_exec (regex, 0, node->text->str, node->end, node->off, 0, ovector, 33 )) { xml_tree_add9 (tree, ovector, names); } pcre_free (regex); The xml_tree_add9 is convenient capitalization on top of its basic cousing xml_tree_add2 (tree, offs1, offs2, name) which will lookup the deepest node containing both offsets offs1 and offs2 and it will add a new-node in there with this text-span and the given element name. The new node is returned so that you can further modify it, probably by injecting attributes. Note however that an add2 can fail if the text-span does not match the well-formed'ness of an xml document. grouping nodes The add2 function may have a text-span that already includes markups - and that's the reason that it can fail when one side of offs1 or offs2 is inside such a sub-node, well, that would break well-formdness of an xml tree which is strictly hiararchic. (Theoretically shadow XM-tree can be built atop the same TextArray but that has not been throughly explored yet - it is possible even now but not nativily supported in the library. Those shadow trees are independent hierachies, and two XM-trees do not necessarily need to be mergable xml' well-formed). Apart from grouping by offs-positions in the TextArray it is often needed to group nodes directly - by simply giving the left-most and right-most node and expecting that the new node will take its left-"off"s from the leftmost-node and the right-"end"s from the rightmost node. Visually, you want a text as of <head>..</head> <val>..</val> <val>..</val> into <group><head>..</head> <val>..</val> <last>..</last></group> Of course you may simulate the behaviour by simply taking head.off and last.end and calling add2 with it, but that operation is somewhat slower and not it does not visually reflect the programmer's intention. Instead the same nodeadds.c file cotains a few helpers: xml_node_group_outer (left, right, node) xml_node_group_inner (left, right, node) xml_node_group_cut (node) xml_node_group_outer_new (left, right, name) xml_node_group_inner_new (left, right, name) xml_node_group_outer_alias (node, name) xml_node_group_inner_alias (node, name) The last routines document the difference between outer and inner - a group may degrade to contain only a single sub-node which does reference the exact same textarray portion. In this case it is possible to exchange the two markups within the hierachy without breaking well-formdness, that is the following two are "exchange-identical": <A><Z>hello</Z></A> <Z><A>hello</A></Z> However, a structural XML text (possibly with a DTD) expects markups to only nest in a specific manner - so "symbol" must be lower than "ident". Note that also the "add2" routine has a cousin called "adds2". The "add"-routines will hang themselves into the lowest-possible node that contains both pos-offsets, while "adds" will look that node is "exchange-identical". If it is then the new node will be higher than all its "exchange-identical" sub-nodes. academic note The presented nodemodel is sometimes known as "proximal nodes". It separates the data from metadate and each metadata node refers to a span of corresponding text data. From my own research, the simple form (without extra use of separators) must ensure that the text content can in fact be stripped from the metadata enclosed in an XML file. |