XML/TA library for glib2 and pcre

SourceForge!
xmlg 0.7.18


 
Summary
Download

 


 
README
ChangeLog
old tarballs

 


 
xmlg sources
bin examples

 


unfinished documents:

 
NODEMODEL
XPATHDEF
CHECKING
HYPERCSS
TEXTMINING


pfe sources
pfe manpages
pfe docbook


 CHECKING 

Here we explain how things are matched and how it is implemented. 
There are quite a few places in the library that are heavily optimized
for exact matches to speedup the execution for the usual cases.

 substring .vs. complete

All the element-matches recognize a leading star ('*') in the
matchstring expression. The default case happens to be a complete
match other than people would expect from grep-style matchstring
expressions. Therefore a pattern-expression of "name" will only
match the single string "name" and not a string like "xnames".

The leading star however will turn on substring-matches and so
a match-pattern "*name" will match both "name" and "xnames" as
strings. Note that in xml-processing many match-patterns will
ask for alphanumeric names only - here the "name" and "*name"
will actually work the same for both stdlib and pcre patterns
and match the same set of nodes (or single one for non-star mode).

 RE vs. XP

The "RE" stands for "Regular Expression" and "XE" is shorthand
for "XPath Element" where the latter uses strcmp/strstr style
routines to do the actual pattern matching. For many uses cases
this is all sufficient and happens to be a lot faster than
compiling a RE and applying it on a string buffer. A cpu like
i386 even has an ISA code and cpu-native microcode to make them 
as fast as possible.

 absolute XP - a.k.a. stdlib function must match complete pattern

An absolute XP uses strcmp() when two zero-terminated strings are
being used. If one side is a str/len pair then it is faster to
use memcmp() and check make a (!str[len]) on the zero-terminated
side - that's possible since we assume the text array does not
contain embedded zero-chars and the memcmp would break earlier.
Two str/len pairs can not match for differing lengths anyway and
the case degrades to a memcmp() later.

- (z-string A , z-string B) return (!strcmp (A, B))
- (z-string A , str/len  B) return (!memcmp (A, str, len) && !A[len])
- (ptr/cnt  A , str/len  B) return (cnt == len && !memcmp (ptr,str))

 relative XP - a.k.a. stdlib functions match at partial substring

A relative XP uses strstr() when two zer-terminated strings are
being used. For the three other cases exist helper functions - one
is g_strstr_len taken from gstrfunc.h - each of them is implemented
as a while-loop around a memcmp() but it shouldn't be inlined like
we could try with strstr() which is not a cpu-native opcode either.

The relative processing is therefore slower - it scales linear with
the maximum length of strings being searched for a pattern. However
a true RE in the placed could do anything different but then again
handle a state buffer and RE engine interpreter as an additional
overhead which makes it improbable the the search code can be held
completely pre-decoded in the instruction cache of modern cpus as
it can be done for the small while-loop around a memcmp().

 XP spec - a.k.a. choose absolute/relative depending on first '*'

For the case of two zero-terminated strings it is a simple check
that can be inlined - the "xml_g_node_match_" is an example of a
preprocessor routine for that.

A is target, B is pattern:

- (z-string A , z-string B) 
    return (*B != '*' ? 0 == strcmp (A, B) : 0 != strstr (A, B+1))

- (ptr/cnt  A,  z-string B)
    return (*B != '*' ? (!memcmp (B, ptr, cnt) && !B[cnt]) 
                      : g_strstr_len (ptr, cnt, B+1))

- (ptr/cnt  A , str/len  B)
    return  (*str != '*' ? len==cnt && !memcmp (ptr, str, len)
                         : xml_g_strstr (ptr, cnt, str+1, len-1))

- (z-string  A , str/len  B)
   huh? decode as A/strlen(A) to former case.

 relative RE - a.k.a. pcre based grep for matches within subject.

The relative RE is the default for pcre expressions. Just call
pcre_compile (RE, 0, &errmsg, &erridx, 0) - note that the RE must
be a zero-terminated thing. If the RE-string is from a str/len
buffer (e.g. a part of an RE xpath) then it must be implicitly
converted into a zero-terminated string - in the libxmlpcre that
is done with g_strndup followed by the pcre_compile followed by
the g_free on the intermediate z-terminated regex representation.

 absolute RE - a.k.a. pcre match on complete target subject

In traditional perl-speak that would be done with "\Apattern\Z"
but the pcre-lib has a special pcre_compile option PCRE_ANCHORED
to let a pattern only match at the start of a target string. That
makes the pcre_exec call considerably faster actually. Then check
the returned match for being at the end.

___ char* errmsg; int erridx; ovector[33];
pcre compiled = pcre_compile (RE, PCRE_ANCHORED, &errmsg, erridx, 0);
if (0 < pcre_exec (compiled, 0, ptr, cnt,
                   0, PCRE_NOTEMPTY, ovector, 33)) && ovector[1] == cnt
{ pcre_free (compiled); return TRUE; } else
{ pcre_free (compiled); return FALSE; }

You need to turn a zero-terminated target buffer int a ptr/cnt pair
using A/strlen(A) - again both for absolute/relative RE.

 XP spec - a.k.a. choose absolute/relative depending on first '*'

The two RE cases can again be combined - a star up front of a pcre
regex is not valid anyway, so that as soon as we see it we hand the
regex+1 string over to the relative RE match. Since the RE matching
is a two-phase process there are a number of ways to combine the
difference at pcre_compile with that of pcre_exec - one could for
instance recheck (*regex=='*') later or one could memorize the
pcre_compile flag being non-empty and check for it later (i.e. the
zero .vs. PCRE_ANCHORED).

Oh, note that some "*"-relative regex could match an empty part of
the target and usually you do not want that which is why one would
usually prce_exec with PCRE_NOTEMPTY - using it for the absolute
variant is purely optional but it does not hurt either.