|
unfinished documents:
pfe sources pfe manpages pfe docbook |
CHECKING
Here we explain how things are matched and how it is implemented.
There are quite a few places in the library that are heavily optimized
for exact matches to speedup the execution for the usual cases.
substring .vs. complete
All the element-matches recognize a leading star ('*') in the
matchstring expression. The default case happens to be a complete
match other than people would expect from grep-style matchstring
expressions. Therefore a pattern-expression of "name" will only
match the single string "name" and not a string like "xnames".
The leading star however will turn on substring-matches and so
a match-pattern "*name" will match both "name" and "xnames" as
strings. Note that in xml-processing many match-patterns will
ask for alphanumeric names only - here the "name" and "*name"
will actually work the same for both stdlib and pcre patterns
and match the same set of nodes (or single one for non-star mode).
RE vs. XP
The "RE" stands for "Regular Expression" and "XE" is shorthand
for "XPath Element" where the latter uses strcmp/strstr style
routines to do the actual pattern matching. For many uses cases
this is all sufficient and happens to be a lot faster than
compiling a RE and applying it on a string buffer. A cpu like
i386 even has an ISA code and cpu-native microcode to make them
as fast as possible.
absolute XP - a.k.a. stdlib function must match complete pattern
An absolute XP uses strcmp() when two zero-terminated strings are
being used. If one side is a str/len pair then it is faster to
use memcmp() and check make a (!str[len]) on the zero-terminated
side - that's possible since we assume the text array does not
contain embedded zero-chars and the memcmp would break earlier.
Two str/len pairs can not match for differing lengths anyway and
the case degrades to a memcmp() later.
- (z-string A , z-string B) return (!strcmp (A, B))
- (z-string A , str/len B) return (!memcmp (A, str, len) && !A[len])
- (ptr/cnt A , str/len B) return (cnt == len && !memcmp (ptr,str))
relative XP - a.k.a. stdlib functions match at partial substring
A relative XP uses strstr() when two zer-terminated strings are
being used. For the three other cases exist helper functions - one
is g_strstr_len taken from gstrfunc.h - each of them is implemented
as a while-loop around a memcmp() but it shouldn't be inlined like
we could try with strstr() which is not a cpu-native opcode either.
The relative processing is therefore slower - it scales linear with
the maximum length of strings being searched for a pattern. However
a true RE in the placed could do anything different but then again
handle a state buffer and RE engine interpreter as an additional
overhead which makes it improbable the the search code can be held
completely pre-decoded in the instruction cache of modern cpus as
it can be done for the small while-loop around a memcmp().
XP spec - a.k.a. choose absolute/relative depending on first '*'
For the case of two zero-terminated strings it is a simple check
that can be inlined - the "xml_g_node_match_" is an example of a
preprocessor routine for that.
A is target, B is pattern:
- (z-string A , z-string B)
return (*B != '*' ? 0 == strcmp (A, B) : 0 != strstr (A, B+1))
- (ptr/cnt A, z-string B)
return (*B != '*' ? (!memcmp (B, ptr, cnt) && !B[cnt])
: g_strstr_len (ptr, cnt, B+1))
- (ptr/cnt A , str/len B)
return (*str != '*' ? len==cnt && !memcmp (ptr, str, len)
: xml_g_strstr (ptr, cnt, str+1, len-1))
- (z-string A , str/len B)
huh? decode as A/strlen(A) to former case.
relative RE - a.k.a. pcre based grep for matches within subject.
The relative RE is the default for pcre expressions. Just call
pcre_compile (RE, 0, &errmsg, &erridx, 0) - note that the RE must
be a zero-terminated thing. If the RE-string is from a str/len
buffer (e.g. a part of an RE xpath) then it must be implicitly
converted into a zero-terminated string - in the libxmlpcre that
is done with g_strndup followed by the pcre_compile followed by
the g_free on the intermediate z-terminated regex representation.
absolute RE - a.k.a. pcre match on complete target subject
In traditional perl-speak that would be done with "\Apattern\Z"
but the pcre-lib has a special pcre_compile option PCRE_ANCHORED
to let a pattern only match at the start of a target string. That
makes the pcre_exec call considerably faster actually. Then check
the returned match for being at the end.
___ char* errmsg; int erridx; ovector[33];
pcre compiled = pcre_compile (RE, PCRE_ANCHORED, &errmsg, erridx, 0);
if (0 < pcre_exec (compiled, 0, ptr, cnt,
0, PCRE_NOTEMPTY, ovector, 33)) && ovector[1] == cnt
{ pcre_free (compiled); return TRUE; } else
{ pcre_free (compiled); return FALSE; }
You need to turn a zero-terminated target buffer int a ptr/cnt pair
using A/strlen(A) - again both for absolute/relative RE.
XP spec - a.k.a. choose absolute/relative depending on first '*'
The two RE cases can again be combined - a star up front of a pcre
regex is not valid anyway, so that as soon as we see it we hand the
regex+1 string over to the relative RE match. Since the RE matching
is a two-phase process there are a number of ways to combine the
difference at pcre_compile with that of pcre_exec - one could for
instance recheck (*regex=='*') later or one could memorize the
pcre_compile flag being non-empty and check for it later (i.e. the
zero .vs. PCRE_ANCHORED).
Oh, note that some "*"-relative regex could match an empty part of
the target and usually you do not want that which is why one would
usually prce_exec with PCRE_NOTEMPTY - using it for the absolute
variant is purely optional but it does not hurt either.
|