unfinished documents:
pfe sources pfe manpages pfe docbook |
CHECKING Here we explain how things are matched and how it is implemented. There are quite a few places in the library that are heavily optimized for exact matches to speedup the execution for the usual cases. substring .vs. complete All the element-matches recognize a leading star ('*') in the matchstring expression. The default case happens to be a complete match other than people would expect from grep-style matchstring expressions. Therefore a pattern-expression of "name" will only match the single string "name" and not a string like "xnames". The leading star however will turn on substring-matches and so a match-pattern "*name" will match both "name" and "xnames" as strings. Note that in xml-processing many match-patterns will ask for alphanumeric names only - here the "name" and "*name" will actually work the same for both stdlib and pcre patterns and match the same set of nodes (or single one for non-star mode). RE vs. XP The "RE" stands for "Regular Expression" and "XE" is shorthand for "XPath Element" where the latter uses strcmp/strstr style routines to do the actual pattern matching. For many uses cases this is all sufficient and happens to be a lot faster than compiling a RE and applying it on a string buffer. A cpu like i386 even has an ISA code and cpu-native microcode to make them as fast as possible. absolute XP - a.k.a. stdlib function must match complete pattern An absolute XP uses strcmp() when two zero-terminated strings are being used. If one side is a str/len pair then it is faster to use memcmp() and check make a (!str[len]) on the zero-terminated side - that's possible since we assume the text array does not contain embedded zero-chars and the memcmp would break earlier. Two str/len pairs can not match for differing lengths anyway and the case degrades to a memcmp() later. - (z-string A , z-string B) return (!strcmp (A, B)) - (z-string A , str/len B) return (!memcmp (A, str, len) && !A[len]) - (ptr/cnt A , str/len B) return (cnt == len && !memcmp (ptr,str)) relative XP - a.k.a. stdlib functions match at partial substring A relative XP uses strstr() when two zer-terminated strings are being used. For the three other cases exist helper functions - one is g_strstr_len taken from gstrfunc.h - each of them is implemented as a while-loop around a memcmp() but it shouldn't be inlined like we could try with strstr() which is not a cpu-native opcode either. The relative processing is therefore slower - it scales linear with the maximum length of strings being searched for a pattern. However a true RE in the placed could do anything different but then again handle a state buffer and RE engine interpreter as an additional overhead which makes it improbable the the search code can be held completely pre-decoded in the instruction cache of modern cpus as it can be done for the small while-loop around a memcmp(). XP spec - a.k.a. choose absolute/relative depending on first '*' For the case of two zero-terminated strings it is a simple check that can be inlined - the "xml_g_node_match_" is an example of a preprocessor routine for that. A is target, B is pattern: - (z-string A , z-string B) return (*B != '*' ? 0 == strcmp (A, B) : 0 != strstr (A, B+1)) - (ptr/cnt A, z-string B) return (*B != '*' ? (!memcmp (B, ptr, cnt) && !B[cnt]) : g_strstr_len (ptr, cnt, B+1)) - (ptr/cnt A , str/len B) return (*str != '*' ? len==cnt && !memcmp (ptr, str, len) : xml_g_strstr (ptr, cnt, str+1, len-1)) - (z-string A , str/len B) huh? decode as A/strlen(A) to former case. relative RE - a.k.a. pcre based grep for matches within subject. The relative RE is the default for pcre expressions. Just call pcre_compile (RE, 0, &errmsg, &erridx, 0) - note that the RE must be a zero-terminated thing. If the RE-string is from a str/len buffer (e.g. a part of an RE xpath) then it must be implicitly converted into a zero-terminated string - in the libxmlpcre that is done with g_strndup followed by the pcre_compile followed by the g_free on the intermediate z-terminated regex representation. absolute RE - a.k.a. pcre match on complete target subject In traditional perl-speak that would be done with "\Apattern\Z" but the pcre-lib has a special pcre_compile option PCRE_ANCHORED to let a pattern only match at the start of a target string. That makes the pcre_exec call considerably faster actually. Then check the returned match for being at the end. ___ char* errmsg; int erridx; ovector[33]; pcre compiled = pcre_compile (RE, PCRE_ANCHORED, &errmsg, erridx, 0); if (0 < pcre_exec (compiled, 0, ptr, cnt, 0, PCRE_NOTEMPTY, ovector, 33)) && ovector[1] == cnt { pcre_free (compiled); return TRUE; } else { pcre_free (compiled); return FALSE; } You need to turn a zero-terminated target buffer int a ptr/cnt pair using A/strlen(A) - again both for absolute/relative RE. XP spec - a.k.a. choose absolute/relative depending on first '*' The two RE cases can again be combined - a star up front of a pcre regex is not valid anyway, so that as soon as we see it we hand the regex+1 string over to the relative RE match. Since the RE matching is a two-phase process there are a number of ways to combine the difference at pcre_compile with that of pcre_exec - one could for instance recheck (*regex=='*') later or one could memorize the pcre_compile flag being non-empty and check for it later (i.e. the zero .vs. PCRE_ANCHORED). Oh, note that some "*"-relative regex could match an empty part of the target and usually you do not want that which is why one would usually prce_exec with PCRE_NOTEMPTY - using it for the absolute variant is purely optional but it does not hurt either. |