diff options
Diffstat (limited to 'contrib/libg++/librx/DOC')
-rw-r--r-- | contrib/libg++/librx/DOC | 179 |
1 files changed, 179 insertions, 0 deletions
diff --git a/contrib/libg++/librx/DOC b/contrib/libg++/librx/DOC new file mode 100644 index 000000000000..dce6e2e10a4b --- /dev/null +++ b/contrib/libg++/librx/DOC @@ -0,0 +1,179 @@ +Most of the interfaces are covered by the documentation for GNU regex. + +This file will accumulate factoids about other interfaces until +somebody writes a manual. + + +* Don't Pass Registers Gratuitously + +Search and match functions take an optional parameter which is a +pointer to "registers" or "match positions". This parameter points +to a structure which during the match is filled in with the offset locations +of parenthesized subexpressions. + +Unless you specificly need the values that would be stored in that +structure, you should pass NULL for this parameter. Usually Rx will +do less backtracking (and so run much faster) if subexpression +positions are not being measured. + + +* Use syntax_parens + +Sometimes you need to know the positions of *some* parenthesized +subexpressions, but not others. You can still help Rx to avoid +backtracking by telling it specificly which subexpressions you are interested +in. You do this by filling in the rxb.syntax_parens field of +a pattern buffer. + + /* If this is a valid pointer, it tells rx not to store the extents of + * certain subexpressions (those corresponding to non-zero entries). + * Passing 0x1 is the same as passing an array of all ones. Passing 0x0 + * is the same as passing an array of all zeros. + * The array should contain as many entries as their are subexps in the + * regexp. + */ + char * syntax_parens; + + +* RX_SEARCH + +For an example of how to use rx_search, you can look at how +re_search_2 is defined (in rx.c). Basicly you need to define three +functions. These are GET_BURST, FETCH_CHAR, and BACK_REF. They each +operate on `struct rx_string_position' and a closure of your design +passed as void *. + + struct rx_string_position + { + const unsigned char * pos; /* The current pos. */ + const unsigned char * string; /* The current string burst. */ + const unsigned char * end; /* First invalid position >= POS. */ + int offset; /* Integer address of current burst */ + int size; /* Current string's size. */ + int search_direction; /* 1 or -1 */ + int search_end; /* First position to not try. */ + }; + +On entry to GET_BURST, all these fields are set, but POS may be >= +END. In fact, STRING and END might both be 0. + +The function of GET_BURST is to make all the fields valid without +changing the logical position in the string. SEARCH_DIRECTION is a +hint about which way the matcher will move next. It is usually 1, and +is -1 only when fastmapping during a reverse search. SEARCH_END +terminates the burst. + + typedef enum rx_get_burst_return + (*rx_get_burst_fn) (struct rx_string_position * pos, + void * app_closure, + int stop); + + +The closure is whatever you pass to rx_search. STOP is an argument to +rx_search that bounds the search. You should never return a string +position from with SEARCH_END set beyond the position indicated by +STOP. + + + enum rx_get_burst_return + { + rx_get_burst_continuation, + rx_get_burst_error, + rx_get_burst_ok, + rx_get_burst_no_more + }; + +Those are the possible return values of get_burst. Normally, you only +ever care about the last two. An error return indicates something +like trouble reading a file. A continuation return means suspend the +search and resume by retrying GET_BURST if the search is restarted. + +GET_BURST is not quite as trivial as you might hope. If you have a +fragmented string, you really have to keep two adjacent fragments at +all times, even though the GET_BURST interface looks like you only +need one. This is because of operators like `word-boundary' that try +to look at two adjacent characters. Such operators are implemented +with FETCH_CHAR. + + typedef int (*rx_fetch_char_fn) (struct rx_string_position * pos, + int offset, + void * app_closure, + int stop); + +That takes the same closure passed to GET_BURST. It returns the +character at POS or at one past POS according to whether OFFSET is 0 +or 1. + +It is guaranteed that POS + OFFSET is within the string being searched. + + + +The last function compares characters at one position with characters +previously matched by a parenthesized subexpression. + + enum rx_back_check_return + { + rx_back_check_continuation, + rx_back_check_error, + rx_back_check_pass, + rx_back_check_fail + }; + + typedef enum rx_back_check_return + (*rx_back_check_fn) (struct rx_string_position * pos, + int lparen, + int rparen, + unsigned char * translate, + void * app_closure, + int stop); + +LPAREN and RPAREN are the integer indexes of the previously matched +characters. The comparison should translate both characters being +compared by mapping them through TRANSLATE. POS is the point at which +to begin comparing. It should be advanced to the last character +matched during backreferencing. + +* Compilation Stages + +In rx_compile, a string is compiled into a pattern buffer. +Compilation proceeds in these stages: + + 1. Make a syntax tree for the regexp. + 2. Duplicate the syntax tree and make both trees nodes + in a single unifying tree. + 3. In one of the two trees, remove all side effects that + aren't needed to test for the possibility of a match. + Such side effects include the filling in of output registers + for subexpressions that are not backreferenced. + 4. Optimize the unifying tree. + 5. Translate the tree to an NFA. + 6. Analyze and optimize the NFA. + 7. Copy the NFA into a contiguous region of memory. + + +* Cache Size + +During a search or match, the NFA is translated into a "super NFA". +A super NFA can match the patterns of the corresponding NFA in no more +and often fewer steps. + +The catch is that the super NFA may be costly to construct in its entirety; +it may not even fit in memory. So, states of the NFA are constructed +on demand and discarded after a period of non-use. They are kept in a cache +so that time is not wasted constructing existing nodes twice. + +The size of the super state NFA cache is a contributing factor the performance +of Rx. The larger the cache (to a point) the faster Rx can run. The +variable rx_cache_bound is an upper limit on the number of superstates +that can exist in the cache. + +The defaulting setting is 128. GNU sed uses 4096. Neither setting +has much justification although sed's is after a small number of quick +and dirty experiments. The memory consumed by one superstate is +between 4k and 8k. The cache only grows to its bounded size if there +is actual demand for that many states. Sed's setting, for example, +may appear quite high but in practice that much memory is hardly ever +used. The default setting was chosen based on the heuristic that a +megabyte is the upper limit on what a good citizen library can +allocate without special arrangement. + |