diff options
Diffstat (limited to 'contrib/libg++/librx/DOC')
-rw-r--r-- | contrib/libg++/librx/DOC | 179 |
1 files changed, 0 insertions, 179 deletions
diff --git a/contrib/libg++/librx/DOC b/contrib/libg++/librx/DOC deleted file mode 100644 index dce6e2e10a4b..000000000000 --- a/contrib/libg++/librx/DOC +++ /dev/null @@ -1,179 +0,0 @@ -Most of the interfaces are covered by the documentation for GNU regex. - -This file will accumulate factoids about other interfaces until -somebody writes a manual. - - -* Don't Pass Registers Gratuitously - -Search and match functions take an optional parameter which is a -pointer to "registers" or "match positions". This parameter points -to a structure which during the match is filled in with the offset locations -of parenthesized subexpressions. - -Unless you specificly need the values that would be stored in that -structure, you should pass NULL for this parameter. Usually Rx will -do less backtracking (and so run much faster) if subexpression -positions are not being measured. - - -* Use syntax_parens - -Sometimes you need to know the positions of *some* parenthesized -subexpressions, but not others. You can still help Rx to avoid -backtracking by telling it specificly which subexpressions you are interested -in. You do this by filling in the rxb.syntax_parens field of -a pattern buffer. - - /* If this is a valid pointer, it tells rx not to store the extents of - * certain subexpressions (those corresponding to non-zero entries). - * Passing 0x1 is the same as passing an array of all ones. Passing 0x0 - * is the same as passing an array of all zeros. - * The array should contain as many entries as their are subexps in the - * regexp. - */ - char * syntax_parens; - - -* RX_SEARCH - -For an example of how to use rx_search, you can look at how -re_search_2 is defined (in rx.c). Basicly you need to define three -functions. These are GET_BURST, FETCH_CHAR, and BACK_REF. They each -operate on `struct rx_string_position' and a closure of your design -passed as void *. - - struct rx_string_position - { - const unsigned char * pos; /* The current pos. */ - const unsigned char * string; /* The current string burst. */ - const unsigned char * end; /* First invalid position >= POS. */ - int offset; /* Integer address of current burst */ - int size; /* Current string's size. */ - int search_direction; /* 1 or -1 */ - int search_end; /* First position to not try. */ - }; - -On entry to GET_BURST, all these fields are set, but POS may be >= -END. In fact, STRING and END might both be 0. - -The function of GET_BURST is to make all the fields valid without -changing the logical position in the string. SEARCH_DIRECTION is a -hint about which way the matcher will move next. It is usually 1, and -is -1 only when fastmapping during a reverse search. SEARCH_END -terminates the burst. - - typedef enum rx_get_burst_return - (*rx_get_burst_fn) (struct rx_string_position * pos, - void * app_closure, - int stop); - - -The closure is whatever you pass to rx_search. STOP is an argument to -rx_search that bounds the search. You should never return a string -position from with SEARCH_END set beyond the position indicated by -STOP. - - - enum rx_get_burst_return - { - rx_get_burst_continuation, - rx_get_burst_error, - rx_get_burst_ok, - rx_get_burst_no_more - }; - -Those are the possible return values of get_burst. Normally, you only -ever care about the last two. An error return indicates something -like trouble reading a file. A continuation return means suspend the -search and resume by retrying GET_BURST if the search is restarted. - -GET_BURST is not quite as trivial as you might hope. If you have a -fragmented string, you really have to keep two adjacent fragments at -all times, even though the GET_BURST interface looks like you only -need one. This is because of operators like `word-boundary' that try -to look at two adjacent characters. Such operators are implemented -with FETCH_CHAR. - - typedef int (*rx_fetch_char_fn) (struct rx_string_position * pos, - int offset, - void * app_closure, - int stop); - -That takes the same closure passed to GET_BURST. It returns the -character at POS or at one past POS according to whether OFFSET is 0 -or 1. - -It is guaranteed that POS + OFFSET is within the string being searched. - - - -The last function compares characters at one position with characters -previously matched by a parenthesized subexpression. - - enum rx_back_check_return - { - rx_back_check_continuation, - rx_back_check_error, - rx_back_check_pass, - rx_back_check_fail - }; - - typedef enum rx_back_check_return - (*rx_back_check_fn) (struct rx_string_position * pos, - int lparen, - int rparen, - unsigned char * translate, - void * app_closure, - int stop); - -LPAREN and RPAREN are the integer indexes of the previously matched -characters. The comparison should translate both characters being -compared by mapping them through TRANSLATE. POS is the point at which -to begin comparing. It should be advanced to the last character -matched during backreferencing. - -* Compilation Stages - -In rx_compile, a string is compiled into a pattern buffer. -Compilation proceeds in these stages: - - 1. Make a syntax tree for the regexp. - 2. Duplicate the syntax tree and make both trees nodes - in a single unifying tree. - 3. In one of the two trees, remove all side effects that - aren't needed to test for the possibility of a match. - Such side effects include the filling in of output registers - for subexpressions that are not backreferenced. - 4. Optimize the unifying tree. - 5. Translate the tree to an NFA. - 6. Analyze and optimize the NFA. - 7. Copy the NFA into a contiguous region of memory. - - -* Cache Size - -During a search or match, the NFA is translated into a "super NFA". -A super NFA can match the patterns of the corresponding NFA in no more -and often fewer steps. - -The catch is that the super NFA may be costly to construct in its entirety; -it may not even fit in memory. So, states of the NFA are constructed -on demand and discarded after a period of non-use. They are kept in a cache -so that time is not wasted constructing existing nodes twice. - -The size of the super state NFA cache is a contributing factor the performance -of Rx. The larger the cache (to a point) the faster Rx can run. The -variable rx_cache_bound is an upper limit on the number of superstates -that can exist in the cache. - -The defaulting setting is 128. GNU sed uses 4096. Neither setting -has much justification although sed's is after a small number of quick -and dirty experiments. The memory consumed by one superstate is -between 4k and 8k. The cache only grows to its bounded size if there -is actual demand for that many states. Sed's setting, for example, -may appear quite high but in practice that much memory is hardly ever -used. The default setting was chosen based on the heuristic that a -megabyte is the upper limit on what a good citizen library can -allocate without special arrangement. - |