Based on Drupal's search.module version 1.224 {@link http://cvs.drupal.org/viewcvs/drupal/drupal/modules/search/search.module?view=markup}
Copyright: | (C) 2001-3001 Eloy Lafuente (stronk7) {@link http://contiento.com} |
License: | http://www.gnu.org/copyleft/gpl.html GNU GPL v3 or later |
File Size: | 406 lines (17 kb) |
Included or required: | 0 times |
Referenced: | 0 times |
Includes or requires: | 0 files |
tokenise_text($text, $stop_words = array() X-Ref |
This function process the text passed at input, extracting all the tokens and scoring each one based in their number of ocurrences and relation with some well-known html tags return: array one sorted array of tokens, with tokens being the keys and scores in the values. param: string $text the text to be tokenised. param: array $stop_words array of utf-8 words than can be ignored in param: boolean $overlap_cjk option to split CJK text into some overlapping param: boolean $join_numbers option to join in one unique token sequences of numbers |
tokenise_split($text, $stop_words, $overlap_cjk, $join_numbers) X-Ref |
Splits a string into tokens |
tokenise_simplify($text, $overlap_cjk, $join_numbers) X-Ref |
Simplifies a string according to indexing rules. |
tokenise_expand_cjk($matches) X-Ref |
Basic CJK tokeniser. Simply splits a string into consecutive, overlapping sequences of characters (MINIMUM_WORD_SIZE long). |
tokenise_truncate_word(&$text) X-Ref |
Helper function for array_walk in search_index_split. Truncates one string (token) to MAXIMUM_WORD_SIZE |