Search inside Lucene in Action

4.7 : Stemming analysis

starts on page 136 in chapter 4 (Analysis)

... are removed, and also leverages a stemming filter. The PorterStemFilter is shown in figure 4.3, but it isn't used by any built-in analyzer. It stems words using the Porter stemming algorithm created by Dr. Mar- tin Porter, and it's best defined in his own words: The Porter stemming algorithm (or `Porter ... , reduce to breath. The Porter stemmer is one of many stemming algorithms. See section 8.3.1, page ... by Dr. Porter). KStem is another stemming algorithm that has been adapted to Lucene (search Google...

4.8.2 : Analyzing non-English languages

starts on page 141 under section 4.8 (Language analysis issues) in chapter 4 (Analysis)

... and punctuation are used to separate words, you must adjust stop-word lists and stemming algorithms ... . Both of these employ language-specific stemming and stop-word removal. Also freely available ... discussed in section 4.3.1. The GermanStemFilter stems words based on German-language rules and also pro- vides a mechanism to provide an exclusion set of words that shouldn't be stemmed (which is empty ... ; it also lets you provide a custom set. Finally, the RussianStemFilter stems words using the Snowball...

4.7.2 : Putting it together

starts on page 137 under section 4.7 (Stemming analysis) in chapter 4 (Analysis)

...This custom analyzer uses our custom stop-word removal filter, which is fed from a LowerCaseTokenizer. The results of the stop filter are fed to the Porter stem- mer. Listing 4.8 shows the full implementation of this sophisticated analyzer. LowerCaseTokenizer kicks off the analysis process, feeding tokens through our custom stop-word removal filter and finally stemming the words using ... ) and stems words public class PositionalPorterStopAnalyzer extends Analyzer { private Set stopWords...

10.1.2 : Other Nutch features

starts on page 328 under section 10.1 (Nutch: "The NPR of search engines") in chapter 10 (Case studies)

... from an HTML document. Nutch does not use stemming or term aliasing of any kind. Search engines have not historically done much stemming, but it is a question that comes up regularly. The Nutch...

8.3.1 : SnowballAnalyzer

starts on page 283 under section 8.3 (Analyzers, tokenizers, and TokenFilters, oh my) in chapter 8 (Tools and extensions)

... of stemmers for different languages. Stemming was first introduced in section 4.7. Dr. Martin Porter, who also developed the Porter stemming algo- rithm, created the Snowball algorithm.3 The Porter ... of stemming algorithms. Through these algorithmic defini- tions, accurate implementations can ... demonstrates the result of the English stemmer strip- ping off the trailing ming from stemming ... SnowballAnalyzer("English"); assertAnalyzesTo(analyzer, "stemming algorithms", new String[] {"stem", "algorithm...

4.2.2 : TokenStreams uncensored

starts on page 109 under section 4.2 (Analyzing the analyzer) in chapter 4 (Analysis)

.... PorterStemFilter Stems each token using the Porter stemming algorithm. For example, country and countries both stem to countri. StandardFilter Designed to be fed by a StandardTokenizer. Removes dots from...

3.6 : Summary

starts on page 100 in chapter 3 (Adding search to your application)

... of the confusion regarding QueryParser stems from unexpected analysis interactions; chapter 4 goes... [Full sample chapter]

4.8 : Language analysis issues

starts on page 140 in chapter 4 (Analysis)

... process, different languages have different sets of stop words and unique stemming algorithms. Perhaps...

4.10 : Summary

starts on page 147 in chapter 4 (Analysis)

... removal and stemming of words. Removing words decreases your index size but can have a negative impact...

10.5.2 : Orthographic variation

starts on page 354 under section 10.5 (Alias-i: orthographic variation with Lucene) in chapter 10 (Case studies)

... problems are partly ameliorated by standard tokenization, stemming, and stop lists. For instance...

4.7.3 : Hole lot of trouble

starts on page 138 under section 4.7 (Stemming analysis) in chapter 4 (Analysis)

... word removed with remaining words stemmed). PhraseQuery does allow a little looseness, called slop ... removal that leaves holes; you can now see the benefit our analyzer provides, thanks to the stemming...

3.1.1 : Searching for a specific term

starts on page 70 under section 3.1 (Implementing a simple search feature) in chapter 3 (Adding search to your application)

...), convert terms to lowercase, convert terms to base word forms (stemming), or in- sert additional terms... [Full sample chapter]

4.0 : Analysis

starts on page 102

... common words, reducing words to a root form (stemming), or changing words into the basic form...

8.3 : Analyzers, tokenizers, and TokenFilters, oh my

starts on page 282 in chapter 8 (Tools and extensions)

... use language-specific stemming and custom stop-word lists. The Czech analyzer uses standard ... of these analyzers do quite a bit in the filtering process. If the stemming or tokenization is all you need...

8.6.2 : Tying WordNet synonyms into an analyzer

starts on page 296 under section 8.6 (Synonyms from WordNet) in chapter 8 (Tools and extensions)

... in singular form. Perhaps stemming should be added to our SynonymAnalyzer prior to the SynonymFilter, or maybe the WordNetSynonym- Engine should be responsible for stemming words before looking them...

8.2.2 : Luke: the Lucene Index Toolbox

starts on page 271 under section 8.2 (Interacting with an index) in chapter 8 (Tools and extensions)

..., if stop words were removed or tokens were stemmed during the analysis process then the original...

index

starts on page 416

... 292 POI 264 and WordNet 292 paging Porter stemming algorithm 136 misspellings 354 at jGuru 336 Porter ... 109 stemming alternative 359 additional 282 Walls, Craig 361 stemming analyzer 283 ordering 116 web...

preface

starts on page xix

... Software Foundation when Lucene migrated there in 2002. My devotion to Lucene stems from its being a core...