Lucene in Action

foreword

preface

acknowledgments

about this book

1.0 : Meet Lucene

1.1 : Evolution of information organization and access

1.2 : Understanding Lucene

1.2.1 : What Lucene is

1.2.2 : What Lucene can do for you

1.2.3 : History of Lucene

1.2.4 : Who uses Lucene

1.2.5 : Lucene ports: Perl, Python, C++, .NET, Ruby

1.3 : Indexing and searching

1.3.1 : What is indexing, and why is it important?

1.3.2 : What is searching?

1.4 : Lucene in action: a sample application

1.4.1 : Creating an index

Indexer command-line example whitespace issue

1.4.2 : Searching an index

1.5 : Understanding the core indexing classes

1.5.1 : IndexWriter

1.5.2 : Directory

1.5.3 : Analyzer

1.5.4 : Document

1.5.5 : Field

1.6 : Understanding the core searching classes

1.6.1 : IndexSearcher

1.6.2 : Term

1.6.3 : Query

1.6.4 : TermQuery

1.6.5 : Hits

1.7 : Review of alternate search products

1.7.1 : IR libraries

1.7.2 : Indexing and searching applications

1.7.3 : Online resources

1.8 : Summary

2.0 : Indexing

2.1 : Understanding the indexing process

2.1.1 : Conversion to text

Incorrect figure reference

2.1.2 : Analysis

2.1.3 : Index writing

2.2 : Basic index operations

2.2.1 : Adding documents to an index

2.2.2 : Removing Documents from an index

2.2.3 : Undeleting Documents

2.2.4 : Updating Documents in an index

2.3 : Boosting Documents and Fields

Document and Field boost setting

2.4 : Indexing dates

2.5 : Indexing numbers

2.6 : Indexing Fields used for sorting

2.7 : Controlling the indexing process

2.7.1 : Tuning indexing performance

2.7.2 : In-memory indexing: RAMDirectory

Two Pseudocode fixes

2.7.3 : Limiting Field sizes: maxFieldLength

2.8 : Optimizing an index

2.9 : Concurrency, thread-safety, and locking issues

2.9.1 : Concurrency rules

2.9.2 : Thread-safety

2.9.3 : Index locking

2.9.4 : Disabling index locking

2.10 : Debugging indexing

2.11 : Summary

3.0 : Adding search to your application

3.1 : Implementing a simple search feature

3.1.1 : Searching for a specific term

3.1.2 : Parsing a user-entered query expression: QueryParser

3.2 : Using IndexSearcher

3.2.1 : Working with Hits

3.2.2 : Paging through Hits

3.2.3 : Reading indexes into memory

3.3 : Understanding Lucene scoring

Scoring formula figure omission

3.3.1 : Lucene, you got a lot of `splainin' to do!

3.4 : Creating queries programmatically

3.4.1 : Searching by term: TermQuery

3.4.2 : Searching within a range: RangeQuery

3.4.3 : Searching on a string: PrefixQuery

3.4.4 : Combining queries: BooleanQuery

unintended whitespace

3.4.5 : Searching by phrase: PhraseQuery

3.4.6 : Searching by wildcard: WildcardQuery

3.4.7 : Searching for similar terms: FuzzyQuery

3.5 : Parsing query expressions: QueryParser

3.5.1 : Query.toString

3.5.2 : Boolean operators

3.5.3 : Grouping

3.5.4 : Field selection

3.5.5 : Range searches

3.5.6 : Phrase queries

3.5.7 : Wildcard and prefix queries

3.5.8 : Fuzzy queries

Caveats that apply

3.5.9 : Boosting queries

3.5.10 : To QueryParse or not to QueryParse?

3.6 : Summary

4.0 : Analysis

4.1 : Using analyzers

4.1.1 : Indexing analysis

Of course

4.1.2 : QueryParser analysis

4.1.3 : Parsing versus analysis: when an analyzer isn't appropriate

4.2 : Analyzing the analyzer

4.2.1 : What's in a token?

4.2.2 : TokenStreams uncensored

4.2.3 : Visualizing analyzers

4.2.4 : Filtering order can be important

4.3 : Using the built-in analyzers

4.3.1 : StopAnalyzer

4.3.2 : StandardAnalyzer

4.4 : Dealing with keyword fields

4.4.1 : Alternate keyword analyzer

4.5 : "Sounds like" querying

4.6 : Synonyms, aliases, and words that

4.6.1 : Visualizing token positions

4.7 : Stemming analysis

4.7.1 : Leaving holes

4.7.2 : Putting it together

4.7.3 : Hole lot of trouble

4.8 : Language analysis issues

4.8.1 : Unicode and encodings

4.8.2 : Analyzing non-English languages

4.8.3 : Analyzing Asian languages

4.8.4 : Zaijian

4.9 : Nutch analysis

4.10 : Summary

5.0 : Advanced search techniques

5.1 : Sorting search results

5.1.1 : Using a sort

5.1.2 : Sorting by relevance

5.1.3 : Sorting by index order

5.1.4 : Sorting by a field

5.1.5 : Reversing sort order

5.1.6 : Sorting by multiple fields

5.1.7 : Selecting a sorting field type

5.1.8 : Using a nondefault locale for sorting

5.1.9 : Performance effect of sorting

5.2 : Using PhrasePrefixQuery

5.3 : Querying on multiple fields at once

5.4 : Span queries: Lucene's new hidden gem

5.4.1 : Building block of spanning, SpanTermQuery

5.4.2 : Finding spans at the beginning of a field

5.4.3 : Spans near one another

5.4.4 : Excluding span overlap from matches

5.4.5 : Spanning the globe

5.4.6 : SpanQuery and QueryParser

5.5 : Filtering a search

5.5.1 : Using DateFilter

5.5.2 : Using QueryFilter

5.5.3 : Security filters

5.5.4 : A QueryFilter alternative

5.5.5 : Caching filter results

5.5.6 : Beyond the built-in filters

5.6 : Searching across multiple Lucene indexes

5.6.1 : Using MultiSearcher

5.6.2 : Multithreaded searching using ParallelMultiSearcher

5.7 : Leveraging term vectors

5.7.1 : Books like this

5.7.2 : What category?

5.8 : Summary

6.0 : Extending search

6.1 : Using a custom sort method

Memory leak in custom sort code

6.1.1 : Accessing values used in custom sorting

6.2 : Developing a custom HitCollector

6.2.1 : About BookLinkCollector

6.2.2 : Using BookLinkCollector

6.3 : Extending QueryParser

6.3.1 : Customizing QueryParser's behavior

6.3.2 : Prohibiting fuzzy and wildcard queries

6.3.3 : Handling numeric field-range queries

6.3.4 : Allowing ordered phrase queries

6.4 : Using a custom filter

6.4.1 : Using a filtered query

6.5 : Performance testing

6.5.1 : Testing the speed of a search

6.5.2 : Load testing

6.5.3 : QueryParser again!

6.5.4 : Morals of performance testing

6.6 : Summary

7.0 : Parsing common document formats

7.1 : Handling rich-text documents

7.1.1 : Creating a common DocumentHandler interface

7.2 : Indexing XML

7.2.1 : Parsing and indexing using SAX

SAXXMLHandler attributeMap initialization

7.2.2 : Parsing and indexing using Digester

7.3 : Indexing a PDF document

7.3.1 : Extracting text and indexing using PDFBox

NPE in PDFBoxPDFHandler

7.3.2 : Built-in Lucene support

7.4 : Indexing an HTML document

7.4.1 : Getting the HTML source data

7.4.2 : Using JTidy

7.4.3 : Using NekoHTML

7.5 : Indexing a Microsoft Word document

7.5.1 : Using POI

7.5.2 : Using TextMining.org's API

7.6 : Indexing an RTF document

7.7 : Indexing a plain-text document

7.8 : Creating a document-handling framework

7.8.1 : FileHandler interface

7.8.2 : ExtensionFileHandler

7.8.3 : FileIndexer application

7.8.4 : Using FileIndexer

7.8.5 : FileIndexer drawbacks, and how to extend the framework

7.9 : Other text-extraction tools

7.9.1 : Document-management systems and services

7.10 : Summary

8.0 : Tools and extensions

8.1 : Playing in Lucene's Sandbox

8.2 : Interacting with an index

8.2.1 : lucli: a command-line interface

8.2.2 : Luke: the Lucene Index Toolbox

8.2.3 : LIMO: Lucene Index Monitor

8.3 : Analyzers, tokenizers, and TokenFilters, oh my

8.3.1 : SnowballAnalyzer

8.3.2 : Obtaining the Sandbox analyzers

8.4 : Java Development with Ant and Lucene

8.4.1 : Using the task

8.4.2 : Creating a custom document handler

8.4.3 : Installation

8.5 : JavaScript browser utilities

8.5.1 : JavaScript query construction and validation

8.5.2 : Escaping special characters

8.5.3 : Using JavaScript support

8.6 : Synonyms from WordNet

8.6.1 : Building the synonym index

8.6.2 : Tying WordNet synonyms into an analyzer

8.6.3 : Calling on Lucene

8.7 : Highlighting query terms

8.7.1 : Highlighting with CSS

Incorrect figure reference

8.7.2 : Highlighting Hits

8.8 : Chaining filters

8.9 : Storing an index in Berkeley DB

8.9.1 : Coding to DbDirectory

8.9.2 : Installing DbDirectory

8.10 : Building the Sandbox

8.10.1 : Check it out

8.10.2 : Ant in the Sandbox

8.11 : Summary

9.0 : Lucene ports

9.1 : Ports' relation to Lucene

9.2 : CLucene

9.2.1 : Supported platforms

9.2.2 : API compatibility

9.2.3 : Unicode support

9.2.4 : Performance

9.2.5 : Users

9.3 : dotLucene

9.3.1 : API compatibility

9.3.2 : Index compatibility

9.3.3 : Performance

9.3.4 : Users

9.4 : Plucene

9.4.1 : API compatibility

9.4.2 : Index compatibility

9.4.3 : Performance

9.4.4 : Users

9.5 : Lupy

9.5.1 : API compatibility

9.5.2 : Index compatibility

9.5.3 : Performance

9.5.4 : Users

9.6 : PyLucene

9.6.1 : API compatibility

9.6.2 : Index compatibility

9.6.3 : Performance

9.6.4 : Users

9.7 : Summary

10.0 : Case studies

10.1 : Nutch: "The NPR of search engines"

10.1.1 : More in depth

10.1.2 : Other Nutch features

10.2 : Using Lucene at jGuru

10.2.1 : Topic lexicons and document categorization

10.2.2 : Search database structure

10.2.3 : Index fields

10.2.4 : Indexing and content preparation

10.2.5 : Queries

10.2.6 : JGuruMultiSearcher

10.2.7 : Miscellaneous

10.3 : Using Lucene in SearchBlox

10.3.1 : Why choose Lucene?

10.3.2 : SearchBlox architecture

10.3.3 : Search results

10.3.4 : Language support

10.3.5 : Reporting Engine

10.3.6 : Summary

10.4 : Competitive intelligence with Lucene in XtraMind's XM-InformationMinderTM

10.4.1 : The system architecture

10.4.2 : How Lucene has helped us

10.5 : Alias-i: orthographic variation with Lucene

10.5.1 : Alias-i application architecture

10.5.2 : Orthographic variation

10.5.3 : The noisy channel model of spelling correction

10.5.4 : The vector comparison model of spelling variation

10.5.5 : A subword Lucene analyzer

10.5.6 : Accuracy, efficiency, and other applications

10.5.7 : Mixing in context

10.5.8 : References

10.6 : Artful searching at Michaels.com

10.6.1 : Indexing content

10.6.2 : Searching content

Alluded to...

10.6.3 : Search statistics

10.6.4 : Summary

10.7 : I love Lucene: TheServerSide

10.7.1 : Building better search capability

10.7.2 : High-level infrastructure

10.7.3 : Building the index

10.7.4 : Searching the index

10.7.5 : Configuration: one place to rule them all

10.7.6 : Web tier: TheSeeeeeeeeeeeerverSide?

10.7.7 : Summary

10.8 : Conclusion

Appendix A : Installing Lucene

Appendix B : Lucene index format

Appendix C : Resources