foreword
preface
acknowledgments
about this book
1.0 : Meet Lucene
1.1 : Evolution of information organization and access
1.2 : Understanding Lucene
1.2.1 : What Lucene is
1.2.2 : What Lucene can do for you
1.2.3 : History of Lucene
1.2.4 : Who uses Lucene
1.2.5 : Lucene ports: Perl, Python, C++, .NET, Ruby
1.3 : Indexing and searching
1.3.1 : What is indexing, and why is it important?
1.3.2 : What is searching?
1.4 : Lucene in action: a sample application
1.4.2 : Searching an index
1.5 : Understanding the core indexing classes
1.5.1 : IndexWriter
1.5.2 : Directory
1.5.3 : Analyzer
1.5.4 : Document
1.5.5 : Field
1.6 : Understanding the core searching classes
1.6.1 : IndexSearcher
1.6.2 : Term
1.6.3 : Query
1.6.4 : TermQuery
1.6.5 : Hits
1.7 : Review of alternate search products
1.7.1 : IR libraries
1.7.2 : Indexing and searching applications
1.7.3 : Online resources
1.8 : Summary
2.0 : Indexing
2.1 : Understanding the indexing process
2.1.1 : Conversion to text
2.1.2 : Analysis
2.1.3 : Index writing
2.2 : Basic index operations
2.2.1 : Adding documents to an index
2.2.2 : Removing Documents from an index
2.2.3 : Undeleting Documents
2.2.4 : Updating Documents in an index
2.3 : Boosting Documents and Fields
2.4 : Indexing dates
2.5 : Indexing numbers
2.6 : Indexing Fields used for sorting
2.7 : Controlling the indexing process
2.7.1 : Tuning indexing performance
2.7.2 : In-memory indexing: RAMDirectory
2.7.3 : Limiting Field sizes: maxFieldLength
2.8 : Optimizing an index
2.9 : Concurrency, thread-safety, and locking issues
2.9.1 : Concurrency rules
2.9.2 : Thread-safety
2.9.3 : Index locking
2.9.4 : Disabling index locking
2.10 : Debugging indexing
2.11 : Summary
3.0 : Adding search to your application
3.1 : Implementing a simple search feature
3.1.1 : Searching for a specific term
3.1.2 : Parsing a user-entered query expression: QueryParser
3.2 : Using IndexSearcher
3.2.1 : Working with Hits
3.2.2 : Paging through Hits
3.2.3 : Reading indexes into memory
3.3 : Understanding Lucene scoring
3.3.1 : Lucene, you got a lot of `splainin' to do!
3.4 : Creating queries programmatically
3.4.1 : Searching by term: TermQuery
3.4.2 : Searching within a range: RangeQuery
3.4.3 : Searching on a string: PrefixQuery
3.4.4 : Combining queries: BooleanQuery
3.4.5 : Searching by phrase: PhraseQuery
3.4.6 : Searching by wildcard: WildcardQuery
3.4.7 : Searching for similar terms: FuzzyQuery
3.5 : Parsing query expressions: QueryParser
3.5.1 : Query.toString
3.5.2 : Boolean operators
3.5.3 : Grouping
3.5.4 : Field selection
3.5.5 : Range searches
3.5.6 : Phrase queries
3.5.7 : Wildcard and prefix queries
3.5.8 : Fuzzy queries
3.5.9 : Boosting queries
3.5.10 : To QueryParse or not to QueryParse?
3.6 : Summary
4.0 : Analysis
4.1 : Using analyzers
4.1.1 : Indexing analysis
4.1.2 : QueryParser analysis
4.1.3 : Parsing versus analysis: when an analyzer isn't appropriate
4.2 : Analyzing the analyzer
4.2.1 : What's in a token?
4.2.2 : TokenStreams uncensored
4.2.3 : Visualizing analyzers
4.2.4 : Filtering order can be important
4.3 : Using the built-in analyzers
4.3.1 : StopAnalyzer
4.3.2 : StandardAnalyzer
4.4 : Dealing with keyword fields
4.4.1 : Alternate keyword analyzer
4.5 : "Sounds like" querying
4.6 : Synonyms, aliases, and words that
4.6.1 : Visualizing token positions
4.7 : Stemming analysis
4.7.1 : Leaving holes
4.7.2 : Putting it together
4.7.3 : Hole lot of trouble
4.8 : Language analysis issues
4.8.1 : Unicode and encodings
4.8.2 : Analyzing non-English languages
4.8.3 : Analyzing Asian languages
4.8.4 : Zaijian
4.9 : Nutch analysis
4.10 : Summary
5.0 : Advanced search techniques
5.1 : Sorting search results
5.1.1 : Using a sort
5.1.2 : Sorting by relevance
5.1.3 : Sorting by index order
5.1.4 : Sorting by a field
5.1.5 : Reversing sort order
5.1.6 : Sorting by multiple fields
5.1.7 : Selecting a sorting field type
5.1.8 : Using a nondefault locale for sorting
5.1.9 : Performance effect of sorting
5.2 : Using PhrasePrefixQuery
5.3 : Querying on multiple fields at once
5.4 : Span queries: Lucene's new hidden gem
5.4.1 : Building block of spanning, SpanTermQuery
5.4.2 : Finding spans at the beginning of a field
5.4.3 : Spans near one another
5.4.4 : Excluding span overlap from matches
5.4.5 : Spanning the globe
5.4.6 : SpanQuery and QueryParser
5.5 : Filtering a search
5.5.1 : Using DateFilter
5.5.2 : Using QueryFilter
5.5.3 : Security filters
5.5.4 : A QueryFilter alternative
5.5.5 : Caching filter results
5.5.6 : Beyond the built-in filters
5.6 : Searching across multiple Lucene indexes
5.6.1 : Using MultiSearcher
5.6.2 : Multithreaded searching using ParallelMultiSearcher
5.7 : Leveraging term vectors
5.7.1 : Books like this
5.7.2 : What category?
5.8 : Summary
6.0 : Extending search
6.1 : Using a custom sort method
6.1.1 : Accessing values used in custom sorting
6.2 : Developing a custom HitCollector
6.2.1 : About BookLinkCollector
6.2.2 : Using BookLinkCollector
6.3 : Extending QueryParser
6.3.1 : Customizing QueryParser's behavior
6.3.2 : Prohibiting fuzzy and wildcard queries
6.3.3 : Handling numeric field-range queries
6.3.4 : Allowing ordered phrase queries
6.4 : Using a custom filter
6.4.1 : Using a filtered query
6.5 : Performance testing
6.5.1 : Testing the speed of a search
6.5.2 : Load testing
6.5.3 : QueryParser again!
6.5.4 : Morals of performance testing
6.6 : Summary
7.0 : Parsing common document formats
7.1 : Handling rich-text documents
7.1.1 : Creating a common DocumentHandler interface
7.2 : Indexing XML
7.2.1 : Parsing and indexing using SAX
7.2.2 : Parsing and indexing using Digester
7.3 : Indexing a PDF document
7.3.1 : Extracting text and indexing using PDFBox
7.3.2 : Built-in Lucene support
7.4 : Indexing an HTML document
7.4.1 : Getting the HTML source data
7.4.2 : Using JTidy
7.4.3 : Using NekoHTML
7.5 : Indexing a Microsoft Word document
7.5.1 : Using POI
7.5.2 : Using TextMining.org's API
7.6 : Indexing an RTF document
7.7 : Indexing a plain-text document
7.8 : Creating a document-handling framework
7.8.1 : FileHandler interface
7.8.2 : ExtensionFileHandler
7.8.3 : FileIndexer application
7.8.4 : Using FileIndexer
7.8.5 : FileIndexer drawbacks, and how to extend the framework
7.9 : Other text-extraction tools
7.9.1 : Document-management systems and services
7.10 : Summary
8.0 : Tools and extensions
8.1 : Playing in Lucene's Sandbox
8.2 : Interacting with an index
8.2.1 : lucli: a command-line interface
8.2.2 : Luke: the Lucene Index Toolbox
8.2.3 : LIMO: Lucene Index Monitor
8.3 : Analyzers, tokenizers, and TokenFilters, oh my
8.3.1 : SnowballAnalyzer
8.3.2 : Obtaining the Sandbox analyzers
8.4 : Java Development with Ant and Lucene
8.4.1 : Using the task
8.4.2 : Creating a custom document handler
8.4.3 : Installation
8.5 : JavaScript browser utilities
8.5.1 : JavaScript query construction and validation
8.5.2 : Escaping special characters
8.5.3 : Using JavaScript support
8.6 : Synonyms from WordNet
8.6.1 : Building the synonym index
8.6.2 : Tying WordNet synonyms into an analyzer
8.6.3 : Calling on Lucene
8.7 : Highlighting query terms
8.7.1 : Highlighting with CSS
8.7.2 : Highlighting Hits
8.8 : Chaining filters
8.9 : Storing an index in Berkeley DB
8.9.1 : Coding to DbDirectory
8.9.2 : Installing DbDirectory
8.10 : Building the Sandbox
8.10.1 : Check it out
8.10.2 : Ant in the Sandbox
8.11 : Summary
9.0 : Lucene ports
9.1 : Ports' relation to Lucene
9.2 : CLucene
9.2.1 : Supported platforms
9.2.2 : API compatibility
9.2.3 : Unicode support
9.2.4 : Performance
9.2.5 : Users
9.3 : dotLucene
9.3.1 : API compatibility
9.3.2 : Index compatibility
9.3.3 : Performance
9.3.4 : Users
9.4 : Plucene
9.4.1 : API compatibility
9.4.2 : Index compatibility
9.4.3 : Performance
9.4.4 : Users
9.5 : Lupy
9.5.1 : API compatibility
9.5.2 : Index compatibility
9.5.3 : Performance
9.5.4 : Users
9.6 : PyLucene
9.6.1 : API compatibility
9.6.2 : Index compatibility
9.6.3 : Performance
9.6.4 : Users
9.7 : Summary
10.0 : Case studies
10.1 : Nutch: "The NPR of search engines"
10.1.1 : More in depth
10.1.2 : Other Nutch features
10.2 : Using Lucene at jGuru
10.2.1 : Topic lexicons and document categorization
10.2.2 : Search database structure
10.2.3 : Index fields
10.2.4 : Indexing and content preparation
10.2.5 : Queries
10.2.6 : JGuruMultiSearcher
10.2.7 : Miscellaneous
10.3 : Using Lucene in SearchBlox
10.3.1 : Why choose Lucene?
10.3.2 : SearchBlox architecture
10.3.3 : Search results
10.3.4 : Language support
10.3.5 : Reporting Engine
10.3.6 : Summary
10.4 : Competitive intelligence with Lucene in XtraMind's XM-InformationMinderTM
10.4.1 : The system architecture
10.4.2 : How Lucene has helped us
10.5 : Alias-i: orthographic variation with Lucene
10.5.1 : Alias-i application architecture
10.5.2 : Orthographic variation
10.5.3 : The noisy channel model of spelling correction
10.5.4 : The vector comparison model of spelling variation
10.5.5 : A subword Lucene analyzer
10.5.6 : Accuracy, efficiency, and other applications
10.5.7 : Mixing in context
10.5.8 : References
10.6 : Artful searching at Michaels.com
10.6.1 : Indexing content
10.6.2 : Searching content
10.6.3 : Search statistics
10.6.4 : Summary
10.7 : I love Lucene: TheServerSide
10.7.1 : Building better search capability
10.7.2 : High-level infrastructure
10.7.3 : Building the index
10.7.4 : Searching the index
10.7.5 : Configuration: one place to rule them all
10.7.6 : Web tier: TheSeeeeeeeeeeeerverSide?
10.7.7 : Summary
10.8 : Conclusion
Appendix A : Installing Lucene
Appendix B : Lucene index format
Appendix C : Resources