Search inside Lucene in Action

Query parsed to: index fileindex

81 - 100 of 230 results (Page 5 of 12)

4.1 : Using analyzers

starts on page 104 in chapter 4 (Analysis)

...Before we get into the gory details of what lurks inside an analyzer, let's look at how an analyzer is used within Lucene. Analysis occurs at two spots: during indexing and when using QueryParser ... square brackets to make the separations apparent. During indexing, the tokens extracted during analysis are the terms indexed. And, most important, the terms indexed are the terms that are searchable ... of the analysis process visible to the end user. Terms pulled from the original text are indexed and are matched...

4.1.3 : Parsing versus analysis: when an analyzer isn't appropriate

starts on page 107 under section 4.1 (Using analyzers) in chapter 4 (Analysis)

...An important point about analyzers is that they're used internally for fields flagged to be tokenized. Documents such as HTML, Microsoft Word, XML, and others, contain meta-data such as author, title, last modified date, and poten- tially much more. When you're indexing rich documents, this meta-data should be separated and indexed as separate fields. Analyzers are used to analyze a spe- cific ... options for indexing them; it also discusses parsing various document types in detail....

10.2.6 : JGuruMultiSearcher

starts on page 339 under section 10.2 (Using Lucene at jGuru) in chapter 10 (Case studies)

...Lucene does not have a standard object for searching multiple indexes with dif- ferent queries. Because jGuru needs to search the foreign database versus its internal search databases with slightly different query terms, I made a subclass of Lucene's MultiSearcher, JGuruMultiSearcher (shown ... and ScoreDoc as well as the Searcher interface. Listing 10.1 Searching multiple indexes ... on each index and merges the results like this JGuruMultiSearcher. protected TopDocs search(Query query...

10.4.2 : How Lucene has helped us

starts on page 350 under section 10.4 (Competitive intelligence with Lucene in XtraMind's XM-InformationMinderTM) in chapter 10 (Case studies)

... was the fast generation of extracts for a given search result hit. We currently index the docu- ment ... is that this method is not really efficient when indexing long documents: the index becomes very large...

10.6 : Artful searching at Michaels.com

starts on page 361 in chapter 10 (Case studies)

... the search criteria. Rebuilding the search index involved taking the search facility offline ... the customer's patience. Scalability--The tool must scale well both in terms of the amount of data indexed as well as with the site's load during peak traffic. Robustness--The index must be frequently...

10.8 : Conclusion

starts on page 386 in chapter 10 (Case studies)

... 17 Authors' note: Index optimization is covered in section 2.8. from Doug's Nutch efforts. The Nutch ... is incorporated yields nifty effects. Indexing hexadecimal RGB values and providing external indexing...

5.6.2 : Multithreaded searching using ParallelMultiSearcher

starts on page 180 under section 5.6 (Searching across multiple Lucene indexes) in chapter 5 (Advanced search techniques)

... on your architecture. Supposedly, if the indexes reside on different physical disks and you're able ... CPU, single physical disk, and multiple indexes, performance with MultiSearcher was slightly better ... . An exam- ple, using ParallelMultiSearcher remotely, is shown in listing 5.9. Searching multiple indexes remotely Lucene includes remote index searching capability through Remote Method Invocation (RMI ... remote (and/or local) indexes, and each server could search only a single index. In order...

4.8.4 : Zaijian

starts on page 145 under section 4.8 (Language analysis issues) in chapter 4 (Analysis)

...A major hurdle (unrelated to Lucene) remains when you're dealing with various languages: handling text encoding. The StandardAnalyzer it still the best built-in general-purpose analyzer, even accounting for CJK characters; however, the Sandbox CJKAnalyzer seems better suited for Asian language analysis. When you're indexing documents in multiple languages into a single index, using a per-Document analyzer is appropriate. You may also want to add a field to documents indicating their language...

9.6.2 : Index compatibility

starts on page 323 under section 9.6 (PyLucene) in chapter 9 (Lucene ports)

...Because of the nature of PyLucene ("compiler and SWIG gymnastics"), its indexes are compatible with those of Lucene....

1.3.2 : What is searching?

starts on page 11 under section 1.3 (Indexing and searching) in chapter 1 (Meet Lucene)

...Searching is the process of looking up words in an index to find documents where they appear. The quality of a search is typically described using precision and recall metrics. Recall measures how well the search system finds relevant docu- ments, whereas precision measures how well the system filters out the irrelevant documents. However, you must consider a number of other factors when think- ing about searching. We already mentioned speed and the ability to quickly search large quantities... [Full sample chapter]

2.3 : Boosting Documents and Fields

starts on page 38 in chapter 2 (Indexing)

...Not all Documents and Fields are created equal--or at least you can make sure that's the case by selectively boosting Documents or Fields. Imagine you have to write an application that indexes ... Lucene to consider it more or less important with respect to other Documents in the index. The API ... is a company employee. b When we index messages sent by the company's employees, we set their boost ... . Imagine that another requirement for the email-indexing application is to consider the subject Field...

3.1 : Implementing a simple search feature

starts on page 69 in chapter 3 (Adding search to your application)

...Suppose you're tasked with adding search to an application. You've tackled get- ting the data indexed, but now it's time to expose the full-text searching to the end users. It's hard to imagine ... . Table 3.1 Lucene's primary searching API Class Purpose IndexSearcher Gateway to searching an index ... is returned from IndexSearcher's search method. When you're querying a Lucene index, an ordered ... that will be pre- sented to the user. For large indexes, it wouldn't even be possible to collect all matching... [Full sample chapter]

4.3.1 : StopAnalyzer

starts on page 119 under section 4.3 (Using the built-in analyzers) in chapter 4 (Analysis)

... left by the words removed? Suppose you index "one is not enough". The tokens emitted from StopAnalyzer ... for words removed, so the result is exactly as if you indexed "one enough". If you were to use ... analyzes phrases, and each of these reduces to "one enough" and matches the terms indexed. There is a "hole ... , only the tokens emitted from the analyzer (or indexed as Field.Keyword) are available for searching....

4.4 : Dealing with keyword fields

starts on page 121 in chapter 4 (Analysis)

...It's easy to index a keyword using Field.Keyword, which is a single token added to a field that bypasses tokenization and is indexed exactly as is as a single term. It's also straightforward to query ... indexing. There is nothing special about keyword fields once they're indexed; they're just terms. Let's see the issue exposed with a straightforward test case that indexes a doc- ument with a keyword ... ); here assertEquals(1, hits.length()); Document found } as expected } So far, so good--we've indexed...

9.3.2 : Index compatibility

starts on page 318 under section 9.3 (dotLucene) in chapter 9 (Lucene ports)

...dotLucene is compatible with Lucene at the index level. That is to say, an index created by Lucene can be read by dotLucene and vice versa. Of course, as Lucene evolves, indexes between versions of Lucene itself may not be portable, so this compatibility is currently limited to Lucene version 1.4....

10.5.1 : Alias-i application architecture

starts on page 352 under section 10.5 (Alias-i: orthographic variation with Lucene) in chapter 10 (Case studies)

... the documents and index them. Indexing is carried out with the Apache Lucene search engine. The Lucene indexer itself buffers documents in a RAMDirectory, using a separate thread to merge them periodically with an on-disk index. Figure 10.4 Alias-i Tracker architecture The next two stages...

10.7.1 : Building better search capability

starts on page 371 under section 10.7 (I love Lucene: TheServerSide) in chapter 10 (Case studies)

... TheServer- Side built an infrastructure that allows us to index and search our different content using Lucene. We will chat about our high-level infrastructure, how we index and search, as well as how ... ? We were using ht://Dig and having it crawl our site, building the index as it went along ... me to build an index just like ht://Dig was doing. At the time, LARM was in the Lucene Sandbox...

4.2.1 : What's in a token?

starts on page 108 under section 4.2 (Analyzing the analyzer) in chapter 4 (Analysis)

...A stream of tokens is the fundamental output of the analysis process. During indexing, fields designated for tokenization are processed with the specified ana- lyzer, and each token is written to the index as a term. This distinction between tokens and terms may seem confusing at first. Let's see ... into terms After text is analyzed during indexing, each token is posted to the index as a term ... to the index. Start and end offset as well as token type are discarded--these are only used during...

9.4.2 : Index compatibility

starts on page 320 under section 9.4 (Plucene) in chapter 9 (Lucene ports)

...According to Plucene's author, indexes created by Lucene 1.3 and Plucene 1.19 are compatible. A Java application that uses Lucene 1.3 will be able to read and digest an index created by Plucene 1.19 and vice versa. As is the case for other ports with compatible indexes, indexes between versions of Lucene itself may not be portable as Lucene evolves, so this compatibility is restricted to Lucene version 1.3....

10.5.5 : A subword Lucene analyzer

starts on page 357 under section 10.5 (Alias-i: orthographic variation with Lucene) in chapter 10 (Case studies)

... to terms with the following method: public static Directory index(String[] terms) { Directory ... are TF/IDF weighting of the n-gram vectors, indexing of terms by n-grams, and cosine computation ... names were then indexed using 2-grams, 3-grams, and 4-grams. Then each of the names was used ... to read the strings from a file, index them in mem- ory, optimize the index, and then parse...