Search inside Lucene in Action

Query parsed to: index fileindex

221 - 230 of 230 results (Page 12 of 12)

Appendix B : Lucene index format

starts on page 394

...393 So far, we have treated the Lucene index more or less as a black box and have con- cerned ourselves only with its logical view. Although you don't need to under- stand index structure details in order to use Lucene, you may be curious about the "magic." Lucene's index structure is a case study ... as nature's proof.) In this appendix, we'll look at the logical view of a Lucene index, where ... of Lucene's inverted index. B.1 Logical index view Let's first take a step back and start...

Response to an Amazon review

After all the good reviews and very positive feedback about Lucene in Action that we have received over the last 10 months, we finally came across a not so positive review on Amazon. The review can be broken down into the following 4 main parts:

Lack of import statements
Authors didn't test the code
OOP is not suitable for Lucene code examples and there are no direct Lucene calls
Need for a command-line tool for HTML indexing

As Amazon's site doesn't let us provide feedback and respond to the review there, we thought we would address these issues here and hopefully help the reviewer get more out of our book. Let's address each of the four concerns:

Lack of import statements
Code examples in the book purposely don't contain import statements. Often times the list of import statements would be rather long. If we included all the imports, the code examples would be much longer and would often span multiple pages, thus making them harder for readers to follow. The list of import statements would also often repeat, as most examples import the same or very similar set of Lucene classes. Including imports would result in a thicker, heavier, and thus more expensive book.

So how should one deal with the lack of import statements?

Firstly, all code examples from Lucene in Action are free and available for download, even for those who don't own a copy of the book. The code is packaged with an ant script that can compile all the code, create all needed indexes, and run the code examples from the book.

Secondly, one can import all the code in any modern Java IDE and easily see which classes come from which packages.

This is also described in the book itself, in the "About the Book" section on page xxvii, in the last sentence in the paragraph titled "Code examples".
Authors didn't test the code
One of the novel and interesting aspects of Lucene in Action is that most of its code examples are written as unit tests. All code examples are, therefore, automatically tested. We used the excellent JUnit unit test framework to build the examples, and we provided the reasoning behind this in the "About the Book" section on page xxvii, in the paragraph titled "Why JUnit?".
OOP is not suitable for Lucene code examples and there are no direct Lucene calls
All the calls to Lucene are direct calls, but presented as unit tests. It sounds like the reviewer is confusing OOP and unit tests.
Need for a command-line tool for HTML indexing
We present just such a tool in Chapter 7, in section 7.4.2. The chapter also includes a whole mini-framework for indexing other file types (e.g. XML, Word, PDF, etc.).

[]

7.4.3 : Using NekoHTML

starts on page 245 under section 7.4 (Indexing an HTML document) in chapter 7 (Parsing common document formats)

.../~andyc/neko/doc/index.html. Listing 7.8 shows our DocumentHandler implementation based on NekoHTML...

index

starts on page 416

... Lucene 391 C Almaer, Dion 371 building Sandbox 310 alternative spellings 354 indexing a fileset 284 C++ 10 analysis 103 Antiword 264 CachingWrappingFilter during indexing 105 ANTLR ... C. 26 supported platforms 314 Dutch 282 Berkeley DB, storing Unicode support 316 field types 105 index ... of Lucene 352 indexing 365 injecting synonyms 129, 296 BooleanQuery 85 command-line interface 269 SimpleAnalyzer 108 from QueryParser 72, 87 compound index Snowball 283 n-gram extension 358 creating...

7.2.2 : Parsing and indexing using Digester

starts on page 230 under section 7.2 (Indexing XML) in chapter 7 (Parsing common document formats)

......

Memory leak in custom sort code

Brian Riddle e-mailed us quite a detailed errata item, and with his permission I'm posting the e-mail in its entirety in order to preserve the details:

Hello,
First *huge* thanks for your book Lucene In Action between it and the lucene develepers and user mailing lists i have been able to give our site a much better search infrastructure.
In the last phase of rolling out our new search system we discovered a memory leak in listing 6.2 DistanceComparatorSource. I used that code as a base for a modified integer sort. That was in and of it self pretty straight forward. But the problem was there was no equals and hash code method. That means that equals and hashcode are inherited from object for DistanceScoreDocLookupComparator.
And there in lies the memory leak. Everytime a new DisctanceComparatorSource was retrieved it failed to find the cached value ScoreDocComparator. So it added it to the cache of ScoreDocCompatators kept by o.a.l.s.FieldCacheImpl. The fix was to add a hashcode and equals method to ou ScoreDocCompatator implementation.
The big clue came after using www.yourkit.com's profiler to see what was allocating so much memory and reading the last paragraph on page 199 a couple of times.
"The sorting infrastructure within Lucene caches (based on a key combining the hashcode of the indexReader, the field name, and the custom sort object) ..."
That sentence gave the clue as to what was happening but it is also a little misleading. Looking at o.a.l.s.FieldCacheImpl The index reader is used as the key for the internal WeakHashMap of the different Entry(fieldName, ScoreDocComparator) that are used in an application.
If implementations of ScoreDocComapartors do not implment hashcode and equals correctly every time they are used they will be added to the internal cache of field/comparators.
This was complete my fault as I usually add the to every class i write, not however in this case. I hope you can add this to the errata for the currrent addition (as well as fix the code) and expand on this in the Second addition so others won't be bitten by this bug.
Thanks again for the book you guys *rock*.
PS. We are using lucene-1.4.3.jar /jsdk 1.4.2 & jre 1.5 solaris and linux

[Permalink]

Appendix A : Installing Lucene

starts on page 388

... a substantial amount of documentation, includ- ing Javadocs. The root of the documentation is docs/index ... The command-line Lucene demo consists of two command-line programs: one that indexes a directory tree of files ... adding docs/whoweare.html 9454 total milliseconds This command indexes the entire docs directory tree (339 files in our case) into an index stored in the index subdirectory of the location where you executed the command. NOTE Literally every file in the docs directory tree is indexed, including .gif...

preface

starts on page xix

...From Erik Hatcher I've been intrigued with searching and indexing from the early days ... Index Server, Active Server Pages, and a third COM component for image manipulation. At the time ... from a custom Ant task, <index>, we created that indexes files during the build process using ... .blogscene.org/erik). I run an Ant build process, after cre- ating a blog entry, which indexes new ... , and a Lucene index, allowing for rich queries, even syndication of queries. Compared to other blog- ging...

Appendix C : Resources

starts on page 409

.../200x/ 2003/04/26/UTF Green, Dale, "Trail: Internationalization," http://java.sun.com/docs/books/ tutorial/i18n/index ... .org/bugzilla/show_bug.cgi?id=26763 JTextCat 0.1, http://www.jedi.be/JTextCat/index.html NGramJ, http ... /out/lsa_explanation. htm "Latent Semantic Indexing (LSI)," http://www.cs.utk.edu/~lsi/ Stata, Raymie, Krishna Bharat, and Farzin Maghoul, "The Term Vector Database: Fast Access to Indexing Terms for Web ... for Dynamic Inverted Index Maintenance," coauthored with J. Pedersen, Proceedings of SIGIR...

about this book

starts on page xxv

...'s primary competition. With- out wasting any time, we immediately build simple indexing and searching ... indexing operations. We describe the various field types and techniques for indexing numbers xxv and dates. Tuning the indexing process, optimizing an index, and how to deal with thread-safety ... human-entered query expressions. Chapter 4 delves deep into the heart of Lucene's indexing magic ... 's built-in support for query multiple indexes, even in parallel and remotely. Chapter 6 goes well...