Lucene Tutorial

By Steven J. Owens

Jarkarta Lucene (http://jakarta.apache.org/lucene/) is a high-performance, full-featured, java, open-source, text search engine API written by Doug Cutting.

Note that Lucene is specifically an API, not an application. This means that all the hard parts have been done, but the easy programming has been left to you. The payoff for you is that, unlike normal search engine applications, you spend less time wading through tons of options and build a search application that is specifically suited to what you're doing. You can easily develop a custom search application, perfectly suited to your needs. Lucene is startlingly easy to develop with and use.

I'm going to assume that you're a basically competent programmer and that you are basically competent in java.

Use the Source, Luke

This tutorial is a brief overview; the Lucene distribution comes with four example classes:

FileDocument
IndexFiles
SearchFiles
DeleteFiles

These classes are really a good introduction to how to use Lucene. I wrote this tutorial because I find it easier to follow code if I have a general idea of what's going on, but it was tricky to write because it starts to look like the source code. Lucene really does make it that easy.

Overview

I'm going to try to use emphasis tags any time I introduce a Lucene API class name.

Here's a simple attempt to diagram how the Lucene classes go together:

Index

Document 1

Field A (name/value)

Field B (name/value)

Document 2

Field A (name/value)

Field B (name/value)

At the heart of Lucene is an Index. This class usually gets its data from a filesystem directory that contains a certain set of files that follow a certain structure, but it doesn't absolutely have to be a directory.

You pump data into the Index, then do searches on the Index to get results out. To build the Index, you use an IndexWriter object. To run a search on the Index you use an IndexSearcher object.

The search itself is a Query object, which you pass into IndexSearcher.search(). IndexSearcher.search() returns a Hits object, which contains a Vector of Document objects.

Document objects are stored in the Index, but they have to be put into the Index at some point, and that's your job. You have to select what data to enter in, and convert them into Documents. You read in each data file (or database entry, or whatever), instantiate a Document for it, break down the data into chunks and store the chunks in the Document as Field objects (a name/value pair). When you're done building a Document, you write it to the Index using the IndexWriter.

Queries can be quite complicated, so Lucene includes a tool to help generate Query objects, called a QueryParser. The QueryParser takes a query string, much like what you'd put into an Internet search engine, and generates a Query object.

Note: There's a gotcha that often pops up, so even though it's a lower-level detail, I'm going to mention it here. It's the Analyzer. Lucene indexes text, and part of the first step is cleaning up the text. You use an Analyzer to do this - it drops out punctuation and commonly occurring but meaningless words (the, a, an, etc). Lucene provides a couple different Analyzers, and you can make but your own, but the BIG GOTCHA people keep running into is that you must make sure you use the same sort of analyzer for both indexing and searching. You must feed the same sort of Analyzer to the QueryParser that you originally fed to the IndexWriter.

Moving on... did you notice what's not in the above? Lucene handles the indexing, searching and retrieving, but it doesn't handle:

managing the process (instantiating the objects and hooking them together, both for indexing and for searching)
selecting the data files
parsing the data files
getting the search string from the user
displaying the search results to the user

Those are all your job. There are some helpful tools and some good examples available in the Lucene contrib space, but generally Lucene is focused on doing the indexing and searching, and leaves all of the rest up to you (so you can make exactly the search solution you want).

I'm going to assume that typical uses for Lucene are either command-line driven, or web-driven. The example code I mentioned above is for a command-line driven searchable recipe database. Someday I'm going to build an example of how to make a web-driven Lucene application and add it to this tutorial.

Don't Get Clever

You'll notice, as we get into this, a common theme. You'll notice the same theme if you hang out on the lucene-user list and listen to Doug Cutting answering questions. That theme is don't get clever, all the cleverness you'll ever need has been put into really, really fast indexing and searching. This isn't to say it's always best to use brute force, but in Lucene, if there's a simple way to do it, that way probably makes the most sense. Remember Knuth: "early optimization is the root of much evil."

Indexing Or Searching

At the top, you're either pumping data into your search application (indexing) or pulling data out of it (searching).

I'm going to go over these classes in more or less the order you'd encounter them by going through the the sample source files. Well, to be exact, I'm going to go through them in the order the data would go through them, in going from an input file to the output of a search request.

If you're not sure you're ready to dive into this depth, take a look at my not-so-nitty-gritty overview.

Indexing In Depth

You index by creating Documents full of Fields (which contain name/value pairs) and pumping them into an IndexWriter, which parses the contents of the Field values into tokens and creates an index.

Document Objects

Lucene doesn't index files, it indexes Document objects. To index and then search files, you first need to write code that converts your files into Document objects.

A Document object is a collection of Field objects (name/value pairs). So, for each file, instantiate a Document, then populate it with Fields.

This is the first potentially tricky bit, depending on what kind of files you're indexing, how much the data in those files is structured, and how much of that structure you want to preserve. Lucene just handles name/value pairs. Email, for example, is mostly name/value oriented:

to: fred
from: barney
subject: dinner?
body: Let's get together for dinner tonight!

For more complex files, you have to "flatten" that structure out into a set of name/value fields.

By the way, I'm saying "files" here, but the data source could really be anything - chunks of a very large file, rows returned from an SQL query, individual email messages from a mailbox file.

A minimum, as in the standard Lucene examples, would be:

A field containing...	Which you'll use to...
the path to the original document	actually show the user the original document after the search
a modification date	compare against the original Document's modification date, to see if it needs to be reindexed.
the contents of the file	run the search against

Note: This is an example, not a requirement. For example, if you don't have a modification date, don't sweat it, you just have to reindex all of your files every time (and in fact, that's the standard recommended approach for reindexing, under the "don't get clever" rule of thumb).

The All Field

You also ought to really think about glomming all of the Field data together and storing it as some sort of "all" Field. This is the easiest way to set it up so your users can search all Fields at once, if they want. Yes, you could come up with a complex scheme to rewrite your user's query so it searches across all of the known fields, but remember, "don't get clever", keep it simple.

Digression: Field Objects

A Field object contains a name (a String) and a value (a String or a Reader), and three booleans that control whether or not the value will be indexed for searches, tokenized prior to indexing, and stored in the index so it can be returned with the search.

Let me explain those three booleans a bit more.

Indexed for searches - sometimes you'll want to have fields available in your Documents that don't really have anything to do with searching. Two examples I can think of off the top of my head are creation dates and file names, so you can compare when the Document was created against the file modification date, and decide if the document needs to be reindexed. Since these fields won't ever make sense to use in an actual search, you can decrease the amount of work Lucene does by marking them as not indexed for searches.
Tokenized prior to indexing - tokenizing refers to taking a piece of text and cleaning it up, and breaking it down into individual pieces (tokens) for the indexer. This is done by the Analyzer. Some fields you may not want to be tokenized, for example a serial number field.
Stored in the index - even if a field is entirely indexed, it doesn't necessarily mean that it'll be easy for Lucene to reconstruct it. Although Lucene is a search index, and not a database, if your fields are reasonably small, you can ask Lucene to store them in the index. With the fields stored in the index, instead of using the Document to locate the original file or data and load it, you can actually pull the data out of the Document. This works best with fairly small fields and documents that you'd need to parse for display anyway.
Some fields contain bulk data and are so large that you don't really want to store them in the index. You can still make your life a little easier by storing not just the filename, but a Reader object in the Field. This makes it simpler for your application to just get the Reader out of the Hit and use it to read in the data to display it to the user.

The Field class itself is pretty simple; it pretty much consists of the instance variables of the field, accessor methods for those instance variables, a toString() method, and a normal constructor. The only special part is several convenient static factory methods for manufacturing fields. These factory methods build Fields that are appropriate for several typical uses. I've listed them in order of how often they'd likely be used (in my unqualified opinion):

(Note: Yes, these method names are capitalized; if I had to guess, I'd say it's probably because they're factory methods - they instantiate and return Field objects with particular parameters.)

Factory Method	Tokenized	Indexed	Stored	Use for
Field.Text(String name, String value)	Yes	Yes	Yes	contents you want stored
Field.Text(String name, Reader value)	Yes	Yes	No	contents you don't want stored
Field.Keyword(String name, String value)	No	Yes	Yes	values you don't want broken down
Field.UnIndexed(String name, String value)	No	No	Yes	values you don't want indexed
Field.UnStored(String name, String value)	Yes	Yes	No	values you don't want stored

IndexWriter

The IndexWriter's job is to take the input (a Document), feed it through the Analyzer you instantiate it with, and create an index. Using the IndexWriter itself is fairly simple. You instantiate it with parameters for where to put the index files and the Analyzer you want it to use for cleaning up the tokens. Then feed Documents into IndexWriter.addDocument(). The actual index is a set of data files that the IndexWriter creates in a location defined (depending on how you instantiate the IndexWriter) by a lucene Directory object, a File, or a path string.

Directory Objects

You can also store the index in a Lucene Directory object. A Lucene Directory is an abstraction around the java filesystem classes. Using a Directory lets the Lucene classes hide what exactly is going on. This in turn lets you do clever behind-the-scenes things like keeping the file cached in memory for really high performance by using the RAM-based Directory class (Lucene comes with two Directory classes, one for file-based and one for RAM-based).

Analyzers and Tokenizers

The analyzer's job is to take apart a string of text and give you back a stream of tokens. The tokens are presumably usually words from the text content of the string, and that's what gets stored (along with the location and other details) in the index.

Each analyzer includes one or more tokenizers and may include filters. The tokenizers take care of the actual rules for where to break the text up into words (typically whitespace). The filters do any post-tokenizing work on the tokens (typically dropping out punctuation and commonly occurring words like "the", "an", "a", etc).

Lucene provides an Analyzer abstract class, and three implementations of Analyzer. Glossing over the details:

SimpleAnalyzer	SimpleAnalyzer seems to just use a Tokenizer that converts all of the input to lower case.
StopAnalyzer	StopAnalyzer includes the lower-case filter, and also has a filter that drops out any "stop words", words like articles (a, an, the, etc) that occur so commonly in english that they might as well be noise for searching purposes. StopAnalyzer comes with a set of stop words, but you can instantiate it with your own array of stop words.
StandardAnalyzer	StandardAnalyzer does both lower-case and stop-word filtering, and in addition tries to do some basic clean-up of words, for example taking out apostrophes ( ' ) and removing periods from acronyms (i.e. "T.L.A." becomes "TLA").

These analyzers are in English. There are several analyzers for other languages that have been developed by Lucene users. Check the Lucene Sandbox. If you can't find an analyzer for your language, it's pretty straightforward to implement your own. Use a SimpleAnalyzer for now, to learn how it works.

Searching In Depth

To actually do the search, you need an IndexSearcher, but we'll get to that in a moment; before you can even think about feeding the IndexSearcher a query, you have to have a Query object. The IndexSearcher does the actual munging through the index, but it only understands Query objects.

Query and QueryParser Objects

You produce the Query object by feeding the user's argument string into QueryParser.parse(), along with a string for the default field to search (if the user doesn't specify which field to search) and an Analyzer. The Analyzer is what QueryParser uses to tokenize the argument string. (Gotcha Warning: remember, again, you have to make sure that you use the same flavor Analyzer for tokenizing the argument string as you used for tokenizing the Index. StopAnalyzer is probably a safe choice for this, since that's the one used in the example code.) QueryParser.parse() returns a Query.

QueryParser has a static version of parse(), which I guess is there for convenience. You can instantiate a QueryParser with an Analyzer and default field String and keep it around. However, note that QueryParser is not thread-safe, so each thread will need its own QueryParser.

Digression: Thread Safety

Doug Cutting has posted on the topic of thread safety a couple of times. Indexing and searching are not only thread safe, but process safe. What this means is that:

Multiple index searchers can read the lucene index files at the same time.
An index writer or reader can edit the lucene index files while searches are ongoing
Multiple index writers or readers can try to edit the lucene index files at the same time (it's important for the index writer/reader to be closed so it will release the file lock).

However, the query parser is not thread safe, so each thread using the index should have its own query parser.

The index writer however, is thread safe, so you can update the index while people are searching it. However, you then have to make sure that the threads with open index searchers close them and open new ones, to get the newly updated data.

IndexSearchers

To get an IndexSearcher you simply instantiate an IndexSearcher with a single argument that tells Lucene where to find an existing index. The argument is either of these two:

a string containing a path to the file,
a Lucene Directory object (see the section about Directory objects under "Indexing In Depth", above)

Digression: IndexReaders

(You can safely skip this section, as it's just me meandering through the Lucene source code; not a whole lot of practical value here yet).

There's actually a third option for instantiating an IndexSearcher; you can instantiate it with any class that is a concrete subclass of the abstract class IndexReader

This makes more sense if you take a peek at the code for IndexSearcher. The other two constructors just turn your file path or Directory object into an IndexReader by calling the static method IndexReader.open(). Just for kicks, let's do a little more digging and see that IndexReader.open() takes either a String file path or a java File object and uses them to instantiate a Lucene Directory object, then calls open(Directory).

NOTE: I have to admit, I'm a little confused at this point, since the API docs say IndexReader is abstract (which means it can't be instantiated). Presumably that means IndexReader.open(), a static factory method, instantiates an appropriate concrete subclass of IndexReader and returns it. However, the API docs don't show any concrete subclasses of IndexReader. Since I'm too lazy at the moment to look through the source... oh, all right, I'm not too lazy to look through the source. Hm. It appears the API docs are out of date, the com/Lucene/index directory appears to contain a SegmentReader, which IndexReader.open() uses.

Multiple Indexes

If you're searching a single index, you use an IndexSearcher with a single index. If you need to search across multiple indexes, you instantiate one IndexSearcher per index, create an array, stick the IndexSearcher instances in the array, and instantiate a MultiSearcher with the array as an argument.

Doing The Search

To actually do the search, you take the argument string the user enters, pass it to a QueryParser and get back a parsed Query object (and remember (third time's the charm) to use the right kind of Analyzer when you instantiate the QueryParser; use the same sort of Analyzer that you used when you built the index; the QueryParser'll use the Analyzer to tokenize the argument string).

Then you feed the parsed Query to the IndexSearcher.search(). The return is a Hits object, which is a collection of Document objects for documents that matched the search parameters. The Hits object also includes a score for each Document, indicating how well it matched.

Hits

IndexSearcher.search(Query) returns a "Hits" object, which is sort of like a Vector, containing a ranked list of Lucene Document objects. These are the same Document objects you fed into the IndexWriter, but specifically the ones that matched your search. Now you need to format the hits for a display, or manufacture HREFs pointing to the original documents, or whatever you were basically planning to do with the search results.

What's Not Mentioned Here

There are classes in the Lucene project that didn't get mentioned here, or only got mentioned in passing. After all, the point of a tutorial is as much what NOT to tell you (yet) as what to tell you. Otherwise I'd just say Use The Source, Luke.

I highly recommend sitting down with this tutorial and following through the source of the demo classes first. Then, go back and do it again, only this time when the demo class does something with a Lucene class, go look at the source of the Lucene class and see what it's doing. Not only is this is a good way to learn about Lucene, it's an excellent way to learn more about programming.

Someday To Come

Next we'll go through this process again, and actually build an example program to index some files and then do searches against that index.

After that, we'll actually build a basic web search engine, using servlets and JSP. We've already seen that Lucene is a piece of cake to use, and the servlet/jsp stuff isn't much harder (unless you want to make it harder, which of course is possible to do). This will also introduce the whole question of multithreading Lucene. Fortunately, Lucene makes this really, really easy, because most - or all - of the key Lucene classes are thread-safe.