lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: How to start with Lucene 4.6.1
Date Tue, 09 Jul 2013 00:26:45 GMT
Well ... at a high level, this is what you should do:


   1. Integrate with Apache Tika for parsing the .DOC files (and maybe
   other office files you have)
   2. Tika extracts the contents of the document, as well as some metadata
   3. Create a Lucene Document object to which you add Fields:
      1. TextField for e.g. the "content" field
      2. StringField for e.g. the path to the document on the file system
      3. NumericDocValuesField for e.g. the documents modification date
      4. Perhaps another StringField for the documents type (Word,
      PowerPoint)
   4. Index these documents with IndexWriter
   5. Search using IndexSearcher

I'm sure there's a lot of Lucene tutorials around, for example:
http://www.lucenetutorial.com/lucene-in-5-minutes.html. Covers pretty much
what I've mentioned above.

>From there, you can expand to add search results highlighting (summaries /
snippets) using e.g. PostingsHighlighter, faceted search using Lucene
facets, Spelling correction and more.

Also, are you aware of Solr, which is a search engine developed on top of
Lucene. It takes care of all that for you, and has some pretty good
tutorials and documentation.
If you're not aiming to do something very challenging with these documents,
I think Solr can help you set up search very quickly, without writing any
code.

Shai


On Tue, Jul 9, 2013 at 2:44 AM, Vinh Dang <dqvinh87@gmail.com> wrote:

> Sorry for my typo,
>
> I mean Lucene 4.3.1,
>
> Thank Beale from US for that :)
>
> ---
> Best Regards
> Vinh Dang
> dqvinh87@gmail.com
>
>
>
>
> On Jul 8, 2013, at 9:46 PM, Vinh Dang <dqvinh87@gmail.com> wrote:
>
> > Hi everyone,
> >
> > I am very new in Lucene, so please forgive me if my question is quite
> stupid.
> >
> > I spent a whole day to google how to start with Lucene 4.6.1, but
> failed. I found some clear tutorials, but they were written for too old
> Lucene versions (almost 2).
> >
> > My tasks are:
> > I have a folder which contains multiple .DOC files, with Unicode
> characters (actually, they are Vietnamese characters).
> > I want to index this folder with Lucene (4.6.1 is the best, but another
> versions is OK).
> >
> > Could you give a point to start?
> >
> > Thank you very much,
> >
> > ---
> > Best Regards
> > Vinh Dang
> > dqvinh87@gmail.com
> >
> >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message