lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jim shirreffs" <>
Subject Indexing MSword Documents
Date Fri, 08 Jun 2007 17:23:27 GMT
I am trying to index msword documents. I've got things working but I do not 
think I am doing things properly.

To index msword docs I use an extractor to extract the text. Then I write 
the text to a .txt file and index that using an HTMLDocument object. Seems 
to me that since I have the text I should be able to just do a

        Doc.add("content", the_text_from_the_word_doc, ???, ???);

But looking at it seems the field "content" requires a reader. 
So I write a temporary file to satified that requirement.

What I would like to have is an MSWORDDocument class that would take the 
extracted text as a argument to the constructor and create a Ducument object 
that I could get.

If any one has any idea, please let me know.

Here is my code segment. Notice the msword hack,

* make a document

   if (ftype.startsWith("text"))
      doc = HTMLDocument.Document(f);
   else if (ftype.equals("application/pdf"))
      doc = LucenePDFDocument.getDocument(f);
   else if (ftype.equals("application/msword"))
      FileInputStream fin = new FileInputStream(f.getAbsolutePath());
      WordExtractor extractor = new WordExtractor(fin);
      String content = extractor.getText();
      if(debug) System.out.println(content);
      String tempFileName=f.getAbsolutePath() + ".txt";
      BufferedWriter bw = new BufferedWriter(new FileWriter(tempFileName, 
      bw.write((String) content.toString());
      File df = new File(tempFileName);
      doc = HTMLDocument.Document(df);
   else if (ftype.equals("binary"))
      return null;
      if(debug) System.out.println("Unknown file type not ascii or pdf.");
      doc = HTMLDocument.Document(f);
catch(java.lang.InterruptedException ie)
   throw ie;
catch( ioe)
   throw ioe;

Thanks in advance

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message