lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yavar Husain <yavarhus...@gmail.com>
Subject Pattern for extracting text from a rich document and an associated metadata file
Date Wed, 04 Mar 2015 10:04:50 GMT
What is the best pattern to index the following kind of data:

HarryPotter.PDF
HarryPotter.txt

Avengers.Docx
Avengers.txt

For each of the above file the meta data lies in the text file having same
name as the rich document (as can be seen above).

(1) Now the brute force method that I can think of is extract text from
rich document and extract meta data from the associated txt file, club them
to form an xml and send it to Solr for indexing.

(2) Another thing that I can think of is to use SolrJ and just
programatically read the PDF and the txt file and send that to Solr. If
this is the case then is it possible to send PDF directly to Solr without
having to extract text first in my SolrJ program.

Is there something better that I can do quickly? I know if I just had rich
documents I would have used the Tika-Solr integration/requestHandlers to do
the job.

Any help would be appreciated.

Thanks,
Yavar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message