lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <>
Subject PDF search functionality using Solr
Date Tue, 06 Jan 2015 17:30:36 GMT
Hello Solr-users and developers,
Can you please suggest,

1.       What I should do to index PDF content information column wise?

2.       Do I need to extract the contents using one of the Analyzer, Tokenize and Filter
combination and then add it to Index? How can test the results on command prompt? I do not
know the selection of specific Analyzer, Tokenizer and Filter for this purpose

3.       How can I verify that the needed column info is extracted out of PDF and is indexed?

4.       So for example How to verify Ticket number is extracted in Ticket_number tag and
is indexed?

5.       Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? I think I
saw some posts complaining on how large size that can be posted ?

6.       What will enable Solr to search in any PDF out of many, with different words such
as "Runtime" "Error" "XXXX" and result will provide the link to the PDF

My PDFs are nothing but Jira ticket system.
PDF has info on
Ticket Number:
And so on:

1.       I imported PDF document in Solr and it does the necessary searching and I can test
some of it using the browse client interface provided.

2.       I have 80 GB worth of PDFs.

3.       Total number of PDFs are about 200

4.       Many PDFs are of size 4 GB

5.       What do you suggest me to import such a large PDFs? What tools can you suggest to
extract PDF contents first in some XML format and later Post that XML to be indexed by Solr.?

Your early response is much appreciated.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message