lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Bleackley <bleackl...@zooey.co.uk>
Subject Using Solr Cell to index the internal structure of a PDF
Date Thu, 10 Oct 2013 09:57:47 GMT
I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can 
get Solr to ingest the entire document as one long string, stored in the 
index as "content". However, I want to index structure within the documents.

I know that the ExtractingRequestHandler uses Apache Tika to convert the 
documents to XHTML. I've used the Tika GUI to look at the XHTML 
representation, and I can see that each page is represented as a <div> 
element, and that structure within pages is represented by <p> elements. 
How do I configure Solr to index documents at this level of granularity?

Dr Peter J Bleackley
Computational Linguistics Contractor
Playful Technology Ltd

Mime
View raw message