lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <>
Subject Re: Using Solr Cell to index the internal structure of a PDF
Date Thu, 10 Oct 2013 11:07:51 GMT
You can have a look here:

2013/10/10 Peter Bleackley <>

> I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can
> get Solr to ingest the entire document as one long string, stored in the
> index as "content". However, I want to index structure within the documents.
> I know that the ExtractingRequestHandler uses Apache Tika to convert the
> documents to XHTML. I've used the Tika GUI to look at the XHTML
> representation, and I can see that each page is represented as a <div>
> element, and that structure within pages is represented by <p> elements.
> How do I configure Solr to index documents at this level of granularity?
> Dr Peter J Bleackley
> Computational Linguistics Contractor
> Playful Technology Ltd

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message