lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <>
Subject Re: Can Apache Solr Handle TeraByte Large Data
Date Mon, 03 Aug 2015 18:59:37 GMT
Just to reconfirm, are you indexing file content? Because if you are,
you need to be aware most of the PDF do not extract well, as they do
not have text flow preserved.

If you are indexing PDF files, I would run a sample through Tika
directly (that's what Solr uses under the covers anyway) and see what
the output looks like.

Apart from that, either SolrJ or DIH would work. If this is for a
production system, I'd use SolrJ with client-side Tika parsing. But
you could use DIH for a quick test run.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:

On 3 August 2015 at 13:56, Mugeesh Husain <> wrote:
> Hi Alexandre,
> I have a 40 millions of files which is stored in a file systems,
> the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
> 1.)I have to split all underscore value from a filename and these value have
> to be index to the solr.
> 2.)Do Not need file contains(Text) to index.
> You Told me "The answer is Yes" i didn't get in which way you said Yes.
> Thanks
> --
> View this message in context:
> Sent from the Solr - User mailing list archive at

View raw message