lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context
Date Mon, 19 Jun 2017 11:40:22 GMT
> There is no standard across different types of docs as to what meta-data field is 
>> included. PDF might have a "last_edited" field. Word might have a 
>> "last_modified" field where the two mean the same thing.

On Tika, we _try_ to normalize fields according to various standards, the most predominant
is Dublin core, so that "author" in one format and "creator" in another will both be mapped
to "dc:creator".  That said:

1) there are plenty of areas where we could do a better job of normalizing.  Please let us
know how to improve!
2) no matter how well we normalize, there are some metadata items that are specific to various
file formats...I strongly recommend running Tika against a representative batch of documents
and deciding which fields you need for your application.

Finally, if there's a chance you want metadata from embedded documents/attachments, checkout
the RecursiveParserWrapper.  Under legacy Tika, if you have a bunch of images in a zip file,
you'd never get the lat/longs...or you'd never get "dc:creator" from an MSWord file sent as
an attachment in an MSG file.

Finally, and I mean it this time, I heartily second Erik's point about SolrJ and the need
to keep your file processing outside of Solr's JVM, VM and M!




-----Original Message-----
From: Erik Hatcher [mailto:erik.hatcher@gmail.com] 
Sent: Monday, June 19, 2017 6:56 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with
context

Ziyuan -

You may be interested in the example/files that ships with Solr too.  It’s got schema and
config and even UI for file indexing and searching.   Check it out README.txt under example/files
in your Solr install.

	Erik

> On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyuang@gmail.com> wrote:
> 
> Hi Erick,
> 
> thanks very much for the explanations! Clarification for question 2: 
> more specifically I cannot see the field content in the returned JSON, 
> with the the same definitions as in the post 
> <http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-t
> ext-inside-documents-indexed-with-solr-plus-tika/>
> :
> 
> <field name="content" type="text_general" indexed="false" 
> stored="true"/> <field name="text" type="text_general" multiValued="true" indexed="true"
> stored="false"/>
> <copyField source="content" dest="text"/>
> 
> Is it so that Tika does not fill these two fields automatically and I 
> have to write some client code to fill them?
> 
> Best regards,
> Ziyuan
> 
> 
> On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson 
> <erickerickson@gmail.com>
> wrote:
> 
>> 1> Yes, you can use your single definition. The author identifies the
>> "text" field as a catch-all. Somewhere in the schema there'll be a 
>> copyField directive copying (perhaps) many different fields to the 
>> "text" field. That permits simple searches against a single field 
>> rather than, say, using edismax to search across multiple separate 
>> fields.
>> 
>> 2> The link you referenced is for Data Import Handler, which is much
>> different than just posting files to Solr. See
>> ExtractingRequestHandler:
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> There are ways to map meta-data fields from the doc into specific 
>> fields matching your schema. Be a little careful here. There is no 
>> standard across different types of docs as to what meta-data field is 
>> included. PDF might have a "last_edited" field. Word might have a 
>> "last_modified" field where the two mean the same thing. Here's a 
>> link to a SolrJ program that'll dump all the fields:
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can 
>> easily hack out the DB bits.
>> 
>> BTW, once you get more familiar with processing, I strongly recommend 
>> you do the document processing on the client, the reasons are 
>> outlined in that article.
>> 
>> bq: even I define the fields as he said I cannot see them in the 
>> search results as keys in JSON are the fields set as stored="true"? 
>> They must be to be returned in requests (skipping the docValues 
>> discussion here).
>> 
>> 3> Yes, the text field is a concatenation of all the other ones.
>> Because it has stored=false, you can only search it, you cannot 
>> highlight or view. Fields you highlight must have stored=true BTW.
>> 
>> Whether or not you can highlight "Trevor Hastie" depends an a lot of 
>> things, most particularly whether that text is ever actually in a 
>> field in your index. Just because there's no guarantee that the name 
>> of the file is indexed in a searchable/highlightable way.
>> 
>> And the query q=id:Trevor Hastie won't do what you think. It'll be 
>> parsed as id:Trevor _text_:Hastie _text_ is the default field, look 
>> for a "df" parameter in your request handler in solrconfig.xml 
>> (usually "/select" or "/query").
>> 
>> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <ziyuang@gmail.com> wrote:
>>> Hi,
>>> 
>>> I am new to Solr and I need to implement a full-text search of some 
>>> PDF files. The indexing part works out of the box by using bin/post. 
>>> I can
>> see
>>> search results in the admin UI given some queries, though without 
>>> the matched texts and the context.
>>> 
>>> Now I am reading this post
>>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>>> for the highlighting part. It is for an older version of Solr when
>> managed
>>> schema was not available. Before fully understand what it is doing I 
>>> have some questions:
>>> 
>>> 1. He defined two fields:
>>> 
>>> <field name="content" type="text_general" indexed="false" stored="true"
>>> multiValued="false"/>
>>> <field name="text" type="text_general" indexed="true" stored="false"
>>> multiValued="true"/>
>>> 
>>> But why are there two fields needed? Can I define a field
>>> 
>>> <field name="content" type="text_general" indexed="true" stored="true"
>>> multiValued="true"/>
>>> 
>>> to capture the full text?
>>> 
>>> 2. How are the fields filled? I don't see relevant information in 
>>> TikaEntityProcessor's documentation 
>>> <https://lucene.apache.org/solr/6_6_0/solr-dataimporthandler-extras/
>>> org/
>> apache/solr/handler/dataimport/TikaEntityProcessor.html#
>> fields.inherited.from.class.org.apache.solr.handler.
>> dataimport.EntityProcessorBase>.
>>> The current text extractor should already be Tika (I can see
>>> 
>>> "x_parsed_by":
>>> ["org.apache.tika.parser.DefaultParser","org.apache.
>> tika.parser.pdf.PDFParser"]
>>> 
>>> in the returned JSON of some query). But even I define the fields as 
>>> he said I cannot see them in the search results as keys in JSON.
>>> 
>>> 3. The _text_ field seems a concatenation of other fields, does it
>> contain
>>> the full text? Though it does not seem to be accessible by default.
>>> 
>>> To be brief, using The Elements of Statistical Learning 
>>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
>> ESLII_print10.pdf>
>>> as an example, how to highlight the relevant texts for the query "SVM"?
>> And
>>> if changing the file name into "The Elements of Statistical Learning 
>>> - Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" 
>>> for the query "id:Trevor Hastie"?
>>> 
>>> Thank you.
>>> 
>>> Best regards,
>>> Ziyuan
>> 

Mime
View raw message