lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ZiYuan <ziyu...@gmail.com>
Subject Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context
Date Sat, 17 Jun 2017 22:04:24 GMT
Hi,

I am new to Solr and I need to implement a full-text search of some PDF
files. The indexing part works out of the box by using bin/post. I can see
search results in the admin UI given some queries, though without the
matched texts and the context.

Now I am reading this post
<http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
for the highlighting part. It is for an older version of Solr when managed
schema was not available. Before fully understand what it is doing I have
some questions:

1. He defined two fields:

<field name="content" type="text_general" indexed="false" stored="true"
multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false"
multiValued="true"/>

But why are there two fields needed? Can I define a field

<field name="content" type="text_general" indexed="true" stored="true"
multiValued="true"/>

to capture the full text?

2. How are the fields filled? I don't see relevant information in
TikaEntityProcessor's documentation
<https://lucene.apache.org/solr/6_6_0/solr-dataimporthandler-extras/org/apache/solr/handler/dataimport/TikaEntityProcessor.html#fields.inherited.from.class.org.apache.solr.handler.dataimport.EntityProcessorBase>.
The current text extractor should already be Tika (I can see

"x_parsed_by":
["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]

in the returned JSON of some query). But even I define the fields as he
said I cannot see them in the search results as keys in JSON.

3. The _text_ field seems a concatenation of other fields, does it contain
the full text? Though it does not seem to be accessible by default.

To be brief, using The Elements of Statistical Learning
<http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf>
as an example, how to highlight the relevant texts for the query "SVM"? And
if changing the file name into "The Elements of Statistical Learning -
Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
query "id:Trevor Hastie"?

Thank you.

Best regards,
Ziyuan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message