lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ZiYuan <>
Subject Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context
Date Sat, 17 Jun 2017 22:04:24 GMT

I am new to Solr and I need to implement a full-text search of some PDF
files. The indexing part works out of the box by using bin/post. I can see
search results in the admin UI given some queries, though without the
matched texts and the context.

Now I am reading this post
for the highlighting part. It is for an older version of Solr when managed
schema was not available. Before fully understand what it is doing I have
some questions:

1. He defined two fields:

<field name="content" type="text_general" indexed="false" stored="true"
<field name="text" type="text_general" indexed="true" stored="false"

But why are there two fields needed? Can I define a field

<field name="content" type="text_general" indexed="true" stored="true"

to capture the full text?

2. How are the fields filled? I don't see relevant information in
TikaEntityProcessor's documentation
The current text extractor should already be Tika (I can see


in the returned JSON of some query). But even I define the fields as he
said I cannot see them in the search results as keys in JSON.

3. The _text_ field seems a concatenation of other fields, does it contain
the full text? Though it does not seem to be accessible by default.

To be brief, using The Elements of Statistical Learning
as an example, how to highlight the relevant texts for the query "SVM"? And
if changing the file name into "The Elements of Statistical Learning -
Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
query "id:Trevor Hastie"?

Thank you.

Best regards,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message