lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context
Date Sun, 18 Jun 2017 17:07:16 GMT
1> Yes, you can use your single definition. The author identifies the
"text" field as a catch-all. Somewhere in the schema there'll be a
copyField directive copying (perhaps) many different fields to the
"text" field. That permits simple searches against a single field
rather than, say, using edismax to search across multiple separate

2> The link you referenced is for Data Import Handler, which is much
different than just posting files to Solr. See
There are ways to map meta-data fields from the doc into specific
fields matching your schema. Be a little careful here. There is no
standard across different types of docs as to what meta-data field is
included. PDF might have a "last_edited" field. Word might have a
"last_modified" field where the two mean the same thing. Here's a link
to a SolrJ program that'll dump all the fields: You can easily
hack out the DB bits.

BTW, once you get more familiar with processing, I strongly recommend
you do the document processing on the client, the reasons are outlined
in that article.

bq: even I define the fields as he said I cannot see them in the
search results as keys in JSON
are the fields set as stored="true"? They must be to be returned in
requests (skipping the docValues discussion here).

3> Yes, the text field is a concatenation of all the other ones.
Because it has stored=false, you can only search it, you cannot
highlight or view. Fields you highlight must have stored=true BTW.

Whether or not you can highlight "Trevor Hastie" depends an a lot of
things, most particularly whether that text is ever actually in a
field in your index. Just because there's no guarantee that the name
of the file is indexed in a searchable/highlightable way.

And the query q=id:Trevor Hastie won't do what you think. It'll be parsed as
id:Trevor _text_:Hastie
_text_ is the default field, look for a "df" parameter in your request
handler in solrconfig.xml (usually "/select" or "/query").

On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <> wrote:
> Hi,
> I am new to Solr and I need to implement a full-text search of some PDF
> files. The indexing part works out of the box by using bin/post. I can see
> search results in the admin UI given some queries, though without the
> matched texts and the context.
> Now I am reading this post
> <>
> for the highlighting part. It is for an older version of Solr when managed
> schema was not available. Before fully understand what it is doing I have
> some questions:
> 1. He defined two fields:
> <field name="content" type="text_general" indexed="false" stored="true"
> multiValued="false"/>
> <field name="text" type="text_general" indexed="true" stored="false"
> multiValued="true"/>
> But why are there two fields needed? Can I define a field
> <field name="content" type="text_general" indexed="true" stored="true"
> multiValued="true"/>
> to capture the full text?
> 2. How are the fields filled? I don't see relevant information in
> TikaEntityProcessor's documentation
> <>.
> The current text extractor should already be Tika (I can see
> "x_parsed_by":
> ["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]
> in the returned JSON of some query). But even I define the fields as he
> said I cannot see them in the search results as keys in JSON.
> 3. The _text_ field seems a concatenation of other fields, does it contain
> the full text? Though it does not seem to be accessible by default.
> To be brief, using The Elements of Statistical Learning
> <>
> as an example, how to highlight the relevant texts for the query "SVM"? And
> if changing the file name into "The Elements of Statistical Learning -
> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
> query "id:Trevor Hastie"?
> Thank you.
> Best regards,
> Ziyuan

View raw message