lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Betsey Benagh <betsey.ben...@stresearch.com>
Subject Re: Question about indexing PDFs
Date Fri, 26 Aug 2016 12:48:27 GMT
Erick,

I’m not sure of anything.  I’m new to Solr and find the documentation
extremely confusing.  I’ve searched the web and found tutorials/advice,
but they generally refer to older versions of Solr, and refer to
methods/settings/whatever that no longer exist. That’s why I’m asking for
help here.

I looked at the list of fields in the schema browser, and ‘content' is not
there.  If that is not enough to ‘assume’ that the content is not being
indexed, then please enlighten me as to what is.

I inserted the docs in batches by posting them, following the ‘Quick
Start’ tutorial.  It seemed like a safe assumption that the tutorial on
the Solr site would be correct and produce desirable results.

What I really want to do is index the XML versions of the documents which
have been run through another system, but I cannot for the life of me
figure out how to do that.  I’ve tried, but the documentation about XML
makes no sense to me.  I thought indexing the PDF versions would be easier
and more straightforward, but perhaps that is not the case.

Thanks,

betsey

On 8/25/16, 5:39 PM, "Erick Erickson" <erickerickson@gmail.com> wrote:

>That is always a dangerous assumption. Are you sure
>you're searching on the proper field? Are you sure it's indexed? Are
>you sure it's....
>
>The schema browser I indicated above will give you some
>idea what's actually in the field. You can not only see the
>fields Solr (actually Lucene) see in your index, but you can
>also see what some of the terms are.
>
>Adding &debug=query and looking at the parsed query
>will show you what fields are being searched against. The
>most common causes of what you're describing are:
>
>> not searching against the field you think you are. This
>is very easy to do without knowing it.
>
>> not actually having 'indexed="true" set in your schema
>
>> not committing after inserting the doc
>
>Best,
>Erick
>
>On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh <
>betsey.benagh@stresearch.com> wrote:
>
>> It looks like the metadata of the PDFs was indexed, but not the content
>> (which is what I was interested in).  Searches on terms I know exist in
>> the content come up empty.
>>
>> On 8/25/16, 2:16 PM, "Betsey Benagh" <betsey.benagh@stresearch.com>
>>wrote:
>>
>> >Right, that¹s where I looked.  No Œcontent¹.  Which is what confused
>>me.
>> >
>> >
>> >On 8/25/16, 1:56 PM, "Erick Erickson" <erickerickson@gmail.com> wrote:
>> >
>> >>when you say "I don't see it in the schema for that collection" are
>>you
>> >>talking schema.xml? managed_schema? Or actual documents in the index?
>> >>Often
>> >>these are defined by dynamic fields and the like in the schema files.
>> >>
>> >>Take a look at the admin UI>>schema browser>>drop down and you'll
see
>>all
>> >>the actual fields in your index...
>> >>
>> >>Best,
>> >>Erick
>> >>
>> >>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
>> >><betsey.benagh@stresearch.com
>> >>> wrote:
>> >>
>> >>> Following the instructions in the quick start guide, I imported a
>>bunch
>> >>>of
>> >>> PDF documents into my Solr 6.0 instance.  As far as I can tell from
>>the
>> >>> documentation, there should be a 'content' field indexing, well, the
>> >>> content, but I don't see it in the schema for that collection.  Is
>> >>>there
>> >>> something obvious I might have missed?
>> >>>
>> >>> Thanks!
>> >>>
>> >>>
>> >
>>
>>

Mime
View raw message