lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Srinivasa Meenavalli <Smeenav...@zensar.com>
Subject RE: Question about indexing PDFs
Date Fri, 26 Aug 2016 07:39:36 GMT
Hi Betsey,

I executed some examples in Solr 5.5 from apache Tika Data import handler . content/Text was
not store by default.
I can see PDF contents with documents when stored="true" enabled .

solr start -e dih

<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>

/solr/tika/select?q=*%3A*&wt=json&indent=true

<dataConfig>
    <dataSource type="BinFileDataSource" />
    <document>
        <entity name="tika-test" processor="TikaEntityProcessor"
                url="${solr.install.dir}/example/exampledocs/solr-word.pdf" format="text">
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/>
        </entity>
    </document>
</dataConfig>

Regards
Srinivas Meenavalli

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, August 26, 2016 3:09 AM
To: solr-user
Subject: Re: Question about indexing PDFs

That is always a dangerous assumption. Are you sure you're searching on the proper field?
Are you sure it's indexed? Are you sure it's....

The schema browser I indicated above will give you some idea what's actually in the field.
You can not only see the fields Solr (actually Lucene) see in your index, but you can also
see what some of the terms are.

Adding &debug=query and looking at the parsed query will show you what fields are being
searched against. The most common causes of what you're describing are:

> not searching against the field you think you are. This
is very easy to do without knowing it.

> not actually having 'indexed="true" set in your schema

> not committing after inserting the doc

Best,
Erick

On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh < betsey.benagh@stresearch.com> wrote:

> It looks like the metadata of the PDFs was indexed, but not the
> content (which is what I was interested in).  Searches on terms I know
> exist in the content come up empty.
>
> On 8/25/16, 2:16 PM, "Betsey Benagh" <betsey.benagh@stresearch.com> wrote:
>
> >Right, that¹s where I looked.  No Œcontent¹.  Which is what confused me.
> >
> >
> >On 8/25/16, 1:56 PM, "Erick Erickson" <erickerickson@gmail.com> wrote:
> >
> >>when you say "I don't see it in the schema for that collection" are
> >>you talking schema.xml? managed_schema? Or actual documents in the index?
> >>Often
> >>these are defined by dynamic fields and the like in the schema files.
> >>
> >>Take a look at the admin UI>>schema browser>>drop down and you'll
> >>see all the actual fields in your index...
> >>
> >>Best,
> >>Erick
> >>
> >>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
> >><betsey.benagh@stresearch.com
> >>> wrote:
> >>
> >>> Following the instructions in the quick start guide, I imported a
> >>>bunch of  PDF documents into my Solr 6.0 instance.  As far as I can
> >>>tell from the  documentation, there should be a 'content' field
> >>>indexing, well, the  content, but I don't see it in the schema for
> >>>that collection.  Is there  something obvious I might have missed?
> >>>
> >>> Thanks!
> >>>
> >>>
> >
>
>
Disclaimer: The contents of this e-mail and attachment(s) thereto are confidential and intended
for the named recipient(s) only. It shall not attach any liability on the originator or Zensar
Technologies Limited or its affiliates. Any views or opinions presented in this email are
solely those of the author and may not necessarily reflect the opinions of Zensar Technologies
Limited or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written consent of the
author of this e-mail is strictly prohibited. If you have received this email in error please
delete it and notify the sender immediately. Before opening any mail and attachments please
check them for viruses and defect. Zensar Technologies Ltd or its affiliate do not accept
any liability for virus infected mails.
Mime
View raw message