lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Dockery <dockeryjava...@yahoo.com>
Subject Re: select query does not find indexed pdf document
Date Wed, 14 Sep 2011 02:46:07 GMT
Thank you for your informative reply.

I would like to start simple by combining both filename and content 
  into the same default search field
   ...which my default schema xml calls  "text"
...
<defaultSearchField>text</defaultSearchField>
...

also:
-case and accent insensitive
-no splits on numb3rs
-no highlights 
-text processing same for index and search

however I do like
-I like ngrams prerrably (partial/prefix word/token search)


what schema mod's would be needed?

also what curl syntax to submit/index a pdf (with filename and content combined into the default
search field)?



________________________________
From: Bob Sandiford <bob.sandiford@sirsidynix.com>
To: Michael Dockery <dockeryjavaman@yahoo.com>
Cc: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
Sent: Monday, September 12, 2011 1:38 PM
Subject: RE: select query does not find indexed pdf document

Hi, Michael.

Well, the stock answer is, 'it depends'

For example - would you want to be able to search filename without searching file contents,
or would you always search both of them together?  If both, then copy both the file name
and the parsed file content from the pdf into a single search field, and you can set that
up as the default search field.

Or - what kind of processing / normalizing do you want on this data?  Case insensitive? 
Accent insensitive?  If a 'word' contains camel case (e.g. TheVeryIdea), do you want that
split on the case changes?  (but then watch out for things like "iPad")  If a 'word' contains
numbers, do want them left together, or separated?  Do you want stemming (where searching
for 'stemming' would also find 'stem', 'stemmed', that sort of thing?)  Is this always English,
or are the other languages involved.  Do you want the text processing to be the same for
indexing vs searching?  Do you want to be able to find hits based on the first few characters
of a term?  (ngrams)

Do you want to be able to highlight text segments where the search terms were found?

probably you want to read up on the various tokenizers and filters that are available.  Do
some prototyping and see how it looks.

Here's a starting point: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Basically, there is no 'one size fits all' here.  Part of the power of Solr / Lucene is its
configurability to achieve the results your business case calls for.  Part of the drawback
of Solr / Lucene - especially for new folks - is its configurability to achieve the results
you business case calls for. :)

Anyone got anything else to suggest for Michael?

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.com
www.sirsidynix.com<http://www.sirsidynix.com/>

From: Michael Dockery [mailto:dockeryjavaman@yahoo.com]
Sent: Monday, September 12, 2011 1:18 PM
To: Bob Sandiford
Subject: Re: select query does not find indexed pdf document

thank you.  that worked.

Any tips for   very   very  basic setup of the schema xml?
   ....or is the default basic enough?

I basically only want to search search on
        filename   and    file contents


From: Bob Sandiford <bob.sandiford@sirsidynix.com>
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>; Michael Dockery <dockeryjavaman@yahoo.com>
Sent: Monday, September 12, 2011 10:04 AM
Subject: RE: select query does not find indexed pdf document

Um - looks like you specified your id value as "pdfy", which is reflected in the results from
the "*:*" query, but your id query is searching for "vpn", hence no matches...

What does this query yield?

http://www/SearchApp/select/?q=id:pdfy

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.com<mailto:Bob.Sandiford@sirsidynix.com>
www.sirsidynix.com

> -----Original Message-----
> From: Michael Dockery [mailto:dockeryjavaman@yahoo.com<mailto:dockeryjavaman@yahoo.com>]
> Sent: Monday, September 12, 2011 9:56 AM
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Subject: Re: select query does not find indexed pdf document
>
> http://www/SearchApp/select/?q=id:vpn
>
> yeilds this:
>   <?xml version="1.0" encoding="UTF-8" ?>
> - <response>
> - <lstname="responseHeader">
>   <intname="status">0</int>
>   <intname="QTime">15</int>
> - <lstname="params">
>   <strname="q">id:vpn</str>
>   </lst>
>   </lst>
>   <result name="response"numFound="0"start="0"/>
>   </response>
>
>
> *****************************************
>
>  http://www/SearchApp/select/?q=*:*
>
> yeilds this:
>
>   <?xml version="1.0" encoding="UTF-8" ?>
> - <response>
> - <lstname="responseHeader">
>   <intname="status">0</int>
>   <intname="QTime">16</int>
> - <lstname="params">
>   <strname="q">*.*</str>
>   </lst>
>   </lst>
> - <resultname="response"numFound="1"start="0">
> - <doc>
>   <strname="author">doc</str>
> - <arrname="content_type">
>   <str>application/pdf</str>
>   </arr>
>   <strname="id">pdfy</str>
>   <datename="last_modified">2011-05-20T02:08:48Z</date>
> - <arrname="title">
>   <str>dmvpndeploy.pdf</str>
>   </arr>
>   </doc>
>   </result>
>   </response>
>
>
> From: Jan Høydahl <jan.asf@cominvent.com<mailto:jan.asf@cominvent.com>>
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>; Michael Dockery
> <dockeryjavaman@yahoo.com<mailto:dockeryjavaman@yahoo.com>>
> Sent: Monday, September 12, 2011 4:59 AM
> Subject: Re: select query does not find indexed pdf document
>
> Hi,
>
> What do you get from a query http://www/SearchApp/select/?q=*:* or
> http://www/SearchApp/select/?q=id:vpn ?
> You may not have mapped the fields correctly to your schema?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 12. sep. 2011, at 02:12, Michael Dockery wrote:
>
> > I am new to solr.
> >
> > I tried to upload a pdf file via curl to my solr webapp (on tomcat)
> >
> > curl
> "http://www/SearchApp/update/extract?stream.file=c:\dmvpn.pdf&stream.co
> ntentType=application/pdf&literal.id=pdfy&commit=true"
> >
> >
> >
> > <?xml version="1.0" encoding="UTF-8"?>
> > <response>
> > <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">860</int></lst>
> > </response>
> >
> >
> > but
> >
> > http://www/SearchApp/select/?q=vpn
> >
> >
> > does not find the document
> >
> >
> > <response>
> > <lst name="responseHeader">
> > <int name="status">0</int>
> > <int name="QTime">0</int>
> > <lst name="params">
> > <str name="q">vpn</str>
> > </lst>
> > </lst>
> > <result name="response" numFound="0" start="0"/>
> > </response>
> >
> >
> > help is appreciated.
> >
> > =================================================
> > fyi
> > I point my test webapp to the index/solr home via mod meta-
> data/context.xml
> > <Context crossContext="true" >
> >    <Environment name="solr/home" type="java.lang.String"
> >  value="c:/solr_home" override="true" />
> >
> > and I had to copy all these jars to my webapp lib dir: (to avoid the
> classnotfound)
> > Solr_download\contrib\extraction\lib
> >  ...in the future i plan to put them in the tomcat/lib dir.
> >
> >
> > Also, I have not modified conf\solrconfig.xml or schema.xml.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message