lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Lucene search in attachments
Date Tue, 10 Feb 2015 09:59:25 GMT
Hi,

> -----Original Message-----
> From: sreedevi s [mailto:sreedevi.payikkad@gmail.com]
> Sent: Tuesday, February 10, 2015 10:46 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene search in attachments
> 
> Hi Uwe,
> Thank you for the info update.I will remove the limit in tika and check.
> So, my understanding is,currently lucene doesnt have any restriction on
> number of terms per field but  when a term is greater then 2^15 bytes it is
> silently ignored at indexing time – a message is logged in to infoStream if
> enabled, but no error is thrown .

Yes. There is only a limit on a single term *after* text analysis. But keep in mind that some
Analyzers like StandardAnalyzer have other limits way below that one. On the other hand, if
you index your documents as "StingField" or with KeywordAnalyzer, there is no tokenization
done at all, in that case the whole field is indexed as a single term - but that’s not useful
for searching in full text anyways. So use a suitable analyzer!

> Is that right?

Yes!

Uwe

> Best Regards,
> Sreedevi S
> 
> On Tue, Feb 10, 2015 at 2:45 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> 
> > Hi,
> >
> > There is no restriction to 10000 characters inside Lucene and there
> > never was one. In earlier Lucene versions (long time ago) there was an
> > implicit restriction to 10,000 TERMS (not characters). This is no longer the
> case.
> > If you still want this, you have to wrap your Analyzer:
> > http://goo.gl/SRf45A
> >
> > If you have a limitation to 10,000 characters somewhere, it might be
> > your TIKA text extraction.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> > > -----Original Message-----
> > > From: sreedevi s [mailto:sreedevi.payikkad@gmail.com]
> > > Sent: Tuesday, February 10, 2015 9:53 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Lucene search in attachments
> > >
> > > Thank you David. Yes, it has a restriction of characters to 10000.
> > > But for large files, what could be done in that case?
> > >
> > > Best Regards,
> > > Sreedevi S
> > >
> > > On Tue, Feb 10, 2015 at 2:04 PM, David Pilato <david@pilato.fr> wrote:
> > >
> > > > If you don’t index content, you won’t be able to search for it I guess.
> > > > That said, Tika can have this extracted characters limit. See
> > > > indexedChars
> > > > below:
> > > >
> > > > tika().parseToString(new BytesStreamInput(content, false),
> > > > metadata, indexedChars);
> > > >
> > > > [1]
> > > > https://github.com/elasticsearch/elasticsearch-mapper-attachments/
> > > > blob
> > > >
> > >
> /master/src/main/java/org/elasticsearch/index/mapper/attachment/Atta
> > > ch
> > > > mentMapper.java#L456
> > > > <
> > > > https://github.com/elasticsearch/elasticsearch-mapper-attachments/
> > > > blob
> > > >
> > >
> /master/src/main/java/org/elasticsearch/index/mapper/attachment/Atta
> > > ch
> > > > mentMapper.java#L456
> > > > >
> > > >
> > > > --
> > > > David Pilato | Technical Advocate | Elasticsearch.com @dadoonet
> > > > <https://twitter.com/dadoonet> | @elasticsearchfr <
> > > > https://twitter.com/elasticsearchfr> | @scrutmydocs <
> > > > https://twitter.com/scrutmydocs>
> > > >
> > > >
> > > >
> > > > > Le 10 févr. 2015 à 09:24, sreedevi s
> > > > > <sreedevi.payikkad@gmail.com> a
> > > > écrit :
> > > > >
> > > > > Hi,
> > > > >    Which is the best method to search in attachments in lucene?
> > > > > I am new to lucene and I am using version 4.10.2. By making use
> > > > > of Tika, I know I can convert files to text and then index it as
> > > > > another field. But for
> > > > large
> > > > > files that will not be the ideal solution. I believe the maximum
> > > > characters
> > > > > per field is 10,000. So, what can be ideal method to search
> > > > > attachments
> > > > then
> > > > >
> > > > >
> > > > > Best Regards,
> > > > > Sreedevi S
> > > >
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message