lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sreedevi s <sreedevi.payik...@gmail.com>
Subject Re: Lucene search in attachments
Date Tue, 10 Feb 2015 09:45:50 GMT
Hi Uwe,
Thank you for the info update.I will remove the limit in tika and check.
So, my understanding is,currently lucene doesnt have any restriction on
number of terms per field but  when a term is greater then 2^15 bytes it is
silently ignored at indexing time – a message is logged in to infoStream if
enabled, but no error is thrown .
Is that right?



Best Regards,
Sreedevi S

On Tue, Feb 10, 2015 at 2:45 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
> There is no restriction to 10000 characters inside Lucene and there never
> was one. In earlier Lucene versions (long time ago) there was an implicit
> restriction to 10,000 TERMS (not characters). This is no longer the case.
> If you still want this, you have to wrap your Analyzer:
> http://goo.gl/SRf45A
>
> If you have a limitation to 10,000 characters somewhere, it might be your
> TIKA text extraction.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: sreedevi s [mailto:sreedevi.payikkad@gmail.com]
> > Sent: Tuesday, February 10, 2015 9:53 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Lucene search in attachments
> >
> > Thank you David. Yes, it has a restriction of characters to 10000.
> > But for large files, what could be done in that case?
> >
> > Best Regards,
> > Sreedevi S
> >
> > On Tue, Feb 10, 2015 at 2:04 PM, David Pilato <david@pilato.fr> wrote:
> >
> > > If you don’t index content, you won’t be able to search for it I guess.
> > > That said, Tika can have this extracted characters limit. See
> > > indexedChars
> > > below:
> > >
> > > tika().parseToString(new BytesStreamInput(content, false), metadata,
> > > indexedChars);
> > >
> > > [1]
> > > https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob
> > >
> > /master/src/main/java/org/elasticsearch/index/mapper/attachment/Attach
> > > mentMapper.java#L456
> > > <
> > > https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob
> > >
> > /master/src/main/java/org/elasticsearch/index/mapper/attachment/Attach
> > > mentMapper.java#L456
> > > >
> > >
> > > --
> > > David Pilato | Technical Advocate | Elasticsearch.com @dadoonet
> > > <https://twitter.com/dadoonet> | @elasticsearchfr <
> > > https://twitter.com/elasticsearchfr> | @scrutmydocs <
> > > https://twitter.com/scrutmydocs>
> > >
> > >
> > >
> > > > Le 10 févr. 2015 à 09:24, sreedevi s <sreedevi.payikkad@gmail.com>
a
> > > écrit :
> > > >
> > > > Hi,
> > > >    Which is the best method to search in attachments in lucene? I am
> > > > new to lucene and I am using version 4.10.2. By making use of Tika,
> > > > I know I can convert files to text and then index it as another
> > > > field. But for
> > > large
> > > > files that will not be the ideal solution. I believe the maximum
> > > characters
> > > > per field is 10,000. So, what can be ideal method to search
> > > > attachments
> > > then
> > > >
> > > >
> > > > Best Regards,
> > > > Sreedevi S
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message