lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Pilato <da...@pilato.fr>
Subject Re: Lucene search in attachments
Date Tue, 10 Feb 2015 08:56:13 GMT
I don’t understand.
If you don’t raise this restriction to a higher value (or to -1), all the text won’t be
extracted so only a subset of the text will be indexed.
Non indexed parts of the text won’t be searchable.

Did I misunderstand your question?

-- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr <https://twitter.com/elasticsearchfr>
| @scrutmydocs <https://twitter.com/scrutmydocs>



> Le 10 févr. 2015 à 09:52, sreedevi s <sreedevi.payikkad@gmail.com> a écrit :
> 
> Thank you David. Yes, it has a restriction of characters to 10000.
> But for large files, what could be done in that case?
> 
> Best Regards,
> Sreedevi S
> 
> On Tue, Feb 10, 2015 at 2:04 PM, David Pilato <david@pilato.fr> wrote:
> 
>> If you don’t index content, you won’t be able to search for it I guess.
>> That said, Tika can have this extracted characters limit. See indexedChars
>> below:
>> 
>> tika().parseToString(new BytesStreamInput(content, false), metadata,
>> indexedChars);
>> 
>> [1]
>> https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java#L456
>> <
>> https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java#L456
>>> 
>> 
>> --
>> David Pilato | Technical Advocate | Elasticsearch.com
>> @dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr <
>> https://twitter.com/elasticsearchfr> | @scrutmydocs <
>> https://twitter.com/scrutmydocs>
>> 
>> 
>> 
>>> Le 10 févr. 2015 à 09:24, sreedevi s <sreedevi.payikkad@gmail.com> a
>> écrit :
>>> 
>>> Hi,
>>>   Which is the best method to search in attachments in lucene? I am new
>>> to lucene and I am using version 4.10.2. By making use of Tika, I know I
>>> can convert files to text and then index it as another field. But for
>> large
>>> files that will not be the ideal solution. I believe the maximum
>> characters
>>> per field is 10,000. So, what can be ideal method to search attachments
>> then
>>> 
>>> 
>>> Best Regards,
>>> Sreedevi S
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message