lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emir Arnautović <emir.arnauto...@sematext.com>
Subject Re: Query fields with data of certain length
Date Thu, 04 Jan 2018 08:21:04 GMT
Hi Edwin,
I don’t have enough knowledge in eastern languages to know what is expected number when
you as for sting length. Maybe you can try some of regex unicode settings and see if you’ll
get what you need: try setting unicode flag with (?U) or try using regex groups and ranges.
If you provide example string and expected length, maybe we could provide you regex.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 4 Jan 2018, at 04:37, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com> wrote:
> 
> Hi Emir,
> 
> So this would likely be different from what the operating system counts, as
> the operating system may consider each Chinese characters as 3 to 4 bytes.
> Which is probably why I could not find any record with subject:/.{255,}.*/
> 
> Is there other tools that we can use to query the length for data that are
> already indexed which are not in the standard English language? (Eg:
> Chinese, Japanese, etc)
> 
> Regards,
> Edwin
> 
> On 3 January 2018 at 23:51, Emir Arnautović <emir.arnautovic@sematext.com>
> wrote:
> 
>> Hi Edwin,
>> I do not know, but my guess would be that each character is counted as 1
>> in regex regardless how many bytes it takes in used encoding.
>> 
>> Regards,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 3 Jan 2018, at 16:43, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>> wrote:
>>> 
>>> Thanks for the reply.
>>> 
>>> I am doing the search on existing data that has already been indexed, and
>>> it is likely to be a one time thing.
>>> 
>>> This  subject:/.{255,}.*/  works for English characters. However, there
>> are
>>> Chinese characters in some of the records. The length seems to be more
>> than
>>> 255, but it does not shows up in the results.
>>> 
>>> Do you know how the length for Chinese characters and other languages are
>>> being determined?
>>> 
>>> Regards,
>>> Edwin
>>> 
>>> 
>>> On 3 January 2018 at 23:01, Alexandre Rafalovitch <arafalov@gmail.com>
>>> wrote:
>>> 
>>>> Do that during indexing as Emir suggested. Specifically, use an
>>>> UpdateRequestProcessor chain, probably with the Clone and FieldLength
>>>> processors: http://www.solr-start.com/javadoc/solr-lucene/org/
>>>> apache/solr/update/processor/FieldLengthUpdateProcessorFactory.html
>>>> 
>>>> Regards,
>>>>  Alex.
>>>> 
>>>> On 31 December 2017 at 22:00, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> 
>>>> wrote:
>>>>> Hi,
>>>>> 
>>>>> Would like to check, if it is possible to query a field which has data
>> of
>>>>> more than a certain length?
>>>>> 
>>>>> Like for example, I want to query the field subject that has more than
>>>> 255
>>>>> bytes. Is it possible?
>>>>> 
>>>>> I am currently using Solr 6.5.1.
>>>>> 
>>>>> Regards,
>>>>> Edwin
>>>> 
>> 
>> 


Mime
View raw message