lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: Query fields with data of certain length
Date Thu, 01 Feb 2018 04:42:40 GMT
Hi,

Have you manage to get the regex for this string in Chinese: 预支款管理及账务处理办法
?

Regards,
Edwin


On 4 January 2018 at 18:04, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
wrote:

> Hi Emir,
>
> An example of the string in Chinese is 预支款管理及账务处理办法
>
> The number of characters is 12, but the expected length should be 36.
>
> Regards,
> Edwin
>
>
> On 4 January 2018 at 16:21, Emir Arnautović <emir.arnautovic@sematext.com>
> wrote:
>
>> Hi Edwin,
>> I don’t have enough knowledge in eastern languages to know what is
>> expected number when you as for sting length. Maybe you can try some of
>> regex unicode settings and see if you’ll get what you need: try setting
>> unicode flag with (?U) or try using regex groups and ranges. If you provide
>> example string and expected length, maybe we could provide you regex.
>>
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 4 Jan 2018, at 04:37, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>> wrote:
>> >
>> > Hi Emir,
>> >
>> > So this would likely be different from what the operating system
>> counts, as
>> > the operating system may consider each Chinese characters as 3 to 4
>> bytes.
>> > Which is probably why I could not find any record with
>> subject:/.{255,}.*/
>> >
>> > Is there other tools that we can use to query the length for data that
>> are
>> > already indexed which are not in the standard English language? (Eg:
>> > Chinese, Japanese, etc)
>> >
>> > Regards,
>> > Edwin
>> >
>> > On 3 January 2018 at 23:51, Emir Arnautović <
>> emir.arnautovic@sematext.com>
>> > wrote:
>> >
>> >> Hi Edwin,
>> >> I do not know, but my guess would be that each character is counted as
>> 1
>> >> in regex regardless how many bytes it takes in used encoding.
>> >>
>> >> Regards,
>> >> Emir
>> >> --
>> >> Monitoring - Log Management - Alerting - Anomaly Detection
>> >> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>> >>
>> >>
>> >>
>> >>> On 3 Jan 2018, at 16:43, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>> >> wrote:
>> >>>
>> >>> Thanks for the reply.
>> >>>
>> >>> I am doing the search on existing data that has already been indexed,
>> and
>> >>> it is likely to be a one time thing.
>> >>>
>> >>> This  subject:/.{255,}.*/  works for English characters. However,
>> there
>> >> are
>> >>> Chinese characters in some of the records. The length seems to be more
>> >> than
>> >>> 255, but it does not shows up in the results.
>> >>>
>> >>> Do you know how the length for Chinese characters and other languages
>> are
>> >>> being determined?
>> >>>
>> >>> Regards,
>> >>> Edwin
>> >>>
>> >>>
>> >>> On 3 January 2018 at 23:01, Alexandre Rafalovitch <arafalov@gmail.com
>> >
>> >>> wrote:
>> >>>
>> >>>> Do that during indexing as Emir suggested. Specifically, use an
>> >>>> UpdateRequestProcessor chain, probably with the Clone and FieldLength
>> >>>> processors: http://www.solr-start.com/javadoc/solr-lucene/org/
>> >>>> apache/solr/update/processor/FieldLengthUpdateProcessorFactory.html
>> >>>>
>> >>>> Regards,
>> >>>>  Alex.
>> >>>>
>> >>>> On 31 December 2017 at 22:00, Zheng Lin Edwin Yeo <
>> edwinyeozl@gmail.com
>> >>>
>> >>>> wrote:
>> >>>>> Hi,
>> >>>>>
>> >>>>> Would like to check, if it is possible to query a field which
has
>> data
>> >> of
>> >>>>> more than a certain length?
>> >>>>>
>> >>>>> Like for example, I want to query the field subject that has
more
>> than
>> >>>> 255
>> >>>>> bytes. Is it possible?
>> >>>>>
>> >>>>> I am currently using Solr 6.5.1.
>> >>>>>
>> >>>>> Regards,
>> >>>>> Edwin
>> >>>>
>> >>
>> >>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message