nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sven Davison <svendavi...@gmail.com>
Subject Re: RegEx not catching all tags
Date Wed, 01 Jun 2016 10:23:38 GMT
Thanks. I did some more reading in the documentation and Nifi's documentation says it only
returns the first one. HOWEVER... The Jain object returned had an element of tags already!

$.entities.hashtags.*.text or... Something. I got it working late last night!



-Sven Davison 
(sent from my iPhone)

> On May 31, 2016, at 10:47 PM, Andy LoPresto <alopresto@apache.org> wrote:
> 
> Hi Sven,
> 
> Are you using an ExtractText processor [1] here? If so, you can extract multiple capture
groups which will be stored in flowfile attributes such as “regexattr.1”, “regexattr.2”,
etc. when assigned to the regular expression name “regexattr”. 
> 
> Try the regular expression I’ve provided here [2] (explanation available on the site).
This captures a literal ‘#’, any “word” character one or more times until a word boundary,
and does this “globally”, aka does not stop searching after the first result. I didn’t
check exhaustively if hashtags can contain special characters like ‘-‘, etc. but that
should be well-documented by Twitter. 
> 
> /(#[\w]+\b)/g
> 
> [1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExtractText/index.html
> [2] https://regex101.com/r/gV3mO5/1
> 
>  
> Andy LoPresto
> alopresto@apache.org
> alopresto.apache@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> 
>> On May 31, 2016, at 3:32 PM, Sven Davison <svendavison@gmail.com> wrote:
>> 
>> 
>> http://prntscr.com/basrzy
>> 
>> the above is a screenshot showing a hashtags var only containing the first instance
of a hashtag. i want to get a list of ALL hashtags from twitter.text not just the first one.
i'm fairly sure my RegEx is wrong... here's what i have. 
>> 
>> (#{1}[a-zA-Z0-9_]*)
>> 
>> i'm using https://regex101.com/ to simulate traffic and tests.. but i can't get it
to recognize more than the first instance of the regex.
> 

Mime
View raw message