nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: Extracting text using RegEx
Date Sat, 11 Jul 2015 15:58:41 GMT
The 0.2.0 release is in the active voting stages.  Here is what is on
that release being voted upon:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12316020&version=12332286

If all goes well with the vote the release should be up in 4-5 days.

Thanks
Joe

On Tue, Jun 23, 2015 at 12:44 PM, Chase Cunningham <chase@thecynja.com> wrote:
> sure can you tell me more about the next release? any other new stuff that
> will be upgraded?
>
>
> On 6/22/15 3:41 PM, Mark Payne wrote:
>>
>> Thanks for the clarification.
>>
>> The ExecuteStreamCommand processor that I was suggesting expects that the
>> data could can be streamed
>> directly to the script that it is running. The next version of NiFI
>> (0.2.0-incubating) provides the ability to avoid
>> streaming data to Standard In. This change is available today if you are
>> building from the codebase. If you are
>> just downloading the newest build, it is likely a couple of weeks away
>> from being delivered.
>>
>> With that change, you can use PutFile -> ExecuteStreamCommand so that you
>> write the file to disk, and then
>> use ExecuteStreamCommand to call the script that parses the data. You can
>> then use the ${filename} as one
>> of the parameters to the script in order to tell it which file to run
>> against. From there, you can use GetFile to pick up
>> the result, if you want to bring it back into your NiFi flow, or you can
>> process it however makes sense outside
>> of NiFi.
>>
>> Until that change is available, it may be a little more difficult, as the
>> processor wants to stream the content of
>> the FlowFile directly to the script.
>>
>> A possible workaround in the meantime would be to use PutFile ->
>> ReplaceText -> ExecuteStreamCommand and
>> configure ReplaceText to replace the regex ".*" with an empty value. In
>> that case, it won't stream any data
>> to the script, and you can just invoke the script using the filename as a
>> parameter.
>>
>> Does this help at all?
>>
>> Thanks
>> -Mark
>>
>>
>>
>>
>> ----------------------------------------
>>>
>>> Date: Mon, 22 Jun 2015 15:28:17 -0500
>>> From: chase@thecynja.com
>>> To: users@nifi.incubator.apache.org
>>> Subject: Re: Extracting text using RegEx
>>>
>>> 1. nifi does http stuff to get text files
>>> 2. files are put in directory in .txt format
>>> 3. script runs to parse through files, each data point of value is parsed
>>> 4. parsed data is written to files associated with data points inside
>>> 5. data is sent to data repo for future indexing and use
>>>
>>>
>>>
>>> On 6/22/15 3:22 PM, Mark Payne wrote:
>>>>
>>>> Chase,
>>>>
>>>> I want to understand the use case better before I try to offer any
>>>> advice.
>>>>
>>>> So you want to write the FlowFiles to a directory, and then run an
>>>> external script to process those files, correct?
>>>> Then, once the script has run, what does it do with the result? Does it
>>>> write it to a file, write to standard out,
>>>> interact directly with the database, etc?
>>>>
>>>> Thanks
>>>> -Mark
>>>>
>>>> ----------------------------------------
>>>>>
>>>>> Date: Mon, 22 Jun 2015 15:06:47 -0500
>>>>> From: chase@thecynja.com
>>>>> To: users@nifi.incubator.apache.org
>>>>> Subject: Re: Extracting text using RegEx
>>>>>
>>>>> so i have nifi pulling in data in .txt format from about 30 different
>>>>> sites....that data gets dumped to a directory call feedfiles...then i
>>>>> have a script that will parse out the ip's, exe's, domains, etc..so
>>>>> that
>>>>> the parsed stuff can be allocated to a database for indexing...
>>>>>
>>>>> having trouble automating this activity from the nifi standpoint...help
>>>>> is appreciated.
>>>>>
>>>>> On 6/22/15 2:55 PM, Mark Payne wrote:
>>>>>>
>>>>>> Chase,
>>>>>>
>>>>>> You could certainly use the ExecuteStreamCommand processor to
>>>>>> accomplish that.
>>>>>>
>>>>>> You can see the usage guide/documentation for that processor at [1].
>>>>>> Give that a look and
>>>>>> let me know if it meets your needs or not.
>>>>>>
>>>>>> Thanks
>>>>>> -Mark
>>>>>>
>>>>>> [1]
>>>>>> http://nifi.incubator.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExecuteStreamCommand/index.html
>>>>>>
>>>>>>
>>>>>> ----------------------------------------
>>>>>>>
>>>>>>> Date: Mon, 22 Jun 2015 14:21:00 -0500
>>>>>>> From: chase@thecynja.com
>>>>>>> To: users@nifi.incubator.apache.org
>>>>>>> Subject: Re: Extracting text using RegEx
>>>>>>>
>>>>>>> how can one run a script within NIFI to accomplish parsing?
>>>>>>>
>>>>>>> On 6/22/15 12:41 PM, Mark Payne wrote:
>>>>>>>>
>>>>>>>> Srujan,
>>>>>>>>
>>>>>>>> My guess is that the issue you are seeing is due to the GetHTTP
>>>>>>>> caching the ETag/LastModified value. When the
>>>>>>>> processor receives the response for an HTTP GET request,
it writes
>>>>>>>> the ETag to conf/.httpCache-<processor id>.
>>>>>>>>
>>>>>>>> It does this so that even after a restart of nifi, we don't
keep
>>>>>>>> pulling the same content. If the content changes at any
>>>>>>>> point, it will pull the new version of the content, though.
>>>>>>>>
>>>>>>>> You could trigger it to pull data either by copying and pasting
the
>>>>>>>> GetHTTP Processor and letting the new processor
>>>>>>>> pull the data, or you could delete that file from the conf/
>>>>>>>> directory and restart.
>>>>>>>>
>>>>>>>> If this doesn't give you what you need, please feel free
to let me
>>>>>>>> know!
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -Mark
>>>>>>>>
>>>>>>>> ----------------------------------------
>>>>>>>>>
>>>>>>>>> From: srujan.kotikela@firehost.com
>>>>>>>>> To: users@nifi.incubator.apache.org
>>>>>>>>> Subject: RE: Extracting text using RegEx
>>>>>>>>> Date: Mon, 22 Jun 2015 15:11:18 +0000
>>>>>>>>>
>>>>>>>>> Mark,
>>>>>>>>>
>>>>>>>>> How can I rerun the processors after changing some of
the
>>>>>>>>> attributes? For example, when I change the Regex pattern
and start the
>>>>>>>>> processors, nothing happens.
>>>>>>>>>
>>>>>>>>> Srujan Kotikela
>>>>>>>>> FireHost - SECURE CLOUD HOSTING
>>>>>>>>> North America | Europe | Asia Pacific
>>>>>>>>>
>>>>>>>>> ComputerWorld: 100 Best Places to Work in IT See Current
>>>>>>>>> Opportunities
>>>>>>>>>
>>>>>>>>> This email and any files transmitted with it are confidential
and
>>>>>>>>> intended solely
>>>>>>>>> for the use of the individual(s) to whom they are addressed.
Do not
>>>>>>>>> disseminate,
>>>>>>>>> distribute or copy this e-mail without explicit permission
to do
>>>>>>>>> so. Thank you.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Mark Payne [mailto:markap14@hotmail.com]
>>>>>>>>> Sent: Thursday, June 18, 2015 1:22 PM
>>>>>>>>> To: users@nifi.incubator.apache.org
>>>>>>>>> Subject: RE: Extracting text using RegEx
>>>>>>>>>
>>>>>>>>> Srujan,
>>>>>>>>>
>>>>>>>>> When you pull the file via GetHTTP, it assigns a filename
to the
>>>>>>>>> file. You can easily change the filename by using an
UpdateAttribute
>>>>>>>>> Processor. Just add a new property with the name "filename"
and whatever
>>>>>>>>> value you would like. Then, you can write both to the
same directory.
>>>>>>>>>
>>>>>>>>> With ExtractText, it will route the FlowFile to 'matched'
or
>>>>>>>>> 'unmatched' depending on whether or not any regex that
you provided matches.
>>>>>>>>> However, if the regex has a capturing group, the text
that is extracted will
>>>>>>>>> be just what is captured by that group. For example,
if your regex is
>>>>>>>>> ".*good-(bye).*" then it will route any FlowFIle containing
"good-bye"
>>>>>>>>> to 'matched' but will extract only the text "bye" because
that is
>>>>>>>>> what is in the capturing group.
>>>>>>>>>
>>>>>>>>> Once you have extracted the text, though, it is added
to a FlowFile
>>>>>>>>> attribute, not the content. So you will want to use a
ReplaceText to replace
>>>>>>>>> the content of the FlowFile before you use PutFile.
>>>>>>>>>
>>>>>>>>> Does this make sense? If not, please let me know where
I can help
>>>>>>>>> clarify, and I'll be happy to do so!
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> -Mark
>>>>>>>>>
>>>>>>>>> ----------------------------------------
>>>>>>>>>>
>>>>>>>>>> From: srujan.kotikela@firehost.com
>>>>>>>>>> To: users@nifi.incubator.apache.org
>>>>>>>>>> Subject: RE: Extracting text using RegEx
>>>>>>>>>> Date: Thu, 18 Jun 2015 18:08:58 +0000
>>>>>>>>>>
>>>>>>>>>> Hi Mark,
>>>>>>>>>>
>>>>>>>>>> I am trying to extract some text from a remote file/feed,
>>>>>>>>>> downloaded via HTTP. The flow I am contemplating
is like this:
>>>>>>>>>>
>>>>>>>>>> GetHTTP ====> ExtractText == (matched) ==>
PutFile
>>>>>>>>>> ||
>>>>>>>>>> (unmatched)
>>>>>>>>>> ||
>>>>>>>>>> V
>>>>>>>>>> PutFile
>>>>>>>>>>
>>>>>>>>>> I am able to create this flow just fine. However,
I have following
>>>>>>>>>> issues:
>>>>>>>>>>
>>>>>>>>>> 1. I noticed that the 'file' configured for the GetHTTP
processor
>>>>>>>>>> goes into the 'directory' configured in the 'PutFile'
processor. This is
>>>>>>>>>> leading me to save the matched file and unmatched
file in separate
>>>>>>>>>> directories. Is there way to have those 2 files in
the same directory?
>>>>>>>>>>
>>>>>>>>>> 2. I don't seem to get the RegEx working. The ExtractText
>>>>>>>>>> processor either matches all input or no input. Are
there any particular
>>>>>>>>>> guidelines on how to write regex for NiFi?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Srujan Kotikela
>>>>>>>>>> FireHost - SECURE CLOUD HOSTING
>>>>>>>>>> North America | Europe | Asia Pacific
>>>>>>>>>>
>>>>>>>>>> ComputerWorld: 100 Best Places to Work in IT See
Current
>>>>>>>>>> Opportunities
>>>>>>>>>>
>>>>>>>>>> This email and any files transmitted with it are
confidential and
>>>>>>>>>> intended solely for the use of the individual(s)
to whom they are
>>>>>>>>>> addressed. Do not disseminate, distribute or copy
this e-mail
>>>>>>>>>> without explicit permission to do so. Thank you.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Mark Payne [mailto:markap14@hotmail.com]
>>>>>>>>>> Sent: Tuesday, June 16, 2015 7:11 PM
>>>>>>>>>> To: users@nifi.incubator.apache.org
>>>>>>>>>> Subject: RE: Extracting text using RegEx
>>>>>>>>>>
>>>>>>>>>> Srujan,
>>>>>>>>>>
>>>>>>>>>> I'm not sure how familiar you are with NiFi, so just
a very quick
>>>>>>>>>> note about terminology to make sure you understand
what i'm describing. A
>>>>>>>>>> FlowFile is the basic data record in NiFi. It consists
of two parts:
>>>>>>>>>> - FlowFile Attributes (Key/Value Pairs that are strings)
>>>>>>>>>> - FlowFile Content (arbitrary stream of bytes)
>>>>>>>>>>
>>>>>>>>>> I think the flow that you would want would like this:
>>>>>>>>>>
>>>>>>>>>> GetHTTP -> ExtractText -> ReplaceText ->
PutFile
>>>>>>>>>>
>>>>>>>>>> ExtractText will then evaluate the regex against
the content
>>>>>>>>>> pulled from the HTTP service and put the result in
a FlowFile Attribute. So
>>>>>>>>>> let's say you add a property named "desired.text"
with a value
>>>>>>>>>> "<body>(.*)</body>". This will create
an Attribute named "desired.text" and
>>>>>>>>>> the value of that attribute will be whatever is found
between the <body> and
>>>>>>>>>> </body> tags.
>>>>>>>>>>
>>>>>>>>>> We will then use ReplaceText with the following configuration:
>>>>>>>>>> Regular Expression: .+
>>>>>>>>>> Replacement Value: ${desired.text}
>>>>>>>>>> All other properties: defaults.
>>>>>>>>>>
>>>>>>>>>> So what this is doing is replacing the content of
the FlowFile
>>>>>>>>>> with the "desired.text" attribute.
>>>>>>>>>>
>>>>>>>>>> PutFile then writes the file to disk.
>>>>>>>>>>
>>>>>>>>>> Hope this helps! If this doesn't work out for you
for some reason,
>>>>>>>>>> or if you've got more questions (or if I misunderstood
what you're wanting
>>>>>>>>>> to do), please don't hesitate to shoot back and let
me know!
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> -Mark
>>>>>>>>>>
>>>>>>>>>> ________________________________
>>>>>>>>>>>
>>>>>>>>>>> From: srujan.kotikela@firehost.com
>>>>>>>>>>> To: users@nifi.incubator.apache.org
>>>>>>>>>>> CC: aldrin.piri@onyara.com
>>>>>>>>>>> Subject: Extracting text using RegEx
>>>>>>>>>>> Date: Tue, 16 Jun 2015 17:56:38 +0000
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am trying to download a file (using GetHTTP)
from a website and
>>>>>>>>>>> extract text from it matching a RegEx pattern
(using
>>>>>>>>>>> ExtractText).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am able to download the file using GetHTTP
and save it via
>>>>>>>>>>> PutFile.
>>>>>>>>>>> I understand that ExtractText processor works
only with a
>>>>>>>>>>> FlowFile.
>>>>>>>>>>> So I tried generating a flow file from GetHTTP
and PutFile
>>>>>>>>>>> (separately), but it doesn't seem to work.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Can anyone give me pointers (examples?) on what
processors to be
>>>>>>>>>>> used
>>>>>>>>>>> to extract text from a file pulled down by GetHTTP
and write the
>>>>>>>>>>> matched text to a separate file?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Srujan Kotikela
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Firehost - SECURE CLOUD HOSTING
>>>>>>>>>>> North America | Europe | Asia Pacific
>>>>>>>>>>>
>>>>>>>>>>> ComputerWorld: 100 Best Places to Work in IT
­ See Current
>>>>>>>>>>> Opportunities
>>>>>>>>>>>
>>>>>>>>>>> <http://www.firehost.com/careers>This email
and any files
>>>>>>>>>>> transmitted
>>>>>>>>>>> with it are confidential and intended solely
for the use of the
>>>>>>>>>>> individual(s) to whom they are addressed. Do
not disseminate,
>>>>>>>>>>> distribute or copy this e-mail without explicit
permission to do
>>>>>>>>>>> so.
>>>>>>>>>>> Thank you.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>> --
>>>>>>> Dr. Chase C Cunningham
>>>>>>> CTRC (SW) USN Ret.
>>>>>>> The Cynja LLC Proprietary Business and Technical Information
>>>>>>> CONFIDENTIAL TREATMENT REQUIRED
>>>>>>>
>>>>> --
>>>>> Dr. Chase C Cunningham
>>>>> CTRC (SW) USN Ret.
>>>>> The Cynja LLC Proprietary Business and Technical Information
>>>>> CONFIDENTIAL TREATMENT REQUIRED
>>>>>
>>> --
>>> Dr. Chase C Cunningham
>>> CTRC (SW) USN Ret.
>>> The Cynja LLC Proprietary Business and Technical Information
>>> CONFIDENTIAL TREATMENT REQUIRED
>>>
>>
>
>
> --
> Dr. Chase C Cunningham
> CTRC (SW) USN Ret.
> The Cynja LLC Proprietary Business and Technical Information
> CONFIDENTIAL TREATMENT REQUIRED
>

Mime
View raw message