nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Bende <>
Subject Re: Fetch Contents of HDFS Directory as a Part of a Larger Flow
Date Thu, 03 May 2018 16:18:07 GMT
The two step idea makes sense...

If you did want to go with the OS call you would probably want

On Thu, May 3, 2018 at 12:06 PM, Shawn Weeks <> wrote:
> I'm thinking about ways to do the operation in two steps where the first
> request starts the process of generating the data and returns an uuid and
> the second request can check on the status and download the file. Still have
> to workout how to collect the output from the Hive table so I'll look at the
> rest calls. Not sure of a good way to make an OS call as ExecuteProcess
> doesn't support inputs either.
> Thanks
> Shawn
> ________________________________
> From: Bryan Bende <>
> Sent: Thursday, May 3, 2018 10:51:03 AM
> To:
> Subject: Re: Fetch Contents of HDFS Directory as a Part of a Larger Flow
> Another option would be if the Hadoop client was installed on the NiFi
> node then you could use one of the script processors to make a call to
> "hadoop fs -ls ...".
> If the response is so large that it requires heavy lifting of writing
> out temp tables to HDFS and then fetching those files into NiFi, and
> most likely merging to a single response flow file, is that really
> expected to happen in the context of a single web request/response?
> On Thu, May 3, 2018 at 11:45 AM, Pierre Villard
> <> wrote:
>> Hi Shawn,
>> If you know the path of the files to retrieve in HDFS, you could use
>> FetchHDFS processor.
>> If you need to retrieve all the files within the directory created by
>> Hive,
>> I guess you could list the existing files calling the REST API of WebHDFS
>> and then use the FetchHDFS processor.
>> Not sure that's the best solution to your requirement though.
>> Pierre
>> 2018-05-03 17:35 GMT+02:00 Shawn Weeks <>:
>>> I'm building a rest service with the HTTP Request and Response Processors
>>> to support data extracts from Hive. Since some of the extracts can be
>>> quiet
>>> large using the SelectHiveQL Processor isn't a performant option and
>>> instead
>>> I'm trying to use on demand Hive Temporary Tables to do the heavy lifting
>>> via CTAS(Create Table as Select). Since GetHDFS doesn't support an
>>> incoming
>>> connection I'm trying to figure out another way to fetch the files Hive
>>> creates and return them as a download in the web service. Has anyone else
>>> worked out a good solution for fetching the contents of a directory from
>>> HDFS as a part of larger flow?
>>> Thanks
>>> Shawn

View raw message