nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Burgess <mattyb...@apache.org>
Subject Re: Simple CSV to Parquet without Hadoop
Date Wed, 15 Aug 2018 19:43:29 GMT
I don't think you have to install Hadoop on Windows in order to get it
to work, just the winutils.exe and I guess put it wherever it's
looking for it (that might be configurable via an environment variable
or something).

There are pre-built binaries [1] for various versions of Hadoop, even
though you'll be writing to a local file system you'll want to match
the version of winutils.exe with the version of Hadoop (usually 2.7.3
for slightly older NiFi versions or 3.0.0 for the latest version(s) I
think) for best results.

Regards,
Matt

[1] https://github.com/steveloughran/winutils

On Wed, Aug 15, 2018 at 3:23 PM scott <tcots8888@gmail.com> wrote:
>
> Just tested in my Centos VM, worked like a charm without Hadoop. I'll open a Jira bug
on PutParquet, doesn't seem to run on Windows.
> Still not sure what I can do. Converting our production Windows NiFi install to Docker
would be a major effort.
> Has anyone heard of a Parquet writer tool I can download and call from NiFi?
>
> On Wed, Aug 15, 2018 at 12:01 PM, Mike Thomsen <mikerthomsen@gmail.com> wrote:
>>
>> > Mike, that's a good tip. I'll test that, but unfortunately, I've already committed
to Windows.
>>
>> You can run both Docker and the standard NiFi docker image on Windows.
>>
>> On Wed, Aug 15, 2018 at 2:52 PM scott <tcots8888@gmail.com> wrote:
>>>
>>> Mike, that's a good tip. I'll test that, but unfortunately, I've already committed
to Windows.
>>> What about a script? Is there some tool you know of that can just be called by
NiFi to convert an input CSV file to a Parquet file?
>>>
>>> On Wed, Aug 15, 2018 at 8:32 AM, Mike Thomsen <mikerthomsen@gmail.com>
wrote:
>>>>
>>>> Scott,
>>>>
>>>> You can also try Docker on Windows. Something like this should work:
>>>>
>>>> docker run -d --name nifi-test -v C:/nifi_temp:/opt/data_output -p 8080:8080
apache/nifi:latest
>>>>
>>>> I don't have Windows either, but Docker seems to work fine for my colleagues
that have to use it on Windows. That should bridge C:\nifi_temp and /opt/data_output between
host and container and remap localhost:8080 to the container on 8080 so you don't have to
mess with a Hadoop client just to try out some Parquet stuff.
>>>>
>>>> Mike
>>>>
>>>> On Wed, Aug 15, 2018 at 11:20 AM scott <tcots8888@gmail.com> wrote:
>>>>>
>>>>> Thanks Bryan. I'll give the Hadoop client a try.
>>>>>
>>>>> On Wed, Aug 15, 2018 at 7:51 AM, Bryan Bende <bbende@gmail.com>
wrote:
>>>>>>
>>>>>> I think there is a good chance that installing the Hadoop client
would
>>>>>> solve the issue, but I can't say for sure since I don't have a Windows
>>>>>> machine to test.
>>>>>>
>>>>>> The processor depends on the Apache Parquet Java client library which
>>>>>> depends on Apache Hadoop client [1], and the Hadoop client has a
>>>>>> limitation on Windows where it requires something additional.
>>>>>>
>>>>>> [1] https://github.com/apache/parquet-mr/blob/master/parquet-avro/pom.xml#L62-L65
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 15, 2018 at 10:16 AM, scott <tcots8888@gmail.com>
wrote:
>>>>>> > If I install a Hadoop client on my NiFi host, would I be able
to get past
>>>>>> > this error?
>>>>>> > I don't understand why this processor depends on Hadoop. Other
projects like
>>>>>> > Drill and Spark don't have such a dependency to be able to write
Parquet
>>>>>> > files.
>>>>>> >
>>>>>> > On Tue, Aug 14, 2018 at 2:58 PM, Juan Pablo Gardella
>>>>>> > <gardellajuanpablo@gmail.com> wrote:
>>>>>> >>
>>>>>> >> It's a warning. You can ignore that.
>>>>>> >>
>>>>>> >> On Tue, 14 Aug 2018 at 18:53 Bryan Bende <bbende@gmail.com>
wrote:
>>>>>> >>>
>>>>>> >>> Scott,
>>>>>> >>>
>>>>>> >>> Sorry I did not realize the Hadoop client would be looking
for this
>>>>>> >>> winutils.exe when running on Windows.
>>>>>> >>>
>>>>>> >>> On linux and MacOS you don't need anything external
installed outside
>>>>>> >>> of NiFi so I wasn't expecting this.
>>>>>> >>>
>>>>>> >>> Not sure if there is any other good option here regarding
Parquet.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>>
>>>>>> >>> Bryan
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Tue, Aug 14, 2018 at 5:31 PM, scott <tcots8888@gmail.com>
wrote:
>>>>>> >>> > Hi Bryan,
>>>>>> >>> > I'm fine if I have to trick the API, but don't
I still need Hadoop
>>>>>> >>> > installed
>>>>>> >>> > somewhere? After creating the core-site.xml as
you described, I get the
>>>>>> >>> > following errors:
>>>>>> >>> >
>>>>>> >>> > Failed to locate the winutils binary in the hadoop
binary path
>>>>>> >>> > IOException: Could not locate executable null\bin\winutils.exe
in the
>>>>>> >>> > Hadoop
>>>>>> >>> > binaries
>>>>>> >>> > Unable to load native-hadoop library for your platform...
using
>>>>>> >>> > builtin-java
>>>>>> >>> > classes where applicable
>>>>>> >>> > Failed to write due to java.io.IOException: No
FileSystem for scheme
>>>>>> >>> >
>>>>>> >>> > BTW, I'm using NiFi version 1.5
>>>>>> >>> >
>>>>>> >>> > Thanks,
>>>>>> >>> > Scott
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> > On Tue, Aug 14, 2018 at 12:44 PM, Bryan Bende <bbende@gmail.com>
wrote:
>>>>>> >>> >>
>>>>>> >>> >> Scott,
>>>>>> >>> >>
>>>>>> >>> >> Unfortunately the Parquet API itself is tied
to the Hadoop Filesystem
>>>>>> >>> >> object which is why NiFi can't read and write
Parquet directly to flow
>>>>>> >>> >> files (i.e. they don't provide a way to read/write
to/from Java input
>>>>>> >>> >> and output streams).
>>>>>> >>> >>
>>>>>> >>> >> The best you can do is trick the Hadoop API
into using the local
>>>>>> >>> >> file-system by creating a core-site.xml with
the following:
>>>>>> >>> >>
>>>>>> >>> >> <configuration>
>>>>>> >>> >>     <property>
>>>>>> >>> >>         <name>fs.defaultFS</name>
>>>>>> >>> >>         <value>file:///</value>
>>>>>> >>> >>     </property>
>>>>>> >>> >> </configuration>
>>>>>> >>> >>
>>>>>> >>> >> That will make PutParquet or FetchParquet work
with your local
>>>>>> >>> >> file-system.
>>>>>> >>> >>
>>>>>> >>> >> Thanks,
>>>>>> >>> >>
>>>>>> >>> >> Bryan
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >> On Tue, Aug 14, 2018 at 3:22 PM, scott <tcots8888@gmail.com>
wrote:
>>>>>> >>> >> > Hello NiFi community,
>>>>>> >>> >> > Is there a simple way to read CSV files
and write them out as
>>>>>> >>> >> > Parquet
>>>>>> >>> >> > files
>>>>>> >>> >> > without Hadoop? I run NiFi on Windows
and don't have access to a
>>>>>> >>> >> > Hadoop
>>>>>> >>> >> > environment. I'm trying to write the output
of my ETL in a
>>>>>> >>> >> > compressed
>>>>>> >>> >> > and
>>>>>> >>> >> > still query-able format. Is there something
I should be using
>>>>>> >>> >> > instead of
>>>>>> >>> >> > Parquet?
>>>>>> >>> >> >
>>>>>> >>> >> > Thanks for your time,
>>>>>> >>> >> > Scott
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >
>>>>>> >
>>>>>
>>>>>
>>>
>

Mime
View raw message