nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Williams <aaronfwilli...@outlook.com>
Subject RE: CSV to Mongo
Date Tue, 22 Sep 2015 14:43:57 GMT
Thank you Bryan & everyone.  I will check out that template, looks perfect for me!

Date: Tue, 22 Sep 2015 09:36:27 -0400
Subject: Re: CSV to Mongo
From: joe.witt@gmail.com
To: users@nifi.apache.org

There aren't any plans.  But is an awesome idea and great JIRA.
Thanks

Joe
On Sep 22, 2015 9:31 AM, "Jonathan Lyons" <jonathan@jaroop.com> wrote:
Speaking of CSV to JSON conversion, is there any interest in implementing schema inference
in general, and specifically schema inference for CSV files? This is something that was added
to spark-csv recently (https://github.com/databricks/spark-csv/pull/93). Any thoughts?
On Tue, Sep 22, 2015 at 9:16 AM, Bryan Bende <bbende@gmail.com> wrote:
Andrew,
If you are interested in the ExtractText+ReplaceText approach, I posted an example template
that shows how to convert a line from a CSV file to a JSON document [1].
The first part of the flow is just for testing and generates a flow file with the content
set to "a,b,c,d", then the ExtractText pulls those values into attributes (csv.1, csv.2, csv.3,
csv.4) and ReplaceText uses them to build a JSON document.
-Bryan
[1] https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates  (CsvToJson)

On Mon, Sep 21, 2015 at 4:40 PM, Bryan Bende <bbende@gmail.com> wrote:
Yup, Joe beat me too it, but was going to suggest those options... 
In the second case, you would probably use SplitText to get each line of the CSV as a FlowFile,
then ExtractText to pull out every value of the line into attributes, then ReplaceText would
construct a JSON document using expression language to access the attributes from ExtractText.
On Mon, Sep 21, 2015 at 4:33 PM, Joe Witt <joe.witt@gmail.com> wrote:
Adam, Bryan,



Could do the CSV to Avro processor and then follow it with the Avro to

JSON processor.  Alternatively, could use ExtractText to pull the

fields as attributes and then use ReplaceText to produce a JSON

output.



Thanks

Joe



On Mon, Sep 21, 2015 at 4:21 PM, Adam Williams

<aaronfwilliams@outlook.com> wrote:

> Bryan,

>

> Thanks for the feedback.  I stripped the ExtractText and tried routing all

> unmatched traffic to Mongo as well, hence the CSV import problems.  Off the

> top of my head i do not think MongoDB allows CSV inserts through the java

> client, we've always had to work with the JSON/document model for it.  For a

> CSV format, it would have to be similar to this idea:

> https://github.com/AdoptOpenJDK/javacountdown/blob/master/src/main/java/org/adoptopenjdk/javacountdown/ImportGeoData.java

>

> So looking at the other processors in NiFi, is there a way then to move from

> a CSV format to JSON before putting to Mongo?

>

> ________________________________

> Date: Mon, 21 Sep 2015 16:09:10 -0400

>

> Subject: Re: CSV to Mongo

> From: bbende@gmail.com

> To: users@nifi.apache.org

>

> Adam,

>

> I was able import the full template, thanks. A couple of things...

>

> The ExtractText processor works by adding user-defined properties  (the +

> icon in the top-right of the properties window) where the property name is a

> destination attribute and the value is a regular expression.

> Right now there weren't any regular expressions defined so that processor

> will always route the file to 'unmatched'. Generally you would probably want

> to route the matched files to the next processor, and then auto-terminate

> the unmatched relationship (assuming you want to filter out non-matches).

>

> Do you know if MongoDB supports inserting a CSV file through their Java

> client? do you have similar code that already does this in Storm?

>

> I am honestly not that familiar with MongoDB, but in the PutMongo processor

> it takes the incoming data and calls:

> Document doc = Document.parse(new String(content, charset));

>

> Looking at that Document.parse() method, it looks like it expects a JSON

> document, so I just want to make sure that we expect CSV insertions to work

> here.

> In researching this, it looks Mongo has some kind of bulkimport utility that

> handles CSV [1], but this is a command line utility.

>

> -Bryan

>

> [1] http://docs.mongodb.org/manual/reference/program/mongoimport/

>

>

> On Mon, Sep 21, 2015 at 3:19 PM, Adam Williams <aaronfwilliams@outlook.com>

> wrote:

>

> Sorry about that, this should work.  Attached the template and the below

> error:

>

> 2015-09-21 14:36:02,821 ERROR [Timer-Driven Process Thread-10]

> o.a.nifi.processors.mongodb.PutMongo

> PutMongo[id=480877a4-f349-4ef7-9538-8e3e3e108e06] Failed to insert

> StandardFlowFileRecord[uuid=bbd7048f-d5a1-4db4-b938-da64b67e810e,claim=org.apache.nifi.controller.repository.claim.StandardContentClaim@8893ae38,offset=0,name=GDELT.MASTERREDUCEDV2.TXT,size=6581409407]

> into MongoDB due to java.lang.NegativeArraySizeException:

> java.lang.NegativeArraySizeException

>

> ________________________________

> Date: Mon, 21 Sep 2015 15:12:43 -0400

> Subject: Re: CSV to Mongo

> From: bbende@gmail.com

> To: users@nifi.apache.org

>

>

> Adam,

>

> I imported the template and it looks like it only captured the PutMongo

> processor. Can you try deselecting everything on the graph and creating the

> template again so we can take a look at the rest of the flow? or if you have

> other stuff on your graph, select all of the processors you described so

> they all get captured.

>

> Also, can you provide any of the stacktrace for the exception you are

> seeing? The log is in NIFI_HOME/logs/nifi-app.log

>

> Thanks,

>

> Bryan

>

>

> On Mon, Sep 21, 2015 at 3:03 PM, Bryan Bende <bbende@gmail.com> wrote:

>

> Adam,

>

> Thanks for attaching the template, we will take a look and see what is going

> on.

>

> Thanks,

>

> Bryan

>

>

> On Mon, Sep 21, 2015 at 2:50 PM, Adam Williams <aaronfwilliams@outlook.com>

> wrote:

>

> Hey Joe,

>

> Sure thing.  I attached the template, I'm just taking the GDELT data set for

> the getFile Processor which works.  The error i get is a negative array.

>

>

>

>> Date: Mon, 21 Sep 2015 14:24:50 -0400

>> Subject: Re: CSV to Mongo

>> From: joe.witt@gmail.com

>> To: users@nifi.apache.org

>

>>

>> Adam,

>>

>> Regarding moving from Storm to NiFi i'd say they make better teammates

>> than competitors. The use case outlines above should be quite easy

>> for NiFi but there are analytic/processing functions Storm is probably

>> a better answer for. We're happy to help explore that with you as you

>> progress.

>>

>> If you ever run into an ArrayIndexBoundsException.. then it will

>> always be 100% a coding error. Would you mind sending your

>> flow.xml.gz over or making a template of the flow (assuming it

>> contains nothing sensitive)? If at all possible sample data which

>> exposes the issue would be ideal. As an alternative can you go ahead

>> and send us the resulting stack trace/error that comes out?

>>

>> We'll get this addressed.

>>

>> Thanks

>> Joe

>>

>> On Mon, Sep 21, 2015 at 2:17 PM, Adam Williams

>> <aaronfwilliams@outlook.com> wrote:

>> > Hello,

>> >

>> > I'm moving from storm to NiFi and trying to do a simple test with

>> > getting a

>> > large CSV file dumped into MongoDB. The CSV file has a header with

>> > column

>> > names and it is structured, my only problem is dumping it into MongoDB.

>> > At

>> > a high level, do the following processor steps look correct? All i want

>> > is

>> > to just pull the whole CSV file over the MongoDB without a regex or

>> > anything

>> > fancy (yet). I eventually always seem to hit trouble with array index

>> > problems with the putmongo processor:

>> >

>> > GetFile --> ExtractText --> RoutOnAttribute(not a null line) -->

>> > PutMongo.

>> >

>> > Does that seem to be the right way to do this in NiFi?

>> >

>> > Thank you,

>> > Adam

>

>

>

>







 		 	   		  
Mime
View raw message