nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Bende <bbe...@gmail.com>
Subject Re: Bulk inserting into HBase with NiFi
Date Tue, 06 Jun 2017 23:40:52 GMT
Mike,

With the recent record-oriented processors that have come out recently, a
good solution would be to implement a PutHBaseRecord processor that would
have a Record Reader configured. This way the processor could read in a
large CSV without having to convert to individual JSON documents.

One thing to consider is how many records/puts to send in a single call to
HBase. Assuming multi-GB csv files you'll want to send portions at a time
to avoid having the whole content in memory (some kind of record batch size
property), but then you also have to deal with what happens when things
fail half way through. If the puts are idempotent then it may be fine to
route the whole to failure and try again even if some data was already
inserted.

Feel free to create a JIRA for hbase record processors, or I can do it
later.

Hope that helps.

-Bryan


On Tue, Jun 6, 2017 at 7:21 PM Mike Thomsen <mikerthomsen@gmail.com> wrote:

> We have a very large body of CSV files (well over 1TB) that need to be
> imported into HBase. For a single 20GB segment, we are looking at having to
> push easily 100M flowfiles into HBase and most of the JSON files generated
> are rather small (like 20-250 bytes).
>
> It's going very slowly, and I assume that is because we're taxing the disk
> very heavily because of the content and provenance repositories coming into
> play. So I'm wondering if anyone has a suggestion on a good NiFiesque way
> of solving this. Right now, I'm considering two options:
>
> 1. Looking for a way to inject the HBase controller service into an
> ExecuteScript processor so I can handle the data in large chunks (splitting
> text and generating a List<Put> inside the processor myself and doing one
> huge Put)
>
> 2. Creating a library that lets me generate HFiles from within an
> ExecuteScript processor.
>
> What I really need is something fast within NiFi that would let me
> generate huge blocks of updates for HBase and push them out. Any ideas?
>
> Thanks,
>
> Mike
>
-- 
Sent from Gmail Mobile

Mime
View raw message