hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Buttler, David" <buttl...@llnl.gov>
Subject RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file system (or Hbase)?
Date Tue, 04 Jan 2011 01:05:54 GMT
Right, I should have realized that you guys would be using a good parallel file system.  In
that case a M/R job will be great for moving the data -- as long as you don't overload the
network.  And if you are going to have the data end up in HBase you may just write a map job
to directly insert into hbase, either through standard inserts or bulk loads (which should
be 10x faster).

If you have time to play, it might be interesting to see how hbase runs over your current
file system.  Just use the file://<path to shared directory> instead of the hdfs://
url in the hbase-site.xml file.  I would stress test that first with both hbase and m/r jobs
just to make sure that it behaves well, but there would be many people who would love to hear
about your experience here.  I for one would like to know if I can run hbase over luster since
we love luster around here.


-----Original Message-----
From: Kevin Fox [mailto:Kevin.Fox@pnl.gov] 
Sent: Monday, January 03, 2011 2:58 PM
To: Buttler, David
Cc: user@hbase.apache.org; 'general@hadoop.apache.org'; Brown, David M JR
Subject: RE: What is the fastest way to get a large amount of data into the Hadoop HDFS file
system (or Hbase)?

On Mon, 2011-01-03 at 12:20 -0800, Buttler, David wrote:
> Hi Ron,
> Loading into HDFS and HBase are two different issues.  
> HDFS: if you have a large number of files to load from your nfs file system into HDFS
it is not clear that parallelizing the load will help. 

Its not nfs. Its a parallel file system.

>  You have two sources of bottlenecks: the nfs file system and the HDFS file system. 
In your parallel example, you will likely saturate your nfs file system first.

Unlikely in this case. We're in the unusual position of our archive
cluster being faster then our hadoop cluster.

>   If they are actually local files, then loading them via M/R is a non-starter as you
have no control over which machine will get a map task.

If the same files are "local" on each node, does it matter? Shouldn't
the map jobs all be scheduled in a way as to spread out the load?


>   Unless all of the machines have files in the same directory and you are just going
to look in that directory to upload.  Then, it sounds like more of a job for a parallel shell
command and less of a map/reduce command.
> HBase: So far my strategy has been to get the files into HDFS first, and then write a
Map job to load them into HBase.  You can try to do this and see if direct inserts into hbase
are fast enough for your use case.  But, if you are going to TBs/week then you will likely
want to investigate the bulk load features.  I haven't yet incorporated that into my workflow
so I can't offer much advice there. Just be sure your cluster is sized appropriately.  E.g.,
with your compression turned on in hbase, see how much a 1 GB input file expands to inside
hbase / hdfs.  That should give you a feeling for how much space you will need for your expected
data load.
> Dave
> -----Original Message-----
> From: Taylor, Ronald C [mailto:ronald.taylor@pnl.gov] 
> Sent: Tuesday, December 28, 2010 2:05 PM
> To: 'user@hbase.apache.org'; 'general@hadoop.apache.org'
> Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
> Subject: What is the fastest way to get a large amount of data into the Hadoop HDFS file
system (or Hbase)?
> Folks,
> We plan on uploading large amounts of data on a regular basis onto a Hadoop cluster,
with Hbase operating on top of Hadoop. Figure eventually on the order of multiple terabytes
per week. So - we are concerned about doing the uploads themselves as fast as possible from
our native Linux file system into HDFS. Figure files will be in, roughly, the 1 to 300 GB
> Off the top of my head, I'm thinking that doing this in parallel using a Java MapReduce
program would work fastest. So my idea would be to have a file listing all the data files
(full paths) to be uploaded, one per line, and then use that listing file as input to a MapReduce
> Each Mapper would then upload one of the data files (using "hadoop fs -copyFromLocal
<source> <dest>") in parallel with all the other Mappers, with the Mappers operating
on all the nodes of the cluster, spreading out the file upload across the nodes.
> Does that sound like a wise way to approach this? Are there better methods? Anything
else out there for doing automated upload in parallel? We would very much appreciate advice
in this area, since we believe upload speed might become a bottleneck.
>   - Ron Taylor
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology & Bioinformatics Group
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, Mail Stop J4-33
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: ronald.taylor@pnl.gov

View raw message