hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Håkon Sagehaug <hakon.sageh...@googlemail.com>
Subject Re: Usign Hbase, for storing biology data and query it
Date Fri, 16 Apr 2010 10:22:09 GMT
Hi

Thanks for the tips, I also thought about hive, but so many new thing in the
hadoop ecosystem. I've started using hive on my local machine, looking
forward to test it on my small hadoop cluster.

cheers, Håkon

On 15 April 2010 20:38, Tim Robertson <timrobertson100@gmail.com> wrote:

> I think I agree with Jesper... HBase does not seem the best fit to me
> since you are concerned with batch scanning and transformation, rather
> than single record access.
>
> If you chose MapReduce you would do something like:
> - provide the filter (column6, greaterThan, 0.1) and pass that around
> in Job config for the Mappers to read in the init() phase
> - map() would apply the filter and pass only the data meeting the criteria
> - reduce would do nothing I think
> - you'd write a custom output format which generates the HDF file.
>
> However....
>
> This screams to me for Hive.  With Hive, you load in the Tab delimited
> (csv file) to Hadoop HDFS.  Then create a table (just like in your
> favourite DB), and then you issue SQL against it, with the results
> going to another file (think SQL on top of a CSV file).  The output
> would then be turned into your HDF file as you already do.  Hive does
> a query plan from the SQL, and launches MapReduce jobs to do the work
> for you.
>
> I was doing custom MapReduce jobs for the same until I discovered
> Hive.  It really is very easy to use, and this 15 minutes will explain
> a lot: http://vimeo.com/3598672
>
> Hope this helps,
> Tim
>
>
>
>
> On Thu, Apr 15, 2010 at 12:27 PM, Jesper Utoft <jesper.utoft@gmail.com>
> wrote:
> > Hey.
> >
> > First off i have only been playing around with HBase and Hadoop in
> school,
> > so i have no in debt knowledge of it.
> >
> > I think you should not use HBase but just store the files in HDFS
> direcly.
> > And then make these HDF files using a map/reduce job in some way.
> >
> > Just my 2 cents.
> >
> > Cheers.
> >
> > 2010/4/15 Håkon Sagehaug <hakon.sagehaug@googlemail.com>
> >
> >> Hi
> >>
> >> Does anyone have an input on my question?
> >>
> >> Håkon
> >>
> >> 2010/4/9 Håkon Sagehaug <hakon.sagehaug@googlemail.com>
> >>
> >> > Hi all,
> >> >
> >> > I work in a project where we need to deal with different types of
> biology
> >> > data. For the first case, which I'm now investigating if HBase is
> >> something
> >> > we might use the scenario is like this.
> >> >
> >> > The raw text data, is public so we can download it and store it as
> >> regular
> >> > files. The content of looks like this
> >> >
> >> >  1                  2         3                4            5    6
>  7
> >> > 8
> >> >
> >> > 24915 31643 CHB rs2003280 rs1500095 1.0 0.0 0.02 0
> >> > 24915 36594 CHB rs2003280 rs7299571 1.0 0.025 0.21 0
> >> > 24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0
> >> > 24916 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0
> >> > 24916 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0
> >> > 24916 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0
> >> >
> >> > One files is normally between 1-2 gb(20-30 million lines), and we have
> >> > between 23-60 files. Data is something called LD_data if anyone is
> >> > interested. For storing this better we've turned all these files into
> a
> >> HDF
> >> > file, that is a binary format, this can then be handed over to
> >> applications
> >> > using LD_data in analysis of biology problems. The reason why we're
> >> thinking
> >> > of HBase for storing the raw text files is that we want to offer the
> >> users
> >> > ability to issue the creation of these HDF files them self, based on a
> >> > cutoff value from one or the two last columns in the file as input.
> We've
> >> > now just turned the hole file into to a HDF, and then the application
> >> > receiving the file deals with the cutoff. So a "query" from user that
> >> needs
> >> > the lines with a value of column 6 > 0.1 gets
> >> >
> >> > 24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0
> >> > 24915 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0
> >> > 24915 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0
> >> > 24915 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0
> >> >
> >> > Is this something that sound reasonable to use Hbase for. I guess I
> also
> >> > could use hadoop, and do map-reduce job, but sure how to define the
> map
> >> > and/or the reduce job for this. Would the best maybe be to go through
> the
> >> > files and map columns 3, can be looked at as a key, to a list of its
> >> values
> >> > over the cutoff. Map for the query above woule then in a map be
> >> >
> >> >
> >> > < rs2003280,    {
> >> >     24915 50733 CHB rs4079417 1.0 0.130 0.09 0
> >> >     }
> >> > >
> >> >
> >> >
> >> > <rs2003282,    {
> >> >     24915 59354 CHB rs1500098 1.0 0.157 0.91 0,
> >> >     24915 61880 CHB rs11063263 1.0 0.157 0.91 0,
> >> >     24915 62481 CHB rs10774263 1.0 0.157 0.91 0
> >> >     }
> >> > >
> >> >
> >> > If the Hbase would be used, I'm bit unsure how the data should be
> >> > structured best, of way is to store one row per line in the file, but
> >> maybe
> >> > not the best. Maybe another one is something like this, for the first
> >> line
> >> > in the example above
> >> >
> >> > rs2003280{
> >> >                  col1:24915 = 24915,
> >> >                  col:31643 = 31643,
> >> >                  col4:rs1500095 = rs1500095,
> >> >                  col4:rs7299571 = rs7299571,
> >> >                  col4:rs4079417 = rs4079417,
> >> >                  value:1=1.0,
> >> >                  value:2=0.0,
> >> >                  value:3=0.02,
> >> >                  value:4=0,
> >> > }
> >> >
> >> >
> >> >
> >> > As you all can see I've got some questions, I'm in the process of
> >> grasping
> >> > Hbase,hadoop concepts.
> >> >
> >> > cheers, Håkon
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message