hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <Ankur.G...@corp.aol.com>
Subject RE: HBase performance tuning
Date Tue, 08 Apr 2008 13:42:02 GMT
Hi Stack
     I uploaded the hertirx2-hbase-writer code here
http://heritrix2-hbase-writer.googlecode.com/files/heritrix2.0-hbase-wri
ter.jar

The jar size is 15 MB as it has all the necessary libraries to build
writer code. 
The actual code is split in 5 files.

Do take a look.

Thanks
-Ankur


-----Original Message-----
From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com] 
Sent: Friday, March 28, 2008 1:06 PM
To: hbase-user@hadoop.apache.org
Subject: RE: HBase performance tuning

Thanks for posting the code stack. One thing that I saw missing in my
code is the use of a writer pool.
I'll incorporate that in my code and make some other changes as well. 

There should'nt be any issues in the contributing the updated code
except for converting the schema to make it column oriented. At the
moment it's a simple RDMS schema converted directly to an Hbase schema
by substituting column name with column family. 

I'll try to reduce it to make it fit the column-oriented design. Feel
free to suggest changes if you like.
The details have been mentioned in a post before.

Thanks
-Ankur

-----Original Message-----
From: stack [mailto:stack@duboce.net]
Sent: Friday, March 28, 2008 11:54 AM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase performance tuning

Goel, Ankur wrote:
>  ...
>
> I'll check and let you know if the code can be contributed.
> Once I get a green, I'll make some modifications to make it more 
> generic and share with you folks to understand how we can Improve it 
> further before posting.
>   

A while back, I had a go at making such a Writer: see
http://www.duboce.net/~stack/hbase-writer.tgz.  Its old, probably won't
work w/ current hbase -- I haven't tried it -- and its for Heritrix 1.x
generation but shouldn't be hard to update.  When I left it, I was
trying to mavenize it and was to put needed jars -- hadoop, etc. -- up 
on the Archive's build box.   Publishing such a Writer is a little 
awkward given the different licenses.  Having maven pull jars seemed
like one way of working within the constraints imposed by licensing
(Archive is apparently moving toward Apache licensing which should
alleviate at least the above issue).

St.Ack

> Thanks
> -Ankur
>
>
>
> -----Original Message-----
> From: stack [mailto:stack@duboce.net]
> Sent: Thursday, March 27, 2008 10:08 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase performance tuning
>
> I have some familiarity with that crawler.
>
> Tell us more about your writer.   Is it proprietary?  If not, can we
get
>
> it into a place where others could use it if wanted?
>
> Thanks,
> St.Ack
>
>
> Goel, Ankur wrote:
>   
>> I am crawling the web indeed, but only the sites that are present in 
>> my seedlist. The crawler used here is heritrix 2.0 - 
>> http://webteam.archive.org/confluence/display/Heritrix/2.0.0.
>>
>> I developed a Heritrix specific HBase writer that can be integrated 
>> with Heritrix to write the crawled content directly into Hbase.
>>
>> -Ankur
>>   
>>     


Mime
View raw message