hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <Ankur.G...@corp.aol.com>
Subject RE: HBase performance tuning
Date Fri, 28 Mar 2008 07:35:57 GMT
Thanks for posting the code stack. One thing that I saw missing in my
code is the use of a writer pool.
I'll incorporate that in my code and make some other changes as well. 

There should'nt be any issues in the contributing the updated code
except for converting the schema
to make it column oriented. At the moment it's a simple RDMS schema
converted directly to an Hbase
schema by substituting column name with column family. 

I'll try to reduce it to make it fit the column-oriented design. Feel
free to suggest changes if you like.
The details have been mentioned in a post before.

Thanks
-Ankur

-----Original Message-----
From: stack [mailto:stack@duboce.net] 
Sent: Friday, March 28, 2008 11:54 AM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase performance tuning

Goel, Ankur wrote:
>  ...
>
> I'll check and let you know if the code can be contributed.
> Once I get a green, I'll make some modifications to make it more 
> generic and share with you folks to understand how we can Improve it 
> further before posting.
>   

A while back, I had a go at making such a Writer: see
http://www.duboce.net/~stack/hbase-writer.tgz.  Its old, probably won't
work w/ current hbase -- I haven't tried it -- and its for Heritrix 1.x
generation but shouldn't be hard to update.  When I left it, I was
trying to mavenize it and was to put needed jars -- hadoop, etc. -- up 
on the Archive's build box.   Publishing such a Writer is a little 
awkward given the different licenses.  Having maven pull jars seemed
like one way of working within the constraints imposed by licensing
(Archive is apparently moving toward Apache licensing which should
alleviate at least the above issue).

St.Ack

> Thanks
> -Ankur
>
>
>
> -----Original Message-----
> From: stack [mailto:stack@duboce.net] 
> Sent: Thursday, March 27, 2008 10:08 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase performance tuning
>
> I have some familiarity with that crawler.
>
> Tell us more about your writer.   Is it proprietary?  If not, can we
get
>
> it into a place where others could use it if wanted?
>
> Thanks,
> St.Ack
>
>
> Goel, Ankur wrote:
>   
>> I am crawling the web indeed, but only the sites that are present in 
>> my seedlist. The crawler used here is heritrix 2.0 - 
>> http://webteam.archive.org/confluence/display/Heritrix/2.0.0.
>>
>> I developed a Heritrix specific HBase writer that can be integrated 
>> with Heritrix to write the crawled content directly into Hbase.
>>
>> -Ankur
>>   
>>     


Mime
View raw message