hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <Ankur.G...@corp.aol.com>
Subject RE: HBase performance tuning
Date Wed, 09 Apr 2008 14:24:04 GMT
I appreciate it :-)
Thanks for your feedback Stack.

Once I rectify few things you mentioned, I will announce the same
on heritrix mailing list too as I feel that it will be more encouraging
for the 
users to look into it.

Another thing I am working on that you might want to take a look
Is converting this schema into a column oriented design, may be have a
single table
Instead of a web_content and seedlist table and modify the fields

-----Original Message-----
From: stack [mailto:stack@duboce.net] 
Sent: Tuesday, April 08, 2008 10:49 PM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase performance tuning


I filed a couple of issues against your bundle (smile) up on googlecode.

Here's a few other notes:

Why is the src not checked in?  When I browse to the 'source' tab, there
is nothing there.

You've bundled jars that are LGPL (fastutil, archive-overlay, etc.).  
The archive ones are supposedly going to be relicensed as Apache but
I've not heard that is the case for fastutil.

Do you want to put the writer into the org.archive package rather than
an aol package or whatever the package you use developing software
outside aol-time?

Thats nice that you include a tool to create tables (Make note in the
doc that this exists -- and doc. should include description of schema
you're using).

Is it possible to filter outlinks -- i.e. run outlinks through an
Heritrix filter (maybe its not) -- rather than do it here inside in your
writer? Same canonicalizing?  If not, could you add the hook to call
filters?  (Not important).

Does it work?

Great stuff Ankur (Would suggest announcing on Heritrix list too -- you
might get feedback from there).


Goel, Ankur wrote:
> Hi Stack
>      I uploaded the hertirx2-hbase-writer code here 
> http://heritrix2-hbase-writer.googlecode.com/files/heritrix2.0-hbase-w
> ri
> ter.jar
> The jar size is 15 MB as it has all the necessary libraries to build 
> writer code.
> The actual code is split in 5 files.
> Do take a look.
> Thanks
> -Ankur
> -----Original Message-----
> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> Sent: Friday, March 28, 2008 1:06 PM
> To: hbase-user@hadoop.apache.org
> Subject: RE: HBase performance tuning
> Thanks for posting the code stack. One thing that I saw missing in my 
> code is the use of a writer pool.
> I'll incorporate that in my code and make some other changes as well. 
> There should'nt be any issues in the contributing the updated code 
> except for converting the schema to make it column oriented. At the 
> moment it's a simple RDMS schema converted directly to an Hbase schema

> by substituting column name with column family.
> I'll try to reduce it to make it fit the column-oriented design. Feel 
> free to suggest changes if you like.
> The details have been mentioned in a post before.
> Thanks
> -Ankur
> -----Original Message-----
> From: stack [mailto:stack@duboce.net]
> Sent: Friday, March 28, 2008 11:54 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase performance tuning
> Goel, Ankur wrote:
>>  ...
>> I'll check and let you know if the code can be contributed.
>> Once I get a green, I'll make some modifications to make it more 
>> generic and share with you folks to understand how we can Improve it 
>> further before posting.
> A while back, I had a go at making such a Writer: see 
> http://www.duboce.net/~stack/hbase-writer.tgz.  Its old, probably 
> won't work w/ current hbase -- I haven't tried it -- and its for 
> Heritrix 1.x generation but shouldn't be hard to update.  When I left 
> it, I was trying to mavenize it and was to put needed jars -- hadoop,
etc. -- up
> on the Archive's build box.   Publishing such a Writer is a little 
> awkward given the different licenses.  Having maven pull jars seemed 
> like one way of working within the constraints imposed by licensing 
> (Archive is apparently moving toward Apache licensing which should 
> alleviate at least the above issue).
> St.Ack
>> Thanks
>> -Ankur
>> -----Original Message-----
>> From: stack [mailto:stack@duboce.net]
>> Sent: Thursday, March 27, 2008 10:08 PM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: HBase performance tuning
>> I have some familiarity with that crawler.
>> Tell us more about your writer.   Is it proprietary?  If not, can we
> get
>> it into a place where others could use it if wanted?
>> Thanks,
>> St.Ack
>> Goel, Ankur wrote:
>>> I am crawling the web indeed, but only the sites that are present in

>>> my seedlist. The crawler used here is heritrix 2.0 - 
>>> http://webteam.archive.org/confluence/display/Heritrix/2.0.0.
>>> I developed a Heritrix specific HBase writer that can be integrated 
>>> with Heritrix to write the crawled content directly into Hbase.
>>> -Ankur

View raw message