hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maxjar10 <jcuz...@gmail.com>
Subject HBase schema for crawling
Date Sun, 05 Jul 2009 00:21:23 GMT

Hi All,

I am developing a schema that will be used for crawling. All of the examples
that I have seen to date use a webcrawl table that looks like the below:

Table: webcrawl
rowkey                details                                   family
com.yahoo.www    lastFetchDate:timestamp          content:somedownloadedpage

I understand wanting to use the rowkey in reverse domain order so that it's
easy to recrawl all of a specific site including it's subdomains. However,
it seems inefficient to scan through a large table looking for
"lastFetchDate" where you want to refetch the page.

In my case I'm not concerned with having to recrawl a particular domain as I
am about efficiently locating the urls that I need to recrawl because I
haven't crawled them in a while.

rowkey                              family
20090631;com.google.www   contents:somedownloadedgooglepage
20090701;com.yahoo.www    contents:somedownloadedyahoopage

This would allow you to quickly get to the content needed to recrawl and do
it by date so that you ensure that you recrawl the most stale item first.

Now, here's the dilemma I have... When I create a MapReduce job to go
through each row in the above I want to schedule the url to be recrawled
again at some date in the future. For example,

// Simple psudeocode
Map( row, rowResult )
      BatchUpdate update = new BatchUpdate( row.get() );
      update.put( "contents:content", downloadPage( pageUrl ) );
      update.updateKey( nextFetchDate + ":"  reverseDomain( pageUrl ) ); //
???? No idea how to do this

1) Does HBase you to update the key for a row? Are HBase row keys immutable?

2) If I can't update a key what's the easiest way to copy a row and assign
it a different key?

3) What are the implications for updating/deleting from a table that you are
currently scanning as part of the mapReduce job? 

It seems to me that I may want to do a map and a reduce and during the map
phase I would record the rows that I fetched while in the reduce phase I
would then take those rows, re-add them with the nextFetchDate and then
remove the old row.

I would probably want to do this process in phases (e.g. scan only 5,000
rows at a time) so that if my Mapper died for any particular reason I could
address the issue and in the worst case only have lost the work that I had
done on 5,000 rows.


View this message in context: http://www.nabble.com/HBase-schema-for-crawling-tp24339168p24339168.html
Sent from the HBase User mailing list archive at Nabble.com.

View raw message