hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: HBase schema for crawling
Date Sun, 05 Jul 2009 21:26:23 GMT
On Sat, Jul 4, 2009 at 5:21 PM, maxjar10 <jcuzens@gmail.com> wrote:

> Hi All,
> I am developing a schema that will be used for crawling.

Out of interest, what crawler are you using?

> Now, here's the dilemma I have... When I create a MapReduce job to go
> through each row in the above I want to schedule the url to be recrawled
> again at some date in the future. For example,
> // Simple psudeocode
> Map( row, rowResult )
> {
>      BatchUpdate update = new BatchUpdate( row.get() );
>      update.put( "contents:content", downloadPage( pageUrl ) );
>      update.updateKey( nextFetchDate + ":"  reverseDomain( pageUrl ) ); //
> ???? No idea how to do this
> }

So you want to write a new row with a nextFetchDate prefix?

FYI, have you seen

(You might also find http://sourceforge.net/projects/publicsuffix/ might
also be useful)

> 1) Does HBase you to update the key for a row? Are HBase row keys
> immutable?


If you 'update' a row key, changing it, you will create a new row.

> 2) If I can't update a key what's the easiest way to copy a row and assign
> it a different key?

Get all of the row and then put it all with the new key (Billy Pearson's
suggestion would be the way to go I'd suggest -- keeping a column with
timestamp in it or using hbase versions -- in TRUNK you can ask for data
within a timerange.  Running a scanner asking for rows > some timestamp
should be fast).

> 3) What are the implications for updating/deleting from a table that you
> are
> currently scanning as part of the mapReduce job?

Scanners return the state of the row at the time they trip over it.

> It seems to me that I may want to do a map and a reduce and during the map
> phase I would record the rows that I fetched while in the reduce phase I
> would then take those rows, re-add them with the nextFetchDate and then
> remove the old row.

Do you have to remove old data?  You could let it age or be removed when the
number of versions of pages are > configured maximum.

> I would probably want to do this process in phases (e.g. scan only 5,000
> rows at a time) so that if my Mapper died for any particular reason I could
> address the issue and in the worst case only have lost the work that I had
> done on 5,000 rows.

You could keep an already-seen in another hbase table and just rerun whole
job if first job failed.  Check the already-seen before crawling a page to
see if you'd crawled it recently or not?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message