hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: use hbase as distributed crawl's scheduler
Date Fri, 03 Jan 2014 14:04:03 GMT
Yes, sorry ;) Thanks for the correction.

Should have been:
"One table with the URL already crawled (80 millions), one table with the
URL
to crawle (2 billions) and one table with the URLs been processed. I'm not
running any SQL requests against my dataset but I have MR jobs doing many
different things. I have many other tables to help with the work on the
URLs."


2014/1/3 Ted Yu <yuzhihong@gmail.com>

> bq. One URL ...
>
> I guess you mean one table ...
>
> Cheers
>
> On Jan 3, 2014, at 4:19 AM, Jean-Marc Spaggiari <jean-marc@spaggiari.org>
> wrote:
>
> > Interesting. This is exactly what I'm doing ;)
> >
> > I'm using 3 tables to achieve this.
> >
> > One table with the URL already crawled (80 millions), one URL with the
> URL
> > to crawle (2 billions) and one URL with the URLs been processed. I'm not
> > running any SQL requests against my dataset but I have MR jobs doing many
> > different things. I have many other tables to help with the work on the
> > URLs.
> >
> > I'm "salting" the keys using the URL hash so I can find them back very
> > quickly. There can be some collisions so I store also the URL itself on
> the
> > key. So very small scans returning 1 or something 2 rows allow me to
> > quickly find a row knowing the URL.
> >
> > I also have secondary index tables to store the CRCs of the pages to
> > identify duplicate pages based on this value.
> >
> > And so on ;) Working on that for 2 years now. I might have been able to
> use
> > Nuthc and others, but my goal was to learn and do that with a distributed
> > client on a single dataset...
> >
> > Enjoy.
> >
> > JM
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message