hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Duane Moore <duane.mo...@issinc.com>
Subject Re: idea about web page database
Date Sat, 24 Jul 2010 05:54:09 GMT

You might try looking at Nutch, which as you may know was the origin of Hadoop.  There is
an issue active in the Nutch JIRA for adding integration with HBase: https://issues.apache.org/jira/browse/NUTCH-650

With this change to Nutch, we now have an example usage of HBase which matches very closely
the table design suggested in the Google Bigtable paper.

I downloaded the code for the branch of Nutch integrating with HBase at http://svn.apache.org/repos/asf/nutch/branches/nutchbase/

You can do some searching in that branch, but the class org.apache.nutch.storage.WebPage seems
to have a basic structure for a “web page” table that may be what you’re looking for.
 Nutch is using the gora framework (http://github.com/enis/gora) which I was not familiar
with, but it looks to handle the conversion of the persistence/data object class to the underlying
HBase table when HBase is used.

Best of luck,

From: 罗磊 <luoleicn@gmail.com>
Reply-To: "user@hbase.apache.org" <user@hbase.apache.org>
Date: Fri, 23 Jul 2010 20:27:11 -0700
To: "user@hbase.apache.org" <user@hbase.apache.org>
Subject: idea about web page database


I'm trying to design a datbase which is used to store web pages for search engine. Can you
guys give me some good advice for this?

I read the page of bigtable. Google give an example of webtable, but it makes me a little
confused. google shows how www.cnn.com <http://www.cnn.com>  is stored, but if I have
2 pages named www.cnn.com/a.html <http://www.cnn.com/a.html>  and www.cnn.com/b.html
<http://www.cnn.com/b.html> , I don't know weather or not to store 2 pages in on row.

Google's paper said "In Webtable, we would use URLs as row keys, various aspects of web pages
as column names, and store the contents of the web pages in the contents", it seems google
will use domain name as row key, and store a.html and b.html as column names. But in that
way, it seems impossible for anchor design, how can users tell which page a.html or b.html
an anchor text refer to?

Luo Lei

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message