hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim R. Wilson" <wilson.ji...@gmail.com>
Subject Re: HBase Sample Schemas
Date Fri, 28 Mar 2008 15:02:09 GMT
I see - i'm trying to figure out if my use-case is valid, any help is
appreciated :)

I was thinking about using a bloom filtered column for user-based
blacklisting.  So I'd have a table, and concatenate say the user_id
with a URL or domain that they're blacklisting and store that in a
column.  Then, when I want to test if a user has blacklisted a URL,
concatenate it the user_id/domain as before, then check against the
bloom filter for that entry.

I guess in this case I'd just make the concatenation the primary key
in a special table just for blacklistification?

Thanks for helping me understand this stuff.  I have a solid grasp on
what bloom filters are at the datastructure level, I'm trying to
understand how they can be used/queried within the context of hbase.
Thanks in advance!

-- Jim



On Fri, Mar 28, 2008 at 9:42 AM, Bryan Duxbury <bryan@rapleaf.com> wrote:
> Bloom filters in HBase, as they are currently designed, aren't a
>  construct that users have to interact with directly. All retrieval
>  operations take advantage of a bloom filter if it is configured.
>
>  -Bryan
>
>
>
>  On Mar 28, 2008, at 6:28 AM, Jim R. Wilson wrote:
>
>  > Thanks Ankur!
>  >
>  > Those are very helpful - finding example schemas has been a really
>  > sore point for me as well in trying to learn all this.
>  >
>  > I was wondering if you had an example that defined a bloom filter for
>  > a column, and an example on how to query a bloom filter once it's set
>  > up (shell example or rest example if possible).
>  >
>  > Thanks again!
>  >
>  > -- Jim R. Wilson (jimbojw)
>  >
>  > On Fri, Mar 28, 2008 at 1:33 AM, Goel, Ankur
>  > <Ankur.Goel@corp.aol.com> wrote:
>  >>
>  >>> ....by adding a column.
>  >>  Sorry, I meant colon ":"
>  >>
>  >>
>  >>  -----Original Message-----
>  >>  From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
>  >>  Sent: Friday, March 28, 2008 12:01 PM
>  >>  To: hbase-user@hadoop.apache.org
>  >>
>  >>
>  >> Subject: RE: HBase Sample Schemas
>  >>
>  >>  The tables below are RDBMS tables with column names simply
>  >> converted to
>  >>  column families by adding a column.
>  >>  I'd like to share ideas on how best these tables can be modified (or
>  >>  merged ??) to take advantage of column oriented design.
>  >>
>  >>  -----Original Message-----
>  >>  From: Edward J. Yoon [mailto:edward@udanax.org]
>  >>  Sent: Friday, March 28, 2008 11:48 AM
>  >>  To: hbase-user@hadoop.apache.org
>  >>  Subject: Re: HBase Sample Schemas
>  >>
>  >>  I don't think this is a good example.
>  >>
>  >>  Find the the difference between the two physical schemas for same
>  >>  logical data modeling of relational database using an relationship
>  >>  tables on RDBMS and a list of column qualifiers on BigTable.
>  >>
>  >>  On Fri, Mar 28, 2008 at 2:28 PM, Goel, Ankur
>  >> <Ankur.Goel@corp.aol.com>
>  >>  wrote:
>  >>> Hi Bryan,
>  >>>         Here is the sample schema I have (looks closer to RDBMS, I
>  >>> know)
>  >>>
>  >>> TABLE:           seed_list
>  >>>
>  >>> DESCRIPTION: Used to store seed urls (both old and newly
>  >>> discovered).
>  >>>             Initially populated with some seed URLs. The crawl
>  >>> controller
>  >>>             picks up the seeds from this table that have status=0
>  >>> (Not
>  >>> Visited)
>  >>>                 or status=2 (Visited, but ready for re-crawl) and
>  >>> feeds these seeds
>  >>>             in batch to different crawl engines that it knows about.
>  >>>
>  >>> SCHEMA:      Columns families below
>  >>>
>  >>>          {"referer_id:", "100"}, // Integer here is Max_Length
>  >>>        {"url:","1500"},
>  >>>        {"site:","500"},
>  >>>        {"last_crawl_date:", "1000"},
>  >>>        {"next_crawl_date:", "1000"},
>  >>>        {"create_date:","100"},
>  >>>        {"status:","100"},
>  >>>        {"strike:", "100"},
>  >>>        {"language:","150"},
>  >>>        {"topic:","500"},
>  >>>        {"depth:","100000"}
>  >>>
>  >>> Common attributes are [max versions: 1,  compression: NONE, in
>  >>> memory:
>  >>> false, block cache enabled: true, max length: 100, bloom filter:
>  >>> none]
>  >>>
>  >>>
>  >>> TABLE:   web_content
>  >>>
>  >>> DESCRIPTION: Used to store information retrived after crawling a
>  >>> URL.
>  >>>             Each crawl engines provides information about URL it
>  >>> crawled.
>  >>>             This information is then stored in this table depending
>  >>> upon
>  >>>             the profile settings (what should be stored?)
>  >>> SCHEMA:  Column families below
>  >>>
>  >>>            {"url:", "1500"},
>  >>>          {"site:","500"},
>  >>>          {"content_type:","100"},
>  >>>          {"title:", "1000"},
>  >>>          {"content:", Integer.MAX_VALUE + ""},
>  >>>          {"parsed_text:",Integer.MAX_VALUE + ""},
>  >>>          {"crawl_date:", "1000"},
>  >>>          {"last_modified_date:","100"},
>  >>>          {"http_headers:","10000"},
>  >>>          {"content_length:","11"},
>  >>>          {"outlinks_count:","100000"}
>  >>>
>  >>> Common attributes are [max versions: 1,  compression: BLOCK, in
>  >>  memory:
>  >>> false, block cache enabled: true, max length: 100, bloom filter:
>  >>> none]
>  >>>
>  >>> Please feel free to suggest modifications/enhancements for column
>  >>> oriented Design.
>  >>>
>  >>> Thanks
>  >>> -Ankur
>  >>>
>  >>>
>  >>> -----Original Message-----
>  >>> From: Bryan Duxbury [mailto:bryan@rapleaf.com]
>  >>> Sent: Friday, March 28, 2008 10:33 AM
>  >>> To: hbase-user@hadoop.apache.org
>  >>> Subject: HBase Sample Schemas
>  >>>
>  >>> All,
>  >>>
>  >>> One of the more common types of questions we get from people new to
>  >>> HBase are about the differences in the schema between HBase and
>  >>> relational databases. So that we can generate some good examples of
>  >>> RDBMS schemas and their counterparts as they might be represented in
>  >>> HBase, could you guys post some small (1-5 entities) schemas that
>  >>> you
>  >>> might be interested in using and a few sentences about how you'd
>  >>> like
>  >>> to consume them? We can then discuss possible options and see how
>  >>> things might look. This will also help Stack, Jim, and myself to
>  >>> notice interesting access patterns we might want to support.
>  >>>
>  >>> Thanks in advance,
>  >>>
>  >>> Bryan
>  >>>
>  >>
>  >>
>  >>
>  >>  --
>  >>  B. Regards,
>  >>  Edward J. Yoon
>  >>
>
>

Mime
View raw message