hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <Ankur.G...@corp.aol.com>
Subject RE: Filter omitting columns
Date Wed, 26 Mar 2008 04:58:32 GMT
Looks like I'll have to try using fully qualified column names to see if
it works.
>From the my experience with Hbase, it occurred to me that identifying
and storing
a value for column family "col:" actually stores the value for a column 
"col:[Empty String]". I do not have actual names as like RDBMS I am
treating a 
column family as a single column. 

I am posting my schema below. Do take a glance and see if you can
suggest some changes
to fit the column oriented design.

TABLE:  	 seed_list
DESCRIPTION: Used to store seed urls (both old and newly discovered)
SCHEMA:      Columns families below

	  {"referer_id:", "100"}, // Integer here is Max_Length
        {"url:","1500"},
        {"site:","500"},
        {"last_crawl_date:", "1000"},
        {"next_crawl_date:", "1000"},
        {"create_date:","100"},
        {"status:","100"},
        {"strike:", "100"},
        {"language:","150"},
        {"topic:","500"},
        {"depth:","100000"}

Common attributes are [max versions: 1,  compression: NONE, in memory:
false, block cache enabled: true, max length: 100, bloom filter: none]

TABLE: 	 web_content
DESCRIPTION: Used to store information retrived after crawling a URL.
SCHEMA:	 Column families below

	    {"url", "1500"},
          {"site","500"},
          {"content_type","100"},
          {"title", "1000"},
          {"content", Integer.MAX_VALUE + ""},
          {"parsed_text",Integer.MAX_VALUE + ""},
          {"crawl_date", "1000"},
          {"last_modified_date","100"},
          {"http_headers","10000"},
          {"content_length","11"},
          {"outlinks_count","100000"}
     
Common attributes are [max versions: 1,  compression: BLOCK, in memory:
false, block cache enabled: true, max length: 100, bloom filter: none]

I appreciate your help

Thanks and Regards
-Ankur





-----Original Message-----
From: Clint Morgan [mailto:clint.a.m@gmail.com] 
Sent: Wednesday, March 26, 2008 4:58 AM
To: hbase-user@hadoop.apache.org
Subject: Re: Filter omitting columns

Thats odd, when I use the RegExpRowFilter, and it filters based on a
column's value, the whole row is dropped from the results (as you
expected).

So to answer your question, you should not have to cater to this
manually, but instead let the filter do it.

One thing in your examples that strikes me as odd is to identify columns
solely based on their family name. As Jim pointed out, this works in
scanner construction, but I'm not sure it would work everywhere else
(eg, inside filters). I would try to use fully qualified column names
(family:name)...


On Fri, Mar 21, 2008 at 6:02 AM, Goel, Ankur <Ankur.Goel@corp.aol.com>
wrote:
> Clint,
>       Thanks! The patch works and I am able to get the remaining 
> columns  successfully.
>  One question though, it seems like the behaviour is different from 
> what  I expected.
>  I wanted something like
>
>  Select url:, status:, from mytable where status=0;
>
>  It is correctly able to filter out the status values <> 0 but the
'url:'
>
>  column values are untouched !
>
>  Coming from an RDBMS background, I was hoping that the application of

> column  filter would yield similar result! (fetching only those 'url:'

> column  values  for which 'status:' matched)
>
>  In a column oriented database like Hbase, should such scenario be  
> catered manually ?
>  For e.g ignoring 'url:' values if 'status:' value got filtered ?
>
>  Is there a tutorial available that shows 2-3 tables in RDBMS style 
> and  then transforms them into column oriented design ?
>
>  Thanks
>  -Ankur
>
>
>
>
>
>
>  -----Original Message-----
>  From: Clint Morgan [mailto:clint.a.m@gmail.com]
>
>
> Sent: Thursday, March 20, 2008 10:11 PM
>  To: hbase-user@hadoop.apache.org
>  Subject: Re: Filter omitting columns
>
>  Thats great, but it won't do what he wants, as he want only rows 
> where  the value for the status: column  equals getBytes(1). Hence the
filter.
>
>  On Thu, Mar 20, 2008 at 9:30 AM, Jim Kellerman <jim@powerset.com>
wrote:
>  > If you want to get all the family members, you can just specify  > 
> 'familyname:' as the column. This creates a wild-card scanner  which  
> > will do what you want without filters.
>  >
>  >  ---
>  >  Jim Kellerman, Senior Engineer; Powerset  >  >  >  >  >  > 
> -----Original Message-----  >  > From: Clint Morgan 
> [mailto:clint.a.m@gmail.com]  > Sent: Thursday,  > March 20, 2008 9:19

> AM  > To: hbase-user@hadoop.apache.org  > Subject:
>
>  > Re: Filter omitting columns  >  > I was having a similar problem as

> > well. Though I've never  > used just the column families to specify

> > the columns (eg  > always fully qualified col names like family:col)
.
>
>  > Maybe you  > can try my patch and see if it fixes your problem.
>  >  >
>  >  > https://issues.apache.org/jira/browse/HBASE-527
>  >  >
>  >  > Also you can give a null value for the row key regexp if you  >

> > don't want to use it in RegExpRowFilter.
>  >  >
>  >  > -clint
>  >  >
>  >  > On Thu, Mar 20, 2008 at 7:23 AM, Goel, Ankur  >  > 
> <Ankur.Goel@corp.aol.com> wrote:
>  >  > >
>  >  > >  Hi,
>  >  > >    I am trying to obtain a set of rows by obtaining a scanner
on
>  >  > > Htable. I also specify the RowFilterCriteria like this.
>  >  > >
>  >  > >  /* Code Start */
>  >  > >  Map<Text, byte[]> columnFilter = new HashMap<Text, byte[]>();

> >
>
>  > > columnFilter.put(new Text("status:"), getBytes(1));  > >  > 
> RowFilterInterface rowFilter = new RegExpRowFilter(".*",  > >  > 
> columnFilter);  > >  > >  HTable myTable = new HTable(conf, new  > 
> Text("myTable"));  Text[]  > > columns = {new Text("url:"), new  > 
> Text("status:")};  > > myTable.obtainScanner(columns,  > 
> HConstants.EMPTY_START_ROW,  > rowFilter);  > >  /* Code End */  > > 

> > > >  When I scan the table, I only get 'status:' column family and 
> its
>
>  > > > values.
>  >  > >  The 'url:' family is not present.
>  >  > >
>  >  > >  In simple SQL the query translates to something like  > >  >

> >  > SELECT url, status FROM mytable WHERE status=1;  > >  > >  What

> could  > be wrong ?
>  >  > >
>  >  > >  I eventually want to do something like this  > >  > > 
SELECT

> > url, status, date FROM mytable WHERE (status=1 or (status=2  > > and

> > [today's date] > date));  > >  > >  I have used RowFilterSet with
 >

> RowFilterSet.Operation options to  > > accomplish this but the  > 
> omission  of columns not on the  > filter column  > > list by the  > 
> filter beats me.
>  >  > >
>  >  > >  Thanks
>  >  > >  -Ankur
>  >  > >
>  >  >
>  >  > No virus found in this incoming message.
>  >  > Checked by AVG.
>  >  > Version: 7.5.519 / Virus Database: 269.21.7/1335 - Release  >
Date:
>
>  > 3/19/2008 9:54 AM  >  >
>  >
>  >  No virus found in this outgoing message.
>  >  Checked by AVG.
>  >  Version: 7.5.519 / Virus Database: 269.21.7/1335 - Release Date:
>  > 3/19/2008 9:54 AM
>  >
>  >
>

Mime
View raw message