hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TuX RaceR <tuxrace...@gmail.com>
Subject Re: ported lucandra: lucene index on HBase
Date Thu, 22 Apr 2010 10:25:42 GMT
Thank you Karthik, I did not know your project and joined the project's 
mailing list ;)
As I started this thread here, on Hbase list, maybe I just continue here.

Karthik K wrote:
> The HBase RPC is being modified , to append a docid to an already existing
> field/term , to the compressed encoding stored in the family/ col. name, to
> achieve the locality of reference and scale with the number of documents.
woaw, that sounds very interesting ;) Is  there a HBase Jira for this or 
is that only available in your code?

> Once the documents go in the index, for all practical purpose, the
> manipulation is done across numbers , assigned to the user specified id
> space.
> More often than not, the only field that is stored is the "id" , that is
> retrieved after all the computation, that can then be used to index into
> another store to retrieve other details of the search schema. Except for
> limited cases (sorting / faceting etc.) , using the tf-idf representation
> for storing the 'field's in document goes against the format being used and
> is advised to be used sparingly.
Looking at http://wiki.github.com/akkumar/hbasene/hbase-tf-idf-index-formats

I see you have:

    Term Frequency

The TF-IDF (Term Frequency/ Inverse Document Format) representation is 
as follows.

row 	fm.termFrequency

	<lucene_int_1> 	<lucene_int_2> 	<lucene_int_3>
<field/term> 	<termPositions_of_field/term_in_lucene_int_1> 
<termPositions_in_lucene_int_2> 	<termPositions_in_lucene_int_3>

my question is what happens if the term is a very common term
e.g if you have 1 billion=10^9 documents in your database and a term 
that is contained in one document every 100 document (i.e that is a term 
contained in 10 million=10^7 documents) then retrieving this row you 
will get a huge network payload. How do you deal with that kind of scenario?


View raw message