hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominik Hübner <cont...@dhuebner.com>
Subject Schema design for tweet analytics
Date Thu, 21 May 2015 12:45:36 GMT
Hey, I am new to HBase and struggle a bit on how to design my schema.
To dive in, I gathered a dataset from the Twitter sample stream (roughly 40GB by now).

I want to answer the following queries:
- What where the trending hashtags/terms for a certain period of time (e.g. an hour) over
last X days and being able to plot those as a timeline 

Row key: <“”trending”><TIMESTAMP_OF_THE_DAY> 
Value: Set of most popular X tweets

All writes would be close in terms of data locality, but as its just an aggregate with basically
1 write per hour it should be fine. On lookup time I will be able to scan through the days
and get a timeline of changing trending terms for a range of days.

- Get a tweet by its identifier

Row key: <TWEET_ID><“tweet”>
Column key: <TWEET_FIELD_NAME>
Value: value of the tweet feature like its text or author

Straightforward, 1 tweet per row for direct lookup

- Number of tweets for all countries
Row key: <“tweets_per_country”>
Column key: <COUNTRY_ID_NAME>
Value: the count

A single row to either get all countries or a particular country from a column.

- Which are the most (top N) similar tweets to a particular tweet
This one might be a bit more tricky. I wrote a MapReduce job to the the top N most similar
tweets for a particular tweet with a similarity score. How can I map this to an hbase schema?
My guess would be to keep them in a similar schema as the actual tweets

<tweet_id>< 0 >     <= actual tweet
<tweet_id>< LONG_MAX-SIMILARITY_SCORE>     <= most similar tweets in descending

But what should i store in those rows? The actual (then duplicated tweets) or just their ids
and do a second lookup later.

I would really appreciate if someone could have a look at my ideas about schema/lookup and
tell me if do something wrong here.

View raw message