drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dechang Gu <...@maprtech.com>
Subject Re: Aggregation performance
Date Thu, 15 Dec 2016 22:27:03 GMT
Yousuf,
Which version of drill are you running?
Can you share the profile of the query?

Thanks,
Dechang



On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufuddin@css.org.sa> wrote:

> Hello experts
>
> As a POC project, I've built a drill cluster on 5 VMs , each with the
> following specs
>
> 32 GB ram
>
> 1 TB storage
>
> 16 Cores
>
> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
> engine is mongo which has 5 million docs. (Our daily collection is close to
> 2.5 million tweets)
>
> *Problem*
>
> My aggregation queries are slow, not fit for my realtime dashboard. sample
> query as follows took 12.+ secs,
>
>
> SELECT count(*) as cnt, actor_preferred_username from tweets where
> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
> cnt desc limit 10;
>
> Pls. note that,
>
> *In Drill*
>
> 1) I've changed the conf/drill-env.sh on each node with following
>
> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>
> 2) changed few setting based on the docs <https://drill.apache.org/docs
> /sort-based-and-hash-based-memory-constrained-operators/>
>
> *In Mongo*
>
> data is sharded on 3 servers on shardkey tweet_id which distributes data
> evenly on all shards and created compound index on fields used by the above
> drill query.
>
>
> sample data
>
> {
>     "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>     "rule_list" : [
>         "A",
>         "B",
>         "C",
>         "D13"
>     ],
>     "actor_friends_count" : 40,
>     "klout_score" : 28,
>     "actor_favorites_count" : 1697,
>     "actor_preferred_username" : "_20_xxxx",
>     "sentiment" : "neu",
>     "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>     "object_actor_followers_count" : 573,
>     "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>     "actor_id" : "id:twitter.com:xxxxxxxxxx",
>     "actor_display_name" : "xxxxxx",
>     "retweet_count" : 57,
>     "hashtag_list" : [
>         "c",
>         "d",
>         "E",
>         "f",
>         "VCVC",
>         "XXCXCXC",
>         "RRRR"
>     ],
>     "body" : "some tweet blah blah",
>     "actor_followers_count" : 21,
>     "actor_status_count" : 1756,
>     "verb" : "share",
>     "posted_time" : "2016-08-01T23:47:43.000Z",
>     "object_actor_status_count" : 2796,
>     "lang" : "ar",
>     "object_actor_preferred_username" : "xxxxxxx",
>     "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>     "gender" : "male",
>     "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>     "favorites_count" : 0,
>     "object_posted_time" : "2016-08-01T22:54:22.000Z",
>     "object_actor_friends_count" : 69,
>     "generator_display_name" : "Twitter Web Client",
>     "object_actor_display_name" : "xxxxxxxxxxxx",
>     "actor_listed_count" : 0
> }
>
>
>
> *Questions*
>
> 1) How to improve aggregation query performance?
>
> 2) Do I also need to do something in mongodb to enhance performance? (I
> mean optimize source)
>
> 3) Does Apache drill capable of handling/aggregating billions of documents
> in real-time?
>
> Your early response is highly appreciated!
>
> Thank you & Kind Regards
>
> Yousuf
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message