drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yousuf <yousufud...@css.org.sa>
Subject Aggregation performance
Date Thu, 15 Dec 2016 11:27:51 GMT
Hello experts

As a POC project, I've built a drill cluster on 5 VMs , each with the 
following specs

32 GB ram

1 TB storage

16 Cores

Zookeeper quorum & apache drill installed on all 5 nodes. My storage 
engine is mongo which has 5 million docs. (Our daily collection is close 
to 2.5 million tweets)

*Problem*

My aggregation queries are slow, not fit for my realtime dashboard. 
sample query as follows took 12.+ secs,


SELECT count(*) as cnt, actor_preferred_username from tweets where 
posted_time >= '2016-08-01T00.00.00.000Z' and posted_time 
<='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order 
by cnt desc limit 10;

Pls. note that,

*In Drill*

1) I've changed the conf/drill-env.sh on each node with following

|export DRILL_HEAP=${DRILL_HEAP:-"8G”} export 
DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|

2) changed few setting based on the docs 
<https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/>

*In Mongo*

data is sharded on 3 servers on shardkey tweet_id which distributes data 
evenly on all shards and created compound index on fields used by the 
above drill query.


sample data

{
     "_id" : ObjectId("58524d507e08dae4c0377b9e"),
     "rule_list" : [
         "A",
         "B",
         "C",
         "D13"
     ],
     "actor_friends_count" : 40,
     "klout_score" : 28,
     "actor_favorites_count" : 1697,
     "actor_preferred_username" : "_20_xxxx",
     "sentiment" : "neu",
     "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
     "object_actor_followers_count" : 573,
     "actor_posted_time" : "2016-06-24T00:37:54.000Z",
     "actor_id" : "id:twitter.com:xxxxxxxxxx",
     "actor_display_name" : "xxxxxx",
     "retweet_count" : 57,
     "hashtag_list" : [
         "c",
         "d",
         "E",
         "f",
         "VCVC",
         "XXCXCXC",
         "RRRR"
     ],
     "body" : "some tweet blah blah",
     "actor_followers_count" : 21,
     "actor_status_count" : 1756,
     "verb" : "share",
     "posted_time" : "2016-08-01T23:47:43.000Z",
     "object_actor_status_count" : 2796,
     "lang" : "ar",
     "object_actor_preferred_username" : "xxxxxxx",
     "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
     "gender" : "male",
     "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
     "favorites_count" : 0,
     "object_posted_time" : "2016-08-01T22:54:22.000Z",
     "object_actor_friends_count" : 69,
     "generator_display_name" : "Twitter Web Client",
     "object_actor_display_name" : "xxxxxxxxxxxx",
     "actor_listed_count" : 0
}



*Questions*

1) How to improve aggregation query performance?

2) Do I also need to do something in mongodb to enhance performance? (I 
mean optimize source)

3) Does Apache drill capable of handling/aggregating billions of 
documents in real-time?

Your early response is highly appreciated!

Thank you & Kind Regards

Yousuf




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message