drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dechang Gu <...@maprtech.com>
Subject Re: Aggregation performance
Date Mon, 19 Dec 2016 19:00:20 GMT
Hi Yousuf,
Thanks for the update and profile.  From the profile, looks like most of
the time was spent on the following operator:
05-xx-03 UNKNOWN_OPERATOR 0.000s 0.000s 0.000s *1.350s* *4.903s* *7.817s*
0.000s 0.000s 0.000s 280KB 280KBwhich is mainly mongoScan.  Also the min
(1.35s)  and max (7.817s) processing times here are quite different, which
may indicate some
skewness of data distribution among the nodes. Take a look of the minor
fragment stats to find out how the scanned data is distributed.

Another thing to look is filter pushdown in mongoScan, shown in physical
plan. As an example, the physical plan from Kathleen's run on your sample
data is attached. Please compare it with yours (please also share it with
me if you want me to take a look).


Thanks,
Dechang

On Sat, Dec 17, 2016 at 11:53 PM, yousuf <yousufuddin@css.org.sa> wrote:

> Hi Dechang,
>
> Thanks for your reply,
>
> Drill versiom : 1.8.0
>
> Mongo:3.2.11
>
>
> *Here is the query profile, Kindly also check my system options for
> possibly incorrect settings as attachement. *
>
> *Operator ID* *Type* *Min Setup Time* *Avg Setup Time* *Max Setup Time* *Min
> Process Time* *Avg Process Time* *Max Process Time* *Min Wait Time* *Avg
> Wait Time* *Max Wait Time* *Avg Peak Memory* *Max Peak Memory*
> 00-xx-00 SCREEN 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.001s 0.001s
> 0.001s 52KB 52KB
> 00-xx-01 PROJECT 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s
> 0.000s 52KB 52KB
> 00-xx-02 SELECTION_VECTOR_REMOVER 0.001s 0.001s 0.001s 0.000s 0.000s
> 0.000s 0.000s 0.000s 0.000s 52KB 52KB
> 00-xx-03 LIMIT 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s
> 0.000s 308KB 308KB
> 00-xx-04 MERGING_RECEIVER 0.000s 0.000s 0.000s 0.003s 0.003s 0.003s 7.829s
> 7.829s 7.829s 308KB 308KB
> 01-xx-00 SINGLE_SENDER 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.001s
> 0.001s 0.002s 52KB 52KB
> 01-xx-01 SELECTION_VECTOR_REMOVER 0.001s 0.001s 0.001s 0.000s 0.000s
> 0.000s 0.000s 0.000s 0.000s 84KB 84KB
> 01-xx-02 TOP_N_SORT 0.000s 0.000s 0.000s 0.004s 0.016s 0.050s 0.000s
> 0.000s 0.000s 5MB 18MB
> 01-xx-03 PROJECT 0.000s 0.000s 0.001s 0.000s 0.001s 0.005s 0.000s 0.000s
> 0.000s 32KB 32KB
> 01-xx-04 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.001s 0.007s 0.026s
> 7.759s 7.819s 7.841s - -
> 02-xx-00 HASH_PARTITION_SENDER 0.000s 0.000s 0.000s 0.073s 0.081s 0.090s
> 0.039s 0.044s 0.051s 373KB 379KB
> 02-xx-01 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.001s 0.001s 0.001s
> 7.748s 7.755s 7.761s - -
> 03-xx-00 SINGLE_SENDER 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.001s
> 0.004s 0.008s 5MB 5MB
> 03-xx-01 PROJECT 0.001s 0.002s 0.005s 0.000s 0.000s 0.001s 0.000s 0.000s
> 0.000s 5MB 5MB
> 03-xx-02 PROJECT 0.000s 0.000s 0.002s 0.000s 0.000s 0.000s 0.000s 0.000s
> 0.000s 4MB 4MB
> 03-xx-03 HASH_AGGREGATE 0.004s 0.009s 0.024s 0.054s 0.065s 0.077s 0.000s
> 0.000s 0.000s 6MB 6MB
> 03-xx-04 PROJECT 0.000s 0.001s 0.006s 0.001s 0.002s 0.002s 0.000s 0.000s
> 0.000s 32KB 32KB
> 03-xx-05 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.007s 0.009s 0.018s
> 7.646s 7.670s 7.695s - -
> 04-xx-00 HASH_PARTITION_SENDER 0.000s 0.000s 0.000s 0.435s 0.488s 0.659s
> 0.088s 0.147s 0.216s 2MB 2MB
> 04-xx-01 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.021s 0.025s 0.028s
> 6.014s 6.371s 6.854s - -
> 05-xx-00 SINGLE_SENDER 0.000s 0.000s 0.000s 0.000s 0.002s 0.004s 0.001s
> 0.004s 0.050s 116KB 116KB
> 05-xx-01 PROJECT 0.001s 0.002s 0.010s 0.001s 0.007s 0.012s 0.000s 0.000s
> 0.000s 116KB 116KB
> 05-xx-02 PROJECT 0.000s 0.001s 0.004s 0.000s 0.001s 0.004s 0.000s 0.000s
> 0.000s 100KB 100KB
> 05-xx-03 UNKNOWN_OPERATOR 0.000s 0.000s 0.000s 1.350s 4.903s 7.817s 0.000s
> 0.000s 0.000s 280KB 280KB
>
> Thanks & regards
>
> On 12/16/2016 01:27 AM, Dechang Gu wrote:
>
> Yousuf,
> Which version of drill are you running?
> Can you share the profile of the query?
>
> Thanks,
> Dechang
>
>
>
> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufuddin@css.org.sa> <yousufuddin@css.org.sa>
wrote:
>
>
> Hello experts
>
> As a POC project, I've built a drill cluster on 5 VMs , each with the
> following specs
>
> 32 GB ram
>
> 1 TB storage
>
> 16 Cores
>
> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
> engine is mongo which has 5 million docs. (Our daily collection is close to
> 2.5 million tweets)
>
> *Problem*
>
> My aggregation queries are slow, not fit for my realtime dashboard. sample
> query as follows took 12.+ secs,
>
>
> SELECT count(*) as cnt, actor_preferred_username from tweets where
> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
> cnt desc limit 10;
>
> Pls. note that,
>
> *In Drill*
>
> 1) I've changed the conf/drill-env.sh on each node with following
>
> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>
> 2) changed few setting based on the docs <https://drill.apache.org/docs
> /sort-based-and-hash-based-memory-constrained-operators/> <https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/>
>
> *In Mongo*
>
> data is sharded on 3 servers on shardkey tweet_id which distributes data
> evenly on all shards and created compound index on fields used by the above
> drill query.
>
>
> sample data
>
> {
>     "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>     "rule_list" : [
>         "A",
>         "B",
>         "C",
>         "D13"
>     ],
>     "actor_friends_count" : 40,
>     "klout_score" : 28,
>     "actor_favorites_count" : 1697,
>     "actor_preferred_username" : "_20_xxxx",
>     "sentiment" : "neu",
>     "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>     "object_actor_followers_count" : 573,
>     "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>     "actor_id" : "id:twitter.com:xxxxxxxxxx",
>     "actor_display_name" : "xxxxxx",
>     "retweet_count" : 57,
>     "hashtag_list" : [
>         "c",
>         "d",
>         "E",
>         "f",
>         "VCVC",
>         "XXCXCXC",
>         "RRRR"
>     ],
>     "body" : "some tweet blah blah",
>     "actor_followers_count" : 21,
>     "actor_status_count" : 1756,
>     "verb" : "share",
>     "posted_time" : "2016-08-01T23:47:43.000Z",
>     "object_actor_status_count" : 2796,
>     "lang" : "ar",
>     "object_actor_preferred_username" : "xxxxxxx",
>     "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>     "gender" : "male",
>     "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>     "favorites_count" : 0,
>     "object_posted_time" : "2016-08-01T22:54:22.000Z",
>     "object_actor_friends_count" : 69,
>     "generator_display_name" : "Twitter Web Client",
>     "object_actor_display_name" : "xxxxxxxxxxxx",
>     "actor_listed_count" : 0
> }
>
>
>
> *Questions*
>
> 1) How to improve aggregation query performance?
>
> 2) Do I also need to do something in mongodb to enhance performance? (I
> mean optimize source)
>
> 3) Does Apache drill capable of handling/aggregating billions of documents
> in real-time?
>
> Your early response is highly appreciated!
>
> Thank you & Kind Regards
>
> Yousuf
>
>
>
>
>
>
>

Mime
View raw message