drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kathleen Li <...@maprtech.com>
Subject Re: Aggregation performance
Date Thu, 15 Dec 2016 22:55:24 GMT
In my env, first time took about 1.6s, second time only took 0.5s
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username
from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by
actor_preferred_username   order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt  | actor_preferred_username  |
+------+---------------------------+
| 1    | _20_xxxx                  |
+------+---------------------------+
1 row selected (1.585 seconds)
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username
from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by
actor_preferred_username   order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt  | actor_preferred_username  |
+------+---------------------------+
| 1    | _20_xxxx                  |
+------+---------------------------+
1 row selected (0.505 seconds)






I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment
, see attached drill profile.

As what Dechang suggested  you can check profile and see the time mainly spent on which step.

Kathleen


On 12/15/16, 2:27 PM, "Dechang Gu" <dgu@maprtech.com> wrote:

>Yousuf,
>Which version of drill are you running?
>Can you share the profile of the query?
>
>Thanks,
>Dechang
>
>
>
>On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufuddin@css.org.sa> wrote:
>
>> Hello experts
>>
>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>> following specs
>>
>> 32 GB ram
>>
>> 1 TB storage
>>
>> 16 Cores
>>
>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>> engine is mongo which has 5 million docs. (Our daily collection is close to
>> 2.5 million tweets)
>>
>> *Problem*
>>
>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>> query as follows took 12.+ secs,
>>
>>
>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>> cnt desc limit 10;
>>
>> Pls. note that,
>>
>> *In Drill*
>>
>> 1) I've changed the conf/drill-env.sh on each node with following
>>
>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>
>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>> /sort-based-and-hash-based-memory-constrained-operators/>
>>
>> *In Mongo*
>>
>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>> evenly on all shards and created compound index on fields used by the above
>> drill query.
>>
>>
>> sample data
>>
>> {
>>     "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>     "rule_list" : [
>>         "A",
>>         "B",
>>         "C",
>>         "D13"
>>     ],
>>     "actor_friends_count" : 40,
>>     "klout_score" : 28,
>>     "actor_favorites_count" : 1697,
>>     "actor_preferred_username" : "_20_xxxx",
>>     "sentiment" : "neu",
>>     "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>     "object_actor_followers_count" : 573,
>>     "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>     "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>     "actor_display_name" : "xxxxxx",
>>     "retweet_count" : 57,
>>     "hashtag_list" : [
>>         "c",
>>         "d",
>>         "E",
>>         "f",
>>         "VCVC",
>>         "XXCXCXC",
>>         "RRRR"
>>     ],
>>     "body" : "some tweet blah blah",
>>     "actor_followers_count" : 21,
>>     "actor_status_count" : 1756,
>>     "verb" : "share",
>>     "posted_time" : "2016-08-01T23:47:43.000Z",
>>     "object_actor_status_count" : 2796,
>>     "lang" : "ar",
>>     "object_actor_preferred_username" : "xxxxxxx",
>>     "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>     "gender" : "male",
>>     "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>     "favorites_count" : 0,
>>     "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>     "object_actor_friends_count" : 69,
>>     "generator_display_name" : "Twitter Web Client",
>>     "object_actor_display_name" : "xxxxxxxxxxxx",
>>     "actor_listed_count" : 0
>> }
>>
>>
>>
>> *Questions*
>>
>> 1) How to improve aggregation query performance?
>>
>> 2) Do I also need to do something in mongodb to enhance performance? (I
>> mean optimize source)
>>
>> 3) Does Apache drill capable of handling/aggregating billions of documents
>> in real-time?
>>
>> Your early response is highly appreciated!
>>
>> Thank you & Kind Regards
>>
>> Yousuf
>>
>>
>>
>>

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message