In my env, first time took about 1.6s, second time only took 0.5s
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username
from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by
actor_preferred_username order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt | actor_preferred_username |
+------+---------------------------+
| 1 | _20_xxxx |
+------+---------------------------+
1 row selected (1.585 seconds)
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username
from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by
actor_preferred_username order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt | actor_preferred_username |
+------+---------------------------+
| 1 | _20_xxxx |
+------+---------------------------+
1 row selected (0.505 seconds)
I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment
, see attached drill profile.
As what Dechang suggested you can check profile and see the time mainly spent on which step.
Kathleen
On 12/15/16, 2:27 PM, "Dechang Gu" <dgu@maprtech.com> wrote:
>Yousuf,
>Which version of drill are you running?
>Can you share the profile of the query?
>
>Thanks,
>Dechang
>
>
>
>On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufuddin@css.org.sa> wrote:
>
>> Hello experts
>>
>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>> following specs
>>
>> 32 GB ram
>>
>> 1 TB storage
>>
>> 16 Cores
>>
>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>> engine is mongo which has 5 million docs. (Our daily collection is close to
>> 2.5 million tweets)
>>
>> *Problem*
>>
>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>> query as follows took 12.+ secs,
>>
>>
>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>> cnt desc limit 10;
>>
>> Pls. note that,
>>
>> *In Drill*
>>
>> 1) I've changed the conf/drill-env.sh on each node with following
>>
>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>
>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>> /sort-based-and-hash-based-memory-constrained-operators/>
>>
>> *In Mongo*
>>
>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>> evenly on all shards and created compound index on fields used by the above
>> drill query.
>>
>>
>> sample data
>>
>> {
>> "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>> "rule_list" : [
>> "A",
>> "B",
>> "C",
>> "D13"
>> ],
>> "actor_friends_count" : 40,
>> "klout_score" : 28,
>> "actor_favorites_count" : 1697,
>> "actor_preferred_username" : "_20_xxxx",
>> "sentiment" : "neu",
>> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>> "object_actor_followers_count" : 573,
>> "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>> "actor_id" : "id:twitter.com:xxxxxxxxxx",
>> "actor_display_name" : "xxxxxx",
>> "retweet_count" : 57,
>> "hashtag_list" : [
>> "c",
>> "d",
>> "E",
>> "f",
>> "VCVC",
>> "XXCXCXC",
>> "RRRR"
>> ],
>> "body" : "some tweet blah blah",
>> "actor_followers_count" : 21,
>> "actor_status_count" : 1756,
>> "verb" : "share",
>> "posted_time" : "2016-08-01T23:47:43.000Z",
>> "object_actor_status_count" : 2796,
>> "lang" : "ar",
>> "object_actor_preferred_username" : "xxxxxxx",
>> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>> "gender" : "male",
>> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>> "favorites_count" : 0,
>> "object_posted_time" : "2016-08-01T22:54:22.000Z",
>> "object_actor_friends_count" : 69,
>> "generator_display_name" : "Twitter Web Client",
>> "object_actor_display_name" : "xxxxxxxxxxxx",
>> "actor_listed_count" : 0
>> }
>>
>>
>>
>> *Questions*
>>
>> 1) How to improve aggregation query performance?
>>
>> 2) Do I also need to do something in mongodb to enhance performance? (I
>> mean optimize source)
>>
>> 3) Does Apache drill capable of handling/aggregating billions of documents
>> in real-time?
>>
>> Your early response is highly appreciated!
>>
>> Thank you & Kind Regards
>>
>> Yousuf
>>
>>
>>
>>
|