drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yousuf <yousufud...@css.org.sa>
Subject Re: Aggregation performance
Date Sun, 18 Dec 2016 09:21:19 GMT
Hi Kathleen,

Thanks for responding...

I've noticed when  alter session set store.mongo.bson.record.reader = 
true; the performance is improved. However, the other queries are 
failing :(.


0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
+-------+------------------------------------------+
|  ok   |                 summary                  |
+-------+------------------------------------------+
| true  | store.mongo.bson.record.reader updated.  |
+-------+------------------------------------------+
1 row selected (0.082 seconds)
0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from 
tweets group by actor_preferred_username order by cnt desc limit 10;
+--------+---------------------------+
|  cnt   | actor_preferred_username  |
+--------+---------------------------+
| 10770  | mrnota53                  |
| 6239   | cyberahsokatano           |
| 4609   | abeerlilak                |
| 4562   | DaeshAjel3                |
| 4523   | Aster__Q                  |
| 4275   | DaeshCrimes               |
| 4182   | AfwfwefOfwefho            |
| 3980   | HewarMaftuh2              |
| 3698   | DaeshAjel                 |
| 3661   | sarosh_iq                 |
+--------+---------------------------+
10 rows selected (4.64 seconds)


*Failed queries when **store.mongo.bson.record.reader = true;*

*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select 
flatten(hashtag_list) as hashtag from hashtags) group by hashtag order 
by cnt desc  limit 10;
Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a 
VarChar type when you are using a ValueWriter of type UnionListWriter.

Fragment 4:1

[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]

   (java.lang.IllegalArgumentException) You tried to write a VarChar 
type when you are using a ValueWriter of type UnionListWriter.
org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
     org.apache.drill.exec.store.bson.BsonRecordReader.write():75
org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
     org.apache.drill.exec.physical.impl.ScanBatch.next():178
     org.apache.drill.exec.record.AbstractRecordBatch.next():119
     org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
     org.apache.drill.exec.record.AbstractRecordBatch.next():162
     org.apache.drill.exec.record.AbstractRecordBatch.next():119
     org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
     org.apache.drill.exec.record.AbstractRecordBatch.next():162
     org.apache.drill.exec.record.AbstractRecordBatch.next():119
     org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
     org.apache.drill.exec.record.AbstractRecordBatch.next():142
     org.apache.drill.exec.record.AbstractRecordBatch.next():119
     org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
     org.apache.drill.exec.record.AbstractRecordBatch.next():162
     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
     java.security.AccessController.doPrivileged():-2
     javax.security.auth.Subject.doAs():422
     org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
     org.apache.drill.common.SelfCleaningRunnable.run():38
     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
     java.lang.Thread.run():745 (state=,code=0)

*


On 12/16/2016 01:55 AM, Kathleen Li wrote:
> In my env, first time took about 1.6s, second time only took 0.5s
> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username
from test where
> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group
by actor_preferred_username   order by
> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
> +------+---------------------------+
> | cnt  | actor_preferred_username  |
> +------+---------------------------+
> | 1    | _20_xxxx                  |
> +------+---------------------------+
> 1 row selected (1.585 seconds)
> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username
from test where
> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group
by actor_preferred_username   order by
> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
> +------+---------------------------+
> | cnt  | actor_preferred_username  |
> +------+---------------------------+
> | 1    | _20_xxxx                  |
> +------+---------------------------+
> 1 row selected (0.505 seconds)
>
>
>
>
>
>
> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment
, see attached drill profile.
>
> As what Dechang suggested  you can check profile and see the time mainly spent on which
step.
>
> Kathleen
>
>
> On 12/15/16, 2:27 PM, "Dechang Gu" <dgu@maprtech.com> wrote:
>
>> Yousuf,
>> Which version of drill are you running?
>> Can you share the profile of the query?
>>
>> Thanks,
>> Dechang
>>
>>
>>
>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufuddin@css.org.sa> wrote:
>>
>>> Hello experts
>>>
>>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>>> following specs
>>>
>>> 32 GB ram
>>>
>>> 1 TB storage
>>>
>>> 16 Cores
>>>
>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>>> engine is mongo which has 5 million docs. (Our daily collection is close to
>>> 2.5 million tweets)
>>>
>>> *Problem*
>>>
>>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>>> query as follows took 12.+ secs,
>>>
>>>
>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>>> cnt desc limit 10;
>>>
>>> Pls. note that,
>>>
>>> *In Drill*
>>>
>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>
>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>
>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>
>>> *In Mongo*
>>>
>>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>>> evenly on all shards and created compound index on fields used by the above
>>> drill query.
>>>
>>>
>>> sample data
>>>
>>> {
>>>      "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>>      "rule_list" : [
>>>          "A",
>>>          "B",
>>>          "C",
>>>          "D13"
>>>      ],
>>>      "actor_friends_count" : 40,
>>>      "klout_score" : 28,
>>>      "actor_favorites_count" : 1697,
>>>      "actor_preferred_username" : "_20_xxxx",
>>>      "sentiment" : "neu",
>>>      "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>>      "object_actor_followers_count" : 573,
>>>      "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>>      "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>>      "actor_display_name" : "xxxxxx",
>>>      "retweet_count" : 57,
>>>      "hashtag_list" : [
>>>          "c",
>>>          "d",
>>>          "E",
>>>          "f",
>>>          "VCVC",
>>>          "XXCXCXC",
>>>          "RRRR"
>>>      ],
>>>      "body" : "some tweet blah blah",
>>>      "actor_followers_count" : 21,
>>>      "actor_status_count" : 1756,
>>>      "verb" : "share",
>>>      "posted_time" : "2016-08-01T23:47:43.000Z",
>>>      "object_actor_status_count" : 2796,
>>>      "lang" : "ar",
>>>      "object_actor_preferred_username" : "xxxxxxx",
>>>      "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>>      "gender" : "male",
>>>      "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>>      "favorites_count" : 0,
>>>      "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>>      "object_actor_friends_count" : 69,
>>>      "generator_display_name" : "Twitter Web Client",
>>>      "object_actor_display_name" : "xxxxxxxxxxxx",
>>>      "actor_listed_count" : 0
>>> }
>>>
>>>
>>>
>>> *Questions*
>>>
>>> 1) How to improve aggregation query performance?
>>>
>>> 2) Do I also need to do something in mongodb to enhance performance? (I
>>> mean optimize source)
>>>
>>> 3) Does Apache drill capable of handling/aggregating billions of documents
>>> in real-time?
>>>
>>> Your early response is highly appreciated!
>>>
>>> Thank you & Kind Regards
>>>
>>> Yousuf
>>>
>>>
>>>
>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message