drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kathleen Li <...@maprtech.com>
Subject Re: Aggregation performance
Date Mon, 19 Dec 2016 18:31:23 GMT
Hi Yousuf,


Yes in my env, I was set store.mongo.bson.record.reader = true.
With one record you provided, the same query works fine for me,  the error you got is the
schema changes related errors:
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt from (select
. . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag from test)
group by hashtag order
. . . . . . . . . . . . . . . . . . . . . . .> by cnt desc  limit 10;
+----------+------+
| hashtag  | cnt  |
+----------+------+
| RRRR     | 1    |
| VCVC     | 1    |
| XXCXCXC  | 1    |
| c        | 1    |
| d        | 1    |
| f        | 1    |
| E        | 1    |


With that, you might try:  alter session set store.mongo.all_text_mode=true and rerun the
query to see if you can pass the error 


Kathleen







On 12/18/16, 1:21 AM, "yousuf" <yousufuddin@css.org.sa> wrote:

>Hi Kathleen,
>
>Thanks for responding...
>
>I've noticed when  alter session set store.mongo.bson.record.reader = 
>true; the performance is improved. However, the other queries are 
>failing :(.
>
>
>0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
>+-------+------------------------------------------+
>|  ok   |                 summary                  |
>+-------+------------------------------------------+
>| true  | store.mongo.bson.record.reader updated.  |
>+-------+------------------------------------------+
>1 row selected (0.082 seconds)
>0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from 
>tweets group by actor_preferred_username order by cnt desc limit 10;
>+--------+---------------------------+
>|  cnt   | actor_preferred_username  |
>+--------+---------------------------+
>| 10770  | mrnota53                  |
>| 6239   | cyberahsokatano           |
>| 4609   | abeerlilak                |
>| 4562   | DaeshAjel3                |
>| 4523   | Aster__Q                  |
>| 4275   | DaeshCrimes               |
>| 4182   | AfwfwefOfwefho            |
>| 3980   | HewarMaftuh2              |
>| 3698   | DaeshAjel                 |
>| 3661   | sarosh_iq                 |
>+--------+---------------------------+
>10 rows selected (4.64 seconds)
>
>
>*Failed queries when **store.mongo.bson.record.reader = true;*
>
>*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select 
>flatten(hashtag_list) as hashtag from hashtags) group by hashtag order 
>by cnt desc  limit 10;
>Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a 
>VarChar type when you are using a ValueWriter of type UnionListWriter.
>
>Fragment 4:1
>
>[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]
>
>   (java.lang.IllegalArgumentException) You tried to write a VarChar 
>type when you are using a ValueWriter of type UnionListWriter.
>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
>org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
>org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
>     org.apache.drill.exec.store.bson.BsonRecordReader.write():75
>org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
>     org.apache.drill.exec.physical.impl.ScanBatch.next():178
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
>     org.apache.drill.exec.record.AbstractRecordBatch.next():142
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
>     java.security.AccessController.doPrivileged():-2
>     javax.security.auth.Subject.doAs():422
>     org.apache.hadoop.security.UserGroupInformation.doAs():1657
>org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>     java.lang.Thread.run():745 (state=,code=0)
>
>*
>
>
>On 12/16/2016 01:55 AM, Kathleen Li wrote:
>> In my env, first time took about 1.6s, second time only took 0.5s
>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username
from test where
>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z'
group by actor_preferred_username   order by
>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>> +------+---------------------------+
>> | cnt  | actor_preferred_username  |
>> +------+---------------------------+
>> | 1    | _20_xxxx                  |
>> +------+---------------------------+
>> 1 row selected (1.585 seconds)
>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username
from test where
>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z'
group by actor_preferred_username   order by
>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>> +------+---------------------------+
>> | cnt  | actor_preferred_username  |
>> +------+---------------------------+
>> | 1    | _20_xxxx                  |
>> +------+---------------------------+
>> 1 row selected (0.505 seconds)
>>
>>
>>
>>
>>
>>
>> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one
fragment , see attached drill profile.
>>
>> As what Dechang suggested  you can check profile and see the time mainly spent on
which step.
>>
>> Kathleen
>>
>>
>> On 12/15/16, 2:27 PM, "Dechang Gu" <dgu@maprtech.com> wrote:
>>
>>> Yousuf,
>>> Which version of drill are you running?
>>> Can you share the profile of the query?
>>>
>>> Thanks,
>>> Dechang
>>>
>>>
>>>
>>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufuddin@css.org.sa> wrote:
>>>
>>>> Hello experts
>>>>
>>>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>>>> following specs
>>>>
>>>> 32 GB ram
>>>>
>>>> 1 TB storage
>>>>
>>>> 16 Cores
>>>>
>>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>>>> engine is mongo which has 5 million docs. (Our daily collection is close
to
>>>> 2.5 million tweets)
>>>>
>>>> *Problem*
>>>>
>>>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>>>> query as follows took 12.+ secs,
>>>>
>>>>
>>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order
by
>>>> cnt desc limit 10;
>>>>
>>>> Pls. note that,
>>>>
>>>> *In Drill*
>>>>
>>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>>
>>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>>
>>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>>
>>>> *In Mongo*
>>>>
>>>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>>>> evenly on all shards and created compound index on fields used by the above
>>>> drill query.
>>>>
>>>>
>>>> sample data
>>>>
>>>> {
>>>>      "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>>>      "rule_list" : [
>>>>          "A",
>>>>          "B",
>>>>          "C",
>>>>          "D13"
>>>>      ],
>>>>      "actor_friends_count" : 40,
>>>>      "klout_score" : 28,
>>>>      "actor_favorites_count" : 1697,
>>>>      "actor_preferred_username" : "_20_xxxx",
>>>>      "sentiment" : "neu",
>>>>      "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>>>      "object_actor_followers_count" : 573,
>>>>      "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>>>      "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>>>      "actor_display_name" : "xxxxxx",
>>>>      "retweet_count" : 57,
>>>>      "hashtag_list" : [
>>>>          "c",
>>>>          "d",
>>>>          "E",
>>>>          "f",
>>>>          "VCVC",
>>>>          "XXCXCXC",
>>>>          "RRRR"
>>>>      ],
>>>>      "body" : "some tweet blah blah",
>>>>      "actor_followers_count" : 21,
>>>>      "actor_status_count" : 1756,
>>>>      "verb" : "share",
>>>>      "posted_time" : "2016-08-01T23:47:43.000Z",
>>>>      "object_actor_status_count" : 2796,
>>>>      "lang" : "ar",
>>>>      "object_actor_preferred_username" : "xxxxxxx",
>>>>      "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>>>      "gender" : "male",
>>>>      "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>>>      "favorites_count" : 0,
>>>>      "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>>>      "object_actor_friends_count" : 69,
>>>>      "generator_display_name" : "Twitter Web Client",
>>>>      "object_actor_display_name" : "xxxxxxxxxxxx",
>>>>      "actor_listed_count" : 0
>>>> }
>>>>
>>>>
>>>>
>>>> *Questions*
>>>>
>>>> 1) How to improve aggregation query performance?
>>>>
>>>> 2) Do I also need to do something in mongodb to enhance performance? (I
>>>> mean optimize source)
>>>>
>>>> 3) Does Apache drill capable of handling/aggregating billions of documents
>>>> in real-time?
>>>>
>>>> Your early response is highly appreciated!
>>>>
>>>> Thank you & Kind Regards
>>>>
>>>> Yousuf
>>>>
>>>>
>>>>
>>>>
>


Mime
View raw message