drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kathleen Li <...@maprtech.com>
Subject Re: Aggregation performance
Date Mon, 19 Dec 2016 19:03:01 GMT
It seemed in Drill 1.8  parameter store.mongo.all_text_mode=true by default already

 Try ALTER SESSION SET `exec.enable_union_type` = true; if you still get the errors, you might
open one public JIRA with the detailed information.


You might also try to use CTAS to create drill tables with parquet format, then run your queries
against drill parquet tables to see if the performance getting better.


Thanks,

Kathleen




On 12/19/16, 10:31 AM, "Kathleen Li" <kli@maprtech.com> wrote:

>Hi Yousuf,
>
>
>Yes in my env, I was set store.mongo.bson.record.reader = true.
>With one record you provided, the same query works fine for me,  the error you got is
the schema changes related errors:
>0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt from
(select
>. . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag from
test) group by hashtag order
>. . . . . . . . . . . . . . . . . . . . . . .> by cnt desc  limit 10;
>+----------+------+
>| hashtag  | cnt  |
>+----------+------+
>| RRRR     | 1    |
>| VCVC     | 1    |
>| XXCXCXC  | 1    |
>| c        | 1    |
>| d        | 1    |
>| f        | 1    |
>| E        | 1    |
>
>
>With that, you might try:  alter session set store.mongo.all_text_mode=true and rerun
the query to see if you can pass the error 
>
>
>Kathleen
>
>
>
>
>
>
>
>On 12/18/16, 1:21 AM, "yousuf" <yousufuddin@css.org.sa> wrote:
>
>>Hi Kathleen,
>>
>>Thanks for responding...
>>
>>I've noticed when  alter session set store.mongo.bson.record.reader = 
>>true; the performance is improved. However, the other queries are 
>>failing :(.
>>
>>
>>0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
>>+-------+------------------------------------------+
>>|  ok   |                 summary                  |
>>+-------+------------------------------------------+
>>| true  | store.mongo.bson.record.reader updated.  |
>>+-------+------------------------------------------+
>>1 row selected (0.082 seconds)
>>0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from 
>>tweets group by actor_preferred_username order by cnt desc limit 10;
>>+--------+---------------------------+
>>|  cnt   | actor_preferred_username  |
>>+--------+---------------------------+
>>| 10770  | mrnota53                  |
>>| 6239   | cyberahsokatano           |
>>| 4609   | abeerlilak                |
>>| 4562   | DaeshAjel3                |
>>| 4523   | Aster__Q                  |
>>| 4275   | DaeshCrimes               |
>>| 4182   | AfwfwefOfwefho            |
>>| 3980   | HewarMaftuh2              |
>>| 3698   | DaeshAjel                 |
>>| 3661   | sarosh_iq                 |
>>+--------+---------------------------+
>>10 rows selected (4.64 seconds)
>>
>>
>>*Failed queries when **store.mongo.bson.record.reader = true;*
>>
>>*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select 
>>flatten(hashtag_list) as hashtag from hashtags) group by hashtag order 
>>by cnt desc  limit 10;
>>Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a 
>>VarChar type when you are using a ValueWriter of type UnionListWriter.
>>
>>Fragment 4:1
>>
>>[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]
>>
>>   (java.lang.IllegalArgumentException) You tried to write a VarChar 
>>type when you are using a ValueWriter of type UnionListWriter.
>>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
>>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
>>org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
>>org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
>>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
>>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
>>     org.apache.drill.exec.store.bson.BsonRecordReader.write():75
>>org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
>>     org.apache.drill.exec.physical.impl.ScanBatch.next():178
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():142
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>>org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
>>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
>>     java.security.AccessController.doPrivileged():-2
>>     javax.security.auth.Subject.doAs():422
>>     org.apache.hadoop.security.UserGroupInformation.doAs():1657
>>org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
>>     org.apache.drill.common.SelfCleaningRunnable.run():38
>>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>>     java.lang.Thread.run():745 (state=,code=0)
>>
>>*
>>
>>
>>On 12/16/2016 01:55 AM, Kathleen Li wrote:
>>> In my env, first time took about 1.6s, second time only took 0.5s
>>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username
from test where
>>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
>>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z'
group by actor_preferred_username   order by
>>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>>> +------+---------------------------+
>>> | cnt  | actor_preferred_username  |
>>> +------+---------------------------+
>>> | 1    | _20_xxxx                  |
>>> +------+---------------------------+
>>> 1 row selected (1.585 seconds)
>>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username
from test where
>>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
>>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z'
group by actor_preferred_username   order by
>>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>>> +------+---------------------------+
>>> | cnt  | actor_preferred_username  |
>>> +------+---------------------------+
>>> | 1    | _20_xxxx                  |
>>> +------+---------------------------+
>>> 1 row selected (0.505 seconds)
>>>
>>>
>>>
>>>
>>>
>>>
>>> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using
one fragment , see attached drill profile.
>>>
>>> As what Dechang suggested  you can check profile and see the time mainly spent
on which step.
>>>
>>> Kathleen
>>>
>>>
>>> On 12/15/16, 2:27 PM, "Dechang Gu" <dgu@maprtech.com> wrote:
>>>
>>>> Yousuf,
>>>> Which version of drill are you running?
>>>> Can you share the profile of the query?
>>>>
>>>> Thanks,
>>>> Dechang
>>>>
>>>>
>>>>
>>>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufuddin@css.org.sa> wrote:
>>>>
>>>>> Hello experts
>>>>>
>>>>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>>>>> following specs
>>>>>
>>>>> 32 GB ram
>>>>>
>>>>> 1 TB storage
>>>>>
>>>>> 16 Cores
>>>>>
>>>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>>>>> engine is mongo which has 5 million docs. (Our daily collection is close
to
>>>>> 2.5 million tweets)
>>>>>
>>>>> *Problem*
>>>>>
>>>>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>>>>> query as follows took 12.+ secs,
>>>>>
>>>>>
>>>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order
by
>>>>> cnt desc limit 10;
>>>>>
>>>>> Pls. note that,
>>>>>
>>>>> *In Drill*
>>>>>
>>>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>>>
>>>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>>>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>>>
>>>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>>>
>>>>> *In Mongo*
>>>>>
>>>>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>>>>> evenly on all shards and created compound index on fields used by the
above
>>>>> drill query.
>>>>>
>>>>>
>>>>> sample data
>>>>>
>>>>> {
>>>>>      "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>>>>      "rule_list" : [
>>>>>          "A",
>>>>>          "B",
>>>>>          "C",
>>>>>          "D13"
>>>>>      ],
>>>>>      "actor_friends_count" : 40,
>>>>>      "klout_score" : 28,
>>>>>      "actor_favorites_count" : 1697,
>>>>>      "actor_preferred_username" : "_20_xxxx",
>>>>>      "sentiment" : "neu",
>>>>>      "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>>>>      "object_actor_followers_count" : 573,
>>>>>      "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>>>>      "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>>>>      "actor_display_name" : "xxxxxx",
>>>>>      "retweet_count" : 57,
>>>>>      "hashtag_list" : [
>>>>>          "c",
>>>>>          "d",
>>>>>          "E",
>>>>>          "f",
>>>>>          "VCVC",
>>>>>          "XXCXCXC",
>>>>>          "RRRR"
>>>>>      ],
>>>>>      "body" : "some tweet blah blah",
>>>>>      "actor_followers_count" : 21,
>>>>>      "actor_status_count" : 1756,
>>>>>      "verb" : "share",
>>>>>      "posted_time" : "2016-08-01T23:47:43.000Z",
>>>>>      "object_actor_status_count" : 2796,
>>>>>      "lang" : "ar",
>>>>>      "object_actor_preferred_username" : "xxxxxxx",
>>>>>      "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>>>>      "gender" : "male",
>>>>>      "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>>>>      "favorites_count" : 0,
>>>>>      "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>>>>      "object_actor_friends_count" : 69,
>>>>>      "generator_display_name" : "Twitter Web Client",
>>>>>      "object_actor_display_name" : "xxxxxxxxxxxx",
>>>>>      "actor_listed_count" : 0
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> *Questions*
>>>>>
>>>>> 1) How to improve aggregation query performance?
>>>>>
>>>>> 2) Do I also need to do something in mongodb to enhance performance?
(I
>>>>> mean optimize source)
>>>>>
>>>>> 3) Does Apache drill capable of handling/aggregating billions of documents
>>>>> in real-time?
>>>>>
>>>>> Your early response is highly appreciated!
>>>>>
>>>>> Thank you & Kind Regards
>>>>>
>>>>> Yousuf
>>>>>
>>>>>
>>>>>
>>>>>


Mime
View raw message