drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yousuf <yousufud...@css.org.sa>
Subject Re: Aggregation performance
Date Tue, 20 Dec 2016 08:36:40 GMT
Hi Kathleen

A bug is reported on JIRA <https://issues.apache.org/jira/browse/DRILL-5139>

Thanks & Regards
Yousuf


On 12/19/2016 10:03 PM, Kathleen Li wrote:
> It seemed in Drill 1.8  parameter store.mongo.all_text_mode=true by default already
>
>   Try ALTER SESSION SET `exec.enable_union_type` = true; if you still get the errors,
you might open one public JIRA with the detailed information.
>
>
> You might also try to use CTAS to create drill tables with parquet format, then run your
queries against drill parquet tables to see if the performance getting better.
>
>
> Thanks,
>
> Kathleen
>
>
>
>
> On 12/19/16, 10:31 AM, "Kathleen Li" <kli@maprtech.com> wrote:
>
>> Hi Yousuf,
>>
>>
>> Yes in my env, I was set store.mongo.bson.record.reader = true.
>> With one record you provided, the same query works fine for me,  the error you got
is the schema changes related errors:
>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt
from (select
>> . . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag
from test) group by hashtag order
>> . . . . . . . . . . . . . . . . . . . . . . .> by cnt desc  limit 10;
>> +----------+------+
>> | hashtag  | cnt  |
>> +----------+------+
>> | RRRR     | 1    |
>> | VCVC     | 1    |
>> | XXCXCXC  | 1    |
>> | c        | 1    |
>> | d        | 1    |
>> | f        | 1    |
>> | E        | 1    |
>>
>>
>> With that, you might try:  alter session set store.mongo.all_text_mode=true and rerun
the query to see if you can pass the error
>>
>>
>> Kathleen
>>
>>
>>
>>
>>
>>
>>
>> On 12/18/16, 1:21 AM, "yousuf" <yousufuddin@css.org.sa> wrote:
>>
>>> Hi Kathleen,
>>>
>>> Thanks for responding...
>>>
>>> I've noticed when  alter session set store.mongo.bson.record.reader =
>>> true; the performance is improved. However, the other queries are
>>> failing :(.
>>>
>>>
>>> 0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
>>> +-------+------------------------------------------+
>>> |  ok   |                 summary                  |
>>> +-------+------------------------------------------+
>>> | true  | store.mongo.bson.record.reader updated.  |
>>> +-------+------------------------------------------+
>>> 1 row selected (0.082 seconds)
>>> 0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from
>>> tweets group by actor_preferred_username order by cnt desc limit 10;
>>> +--------+---------------------------+
>>> |  cnt   | actor_preferred_username  |
>>> +--------+---------------------------+
>>> | 10770  | mrnota53                  |
>>> | 6239   | cyberahsokatano           |
>>> | 4609   | abeerlilak                |
>>> | 4562   | DaeshAjel3                |
>>> | 4523   | Aster__Q                  |
>>> | 4275   | DaeshCrimes               |
>>> | 4182   | AfwfwefOfwefho            |
>>> | 3980   | HewarMaftuh2              |
>>> | 3698   | DaeshAjel                 |
>>> | 3661   | sarosh_iq                 |
>>> +--------+---------------------------+
>>> 10 rows selected (4.64 seconds)
>>>
>>>
>>> *Failed queries when **store.mongo.bson.record.reader = true;*
>>>
>>> *0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select
>>> flatten(hashtag_list) as hashtag from hashtags) group by hashtag order
>>> by cnt desc  limit 10;
>>> Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a
>>> VarChar type when you are using a ValueWriter of type UnionListWriter.
>>>
>>> Fragment 4:1
>>>
>>> [Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]
>>>
>>>    (java.lang.IllegalArgumentException) You tried to write a VarChar
>>> type when you are using a ValueWriter of type UnionListWriter.
>>> org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
>>> org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
>>> org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
>>> org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
>>> org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
>>> org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
>>>      org.apache.drill.exec.store.bson.BsonRecordReader.write():75
>>> org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
>>>      org.apache.drill.exec.physical.impl.ScanBatch.next():178
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>> org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>> org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():142
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>>      org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>>> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>>>      org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
>>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
>>>      java.security.AccessController.doPrivileged():-2
>>>      javax.security.auth.Subject.doAs():422
>>>      org.apache.hadoop.security.UserGroupInformation.doAs():1657
>>> org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
>>>      org.apache.drill.common.SelfCleaningRunnable.run():38
>>>      java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>>>      java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>>>      java.lang.Thread.run():745 (state=,code=0)
>>>
>>> *
>>>
>>>
>>> On 12/16/2016 01:55 AM, Kathleen Li wrote:
>>>> In my env, first time took about 1.6s, second time only took 0.5s
>>>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt,
actor_preferred_username from test where
>>>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
>>>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z'
group by actor_preferred_username   order by
>>>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>>>> +------+---------------------------+
>>>> | cnt  | actor_preferred_username  |
>>>> +------+---------------------------+
>>>> | 1    | _20_xxxx                  |
>>>> +------+---------------------------+
>>>> 1 row selected (1.585 seconds)
>>>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt,
actor_preferred_username from test where
>>>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
>>>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z'
group by actor_preferred_username   order by
>>>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>>>> +------+---------------------------+
>>>> | cnt  | actor_preferred_username  |
>>>> +------+---------------------------+
>>>> | 1    | _20_xxxx                  |
>>>> +------+---------------------------+
>>>> 1 row selected (0.505 seconds)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using
one fragment , see attached drill profile.
>>>>
>>>> As what Dechang suggested  you can check profile and see the time mainly
spent on which step.
>>>>
>>>> Kathleen
>>>>
>>>>
>>>> On 12/15/16, 2:27 PM, "Dechang Gu" <dgu@maprtech.com> wrote:
>>>>
>>>>> Yousuf,
>>>>> Which version of drill are you running?
>>>>> Can you share the profile of the query?
>>>>>
>>>>> Thanks,
>>>>> Dechang
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufuddin@css.org.sa>
wrote:
>>>>>
>>>>>> Hello experts
>>>>>>
>>>>>> As a POC project, I've built a drill cluster on 5 VMs , each with
the
>>>>>> following specs
>>>>>>
>>>>>> 32 GB ram
>>>>>>
>>>>>> 1 TB storage
>>>>>>
>>>>>> 16 Cores
>>>>>>
>>>>>> Zookeeper quorum & apache drill installed on all 5 nodes. My
storage
>>>>>> engine is mongo which has 5 million docs. (Our daily collection is
close to
>>>>>> 2.5 million tweets)
>>>>>>
>>>>>> *Problem*
>>>>>>
>>>>>> My aggregation queries are slow, not fit for my realtime dashboard.
sample
>>>>>> query as follows took 12.+ secs,
>>>>>>
>>>>>>
>>>>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>>>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username
  order by
>>>>>> cnt desc limit 10;
>>>>>>
>>>>>> Pls. note that,
>>>>>>
>>>>>> *In Drill*
>>>>>>
>>>>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>>>>
>>>>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>>>>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>>>>
>>>>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>>>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>>>>
>>>>>> *In Mongo*
>>>>>>
>>>>>> data is sharded on 3 servers on shardkey tweet_id which distributes
data
>>>>>> evenly on all shards and created compound index on fields used by
the above
>>>>>> drill query.
>>>>>>
>>>>>>
>>>>>> sample data
>>>>>>
>>>>>> {
>>>>>>       "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>>>>>       "rule_list" : [
>>>>>>           "A",
>>>>>>           "B",
>>>>>>           "C",
>>>>>>           "D13"
>>>>>>       ],
>>>>>>       "actor_friends_count" : 40,
>>>>>>       "klout_score" : 28,
>>>>>>       "actor_favorites_count" : 1697,
>>>>>>       "actor_preferred_username" : "_20_xxxx",
>>>>>>       "sentiment" : "neu",
>>>>>>       "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>>>>>       "object_actor_followers_count" : 573,
>>>>>>       "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>>>>>       "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>>>>>       "actor_display_name" : "xxxxxx",
>>>>>>       "retweet_count" : 57,
>>>>>>       "hashtag_list" : [
>>>>>>           "c",
>>>>>>           "d",
>>>>>>           "E",
>>>>>>           "f",
>>>>>>           "VCVC",
>>>>>>           "XXCXCXC",
>>>>>>           "RRRR"
>>>>>>       ],
>>>>>>       "body" : "some tweet blah blah",
>>>>>>       "actor_followers_count" : 21,
>>>>>>       "actor_status_count" : 1756,
>>>>>>       "verb" : "share",
>>>>>>       "posted_time" : "2016-08-01T23:47:43.000Z",
>>>>>>       "object_actor_status_count" : 2796,
>>>>>>       "lang" : "ar",
>>>>>>       "object_actor_preferred_username" : "xxxxxxx",
>>>>>>       "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>>>>>       "gender" : "male",
>>>>>>       "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>>>>>       "favorites_count" : 0,
>>>>>>       "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>>>>>       "object_actor_friends_count" : 69,
>>>>>>       "generator_display_name" : "Twitter Web Client",
>>>>>>       "object_actor_display_name" : "xxxxxxxxxxxx",
>>>>>>       "actor_listed_count" : 0
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Questions*
>>>>>>
>>>>>> 1) How to improve aggregation query performance?
>>>>>>
>>>>>> 2) Do I also need to do something in mongodb to enhance performance?
(I
>>>>>> mean optimize source)
>>>>>>
>>>>>> 3) Does Apache drill capable of handling/aggregating billions of
documents
>>>>>> in real-time?
>>>>>>
>>>>>> Your early response is highly appreciated!
>>>>>>
>>>>>> Thank you & Kind Regards
>>>>>>
>>>>>> Yousuf
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message