Hi Yousuf,
Yes in my env, I was set store.mongo.bson.record.reader = true.
With one record you provided, the same query works fine for me, the error you got is the
schema changes related errors:
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt from (select
. . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag from test)
group by hashtag order
. . . . . . . . . . . . . . . . . . . . . . .> by cnt desc limit 10;
+----------+------+
| hashtag | cnt |
+----------+------+
| RRRR | 1 |
| VCVC | 1 |
| XXCXCXC | 1 |
| c | 1 |
| d | 1 |
| f | 1 |
| E | 1 |
With that, you might try: alter session set store.mongo.all_text_mode=true and rerun the
query to see if you can pass the error
Kathleen
On 12/18/16, 1:21 AM, "yousuf" <yousufuddin@css.org.sa> wrote:
>Hi Kathleen,
>
>Thanks for responding...
>
>I've noticed when alter session set store.mongo.bson.record.reader =
>true; the performance is improved. However, the other queries are
>failing :(.
>
>
>0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
>+-------+------------------------------------------+
>| ok | summary |
>+-------+------------------------------------------+
>| true | store.mongo.bson.record.reader updated. |
>+-------+------------------------------------------+
>1 row selected (0.082 seconds)
>0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from
>tweets group by actor_preferred_username order by cnt desc limit 10;
>+--------+---------------------------+
>| cnt | actor_preferred_username |
>+--------+---------------------------+
>| 10770 | mrnota53 |
>| 6239 | cyberahsokatano |
>| 4609 | abeerlilak |
>| 4562 | DaeshAjel3 |
>| 4523 | Aster__Q |
>| 4275 | DaeshCrimes |
>| 4182 | AfwfwefOfwefho |
>| 3980 | HewarMaftuh2 |
>| 3698 | DaeshAjel |
>| 3661 | sarosh_iq |
>+--------+---------------------------+
>10 rows selected (4.64 seconds)
>
>
>*Failed queries when **store.mongo.bson.record.reader = true;*
>
>*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select
>flatten(hashtag_list) as hashtag from hashtags) group by hashtag order
>by cnt desc limit 10;
>Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a
>VarChar type when you are using a ValueWriter of type UnionListWriter.
>
>Fragment 4:1
>
>[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]
>
> (java.lang.IllegalArgumentException) You tried to write a VarChar
>type when you are using a ValueWriter of type UnionListWriter.
>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
>org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
>org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
> org.apache.drill.exec.store.bson.BsonRecordReader.write():75
>org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
> org.apache.drill.exec.physical.impl.ScanBatch.next():178
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
> org.apache.drill.exec.record.AbstractRecordBatch.next():142
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():422
> org.apache.hadoop.security.UserGroupInformation.doAs():1657
>org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1142
> java.util.concurrent.ThreadPoolExecutor$Worker.run():617
> java.lang.Thread.run():745 (state=,code=0)
>
>*
>
>
>On 12/16/2016 01:55 AM, Kathleen Li wrote:
>> In my env, first time took about 1.6s, second time only took 0.5s
>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username
from test where
>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z'
group by actor_preferred_username order by
>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>> +------+---------------------------+
>> | cnt | actor_preferred_username |
>> +------+---------------------------+
>> | 1 | _20_xxxx |
>> +------+---------------------------+
>> 1 row selected (1.585 seconds)
>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username
from test where
>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z'
and posted_time
>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z'
group by actor_preferred_username order by
>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>> +------+---------------------------+
>> | cnt | actor_preferred_username |
>> +------+---------------------------+
>> | 1 | _20_xxxx |
>> +------+---------------------------+
>> 1 row selected (0.505 seconds)
>>
>>
>>
>>
>>
>>
>> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one
fragment , see attached drill profile.
>>
>> As what Dechang suggested you can check profile and see the time mainly spent on
which step.
>>
>> Kathleen
>>
>>
>> On 12/15/16, 2:27 PM, "Dechang Gu" <dgu@maprtech.com> wrote:
>>
>>> Yousuf,
>>> Which version of drill are you running?
>>> Can you share the profile of the query?
>>>
>>> Thanks,
>>> Dechang
>>>
>>>
>>>
>>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufuddin@css.org.sa> wrote:
>>>
>>>> Hello experts
>>>>
>>>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>>>> following specs
>>>>
>>>> 32 GB ram
>>>>
>>>> 1 TB storage
>>>>
>>>> 16 Cores
>>>>
>>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>>>> engine is mongo which has 5 million docs. (Our daily collection is close
to
>>>> 2.5 million tweets)
>>>>
>>>> *Problem*
>>>>
>>>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>>>> query as follows took 12.+ secs,
>>>>
>>>>
>>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order
by
>>>> cnt desc limit 10;
>>>>
>>>> Pls. note that,
>>>>
>>>> *In Drill*
>>>>
>>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>>
>>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>>
>>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>>
>>>> *In Mongo*
>>>>
>>>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>>>> evenly on all shards and created compound index on fields used by the above
>>>> drill query.
>>>>
>>>>
>>>> sample data
>>>>
>>>> {
>>>> "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>>> "rule_list" : [
>>>> "A",
>>>> "B",
>>>> "C",
>>>> "D13"
>>>> ],
>>>> "actor_friends_count" : 40,
>>>> "klout_score" : 28,
>>>> "actor_favorites_count" : 1697,
>>>> "actor_preferred_username" : "_20_xxxx",
>>>> "sentiment" : "neu",
>>>> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>>> "object_actor_followers_count" : 573,
>>>> "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>>> "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>>> "actor_display_name" : "xxxxxx",
>>>> "retweet_count" : 57,
>>>> "hashtag_list" : [
>>>> "c",
>>>> "d",
>>>> "E",
>>>> "f",
>>>> "VCVC",
>>>> "XXCXCXC",
>>>> "RRRR"
>>>> ],
>>>> "body" : "some tweet blah blah",
>>>> "actor_followers_count" : 21,
>>>> "actor_status_count" : 1756,
>>>> "verb" : "share",
>>>> "posted_time" : "2016-08-01T23:47:43.000Z",
>>>> "object_actor_status_count" : 2796,
>>>> "lang" : "ar",
>>>> "object_actor_preferred_username" : "xxxxxxx",
>>>> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>>> "gender" : "male",
>>>> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>>> "favorites_count" : 0,
>>>> "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>>> "object_actor_friends_count" : 69,
>>>> "generator_display_name" : "Twitter Web Client",
>>>> "object_actor_display_name" : "xxxxxxxxxxxx",
>>>> "actor_listed_count" : 0
>>>> }
>>>>
>>>>
>>>>
>>>> *Questions*
>>>>
>>>> 1) How to improve aggregation query performance?
>>>>
>>>> 2) Do I also need to do something in mongodb to enhance performance? (I
>>>> mean optimize source)
>>>>
>>>> 3) Does Apache drill capable of handling/aggregating billions of documents
>>>> in real-time?
>>>>
>>>> Your early response is highly appreciated!
>>>>
>>>> Thank you & Kind Regards
>>>>
>>>> Yousuf
>>>>
>>>>
>>>>
>>>>
>
|