From user-return-7270-apmail-drill-user-archive=drill.apache.org@drill.apache.org Sun Dec 18 09:21:39 2016 Return-Path: X-Original-To: apmail-drill-user-archive@www.apache.org Delivered-To: apmail-drill-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D16E019B01 for ; Sun, 18 Dec 2016 09:21:39 +0000 (UTC) Received: (qmail 85690 invoked by uid 500); 18 Dec 2016 09:21:34 -0000 Delivered-To: apmail-drill-user-archive@drill.apache.org Received: (qmail 85620 invoked by uid 500); 18 Dec 2016 09:21:34 -0000 Mailing-List: contact user-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@drill.apache.org Delivered-To: mailing list user@drill.apache.org Received: (qmail 85603 invoked by uid 99); 18 Dec 2016 09:21:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Dec 2016 09:21:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 3B746180BC7 for ; Sun, 18 Dec 2016 09:21:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 5.108 X-Spam-Level: ***** X-Spam-Status: No, score=5.108 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, HTML_MESSAGE=2, RDNS_NONE=3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_DKIM_INVALID=0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=fail (1024-bit key) reason="fail (OpenSSL error: data too large for key size)" header.d=css.org.sa Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id E_dXdaeS63Xf for ; Sun, 18 Dec 2016 09:21:26 +0000 (UTC) Received: from css.org.sa (unknown [46.165.251.208]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 2888F5FBB1 for ; Sun, 18 Dec 2016 09:21:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=css.org.sa; s=default; h=Content-Type:In-Reply-To:MIME-Version:Date:Message-ID:From: References:To:Subject:Sender:Reply-To:Cc:Content-Transfer-Encoding:Content-ID :Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To: Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe :List-Post:List-Owner:List-Archive; bh=NDaXsHj7yFRBlR9Hv1+zhiq25dvxWWl0gk7niwQ2F6o=; b=SMg95uDTpwvUhla89FII2oh2tb N/QefB1YvIv9lMqYwlVH1TmcBAMUd4Nqrt5Z/rM+xncqXT+4VbkkOcD1OZ1CML1wD/TB6LsjY3WPW ep1/Ex42znYDPVrGztW0VUerJ+d+a5KRPgjKsH4mMOzBxLu2DgUr1WuHLy3v1pw3ebAYcISWiCe/j vD6qIPz5vH6jK9zeOs5nHgnfqztX9CuFsAlo7MjrpIyvpcV8boB7FdZ0SfJ2LSeg/hbZ0WWcEBnpU EUvMrXBBshFOyLWDVSN7uepLT5MbLeBy3ekzpoXDb4Ta2EmgqkKZye9RCXPNB9CReZKFlHBBT50T1 P+YpvYKw==; Received: from [94.97.41.193] (port=43124 helo=DEVPC-02.css.org) by onlineserver.css.sa with esmtpsa (TLSv1.2:ECDHE-RSA-AES128-GCM-SHA256:128) (Exim 4.87) (envelope-from ) id 1cIXel-0000vz-7K for user@drill.apache.org; Sun, 18 Dec 2016 12:21:25 +0300 Subject: Re: Aggregation performance To: user@drill.apache.org References: <42f0c5e8-7bb3-8634-7732-856f196314d2@css.org.sa> From: yousuf Message-ID: <7a47b08c-1c7b-6f02-7fbc-7cc333eb5f86@css.org.sa> Date: Sun, 18 Dec 2016 12:21:19 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/alternative; boundary="------------92CCE4A86DA15F6B31EF7AF3" X-OutGoing-Spam-Status: No, score=-1.0 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - onlineserver.css.sa X-AntiAbuse: Original Domain - drill.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - css.org.sa X-Get-Message-Sender-Via: onlineserver.css.sa: authenticated_id: yousufuddin@css.org.sa X-Authenticated-Sender: onlineserver.css.sa: yousufuddin@css.org.sa X-Source: X-Source-Args: X-Source-Dir: --------------92CCE4A86DA15F6B31EF7AF3 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Hi Kathleen, Thanks for responding... I've noticed when alter session set store.mongo.bson.record.reader = true; the performance is improved. However, the other queries are failing :(. 0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true; +-------+------------------------------------------+ | ok | summary | +-------+------------------------------------------+ | true | store.mongo.bson.record.reader updated. | +-------+------------------------------------------+ 1 row selected (0.082 seconds) 0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from tweets group by actor_preferred_username order by cnt desc limit 10; +--------+---------------------------+ | cnt | actor_preferred_username | +--------+---------------------------+ | 10770 | mrnota53 | | 6239 | cyberahsokatano | | 4609 | abeerlilak | | 4562 | DaeshAjel3 | | 4523 | Aster__Q | | 4275 | DaeshCrimes | | 4182 | AfwfwefOfwefho | | 3980 | HewarMaftuh2 | | 3698 | DaeshAjel | | 3661 | sarosh_iq | +--------+---------------------------+ 10 rows selected (4.64 seconds) *Failed queries when **store.mongo.bson.record.reader = true;* *0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select flatten(hashtag_list) as hashtag from hashtags) group by hashtag order by cnt desc limit 10; Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a VarChar type when you are using a ValueWriter of type UnionListWriter. Fragment 4:1 [Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010] (java.lang.IllegalArgumentException) You tried to write a VarChar type when you are using a ValueWriter of type UnionListWriter. org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762 org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325 org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91 org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112 org.apache.drill.exec.store.bson.BsonRecordReader.write():75 org.apache.drill.exec.store.mongo.MongoRecordReader.next():186 org.apache.drill.exec.physical.impl.ScanBatch.next():178 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97 org.apache.drill.exec.record.AbstractRecordBatch.next():142 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.fragment.FragmentExecutor.run():226 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor$Worker.run():617 java.lang.Thread.run():745 (state=,code=0) * On 12/16/2016 01:55 AM, Kathleen Li wrote: > In my env, first time took about 1.6s, second time only took 0.5s > 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where > . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time > . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by > . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10; > +------+---------------------------+ > | cnt | actor_preferred_username | > +------+---------------------------+ > | 1 | _20_xxxx | > +------+---------------------------+ > 1 row selected (1.585 seconds) > 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where > . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time > . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by > . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10; > +------+---------------------------+ > | cnt | actor_preferred_username | > +------+---------------------------+ > | 1 | _20_xxxx | > +------+---------------------------+ > 1 row selected (0.505 seconds) > > > > > > > I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile. > > As what Dechang suggested you can check profile and see the time mainly spent on which step. > > Kathleen > > > On 12/15/16, 2:27 PM, "Dechang Gu" wrote: > >> Yousuf, >> Which version of drill are you running? >> Can you share the profile of the query? >> >> Thanks, >> Dechang >> >> >> >> On Thu, Dec 15, 2016 at 3:27 AM, yousuf wrote: >> >>> Hello experts >>> >>> As a POC project, I've built a drill cluster on 5 VMs , each with the >>> following specs >>> >>> 32 GB ram >>> >>> 1 TB storage >>> >>> 16 Cores >>> >>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage >>> engine is mongo which has 5 million docs. (Our daily collection is close to >>> 2.5 million tweets) >>> >>> *Problem* >>> >>> My aggregation queries are slow, not fit for my realtime dashboard. sample >>> query as follows took 12.+ secs, >>> >>> >>> SELECT count(*) as cnt, actor_preferred_username from tweets where >>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time >>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by >>> cnt desc limit 10; >>> >>> Pls. note that, >>> >>> *In Drill* >>> >>> 1) I've changed the conf/drill-env.sh on each node with following >>> >>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export >>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}| >>> >>> 2) changed few setting based on the docs >> /sort-based-and-hash-based-memory-constrained-operators/> >>> >>> *In Mongo* >>> >>> data is sharded on 3 servers on shardkey tweet_id which distributes data >>> evenly on all shards and created compound index on fields used by the above >>> drill query. >>> >>> >>> sample data >>> >>> { >>> "_id" : ObjectId("58524d507e08dae4c0377b9e"), >>> "rule_list" : [ >>> "A", >>> "B", >>> "C", >>> "D13" >>> ], >>> "actor_friends_count" : 40, >>> "klout_score" : 28, >>> "actor_favorites_count" : 1697, >>> "actor_preferred_username" : "_20_xxxx", >>> "sentiment" : "neu", >>> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx", >>> "object_actor_followers_count" : 573, >>> "actor_posted_time" : "2016-06-24T00:37:54.000Z", >>> "actor_id" : "id:twitter.com:xxxxxxxxxx", >>> "actor_display_name" : "xxxxxx", >>> "retweet_count" : 57, >>> "hashtag_list" : [ >>> "c", >>> "d", >>> "E", >>> "f", >>> "VCVC", >>> "XXCXCXC", >>> "RRRR" >>> ], >>> "body" : "some tweet blah blah", >>> "actor_followers_count" : 21, >>> "actor_status_count" : 1756, >>> "verb" : "share", >>> "posted_time" : "2016-08-01T23:47:43.000Z", >>> "object_actor_status_count" : 2796, >>> "lang" : "ar", >>> "object_actor_preferred_username" : "xxxxxxx", >>> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx", >>> "gender" : "male", >>> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx", >>> "favorites_count" : 0, >>> "object_posted_time" : "2016-08-01T22:54:22.000Z", >>> "object_actor_friends_count" : 69, >>> "generator_display_name" : "Twitter Web Client", >>> "object_actor_display_name" : "xxxxxxxxxxxx", >>> "actor_listed_count" : 0 >>> } >>> >>> >>> >>> *Questions* >>> >>> 1) How to improve aggregation query performance? >>> >>> 2) Do I also need to do something in mongodb to enhance performance? (I >>> mean optimize source) >>> >>> 3) Does Apache drill capable of handling/aggregating billions of documents >>> in real-time? >>> >>> Your early response is highly appreciated! >>> >>> Thank you & Kind Regards >>> >>> Yousuf >>> >>> >>> >>> --------------92CCE4A86DA15F6B31EF7AF3--