From dev-return-18462-apmail-drill-dev-archive=drill.apache.org@drill.apache.org Fri Nov 13 22:58:34 2015 Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9C0E918E4C for ; Fri, 13 Nov 2015 22:58:34 +0000 (UTC) Received: (qmail 25579 invoked by uid 500); 13 Nov 2015 22:58:34 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 25502 invoked by uid 500); 13 Nov 2015 22:58:34 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 25092 invoked by uid 99); 13 Nov 2015 22:58:33 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Nov 2015 22:58:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 3636CC0052 for ; Fri, 13 Nov 2015 22:58:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3 X-Spam-Level: *** X-Spam-Status: No, score=3 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=activitystream_com.20150623.gappssmtp.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id AMHha2wTjFo0 for ; Fri, 13 Nov 2015 22:58:27 +0000 (UTC) Received: from mail-ig0-f172.google.com (mail-ig0-f172.google.com [209.85.213.172]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 94421439CD for ; Fri, 13 Nov 2015 22:58:27 +0000 (UTC) Received: by igcph11 with SMTP id ph11so25792767igc.1 for ; Fri, 13 Nov 2015 14:58:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=activitystream_com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ugaJhdpskMr+2W7N2Q7vYAEbSud0J/eRXK+IebgQMms=; b=PZHANcdfu15aSmFMtTx8BsS1fN4TTRmq5mFC2v/tUVV42P1QYnq7k8mMHqeXV+KF8U hv6cjoGiRbP0qgvGA3u93n93k49lNFL1jER07fxN569Z3zhuUPK45wGs+2baKPlLe3rL JNC9YcpnV9cwltBGjLeXkfjPmixPaKbKaFQLCgV12ikZEZ5muheeTXyiTjd+KEyA0oyT k1lO+0ZNGnaLIrA1o8OjluBgojnIX1ych3XCUrRV8j40HOgaQ2n0sGz0KeVmsmxrh3oO B81/WaS91wKvj36cXRHZ9EFJzjK2r+lhT+UGc5q2cJdNO6J+E54KTRIAO1YsL8nxrEZq t/LQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=ugaJhdpskMr+2W7N2Q7vYAEbSud0J/eRXK+IebgQMms=; b=dTNZrpTTgO4KfOnX+1a4f2e99G7VOmyC0EvjPz1TstMh9nSkCz9zUDIyWbuGrjNIZ4 pJU55N9XVyCF0loLTfN6Dr2u6O8A2DJfOD1+7cVfO2/f0ZS6azpEQG+Sio2kCaj0lZSs o22BfWskkMjFHe9WsSarG+Vmy6Lbm5WsCKVkAPz8uSeI0sQXQCwbVgez45cAK4AOBsUN bcqB6QCI6j0ZFQVrWMfNnYFPb7tyoA1EIHmVVewqFCgIyt0j5u8XbG17kpZdEzhJKqDz Kmsb8DrQCwUXyBxV+246bv9QFAS0wgW8ULpugPPc8s2U/gExyBXiZUF1x6Lteb/bmMb+ sAaw== X-Gm-Message-State: ALoCoQmgPiYYC7Jl7k8nGHryEtfbyIJe8KIT1ZT8/eFBMXWkGHh0TFkNU5TmSd+6g5eNtFrmCmBv MIME-Version: 1.0 X-Received: by 10.50.59.227 with SMTP id c3mr2121354igr.0.1447455507135; Fri, 13 Nov 2015 14:58:27 -0800 (PST) Received: by 10.36.196.197 with HTTP; Fri, 13 Nov 2015 14:58:27 -0800 (PST) In-Reply-To: References: Date: Fri, 13 Nov 2015 22:58:27 +0000 Message-ID: Subject: Re: Avro deserialization bug - 1.3-SNAPSHOT From: =?UTF-8?Q?Stef=C3=A1n_Baxter?= To: user , dev Content-Type: multipart/alternative; boundary=047d7bd74d2854e00d052473ff4e --047d7bd74d2854e00d052473ff4e Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable So, Could someone point me to the appropriate place in the Drill code to start investigating this (We would love to contribute but getting up to speed is a bit much). I realize that there are many good things happening and that v. 1.3 is around the corner but it seems that I incorrectly assumed that data corruption issues would get a higher priority or that I would, at the very least, get someone to confirm such a bug. We are now impeded by this after having moved all our logging from JSON to Avro to avoid the schema related problems we have been running into with the JSON reader (null interpreted like double and failing when a string eventually comes along) . - Stefan On Wed, Nov 11, 2015 at 10:14 PM, Stef=C3=A1n Baxter wrote: > Hi, > > Can someone please verify that this is in fact a bug so I can rule out ou= r > own mistakes? > > We have recently moved all our logging to Avro to compensate for schema > differences in JSON that were causing various problems and our latest > release is now impeded with this. > Alternatively can someone please point me in the right direction if I was > to try to fix this myself. > > Regards, > -Stef=C3=A1n > > On Tue, Nov 10, 2015 at 2:41 PM, Stef=C3=A1n Baxter > wrote: > >> Thank you Kamesh. >> >> I have created https://issues.apache.org/jira/browse/DRILL-4056 with the >> description. >> I will send you a confidential test file to your private email. >> >> Regards, >> -Stefan >> >> On Tue, Nov 10, 2015 at 2:30 PM, Kamesh wrote: >> >>> Hi Stef=C3=A1n, >>> Could you please raise a Jira with sample schema and sample input to >>> reproduce it. I will look into this. >>> >>> On Tue, Nov 10, 2015 at 7:55 PM, Stef=C3=A1n Baxter < >>> stefan@activitystream.com> >>> wrote: >>> >>> > Hi, >>> > >>> > I have an Avro file that support the following data/schema: >>> > >>> > {"field":"some", "classification":{"variant":"G=C3=A6st"}} >>> > >>> > When I select 10 rows from this file I get: >>> > >>> > +---------------------+ >>> > | EXPR$0 | >>> > +---------------------+ >>> > | G=C3=A6st | >>> > | Voksen | >>> > | Voksen | >>> > | Invitation KIF KBH | >>> > | Invitation KIF KBH | >>> > | Ordinarie pris KBH | >>> > | Ordinarie pris KBH | >>> > | Biljetter 200 krBH | >>> > | Biljetter 200 krBH | >>> > | Biljetter 200 krBH | >>> > +---------------------+ >>> > >>> > The bug is that the field values are incorrectly de-serialized and th= e >>> > value from the previous row is retained if the subsequent row is >>> shorter. >>> > >>> > The sql query: >>> > >>> > "select s.classification.variant variant from dfs. as s limit >>> 10;" >>> > >>> > >>> > That way the "Ordinarie pris" becomes "Ordinarie pris KBH" because t= he >>> > previous row had the value "Invitation KIF KBH". >>> > >>> > Regards, >>> > -Stef=C3=A1n >>> > >>> >>> >>> >>> -- >>> Kamesh. >>> >> >> > --047d7bd74d2854e00d052473ff4e--