From user-return-10172-apmail-drill-user-archive=drill.apache.org@drill.apache.org Fri Feb 22 14:40:19 2019 Return-Path: X-Original-To: apmail-drill-user-archive@www.apache.org Delivered-To: apmail-drill-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A55F918CFE for ; Fri, 22 Feb 2019 14:40:19 +0000 (UTC) Received: (qmail 58875 invoked by uid 500); 22 Feb 2019 14:40:19 -0000 Delivered-To: apmail-drill-user-archive@drill.apache.org Received: (qmail 58788 invoked by uid 500); 22 Feb 2019 14:40:19 -0000 Mailing-List: contact user-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@drill.apache.org Delivered-To: mailing list user@drill.apache.org Received: (qmail 58763 invoked by uid 99); 22 Feb 2019 14:40:18 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Feb 2019 14:40:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 13DFFC5AF8 for ; Fri, 22 Feb 2019 14:40:18 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.6 X-Spam-Level: X-Spam-Status: No, score=0.6 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, KAM_ASCII_DIVIDERS=0.8, MIME_QP_LONG_LINE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 6EXgujv69Pog for ; Fri, 22 Feb 2019 14:40:16 +0000 (UTC) Received: from mail-ed1-f46.google.com (mail-ed1-f46.google.com [209.85.208.46]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 4899C5FAD7 for ; Fri, 22 Feb 2019 14:40:16 +0000 (UTC) Received: by mail-ed1-f46.google.com with SMTP id h58so1941149edb.5 for ; Fri, 22 Feb 2019 06:40:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=user-agent:date:subject:from:to:message-id:thread-topic:references :in-reply-to:mime-version:content-transfer-encoding; bh=y0Vx2ghLnIf6D+CWq7hc3BuDc317+H9349/xEeY2pBI=; b=JiiWhE425H4ZsZbh3D9HhqWnxZQUiby/jfqShnNaAijFbZyy7ZGtANjLkgjsAFj/Hu bNp8qSZ0HZYiDiGnJVbv+S2P2G6OD4vo/AcD3bcyeZtfikU1WnX1KAa2rdMEp6/8GlS4 L1o7MUpk9CeU1BmnAbzIl5z1/tnb8+C6pZLhiozMf9DrOpmVrBmv9LBl687G2R4LerOR iX+pr+2WVDeEkQUriUmx1EUgy+jgZ3A+0/mWLihd3qoRflxxWUUfvC3Lzor1TMBXGXfx PGKAwZAt0DMiw3yZWd2dEXQKYi/zOmHMsvPLlthIqKtSgLzHTDvng8+du5eCvGx6dE+N mKmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:user-agent:date:subject:from:to:message-id :thread-topic:references:in-reply-to:mime-version :content-transfer-encoding; bh=y0Vx2ghLnIf6D+CWq7hc3BuDc317+H9349/xEeY2pBI=; b=H6umhQp2t68AJgN9Dlz8uYXjx1BQ8hyPPHBas3WeEwjpFOk06o1MoyoqI3WlEIxUuS KmrsTrQfnU5mNPR3qIfUT4ZqxybslxymCfCEKaj6FqwDV7GUNJnz1/UWqn3NUliwYM59 64UiHVm/FGEEBNK/NX1mKkdh7sYsjWEYM5ZjI1ZZ23pfwmM/lwQfYNK8pElItSEHwB05 BC9rkeAanqC+gm2D2ovRa2x0w+Qkr25qPM7w1COvPGbKUOH+comJlFTogZBNI84NqxX2 tRMgi+DtBbugGVS3QTLO1bXwM/oXep8YeZxKW+xCItJMAqO9/i1SDzzgbBj4qdF2ommw sSbw== X-Gm-Message-State: AHQUAuaIofZvfvyrxPzmAue4bqM3Y0ztVroRPvHkk9zp8/Wpj9ywRarW phq3k4f6nF6+Q4r5Wo9EaIkIXU4M68k= X-Google-Smtp-Source: AHgI3IbbSmYTQAMGTY9AfSW7wa2z/TbQeiQlL3Sj7/0Q60XUiR8epxkKhV7U1b1Q72dsHizkTl6hNw== X-Received: by 2002:a50:94c4:: with SMTP id t4mr3624764eda.28.1550846414705; Fri, 22 Feb 2019 06:40:14 -0800 (PST) Received: from [10.228.42.116] ([2001:420:44f0:1252:3910:620c:c1e:2080]) by smtp.gmail.com with ESMTPSA id 4sm431415edy.55.2019.02.22.06.40.13 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 22 Feb 2019 06:40:13 -0800 (PST) User-Agent: Microsoft-MacOutlook/10.10.7.190210 Date: Fri, 22 Feb 2019 15:40:16 +0100 Subject: Re: Drill fails to query pcap files From: Angelo Mantellini To: Message-ID: <0C60AEBC-1042-4891-B616-865EFE755B9A@gmail.com> Thread-Topic: Drill fails to query pcap files References: <1002D02D-FC78-456F-9F91-E1FA763AD2D6@gmail.com> <58C84879-B414-4FFC-91BA-0A43224C9FD8@gmail.com> <0DEBCA2A-EB3E-48B2-BF6D-1EB821880FBC@gmail.com> <977B672B-F3F6-464C-B231-59F7C777CB21@gmail.com> <9514D86E-5B72-4D7B-9B77-DCD4A99EC177@gmail.com> <9C723473-44F7-403A-91FA-A30D0700CDB6@gmail.com> <2114B4E5-918D-46D4-BB0C-2D0E3A73C60B@gmail.com> In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: quoted-printable Hi, I tried the patch, but I see that the lines are always corrupted after the = first exception. So, if my corrupted line is in the first row, the rest of the file is corru= pted. =EF=BB=BFOn 10/02/2019, 18:01, "Charles Givre" wrote: Actually, some good news here=E2=80=A6 =20 I ran some test queries on the corrupted file and it seemed to work pre= tty well. I didn=E2=80=99t get any exceptions! =20 jdbc:drill:zk=3Dlocal> select src_ip, COUNT(*) as packet_count from dfs.= test.`testv1.pcap`WHERE is_corrupt=3D1 GROUP BY src_ip ORDER BY packet_count D= ESC . . . . . . .semicolon> LIMIT 10; +-----------------------------------------+---------------+ | src_ip | packet_count | +-----------------------------------------+---------------+ | 150.249.255.161 | 176 | | 150.249.255.24 | 28 | | 131.38.3.15 | 26 | | 111.248.196.128 | 25 | | 202.13.230.242 | 20 | | 163.28.217.199 | 19 | | 27.18.36.151 | 18 | | 2001:320f:c2ed:8693:1dff:f8f8:500:f1ed | 17 | | 203.70.190.81 | 16 | | 203.70.182.104 | 13 | +-----------------------------------------+---------------+ 10 rows selected (0.944 seconds) =20 =20 select src_ip, dst_ip from dfs.test.`testv1.pcap`WHERE is_corrupt=3D1 LIM= IT 10; +------------------+------------------+ | src_ip | dst_ip | +------------------+------------------+ | 118.233.244.60 | 150.249.255.161 | | 150.249.255.161 | 165.63.110.188 | | 150.249.255.161 | 165.63.110.188 | | 172.40.96.180 | 131.39.133.22 | | 150.249.255.161 | 165.63.110.188 | | 150.249.255.161 | 165.63.110.188 | | 150.249.255.161 | 165.63.110.188 | | 150.249.255.161 | 165.63.110.188 | | 150.249.162.60 | 180.32.119.25 | | 150.249.255.161 | 165.63.110.188 | +------------------+------------------+ 10 rows selected (1.031 seconds) =20 =20 0: jdbc:drill:zk=3Dlocal> SELECT src_port , dst_port , src_mac_address ,= dst_mac_address . . . . . . .semicolon> FROM dfs.test.`testv1.pcap` . . . . . . .semicolon> WHERE is_corrupt =3D1 LIMIT 10; +-----------+-----------+--------------------+--------------------+ | src_port | dst_port | src_mac_address | dst_mac_address | +-----------+-----------+--------------------+--------------------+ | 57058 | 443 | 00:0C:DB:1F:72:41 | 88:E0:F3:7A:66:F0 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 443 | 55972 | 00:0C:DB:1F:72:41 | CC:4E:24:1F:4E:00 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 4016 | 7699 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | +-----------+-----------+--------------------+--------------------+ 10 rows selected (0.751 seconds) =20 SELECT getCountryName(src_ip) AS country, COUNT(*) as packet_count FROM= dfs.test.`testv1.pcap` WHERE is_corrupt=3D1 GROUP BY getCountryName(src_ip) = ORDER BY packet_count DESC LIMIT 10; +----------------+---------------+ | country | packet_count | +----------------+---------------+ | Japan | 269 | | Taiwan | 124 | | United States | 105 | | Unknown | 49 | | China | 26 | | South Korea | 8 | | Australia | 4 | | Germany | 3 | | Hong Kong | 2 | | Italy | 1 | +----------------+---------------+ 10 rows selected (1.519 seconds) =20 SELECT is_corrupt, COUNT(*) as packet_count FROM dfs.test.`testv1.pcap`= GROUP BY is_corrupt; +-------------+---------------+ | is_corrupt | packet_count | +-------------+---------------+ | 0 | 6408 | | 1 | 592 | +-------------+---------------+ 2 rows selected (0.931 seconds) =20 =20 This PCAP file worked well with Superset also.=20 =20 =20 > On Feb 10, 2019, at 10:59, Charles Givre wrote: >=20 > If I can get some more examples of corrupted files I=E2=80=99ll test more t= horoughly. Also, we=E2=80=99ll need to apply the same methodology to PCAP-NG, so = I=E2=80=99ll need some examples there as well. My strategy is going to be get as = much data as possible out of the corrupt packet.=20 > =E2=80=94 C >=20 >=20 >=20 >> On Feb 10, 2019, at 10:54, Ted Dunning wrote= : >>=20 >> I think that accessing fields in corrupted packets will also cause >> exceptions. But this is a great start. Conditionalizing field access= on >> !is_corrupt() might be sufficient for the next step. >>=20 >>=20 >>=20 >> On Sun, Feb 10, 2019 at 4:58 AM Charles Givre wro= te: >>=20 >>> All, >>> I posted the following PR for this issue: >>> https://github.com/apache/drill/pull/1637 < >>> https://github.com/apache/drill/pull/1637> >>>=20 >>> Basically this PR does two things. >>> 1. It creates a boolean column called is_corrupt and >>> 2. If the PCAP file has a corrupt row, it marks that row as corrup= t by >>> setting is_corrupt to true and keeps going >>>=20 >>> WIth the example from Giovanni, I was able to find 590 or so corrup= t rows >>> out of 7000 in that PCAP file. It was late and I don=E2=80=99t know if t= hat was >>> what ti was supposed to find, but it worked and was able to query t= hat. >>> If you guys could send a few more examples, I=E2=80=99d like to test this= on other >>> files to make sure it works with them. We=E2=80=99re also going to have = to do the >>> same thing for the PCAP-NG format I would assume. >>>=20 >>>> On Feb 10, 2019, at 03:07, Ted Dunning wro= te: >>>>=20 >>>> On Sat, Feb 9, 2019 at 2:25 PM Bob Rudis wrote: >>>>=20 >>>>> ... >>>>> And, I did indeed find a few and am just waiting for a formal rev= iew so >>> I >>>>> can submit them for the Drill dev & tests. >>>>>=20 >>>>=20 >>>> Awesome! >>>=20 >>>=20 >=20 =20 =20