drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Givre <cgi...@gmail.com>
Subject Re: Drill fails to query pcap files
Date Fri, 22 Feb 2019 14:51:01 GMT
Angelo
Can you please share an example?  That will help in fixing this issue.  

Sent from my iPhone

> On Feb 22, 2019, at 09:40, Angelo Mantellini <amantellini@gmail.com> wrote:
> 
> Hi,
> I tried the patch, but I see that the lines are always corrupted after the first exception.
> So, if my corrupted line is in the first row, the rest of the file is corrupted.
> 
> 
> 
> On 10/02/2019, 18:01, "Charles Givre" <cgivre@gmail.com> wrote:
> 
>    Actually, some good news here…  
>    I ran some test queries on the corrupted file and it seemed to work pretty well. 
I didn’t get any exceptions!
> 
>     jdbc:drill:zk=local> select src_ip, COUNT(*) as packet_count from dfs.test.`testv1.pcap`WHERE
is_corrupt=1 GROUP BY src_ip ORDER BY packet_count DESC
>    . . . . . . .semicolon> LIMIT 10;
>    +-----------------------------------------+---------------+
>    |                 src_ip                  | packet_count  |
>    +-----------------------------------------+---------------+
>    | 150.249.255.161                         | 176           |
>    | 150.249.255.24                          | 28            |
>    | 131.38.3.15                             | 26            |
>    | 111.248.196.128                         | 25            |
>    | 202.13.230.242                          | 20            |
>    | 163.28.217.199                          | 19            |
>    | 27.18.36.151                            | 18            |
>    | 2001:320f:c2ed:8693:1dff:f8f8:500:f1ed  | 17            |
>    | 203.70.190.81                           | 16            |
>    | 203.70.182.104                          | 13            |
>    +-----------------------------------------+---------------+
>    10 rows selected (0.944 seconds)
> 
> 
>    select src_ip, dst_ip from dfs.test.`testv1.pcap`WHERE is_corrupt=1 LIMIT 10;
>    +------------------+------------------+
>    |      src_ip      |      dst_ip      |
>    +------------------+------------------+
>    | 118.233.244.60   | 150.249.255.161  |
>    | 150.249.255.161  | 165.63.110.188   |
>    | 150.249.255.161  | 165.63.110.188   |
>    | 172.40.96.180    | 131.39.133.22    |
>    | 150.249.255.161  | 165.63.110.188   |
>    | 150.249.255.161  | 165.63.110.188   |
>    | 150.249.255.161  | 165.63.110.188   |
>    | 150.249.255.161  | 165.63.110.188   |
>    | 150.249.162.60   | 180.32.119.25    |
>    | 150.249.255.161  | 165.63.110.188   |
>    +------------------+------------------+
>    10 rows selected (1.031 seconds)
> 
> 
>    0: jdbc:drill:zk=local> SELECT  src_port , dst_port , src_mac_address , dst_mac_address
>    . . . . . . .semicolon> FROM dfs.test.`testv1.pcap`
>    . . . . . . .semicolon> WHERE is_corrupt =1 LIMIT 10;
>    +-----------+-----------+--------------------+--------------------+
>    | src_port  | dst_port  |  src_mac_address   |  dst_mac_address   |
>    +-----------+-----------+--------------------+--------------------+
>    | 57058     | 443       | 00:0C:DB:1F:72:41  | 88:E0:F3:7A:66:F0  |
>    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
>    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
>    | 443       | 55972     | 00:0C:DB:1F:72:41  | CC:4E:24:1F:4E:00  |
>    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
>    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
>    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
>    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
>    | 4016      | 7699      | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
>    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
>    +-----------+-----------+--------------------+--------------------+
>    10 rows selected (0.751 seconds)
> 
>    SELECT getCountryName(src_ip) AS country, COUNT(*) as packet_count FROM dfs.test.`testv1.pcap`
WHERE is_corrupt=1  GROUP BY getCountryName(src_ip) ORDER BY packet_count DESC LIMIT 10;
>    +----------------+---------------+
>    |    country     | packet_count  |
>    +----------------+---------------+
>    | Japan          | 269           |
>    | Taiwan         | 124           |
>    | United States  | 105           |
>    | Unknown        | 49            |
>    | China          | 26            |
>    | South Korea    | 8             |
>    | Australia      | 4             |
>    | Germany        | 3             |
>    | Hong Kong      | 2             |
>    | Italy          | 1             |
>    +----------------+---------------+
>    10 rows selected (1.519 seconds)
> 
>    SELECT is_corrupt, COUNT(*) as packet_count FROM dfs.test.`testv1.pcap` GROUP BY is_corrupt;
>    +-------------+---------------+
>    | is_corrupt  | packet_count  |
>    +-------------+---------------+
>    | 0           | 6408          |
>    | 1           | 592           |
>    +-------------+---------------+
>    2 rows selected (0.931 seconds)
> 
> 
>    This PCAP file worked well with Superset also. 
> 
> 
>> On Feb 10, 2019, at 10:59, Charles Givre <cgivre@gmail.com> wrote:
>> 
>> If I can get some more examples of corrupted files I’ll test more thoroughly. 
Also, we’ll need to apply the same methodology to PCAP-NG, so I’ll need some examples
there as well.  My strategy is going to be get as much data as possible out of the corrupt
packet. 
>> — C
>> 
>> 
>> 
>>> On Feb 10, 2019, at 10:54, Ted Dunning <ted.dunning@gmail.com> wrote:
>>> 
>>> I think that accessing fields in corrupted packets will also cause
>>> exceptions. But this is a great start. Conditionalizing field access on
>>> !is_corrupt() might be sufficient for the next step.
>>> 
>>> 
>>> 
>>>> On Sun, Feb 10, 2019 at 4:58 AM Charles Givre <cgivre@gmail.com> wrote:
>>>> 
>>>> All,
>>>> I posted the following PR for this issue:
>>>> https://github.com/apache/drill/pull/1637 <
>>>> https://github.com/apache/drill/pull/1637>
>>>> 
>>>> Basically this PR does two things.
>>>> 1.  It creates a boolean column called is_corrupt and
>>>> 2.  If the PCAP file has a corrupt row, it marks that row as corrupt by
>>>> setting is_corrupt to true and keeps going
>>>> 
>>>> WIth the example from Giovanni, I was able to find 590 or so corrupt rows
>>>> out of 7000 in that PCAP file.  It was late and I don’t know if that was
>>>> what ti was supposed to find, but it worked and was able to query that.
>>>> If you guys could send a few more examples, I’d like to test this on other
>>>> files to make sure it works with them.  We’re also going to have to do
the
>>>> same thing for the PCAP-NG format I would assume.
>>>> 
>>>>> On Feb 10, 2019, at 03:07, Ted Dunning <ted.dunning@gmail.com>
wrote:
>>>>> 
>>>>> On Sat, Feb 9, 2019 at 2:25 PM Bob Rudis <bob@rud.is> wrote:
>>>>> 
>>>>>> ...
>>>>>> And, I did indeed find a few and am just waiting for a formal review
so
>>>> I
>>>>>> can submit them for the Drill dev & tests.
>>>>>> 
>>>>> 
>>>>> Awesome!
>>>> 
>>>> 
>> 
> 
> 
> 
> 

Mime
View raw message