drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Angelo Mantellini <amantell...@gmail.com>
Subject Re: Drill fails to query pcap files
Date Fri, 22 Feb 2019 14:40:16 GMT
Hi,
I tried the patch, but I see that the lines are always corrupted after the first exception.
So, if my corrupted line is in the first row, the rest of the file is corrupted.



On 10/02/2019, 18:01, "Charles Givre" <cgivre@gmail.com> wrote:

    Actually, some good news here…  
    I ran some test queries on the corrupted file and it seemed to work pretty well.  I didn’t
get any exceptions!
    
     jdbc:drill:zk=local> select src_ip, COUNT(*) as packet_count from dfs.test.`testv1.pcap`WHERE
is_corrupt=1 GROUP BY src_ip ORDER BY packet_count DESC
    . . . . . . .semicolon> LIMIT 10;
    +-----------------------------------------+---------------+
    |                 src_ip                  | packet_count  |
    +-----------------------------------------+---------------+
    | 150.249.255.161                         | 176           |
    | 150.249.255.24                          | 28            |
    | 131.38.3.15                             | 26            |
    | 111.248.196.128                         | 25            |
    | 202.13.230.242                          | 20            |
    | 163.28.217.199                          | 19            |
    | 27.18.36.151                            | 18            |
    | 2001:320f:c2ed:8693:1dff:f8f8:500:f1ed  | 17            |
    | 203.70.190.81                           | 16            |
    | 203.70.182.104                          | 13            |
    +-----------------------------------------+---------------+
    10 rows selected (0.944 seconds)
    
    
    select src_ip, dst_ip from dfs.test.`testv1.pcap`WHERE is_corrupt=1 LIMIT 10;
    +------------------+------------------+
    |      src_ip      |      dst_ip      |
    +------------------+------------------+
    | 118.233.244.60   | 150.249.255.161  |
    | 150.249.255.161  | 165.63.110.188   |
    | 150.249.255.161  | 165.63.110.188   |
    | 172.40.96.180    | 131.39.133.22    |
    | 150.249.255.161  | 165.63.110.188   |
    | 150.249.255.161  | 165.63.110.188   |
    | 150.249.255.161  | 165.63.110.188   |
    | 150.249.255.161  | 165.63.110.188   |
    | 150.249.162.60   | 180.32.119.25    |
    | 150.249.255.161  | 165.63.110.188   |
    +------------------+------------------+
    10 rows selected (1.031 seconds)
    
    
    0: jdbc:drill:zk=local> SELECT  src_port , dst_port , src_mac_address , dst_mac_address
    . . . . . . .semicolon> FROM dfs.test.`testv1.pcap`
    . . . . . . .semicolon> WHERE is_corrupt =1 LIMIT 10;
    +-----------+-----------+--------------------+--------------------+
    | src_port  | dst_port  |  src_mac_address   |  dst_mac_address   |
    +-----------+-----------+--------------------+--------------------+
    | 57058     | 443       | 00:0C:DB:1F:72:41  | 88:E0:F3:7A:66:F0  |
    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
    | 443       | 55972     | 00:0C:DB:1F:72:41  | CC:4E:24:1F:4E:00  |
    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
    | 4016      | 7699      | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
    | 80        | 20706     | 00:0C:DB:1F:72:41  | 00:12:E2:C0:3F:09  |
    +-----------+-----------+--------------------+--------------------+
    10 rows selected (0.751 seconds)
    
    SELECT getCountryName(src_ip) AS country, COUNT(*) as packet_count FROM dfs.test.`testv1.pcap`
WHERE is_corrupt=1  GROUP BY getCountryName(src_ip) ORDER BY packet_count DESC LIMIT 10;
    +----------------+---------------+
    |    country     | packet_count  |
    +----------------+---------------+
    | Japan          | 269           |
    | Taiwan         | 124           |
    | United States  | 105           |
    | Unknown        | 49            |
    | China          | 26            |
    | South Korea    | 8             |
    | Australia      | 4             |
    | Germany        | 3             |
    | Hong Kong      | 2             |
    | Italy          | 1             |
    +----------------+---------------+
    10 rows selected (1.519 seconds)
    
    SELECT is_corrupt, COUNT(*) as packet_count FROM dfs.test.`testv1.pcap` GROUP BY is_corrupt;
    +-------------+---------------+
    | is_corrupt  | packet_count  |
    +-------------+---------------+
    | 0           | 6408          |
    | 1           | 592           |
    +-------------+---------------+
    2 rows selected (0.931 seconds)
    
    
    This PCAP file worked well with Superset also. 
    
    
    > On Feb 10, 2019, at 10:59, Charles Givre <cgivre@gmail.com> wrote:
    > 
    > If I can get some more examples of corrupted files I’ll test more thoroughly. 
Also, we’ll need to apply the same methodology to PCAP-NG, so I’ll need some examples
there as well.  My strategy is going to be get as much data as possible out of the corrupt
packet. 
    > — C
    > 
    > 
    > 
    >> On Feb 10, 2019, at 10:54, Ted Dunning <ted.dunning@gmail.com> wrote:
    >> 
    >> I think that accessing fields in corrupted packets will also cause
    >> exceptions. But this is a great start. Conditionalizing field access on
    >> !is_corrupt() might be sufficient for the next step.
    >> 
    >> 
    >> 
    >> On Sun, Feb 10, 2019 at 4:58 AM Charles Givre <cgivre@gmail.com> wrote:
    >> 
    >>> All,
    >>> I posted the following PR for this issue:
    >>> https://github.com/apache/drill/pull/1637 <
    >>> https://github.com/apache/drill/pull/1637>
    >>> 
    >>> Basically this PR does two things.
    >>> 1.  It creates a boolean column called is_corrupt and
    >>> 2.  If the PCAP file has a corrupt row, it marks that row as corrupt by
    >>> setting is_corrupt to true and keeps going
    >>> 
    >>> WIth the example from Giovanni, I was able to find 590 or so corrupt rows
    >>> out of 7000 in that PCAP file.  It was late and I don’t know if that was
    >>> what ti was supposed to find, but it worked and was able to query that.
    >>> If you guys could send a few more examples, I’d like to test this on other
    >>> files to make sure it works with them.  We’re also going to have to do
the
    >>> same thing for the PCAP-NG format I would assume.
    >>> 
    >>>> On Feb 10, 2019, at 03:07, Ted Dunning <ted.dunning@gmail.com>
wrote:
    >>>> 
    >>>> On Sat, Feb 9, 2019 at 2:25 PM Bob Rudis <bob@rud.is> wrote:
    >>>> 
    >>>>> ...
    >>>>> And, I did indeed find a few and am just waiting for a formal review
so
    >>> I
    >>>>> can submit them for the Drill dev & tests.
    >>>>> 
    >>>> 
    >>>> Awesome!
    >>> 
    >>> 
    > 
    
    



Mime
View raw message