Actually, some good news here…
I ran some test queries on the corrupted file and it seemed to work pretty well. I didn’t
get any exceptions!
jdbc:drill:zk=local> select src_ip, COUNT(*) as packet_count from dfs.test.`testv1.pcap`WHERE
is_corrupt=1 GROUP BY src_ip ORDER BY packet_count DESC
. . . . . . .semicolon> LIMIT 10;
+-----------------------------------------+---------------+
| src_ip | packet_count |
+-----------------------------------------+---------------+
| 150.249.255.161 | 176 |
| 150.249.255.24 | 28 |
| 131.38.3.15 | 26 |
| 111.248.196.128 | 25 |
| 202.13.230.242 | 20 |
| 163.28.217.199 | 19 |
| 27.18.36.151 | 18 |
| 2001:320f:c2ed:8693:1dff:f8f8:500:f1ed | 17 |
| 203.70.190.81 | 16 |
| 203.70.182.104 | 13 |
+-----------------------------------------+---------------+
10 rows selected (0.944 seconds)
select src_ip, dst_ip from dfs.test.`testv1.pcap`WHERE is_corrupt=1 LIMIT 10;
+------------------+------------------+
| src_ip | dst_ip |
+------------------+------------------+
| 118.233.244.60 | 150.249.255.161 |
| 150.249.255.161 | 165.63.110.188 |
| 150.249.255.161 | 165.63.110.188 |
| 172.40.96.180 | 131.39.133.22 |
| 150.249.255.161 | 165.63.110.188 |
| 150.249.255.161 | 165.63.110.188 |
| 150.249.255.161 | 165.63.110.188 |
| 150.249.255.161 | 165.63.110.188 |
| 150.249.162.60 | 180.32.119.25 |
| 150.249.255.161 | 165.63.110.188 |
+------------------+------------------+
10 rows selected (1.031 seconds)
0: jdbc:drill:zk=local> SELECT src_port , dst_port , src_mac_address , dst_mac_address
. . . . . . .semicolon> FROM dfs.test.`testv1.pcap`
. . . . . . .semicolon> WHERE is_corrupt =1 LIMIT 10;
+-----------+-----------+--------------------+--------------------+
| src_port | dst_port | src_mac_address | dst_mac_address |
+-----------+-----------+--------------------+--------------------+
| 57058 | 443 | 00:0C:DB:1F:72:41 | 88:E0:F3:7A:66:F0 |
| 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 |
| 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 |
| 443 | 55972 | 00:0C:DB:1F:72:41 | CC:4E:24:1F:4E:00 |
| 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 |
| 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 |
| 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 |
| 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 |
| 4016 | 7699 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 |
| 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 |
+-----------+-----------+--------------------+--------------------+
10 rows selected (0.751 seconds)
SELECT getCountryName(src_ip) AS country, COUNT(*) as packet_count FROM dfs.test.`testv1.pcap`
WHERE is_corrupt=1 GROUP BY getCountryName(src_ip) ORDER BY packet_count DESC LIMIT 10;
+----------------+---------------+
| country | packet_count |
+----------------+---------------+
| Japan | 269 |
| Taiwan | 124 |
| United States | 105 |
| Unknown | 49 |
| China | 26 |
| South Korea | 8 |
| Australia | 4 |
| Germany | 3 |
| Hong Kong | 2 |
| Italy | 1 |
+----------------+---------------+
10 rows selected (1.519 seconds)
SELECT is_corrupt, COUNT(*) as packet_count FROM dfs.test.`testv1.pcap` GROUP BY is_corrupt;
+-------------+---------------+
| is_corrupt | packet_count |
+-------------+---------------+
| 0 | 6408 |
| 1 | 592 |
+-------------+---------------+
2 rows selected (0.931 seconds)
This PCAP file worked well with Superset also.
> On Feb 10, 2019, at 10:59, Charles Givre <cgivre@gmail.com> wrote:
>
> If I can get some more examples of corrupted files I’ll test more thoroughly. Also,
we’ll need to apply the same methodology to PCAP-NG, so I’ll need some examples there
as well. My strategy is going to be get as much data as possible out of the corrupt packet.
> — C
>
>
>
>> On Feb 10, 2019, at 10:54, Ted Dunning <ted.dunning@gmail.com> wrote:
>>
>> I think that accessing fields in corrupted packets will also cause
>> exceptions. But this is a great start. Conditionalizing field access on
>> !is_corrupt() might be sufficient for the next step.
>>
>>
>>
>> On Sun, Feb 10, 2019 at 4:58 AM Charles Givre <cgivre@gmail.com> wrote:
>>
>>> All,
>>> I posted the following PR for this issue:
>>> https://github.com/apache/drill/pull/1637 <
>>> https://github.com/apache/drill/pull/1637>
>>>
>>> Basically this PR does two things.
>>> 1. It creates a boolean column called is_corrupt and
>>> 2. If the PCAP file has a corrupt row, it marks that row as corrupt by
>>> setting is_corrupt to true and keeps going
>>>
>>> WIth the example from Giovanni, I was able to find 590 or so corrupt rows
>>> out of 7000 in that PCAP file. It was late and I don’t know if that was
>>> what ti was supposed to find, but it worked and was able to query that.
>>> If you guys could send a few more examples, I’d like to test this on other
>>> files to make sure it works with them. We’re also going to have to do the
>>> same thing for the PCAP-NG format I would assume.
>>>
>>>> On Feb 10, 2019, at 03:07, Ted Dunning <ted.dunning@gmail.com> wrote:
>>>>
>>>> On Sat, Feb 9, 2019 at 2:25 PM Bob Rudis <bob@rud.is> wrote:
>>>>
>>>>> ...
>>>>> And, I did indeed find a few and am just waiting for a formal review
so
>>> I
>>>>> can submit them for the Drill dev & tests.
>>>>>
>>>>
>>>> Awesome!
>>>
>>>
>
|