drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Query Error on PCAP over MapR FS
Date Mon, 11 Sep 2017 20:07:56 GMT
On Mon, Sep 11, 2017 at 11:23 AM, Takeo Ogawara <ta-ogawara@kddi-research.jp
> wrote:

> ...
>
> 1. Query error when cluster-name is not specified
> ...
>
> With this setting, the following query failed.
> > select * from mfs.`x.pcap` ;
> > Error: DATA_READ ERROR: /x.pcap (No such file or directory)
> >
> > File name: /x.pcap
> > Fragment 0:0
> >
> > [Error Id: 70b73062-c3ed-4a10-9a88-034b4e6d039a on node21:31010]
> (state=,code=0)
>
> But these queries passed.
> > select * from mfs.root.`x.pcap` ;
> > select * from mfs.`x.csv`;
> > select * from mfs.root.`x.csv`;
>

As Andries mentioned, the problem here has to do with understanding what
Drill is thinking about how paths are manipulated. Nothing to do with the
PCAP capabilities.

Usually, what I do is put entries into the configuration which directly
point to the directory above my data, but I can't add anything Andries
comment.


> 2. Large PCAP file
> Query on very large PCAP file (larger than 100GB) failed with following
> error message.
> > Error: SYSTEM ERROR: IllegalStateException: Bad magic number = 0a0d0d0a
> >
> > Fragment 1:169
> >
> > [Error Id: 8882c359-c253-40c0-866c-417ef1ce5aa3 on node22:31010]
> (state=,code=0)
>
> This happens even on Linux FS not MapR FS
>

Can you provide the stack trace from the Drillbit that hit the problem?

I suspect that this has to do with splitting of the PCAP file. Normally, it
is assumed that parallelism will be achieved by having lots of smaller
files since it is difficult to jump into the middle of a PCAP file and get
good results.

Even if we disable splitting to avoid this error, you will have the
complementary problem of slow queries due to single-threading. That doesn't
seem very satisfactory either.

A similar problem is that splitting a PCAP file pretty much requires a
single-threaded read of the file in question. The read doesn't need to
process very much data, but it does need to touch the whole file.

Is it absolutely required to query large files like this? Would it be
acceptable to split the file first by making a quick scan over it?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message