drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: known bug in csv header parsing
Date Thu, 20 May 2021 14:40:46 GMT
Luoc,

How do I use the CompliantTextBatchReader?

How is the speed?

Can you point me at the old CSV reader? I am not sure where it is.



On Thu, May 20, 2021 at 1:09 AM luoc <luoc@apache.org> wrote:

> Hello Ted,
> It's nice idea. I have done a quick review for the CSV reader, but not
> found any settings to process the errors. And then, We have refactored the
> CSV format using the EVF, please see the CompliantTextBatchReader.java
> (Complies with the RFC 4180 standard for text/csv files).
>
> > 在 2021年5月20日,13:49,Ted Dunning <ted.dunning@gmail.com> 写道:
> >
> > I have a csv file that causes an exception when read by Drill. The file
> is
> > slightly mal-formed (but R can read it).
> >
> > Interestingly, if I don't parse the header line, I don't get the
> exception
> > and the problematic embedded quotes are handled well. Likewise, deleting
> > the first data line (which is well-formed) causes the exception to go
> away.
> > Deleting the second data line also causes the exception to stop. Fixing
> the
> > quoting of the included quotes also fixes the problem. Swapping the lines
> > works like deleting the first line. Repeating the first line after the
> > second line still gets the exception.
> >
> > The file is this:
> > -------------------------
> >
> > desc,name
> >
> > "foo","x"
> >
> > "manure called "foo"","y"
> >
> > -------------
> >
> >
> > The exception is shown below. My thought is that if the CSV file is
> > considered mal-formed, we should get an error on the line that says
> > something along the lines of "mal-formed input". Even better would be to
> > allow such lines to be omitted (up to some sanity limit) or to parse it
> > correctly (which happens without headers being parsed).
> >
> > Anybody have any thoughts?
> >
> > Here is the R behavior (it omits the embedded quotes):
> >
> >> f = read.csv("v.csv")
> >
> >> f
> >
> >       desc name
> >
> > 1               foo    x
> >
> > 2 manure called foo    y
> >
> >
> > And here is the exception:
> >
> > org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> > NegativeArraySizeException Please, refer to logs for more information.
> > [Error Id: 7153f837-45eb-43d1-8e19-e3ca0197c61b ]
> > (java.lang.NegativeArraySizeException) null
> > org.apache.drill.exec.vector.VarCharVector$Accessor.get():487
> > org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():514
> > org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():475
> > org.apache.drill.exec.server.rest.WebUserConnection.sendData():147
> > org.apache.drill.exec.ops.AccountingUserConnection.sendData():42
> >
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():120
> > org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
> > java.security.AccessController.doPrivileged():-2
> > javax.security.auth.Subject.doAs():422
> > org.apache.hadoop.security.UserGroupInformation.doAs():1669
> > org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
> > org.apache.drill.common.SelfCleaningRunnable.run():38
> > java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> > java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> > java.lang.Thread.run():748
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message