drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nipari...@gmail.com>
Subject Re: DRILL 1.4 - newline in strings not supported
Date Mon, 01 Feb 2016 17:00:11 GMT
Splitting csv on newlines that are not surrounded by quote is a solution,
no ? (I mean a regex )

Because valid csv containing newlines in texts must have quoted strings I
guess.

Then it could be a kind of csv config parameter allowNewlineInTexts=true (
like extractHeader by e.g. )


2016-02-01 17:40 GMT+01:00 Abdel Hakim Deneche <adeneche@maprtech.com>:

> Then it's similar to DRILL-3178 indeed.
> Unfortunately there is no way I can think of to read csv files in Drill
> without replacing the new line characters.
> As Ted mentioned, Drill expected one data row per line to allow easy
> splitting of csv files.
>
> On Mon, Feb 1, 2016 at 8:24 AM, Nicolas Paris <niparisco@gmail.com> wrote:
>
> > Abdel,
> >
> > select * on my csv file fails as well
> >
> > Thanks
> >
> > 2016-02-01 17:16 GMT+01:00 Abdel Hakim Deneche <adeneche@maprtech.com>:
> >
> > > When you run a select * on your csv file, does it succeed or fail ?
> > >
> > > On Mon, Feb 1, 2016 at 7:53 AM, Nicolas Paris <niparisco@gmail.com>
> > wrote:
> > >
> > > > @Abdel,
> > > >
> > > > Yes problem is similar. By the way, the jira issue allready exists
> > > isnt'it
> > > > ?
> > > >
> > > >
> > >
> >
> https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg
> > > > If not, I would be glad to add one. Just tell me why
> > > >
> > > > @Ted,
> > > >
> > > > If you have new lines in your files then the files becomes unsuitable
> > for
> > > > splitting.  This means that the only parallelism available in a ctas
> > > > statement is multiple files
> > > >
> > > > ​Does it means newlines are incompatible with drill's distributed
> > > calculus
> > > > ?
> > > >
> > > > Do you have a fair number of files?​
> > > > ​I have one 30GB csv file. I don't know how many parquet file it
> could
> > > > create as process crashes because of newlines.
> > > > I can imagine approx 5 parquet files 500 MB.
> > > >
> > > > Thanks,​
> > > >
> > > >
> > > > 2016-02-01 16:41 GMT+01:00 Abdel Hakim Deneche <
> adeneche@maprtech.com
> > >:
> > > >
> > > > > Another user already reported some problems querying csv files with
> > new
> > > > > line characters:
> > > > >
> > > > >
> > http://comments.gmane.org/gmane.comp.apache.incubator.drill.user/2350
> > > > >
> > > > > His particular problem was related to a bug in the LIKE function.
> > > > > Unfortunately he never got around to fill a JIRA for his issue.
> > > > >
> > > > > Is your problem similar ? if yes, then can you please fill a JIRA
?
> > > > >
> > > > > On Mon, Feb 1, 2016 at 7:26 AM, Nicolas Paris <niparisco@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hello Abdel,
> > > > > >
> > > > > > I am creating parquet file from those CSV files. (CREATE TABLE
> > > syntax).
> > > > > > Basically, I have a text column, with a maximum of 50k
> characters,
> > > > > > containing newlines (the texts come from pdf extracted). I have
> > > > > > multimilions tuples of texts. I am subseting texts containing
> some
> > > > > patterns
> > > > > > (LIKE '%foo%' or regex => sadly I haven't found mention about
> regex
> > > in
> > > > > > documentation (postgresql "~" operator equivalent))
> > > > > > Usually I used postgresql or monetdb in order to mine the texts,
> > but
> > > I
> > > > am
> > > > > > benchmarking/studying apache drill too.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > >
> > > > > > 2016-02-01 15:54 GMT+01:00 Abdel Hakim Deneche <
> > > adeneche@maprtech.com
> > > > >:
> > > > > >
> > > > > > > Hey Nicolas,
> > > > > > >
> > > > > > > what kind of queries are you running on your csv file ?
> > > > > > >
> > > > > > > On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris <
> > > niparisco@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I am trying to import a csv containing large texts.
They
> > contains
> > > > > > newline
> > > > > > > > character "\n".
> > > > > > > > Apache Drill conplains about that. There is a jira
issue
> opened
> > > on
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg
> > > > > > > >
> > > > > > > > Is there a workaround ? (different that removing \n
from
> texts)
> > > > > > > >
> > > > > > > > Thanks by advance
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Abdelhakim Deneche
> > > > > > >
> > > > > > > Software Engineer
> > > > > > >
> > > > > > >   <http://www.mapr.com/>
> > > > > > >
> > > > > > >
> > > > > > > Now Available - Free Hadoop On-Demand Training
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Abdelhakim Deneche
> > > > >
> > > > > Software Engineer
> > > > >
> > > > >   <http://www.mapr.com/>
> > > > >
> > > > >
> > > > > Now Available - Free Hadoop On-Demand Training
> > > > > <
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   <http://www.mapr.com/>
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message