drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdel Hakim Deneche <adene...@maprtech.com>
Subject Re: DRILL 1.4 - newline in strings not supported
Date Mon, 01 Feb 2016 16:40:00 GMT
Then it's similar to DRILL-3178 indeed.
Unfortunately there is no way I can think of to read csv files in Drill
without replacing the new line characters.
As Ted mentioned, Drill expected one data row per line to allow easy
splitting of csv files.

On Mon, Feb 1, 2016 at 8:24 AM, Nicolas Paris <niparisco@gmail.com> wrote:

> Abdel,
>
> select * on my csv file fails as well
>
> Thanks
>
> 2016-02-01 17:16 GMT+01:00 Abdel Hakim Deneche <adeneche@maprtech.com>:
>
> > When you run a select * on your csv file, does it succeed or fail ?
> >
> > On Mon, Feb 1, 2016 at 7:53 AM, Nicolas Paris <niparisco@gmail.com>
> wrote:
> >
> > > @Abdel,
> > >
> > > Yes problem is similar. By the way, the jira issue allready exists
> > isnt'it
> > > ?
> > >
> > >
> >
> https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg
> > > If not, I would be glad to add one. Just tell me why
> > >
> > > @Ted,
> > >
> > > If you have new lines in your files then the files becomes unsuitable
> for
> > > splitting.  This means that the only parallelism available in a ctas
> > > statement is multiple files
> > >
> > > ​Does it means newlines are incompatible with drill's distributed
> > calculus
> > > ?
> > >
> > > Do you have a fair number of files?​
> > > ​I have one 30GB csv file. I don't know how many parquet file it could
> > > create as process crashes because of newlines.
> > > I can imagine approx 5 parquet files 500 MB.
> > >
> > > Thanks,​
> > >
> > >
> > > 2016-02-01 16:41 GMT+01:00 Abdel Hakim Deneche <adeneche@maprtech.com
> >:
> > >
> > > > Another user already reported some problems querying csv files with
> new
> > > > line characters:
> > > >
> > > >
> http://comments.gmane.org/gmane.comp.apache.incubator.drill.user/2350
> > > >
> > > > His particular problem was related to a bug in the LIKE function.
> > > > Unfortunately he never got around to fill a JIRA for his issue.
> > > >
> > > > Is your problem similar ? if yes, then can you please fill a JIRA ?
> > > >
> > > > On Mon, Feb 1, 2016 at 7:26 AM, Nicolas Paris <niparisco@gmail.com>
> > > wrote:
> > > >
> > > > > Hello Abdel,
> > > > >
> > > > > I am creating parquet file from those CSV files. (CREATE TABLE
> > syntax).
> > > > > Basically, I have a text column, with a maximum of 50k characters,
> > > > > containing newlines (the texts come from pdf extracted). I have
> > > > > multimilions tuples of texts. I am subseting texts containing some
> > > > patterns
> > > > > (LIKE '%foo%' or regex => sadly I haven't found mention about
regex
> > in
> > > > > documentation (postgresql "~" operator equivalent))
> > > > > Usually I used postgresql or monetdb in order to mine the texts,
> but
> > I
> > > am
> > > > > benchmarking/studying apache drill too.
> > > > >
> > > > > Thanks,
> > > > >
> > > > >
> > > > > 2016-02-01 15:54 GMT+01:00 Abdel Hakim Deneche <
> > adeneche@maprtech.com
> > > >:
> > > > >
> > > > > > Hey Nicolas,
> > > > > >
> > > > > > what kind of queries are you running on your csv file ?
> > > > > >
> > > > > > On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris <
> > niparisco@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I am trying to import a csv containing large texts. They
> contains
> > > > > newline
> > > > > > > character "\n".
> > > > > > > Apache Drill conplains about that. There is a jira issue
opened
> > on
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg
> > > > > > >
> > > > > > > Is there a workaround ? (different that removing \n from
texts)
> > > > > > >
> > > > > > > Thanks by advance
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Abdelhakim Deneche
> > > > > >
> > > > > > Software Engineer
> > > > > >
> > > > > >   <http://www.mapr.com/>
> > > > > >
> > > > > >
> > > > > > Now Available - Free Hadoop On-Demand Training
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   <http://www.mapr.com/>
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message