drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@dremio.com>
Subject Re: DRILL 1.4 - newline in strings not supported
Date Mon, 01 Feb 2016 16:45:32 GMT
We should enhance Drill's text reader so that you can disable splitting.
Once done, an appropriately escaped newline character could be consumed.
This is future work and I'm not aware of any way to solve this without this
fix.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Feb 1, 2016 at 8:40 AM, Abdel Hakim Deneche <adeneche@maprtech.com>
wrote:

> Then it's similar to DRILL-3178 indeed.
> Unfortunately there is no way I can think of to read csv files in Drill
> without replacing the new line characters.
> As Ted mentioned, Drill expected one data row per line to allow easy
> splitting of csv files.
>
> On Mon, Feb 1, 2016 at 8:24 AM, Nicolas Paris <niparisco@gmail.com> wrote:
>
> > Abdel,
> >
> > select * on my csv file fails as well
> >
> > Thanks
> >
> > 2016-02-01 17:16 GMT+01:00 Abdel Hakim Deneche <adeneche@maprtech.com>:
> >
> > > When you run a select * on your csv file, does it succeed or fail ?
> > >
> > > On Mon, Feb 1, 2016 at 7:53 AM, Nicolas Paris <niparisco@gmail.com>
> > wrote:
> > >
> > > > @Abdel,
> > > >
> > > > Yes problem is similar. By the way, the jira issue allready exists
> > > isnt'it
> > > > ?
> > > >
> > > >
> > >
> >
> https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg
> > > > If not, I would be glad to add one. Just tell me why
> > > >
> > > > @Ted,
> > > >
> > > > If you have new lines in your files then the files becomes unsuitable
> > for
> > > > splitting.  This means that the only parallelism available in a ctas
> > > > statement is multiple files
> > > >
> > > > ​Does it means newlines are incompatible with drill's distributed
> > > calculus
> > > > ?
> > > >
> > > > Do you have a fair number of files?​
> > > > ​I have one 30GB csv file. I don't know how many parquet file it
> could
> > > > create as process crashes because of newlines.
> > > > I can imagine approx 5 parquet files 500 MB.
> > > >
> > > > Thanks,​
> > > >
> > > >
> > > > 2016-02-01 16:41 GMT+01:00 Abdel Hakim Deneche <
> adeneche@maprtech.com
> > >:
> > > >
> > > > > Another user already reported some problems querying csv files with
> > new
> > > > > line characters:
> > > > >
> > > > >
> > http://comments.gmane.org/gmane.comp.apache.incubator.drill.user/2350
> > > > >
> > > > > His particular problem was related to a bug in the LIKE function.
> > > > > Unfortunately he never got around to fill a JIRA for his issue.
> > > > >
> > > > > Is your problem similar ? if yes, then can you please fill a JIRA
?
> > > > >
> > > > > On Mon, Feb 1, 2016 at 7:26 AM, Nicolas Paris <niparisco@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hello Abdel,
> > > > > >
> > > > > > I am creating parquet file from those CSV files. (CREATE TABLE
> > > syntax).
> > > > > > Basically, I have a text column, with a maximum of 50k
> characters,
> > > > > > containing newlines (the texts come from pdf extracted). I have
> > > > > > multimilions tuples of texts. I am subseting texts containing
> some
> > > > > patterns
> > > > > > (LIKE '%foo%' or regex => sadly I haven't found mention about
> regex
> > > in
> > > > > > documentation (postgresql "~" operator equivalent))
> > > > > > Usually I used postgresql or monetdb in order to mine the texts,
> > but
> > > I
> > > > am
> > > > > > benchmarking/studying apache drill too.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > >
> > > > > > 2016-02-01 15:54 GMT+01:00 Abdel Hakim Deneche <
> > > adeneche@maprtech.com
> > > > >:
> > > > > >
> > > > > > > Hey Nicolas,
> > > > > > >
> > > > > > > what kind of queries are you running on your csv file ?
> > > > > > >
> > > > > > > On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris <
> > > niparisco@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I am trying to import a csv containing large texts.
They
> > contains
> > > > > > newline
> > > > > > > > character "\n".
> > > > > > > > Apache Drill conplains about that. There is a jira
issue
> opened
> > > on
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg
> > > > > > > >
> > > > > > > > Is there a workaround ? (different that removing \n
from
> texts)
> > > > > > > >
> > > > > > > > Thanks by advance
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Abdelhakim Deneche
> > > > > > >
> > > > > > > Software Engineer
> > > > > > >
> > > > > > >   <http://www.mapr.com/>
> > > > > > >
> > > > > > >
> > > > > > > Now Available - Free Hadoop On-Demand Training
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Abdelhakim Deneche
> > > > >
> > > > > Software Engineer
> > > > >
> > > > >   <http://www.mapr.com/>
> > > > >
> > > > >
> > > > > Now Available - Free Hadoop On-Demand Training
> > > > > <
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   <http://www.mapr.com/>
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message