drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Altekruse <altekruseja...@gmail.com>
Subject Re: Creating a single parquet or csv file using CTAS command?
Date Thu, 04 Feb 2016 21:13:32 GMT
While both parquet and javascript are widely used, they kind of exist in
different worlds. I cannot find  a javscript reader for parquet files.

That being said, I'm not so sure that one ought to exist, as parquet files
are designed specifically for storing volumes of data for scan efficiency.

Are you feeding a gig of data into D3? Or is the python code performing
some analysis/contraction on the data that gets sent to D3?

Drill does include the option to output to JSON, could you just produce
JSON files directly with Drill?



On Thu, Feb 4, 2016 at 12:13 PM, Peder Jakobsen | gmail <pjakobsen@gmail.com
> wrote:

> Hi,  Jason.... sorry for the confusion;   I'm generating both cvs files and
> parquet. Parquet is just an experiment for me to see if I get better
> performance than with CSV or loading the csv into something like TinyDB or
> MongoDB.
>
> I've found a way to read the parquet files with a python library;  So the
> architecture is  :  Static parquet file(s) circa 1Gig -> python in a flask
> webapp reads parquet and generates json ->  D3.js reads json and displays
> graphics.
>
> You are right, if I can figure out how to make the python parquet library
> smart about which parquet files to read, that would be great.  Should be no
> reason to load one big file into memory on the web server every time
> someone makes a request (although I'm sure this problem could be perhaps be
> solved more simply by caching requests with memcache.)
>
> Can't think of a reason to make life more complicated by running Drill on
> AWS, but perhaps I'll find some as I continue learning.
>
> Could Javascript possibly read a parquet file?  Perhaps such a library
> exists now that the Node.js world is so huge...?
>
> Thank you.
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Feb 4, 2016 at 2:40 PM, Jason Altekruse <altekrusejason@gmail.com>
> wrote:
>
> > Are you even trying to write parquet files? in your original post you
> said
> > you are writing CSV files, but then gave files with parquet extensions as
> > what you are trying to concatenate.
> >
> > I'm a little confused though if you are not working with tools for big
> > data, concatenating parquet files is not trivial, it requires rewriting
> the
> > file level metadata after concatenating them. You can't concatenate them
> > the way you would CSV files.
> >
> > On Thu, Feb 4, 2016 at 11:15 AM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> > > On a desktop you will likely be limited on memory.
> > >
> > > Perhaps set width to 1 to go on single threaded execution, and use
> 512MB
> > > or 1GB for parquet block size pending how much memory the Drillbit has
> > for
> > > direct memory. This will limit the number of parquet files being
> created,
> > > see how much smaller the parquet file(s) are compared to CSV on a
> subset
> > of
> > > data. Will there be an issues with multiple parquet files in a
> directory?
> > >
> > > Before Drill can write the parquet file it needs to build the structure
> > in
> > > memory, so it needs space for the input data and the parquet file in
> > > memory. A desktop will unlike have enough memory for a single parquet
> > file.
> > > Experiment with a subset of your data and see what works.
> > >
> > > Have you looked at cloud solutions to process the data apart from just
> a
> > > basic hosting solution? As an example you will find MapR with Drill
> > > available on demand on Amazon AWS and Azure. You may want to look at
> that
> > > to spin up a node(s), load/process/download your data, and then spin it
> > > down. Might be worth a look, pending your needs/budget.
> > >
> > > --Andries
> > >
> > >
> > > > On Feb 4, 2016, at 10:57 AM, Peder Jakobsen | gmail <
> > pjakobsen@gmail.com>
> > > wrote:
> > > >
> > > > Hi Andries,   the trouble is that I run Drill on my desktop machine,
> > but
> > > I
> > > > have no server available to me that is capable of running Drill.
> Most
> > > > $10/month hosting accounts do not permit you to run java apps.  For
> > this
> > > > reason I simply use Drill for "pre-processing" of the files that I
> > > > eventually will use in my simple 50 line python web app.
> > > >
> > > > Even if I could run drill on my server, this seems like a lot of
> > overhead
> > > > for something as simple as a flat file  with 6 columns.
> > > >
> > > > On Thu, Feb 4, 2016 at 1:31 PM, Andries Engelbrecht <
> > > > aengelbrecht@maprtech.com> wrote:
> > > >
> > > >> You can create multiple parquet files and have the ability to query
> > them
> > > >> all through the Drill SQL interface with minimal overhead.
> > > >>
> > > >> Creating a single 50GB parquet file is likely not be the best option
> > for
> > > >> performance, perhaps use Drill partitioning for the parquet files
to
> > > speed
> > > >> up queries and reads in the future. Although parquet should be more
> > > >> efficient that CSV to store data. You can still limit Drill to a
> > single
> > > >> thread to limit memory use for parquet CTAS and potentially number
> of
> > > files
> > > >> created.
> > > >>
> > > >> A bit of experimentation may help to find the optimum config for
> your
> > > use
> > > >> case.
> > > >>
> > > >> --Andries
> > > >>
> > > >>> On Feb 4, 2016, at 10:12 AM, Peder Jakobsen | gmail <
> > > pjakobsen@gmail.com>
> > > >> wrote:
> > > >>>
> > > >>> Sorry, bad typo:  I have 50GB of data, NOT 500GB  ;).  And I
> usually
> > > only
> > > >>> query a 1 GB subset of this data using Drill.
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Thu, Feb 4, 2016 at 1:04 PM, Peder Jakobsen | gmail <
> > > >> pjakobsen@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> On Thu, Feb 4, 2016 at 11:15 AM, Andries Engelbrecht <
> > > >>>> aengelbrecht@maprtech.com> wrote:
> > > >>>>
> > > >>>>> Is there a reason to create a single file? Typically you
may want
> > > more
> > > >>>>> files to improve parallel operation on distributed systems
like
> > > drill.
> > > >>>>>
> > > >>>>
> > > >>>> Good question.   I'm not actually using Drill for "big data".
 In
> > > fact,
> > > >> I
> > > >>>> never deal with "big data", and I'm unlikely to ever  do so.
> > > >>>>
> > > >>>> But I do have 500 GB of CSV files spread across about 100
> > directories.
> > > >>>> They are all part of the same dataset, but this is how it's
been
> > > >> organized
> > > >>>> by the government department who has released it as and Open
Data
> > dump
> > > >>>>
> > > >>>> Drill saves me the hassle of having to stitch these files
together
> > > using
> > > >>>> python or awk. I love being able to just query the files using
SQL
> > (so
> > > >> far
> > > >>>> it's slow though, I need to figure out why - 18 seconds for
a
> simple
> > > >> query
> > > >>>> is too much).   Data eventually needs to end up on the web
to
> share
> > it
> > > >> with
> > > >>>> other people, and I use crossfilter.js and D3.js for presentation.
> > I
> > > >> need
> > > >>>> fine grained control over online data presentation, and all
BI
> tools
> > > >> I've
> > > >>>> seen are terrible in this department, eg. Tableau.
> > > >>>>
> > > >>>> So I need my data in a format that can be read by common web
> > > frameworks,
> > > >>>> and that usually implies dealing with a single file that can
be
> > > >> uploaded to
> > > >>>> the web server.  No need for a database, since I'm just reading
a
> > few
> > > >>>> columns from a big flat file.
> > > >>>>
> > > >>>> I run my apps on a low cost virtual server. I don't have access
to
> > > >>>> java/virtualbox/MongoDB etc.  Nor do I think these things
are
> > > necessary:
> > > >>>> K.I.S.S
> > > >>>>
> > > >>>> So this use case may be quite different from many of the more
> > > >> "corporate"
> > > >>>> users, but Drill is so very useful regardless.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message