drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Reg; Apache Drill
Date Wed, 07 Jun 2017 07:28:17 GMT

Let me rephrase what Bob has to say. It has some merit, but it also
probably has a bit more sting than it needs to have.

The first question that you need to look at in any kind of textual analysis
project is what kind of data you are likely to have. How will the data be
presented to you? For instance, at two different extremes there are the
twitter API (with a very well specified data format and lots of well coded
meta-data) and there patient notes in raw image form (hand-written data
with no transcriptions and possible very little meta-data). As you can
imagine, the tasks that you need to do on each extreme are very, very

Another key aspect of your data is how big it really is. If you only have
millions of examples, then big data is going to be just a hindrance, not a
help. If you have billions of text examples, then big data may become a

Beyond the data source, you need to look at what kind of analysis you need
to do. In particular, it is likely that there will be some sort of
statistical analysis of the data that you are looking at. You might be
looking at some indicators of particular test results that might be found
in social media. Or you might be looking to predict cases of misdiagnosis.
In either case Drill (or Hive) would only be useful for counting up the
cases that have specific features. Finding the features and interpreting
the counts you produce would require other software.

This means that a SQL system like Drill or Hive will have a very minor role
in your analysis. Indeed, many systems that are good for data reduction
(like R or Spark) can do all the counting that Drill or Hive can do.

I hope this helps.

On Wed, Jun 7, 2017 at 3:32 AM, Bob Rudis <bob@rud.is> wrote:

> You should likely spend some time studying statistics and machine
> learning then examine the pluses and minuses of a few "data
> science"-oriented programming languages and focus on one that has
> idioms that make sense to you. Then you'll see just how inappropriate
> your question is.
> On Tue, Jun 6, 2017 at 8:07 AM, Pritam Tambe <pritamm@cdac.in> wrote:
> > Dear Sir,
> >
> > I want to do Social Media Data analysis for Health Domain using Big Data.
> >
> > I am confused weather to go for Apache Drill or HIVE.
> >
> > Please Guide.
> >
> >
> > --
> > Thanks & Regards,
> > Pritam Tambe,
> > Project Engineer - AAI Group,
> > Centre for Development of Advanced Computing [C-DAC],
> >
> > ------------------------------------------------------------
> -------------------------------------------------------------------
> > [ C-DAC is on Social-Media too. Kindly follow us at:
> > Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
> >
> > This e-mail is for the sole use of the intended recipient(s) and may
> > contain confidential and privileged information. If you are not the
> > intended recipient, please contact the sender by reply e-mail and destroy
> > all copies and the original message. Any unauthorized review, use,
> > disclosure, dissemination, forwarding, printing or copying of this email
> > is strictly prohibited and appropriate legal action will be taken.
> > ------------------------------------------------------------
> -------------------------------------------------------------------
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message