spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mal Edwin <mal.ed...@vinadionline.com>
Subject RE: RE: Fast write datastore...
Date Thu, 16 Mar 2017 11:18:12 GMT
Hi All,
I believe here what we are looking for is a serving layer where user queries can be executed
on a subset of processed data.
In this scenario, we are using Impala for this as it provides a layered caching, in our use
case it caches some set in memory and then some in HDFS and the full set is on S3.

Our processing layer is SparkStreaming + HBase  —> extracts to Parquet on S3 —>
Impala is serving layer serving user requests. Impala also has a SQL interface. Drawback is
Impala is not managed via Yarn and has its own resource manager and you would have to figure
out a way to man Yarn and impala co-exist.

Thanks,
Edwin

On Mar 16, 2017, 5:44 AM -0400, yohann jardin <yohannjardin@hotmail.com>, wrote:
> Hello everyone,
>
> I'm also really interested in the answers as I will be facing the same issue soon.
> Muthu, if you evaluate again Apache Ignite, can you share your results? I also noticed
Alluxio to store spark results in memory that you might want to investigate.
>
> In my case I want to use them to have a real time dashboard (or like waiting very few
seconds to refine a dashboard), and that use case seems similar to your filter/aggregate previously
computed spark results.
>
> Regards,
> Yohann
>
> De : Rick Moritz <rahvin@gmail.com>
> Envoyé : jeudi 16 mars 2017 10:37
> À : user
> Objet : Re: RE: Fast write datastore...
>
> If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet might also
be an option. Of course, management-wise it has much more overhead than using ES, since you
need to manually define partitions and buckets, which is suboptimal. On the other hand, for
querying, you can probably get some decent performance by hooking up Impala or Presto or LLAP-Hive,
if Spark were too slow/cumbersome.
> Depending on your particular access patterns, this may not be very practical, but as
a general approach it might be one way to get intermediate results quicker, and with less
of a storage-zoo than some alternatives.
>
> > On Thu, Mar 16, 2017 at 7:57 AM, Shiva Ramagopal <tr.shiv@gmail.com> wrote:
> > > I do think Kafka is an overkill in this case. There are no streaming use- cases
that needs a queue to do pub-sub.
> > >
> > > > On 16-Mar-2017 11:47 AM, "vvshvv" <vvshvv@gmail.com> wrote:
> > > > > Hi,
> > > > >
> > > > > >> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
> > > > >
> > > > > I do not think so, in this case you will be able to process Parquet
files as usual, but Kafka will allow your Elasticsearch cluster to be stable and survive regarding
the number of rows.
> > > > >
> > > > > Regards,
> > > > > Uladzimir
> > > > >
> > > > >
> > > > >
> > > > > On jasbir.sing@accenture.com, Mar 16, 2017 7:52 AM wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Will MongoDB not fit this solution?
> > > > > >
> > > > > >
> > > > > >
> > > > > > From: Vova Shelgunov [mailto:vvshvv@gmail.com]
> > > > > > Sent: Wednesday, March 15, 2017 11:51 PM
> > > > > > To: Muthu Jayakumar <babloo80@gmail.com>
> > > > > > Cc: vincent gromakowski <vincent.gromakowski@gmail.com>;
Richard Siebeling <rsiebeling@gmail.com>; user <user@spark.apache.org>; Shiva
Ramagopal <tr.shiv@gmail.com>
> > > > > > Subject: Re: Fast write datastore...
> > > > > >
> > > > > > Hi Muthu,.
> > > > > >
> > > > > > I did not catch from your message, what performance do you expect
from subsequent queries?
> > > > > >
> > > > > > Regards,
> > > > > > Uladzimir
> > > > > >
> > > > > > On Mar 15, 2017 9:03 PM, "Muthu Jayakumar" <babloo80@gmail.com>
wrote:
> > > > > > > Hello Uladzimir / Shiva,
> > > > > > >
> > > > > > > From ElasticSearch documentation (i have to see the logical
plan of a query to confirm), the richness of filters (like regex,..) is pretty good while
comparing to Cassandra. As for aggregates, i think Spark Dataframes is quite rich enough to
tackle.
> > > > > > > Let me know your thoughts.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Muthu
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Mar 15, 2017 at 10:55 AM, vvshvv <vvshvv@gmail.com>
wrote:
> > > > > > > > Hi muthu,
> > > > > > > >
> > > > > > > > I agree with Shiva, Cassandra also supports SASI indexes,
which can partially replace Elasticsearch functionality.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Uladzimir
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Sent from my Mi phone
> > > > > > > > On Shiva Ramagopal <tr.shiv@gmail.com>, Mar
15, 2017 5:57 PM wrote:
> > > > > > > > > Probably Cassandra is a good choice if you are
mainly looking for a datastore that supports fast writes. You can ingest the data into a table
and define one or more materialized views on top of it to support your queries. Since you
mention that your queries are going to be simple you can define your indexes in the materialized
views according to how you want to query the data.
> > > > > > > > > Thanks,
> > > > > > > > > Shiva
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar
<babloo80@gmail.com> wrote:
> > > > > > > > > > Hello Vincent,
> > > > > > > > > >
> > > > > > > > > > Cassandra may not fit my bill if I need
to define my partition and other indexes upfront. Is this right?
> > > > > > > > > >
> > > > > > > > > > Hello Richard,
> > > > > > > > > >
> > > > > > > > > > Let me evaluate Apache Ignite. I did evaluate
it 3 months back and back then the connector to Apache Spark did not support Spark 2.0.
> > > > > > > > > >
> > > > > > > > > > Another drastic thought may be repartition
the result count to 1 (but have to be cautions on making sure I don't run into Heap issues
if the result is too large to fit into an executor)  and write to a relational database like
mysql / postgres. But, I believe I can do the same using ElasticSearch too.
> > > > > > > > > >
> > > > > > > > > > A slightly over-kill solution may be Spark
to Kafka to ElasticSearch?
> > > > > > > > > >
> > > > > > > > > > More thoughts welcome please.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Muthu
> > > > > > > > > >
> > > > > > > > > > On Wed, Mar 15, 2017 at 4:53 AM, Richard
Siebeling <rsiebeling@gmail.com> wrote:
> > > > > > > > > > > maybe Apache Ignite does fit your requirements
> > > > > > > > > > >
> > > > > > > > > > > On 15 March 2017 at 08:44, vincent
gromakowski <vincent.gromakowski@gmail.com> wrote:
> > > > > > > > > > > > Hi
> > > > > > > > > > > > If queries are statics and filters
are on the same columns, Cassandra is a good option.
> > > > > > > > > > > >
> > > > > > > > > > > > Le 15 mars 2017 7:04 AM, "muthu"
<babloo80@gmail.com> a écrit :
> > > > > > > > > > > > > Hello there,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I have one or more parquet
files to read and perform some aggregate queries
> > > > > > > > > > > > > using Spark Dataframe. I
would like to find a reasonable fast datastore that
> > > > > > > > > > > > > allows me to write the results
for subsequent (simpler queries).
> > > > > > > > > > > > > I did attempt to use ElasticSearch
to write the query results using
> > > > > > > > > > > > > ElasticSearch Hadoop connector.
But I am running into connector write issues
> > > > > > > > > > > > > if the number of Spark executors
are too many for ElasticSearch to handle.
> > > > > > > > > > > > > But in the schema sense,
this seems a great fit as ElasticSearch has smartz
> > > > > > > > > > > > > in place to discover the
schema. Also in the query sense, I can perform
> > > > > > > > > > > > > simple filters and sort using
ElasticSearch and for more complex aggregate,
> > > > > > > > > > > > > Spark Dataframe can come
back to the rescue :).
> > > > > > > > > > > > > Please advice on other possible
data-stores I could use?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Muthu
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
> > > > > > > > > > > > > Sent from the Apache Spark
User List mailing list archive at Nabble.com.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ---------------------------------------------------------------------
> > > > > > > > > > > > > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > This message is for the designated recipient only and may contain
privileged, proprietary, or otherwise confidential information. If you have received it in
error, please notify the sender immediately and delete the original. Any other use of the
e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture
and its affiliates, including e-mail and instant messaging (including content), may be scanned
by our systems for the purposes of information security and assessment of internal compliance
with Accenture policy.
> > > > > > ______________________________________________________________________________________
> > > > > >
> > > > > > www.accenture.com
>

Mime
View raw message