kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Spark on Kudu
Date Wed, 08 Jun 2016 20:19:56 GMT
What's in this doc is what's gonna get released:
https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark

J-D

On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuild11@gmail.com> wrote:

> Will this be documented with examples once 0.9.0 comes out?
>
> Thanks,
> Ben
>
>
> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcryans@apache.org>
> wrote:
>
> It will be in 0.9.0.
>
> J-D
>
> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuild11@gmail.com> wrote:
>
>> Hi Chris,
>>
>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>
>> Thanks,
>> Ben
>>
>>
>> On May 18, 2016, at 9:01 AM, Chris George <Christopher.George@rms.com>
>> wrote:
>>
>> There is some code in review that needs some more refinement.
>> It will allow upsert/insert from a dataframe using the datasource api. It
>> will also allow the creation and deletion of tables from a dataframe
>> http://gerrit.cloudera.org:8080/#/c/2992/
>>
>> Example usages will look something like:
>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc
>>
>> -Chris George
>>
>>
>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuild11@gmail.com> wrote:
>>
>> Can someone tell me what the state is of this Spark work?
>>
>> Also, does anyone have any sample code on how to update/insert data in
>> Kudu using DataFrames?
>>
>> Thanks,
>> Ben
>>
>>
>> On Apr 13, 2016, at 8:22 AM, Chris George <Christopher.George@rms.com>
>> wrote:
>>
>> SparkSQL cannot support these type of statements but we may be able to
>> implement similar functionality through the api.
>> -Chris
>>
>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuild11@gmail.com> wrote:
>>
>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it
>> were to be implemented.
>>
>> MERGE INTO table_name USING table_reference ON (condition)
>>  WHEN MATCHED THEN
>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>  WHEN NOT MATCHED THEN
>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>
>> Cheers,
>> Ben
>>
>> On Apr 11, 2016, at 12:21 PM, Chris George <Christopher.George@rms.com>
>> wrote:
>>
>> I have a wip kuduRDD that I made a few months ago. I pushed it into
>> gerrit if you want to take a look.
>> http://gerrit.cloudera.org:8080/#/c/2754/
>> It does pushdown predicates which the existing input formatter based rdd
>> does not.
>>
>> Within the next two weeks I’m planning to implement a datasource for
>> spark that will have pushdown predicates and insertion/update functionality
>> (need to look more at cassandra and the hbase datasource for best way to do
>> this) I agree that server side upsert would be helpful.
>> Having a datasource would give us useful data frames and also make spark
>> sql usable for kudu.
>>
>> My reasoning for having a spark datasource and not using Impala is: 1. We
>> have had trouble getting impala to run fast with high concurrency when
>> compared to spark 2. We interact with datasources which do not integrate
>> with impala. 3. We have custom sql query planners for extended sql
>> functionality.
>>
>> -Chris George
>>
>>
>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcryans@apache.org> wrote:
>>
>> You guys make a convincing point, although on the upsert side we'll need
>> more support from the servers. Right now all you can do is an INSERT then,
>> if you get a dup key, do an UPDATE. I guess we could at least add an API on
>> the client side that would manage it, but it wouldn't be atomic.
>>
>> J-D
>>
>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <mark@clearstorydata.com>
>> wrote:
>>
>>> It's pretty simple, actually.  I need to support versioned datasets in a
>>> Spark SQL environment.  Instead of a hack on top of a Parquet data store,
>>> I'm hoping (among other reasons) to be able to use Kudu's write and
>>> timestamp-based read operations to support not only appending data, but
>>> also updating existing data, and even some schema migration.  The most
>>> typical use case is a dataset that is updated periodically (e.g., weekly or
>>> monthly) in which the the preliminary data in the previous window (week or
>>> month) is updated with values that are expected to remain unchanged from
>>> then on, and a new set of preliminary values for the current window need to
>>> be added/appended.
>>>
>>> Using Kudu's Java API and developing additional functionality on top of
>>> what Kudu has to offer isn't too much to ask, but the ease of integration
>>> with Spark SQL will gate how quickly we would move to using Kudu and how
>>> seriously we'd look at alternatives before making that decision.
>>>
>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <jdcryans@apache.org
>>> > wrote:
>>>
>>>> Mark,
>>>>
>>>> Thanks for taking some time to reply in this thread, glad it caught the
>>>> attention of other folks!
>>>>
>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <mark@clearstorydata.com
>>>> > wrote:
>>>>
>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>
>>>>>
>>>>> I care about insert into Kudu with Spark SQL.  I'm currently delaying
>>>>> a refactoring of some Spark SQL-oriented insert functionality while trying
>>>>> to evaluate what to expect from Kudu.  Whether Kudu does a good job
>>>>> supporting inserts with Spark SQL will be a key consideration as to whether
>>>>> we adopt Kudu.
>>>>>
>>>>
>>>> I'd like to know more about why SparkSQL inserts in necessary for you.
>>>> Is it just that you currently do it that way into some database or parquet
>>>> so with minimal refactoring you'd be able to use Kudu? Would re-writing
>>>> those SQL lines into Scala and directly use the Java API's KuduSession be
>>>> too much work?
>>>>
>>>> Additionally, what do you expect to gain from using Kudu VS your
>>>> current solution? If it's not completely clear, I'd love to help you think
>>>> through it.
>>>>
>>>>
>>>>>
>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <
>>>>> jdcryans@apache.org> wrote:
>>>>>
>>>>>> Yup, starting to get a good idea.
>>>>>>
>>>>>> What are your DS folks looking for in terms of functionality related
>>>>>> to Spark? A SparkSQL integration that's as fully featured as Impala's?
Do
>>>>>> they care being able to insert into Kudu with SparkSQL or just being
able
>>>>>> to query real fast? Anything more specific to Spark that I'm missing?
>>>>>>
>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera
>>>>>> all our resources are committed to making things happen in time,
and a more
>>>>>> fully featured Spark integration isn't in our plans during that period.
I'm
>>>>>> really hoping someone in the community will help with Spark, the
same way
>>>>>> we got a big contribution for the Flume sink.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuild11@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions.
But,
>>>>>>> since it’s not “production-ready”, upper management doesn’t
want to fully
>>>>>>> deploy it yet. They just want to keep an eye on it though. Kudu
was so much
>>>>>>> simpler and easier to use in every aspect compared to HBase.
Impala was
>>>>>>> great for the report writers and analysts to experiment with
for the short
>>>>>>> time it was up. But, once again, the only blocker was the lack
of Spark
>>>>>>> support for our Data Developers/Scientists. So, production-level
data
>>>>>>> population won’t happen until then.
>>>>>>>
>>>>>>> I hope this helps you get an idea where I am coming from…
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Ben
>>>>>>>
>>>>>>>
>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <
>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>
>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuild11@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> J-D,
>>>>>>>>
>>>>>>>> The main thing I hear that Cassandra is being used as an
updatable
>>>>>>>> hot data store to ensure that duplicates are taken care of
and idempotency
>>>>>>>> is maintained. Whether data was directly retrieved from Cassandra
for
>>>>>>>> analytics, reports, or searches, it was not clear as to what
was its main
>>>>>>>> use. Some also just used it for a staging area to populate
downstream
>>>>>>>> tables in parquet format. The last thing I heard was that
CQL was terrible,
>>>>>>>> so that rules out much use of direct queries against it.
>>>>>>>>
>>>>>>>
>>>>>>> I'm no C* expert, but I don't think CQL is meant for real analytics,
>>>>>>> just ease of use instead of plainly using the APIs. Even then,
Kudu should
>>>>>>> beat it easily on big scans. Same for HBase. We've done benchmarks
against
>>>>>>> the latter, not the former.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> As for our company, we have been looking for an updatable
data
>>>>>>>> store for a long time that can be quickly queried directly
either using
>>>>>>>> Spark SQL or Impala or some other SQL engine and still handle
TB or PB of
>>>>>>>> data without performance degradation and many configuration
headaches. For
>>>>>>>> now, we are using HBase to take on this role with Phoenix
as a fast way to
>>>>>>>> directly query the data. I can see Kudu as the best way to
fill this gap
>>>>>>>> easily, especially being the closest thing to other relational
databases
>>>>>>>> out there in familiarity for the many SQL analytics people
in our company.
>>>>>>>> The other alternative would be to go with AWS Redshift for
the same
>>>>>>>> reasons, but it would come at a cost, of course. If we went
with either
>>>>>>>> solutions, Kudu or Redshift, it would get rid of the need
to extract from
>>>>>>>> HBase to parquet tables or export to PostgreSQL to support
more of the SQL
>>>>>>>> language using by analysts or the reporting software we use..
>>>>>>>>
>>>>>>>
>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off
with
>>>>>>> Kudu. Have you folks tried Kudu with Impala yet with those use
cases?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I hope this helps.
>>>>>>>>
>>>>>>>
>>>>>>> It does, thanks for nice reply.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Ben
>>>>>>>>
>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we
like to
>>>>>>>> refer to "Impala + Kudu" as Kimpala, but yeah it's not as
sexy. My
>>>>>>>> colleagues who were also there did say that the hype around
Spark isn't
>>>>>>>> dying down.
>>>>>>>>
>>>>>>>> There's definitely an overlap in the use cases that Cassandra,
>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying
that C* is just an
>>>>>>>> interim solution for the use case you describe.
>>>>>>>>
>>>>>>>> Nothing significant happened in Kudu over the past month,
it's a
>>>>>>>> storage engine so things move slowly *smile*. I'd love to
see more
>>>>>>>> contributions on the Spark front. I know there's code out
there that could
>>>>>>>> be integrated in kudu-spark, it just needs to land in gerrit.
I'm sure
>>>>>>>> folks will happily review it.
>>>>>>>>
>>>>>>>> Do you have relevant experiences you can share? I'd love
to learn
>>>>>>>> more about the use cases for which you envision using Kudu
as a C*
>>>>>>>> replacement.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuild11@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi J-D,
>>>>>>>>>
>>>>>>>>> My colleagues recently came back from Strata in San Jose.
They
>>>>>>>>> told me that everything was about Spark and there is
a big buzz about the
>>>>>>>>> SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I
still think that
>>>>>>>>> Cassandra is just an interim solution as a low-latency,
easily queried data
>>>>>>>>> store. I was wondering if anything significant happened
in regards to Kudu,
>>>>>>>>> especially on the Spark front. Plus, can you come up
with your own proposed
>>>>>>>>> stack acronym to promote?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <
>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>
>>>>>>>>> Hi Ben,
>>>>>>>>>
>>>>>>>>> AFAIK no one in the dev community committed to any timeline.
I
>>>>>>>>> know of one person on the Kudu Slack who's working on
a better RDD, but
>>>>>>>>> that's about it.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <bkim@amobee.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi J-D,
>>>>>>>>>>
>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214?
I want to target a
>>>>>>>>>> version of Kudu to begin real testing of Spark against
it for our devs. At
>>>>>>>>>> least, I can tell them what timeframe to anticipate.
>>>>>>>>>>
>>>>>>>>>> Just curious,
>>>>>>>>>> *Benjamin Kim*
>>>>>>>>>> *Data Solutions Architect*
>>>>>>>>>>
>>>>>>>>>> [a•mo•bee] *(n.)* the company defining digital
marketing.
>>>>>>>>>>
>>>>>>>>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>*
>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica,
CA 90405  |
>>>>>>>>>> www.amobee.com
>>>>>>>>>>
>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans <
>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure
if it's needed
>>>>>>>>>> either.
>>>>>>>>>>
>>>>>>>>>> The kuduRDD is just leveraging the MR input format,
ideally we'd
>>>>>>>>>> use scans directly.
>>>>>>>>>>
>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any
sort of
>>>>>>>>>> pushdown. It's really basic.
>>>>>>>>>>
>>>>>>>>>> The goal was to provide something for others to contribute
to. We
>>>>>>>>>> have some basic unit tests that others can easily
extend. None of us on the
>>>>>>>>>> team are Spark experts, but we'd be really happy
to assist one improve the
>>>>>>>>>> kudu-spark code.
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <bbuild11@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> J-D,
>>>>>>>>>>>
>>>>>>>>>>> It looks like it fulfills most of the basic requirements
(kudu
>>>>>>>>>>> RDD, kudu DStream) in KUDU-1214. Am I right?
Besides shoring up more Spark
>>>>>>>>>>> SQL functionality (Dataframes) and doing the
documentation, what more needs
>>>>>>>>>>> to be done? Optimizations?
>>>>>>>>>>>
>>>>>>>>>>> I believe that it’s a good place to start using
Spark with Kudu
>>>>>>>>>>> and compare it to HBase with Spark (not clean).
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ben
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans
<
>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>> AFAIK no one is working on it, but we did manage
to get this in
>>>>>>>>>>> for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321
>>>>>>>>>>>
>>>>>>>>>>> It's a really simple wrapper, and yes you can
use SparkSQL on
>>>>>>>>>>> Kudu, but it will require a lot more work to
make it fast/useful.
>>>>>>>>>>>
>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim
<
>>>>>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I see this KUDU-1214
>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214>
targeted for
>>>>>>>>>>>> 0.8.0, but I see no progress on it. When
this is complete, will this mean
>>>>>>>>>>>> that Spark will be able to work with Kudu
both programmatically and as a
>>>>>>>>>>>> client via Spark SQL? Or is there more work
that needs to be done on the
>>>>>>>>>>>> Spark side for it to work?
>>>>>>>>>>>>
>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Ben
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>>
>
>

Mime
View raw message