kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Burkert <...@cloudera.com>
Subject Re: Spark on Kudu
Date Tue, 14 Jun 2016 23:07:07 GMT
Looks like we're missing an import statement in that example.  Could you
try:

import org.kududb.client._

and try again?

- Dan

On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuild11@gmail.com> wrote:

> I encountered an error trying to create a table based on the documentation
> from a DataFrame.
>
> <console>:49: error: not found: type CreateTableOptions
>               kuduContext.createTable(tableName, df.schema, Seq("key"),
> new CreateTableOptions().setNumReplicas(1))
>
> Is there something I’m missing?
>
> Thanks,
> Ben
>
> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcryans@apache.org>
> wrote:
>
> It's only in Cloudera's maven repo:
> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>
> J-D
>
> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuild11@gmail.com> wrote:
>
>> Hi J-D,
>>
>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for
>> spark-shell to use. Can you show me where to find it?
>>
>> Thanks,
>> Ben
>>
>>
>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>> wrote:
>>
>> What's in this doc is what's gonna get released:
>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>
>> J-D
>>
>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuild11@gmail.com> wrote:
>>
>>> Will this be documented with examples once 0.9.0 comes out?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcryans@apache.org>
>>> wrote:
>>>
>>> It will be in 0.9.0.
>>>
>>> J-D
>>>
>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuild11@gmail.com>
>>> wrote:
>>>
>>>> Hi Chris,
>>>>
>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On May 18, 2016, at 9:01 AM, Chris George <Christopher.George@rms.com>
>>>> wrote:
>>>>
>>>> There is some code in review that needs some more refinement.
>>>> It will allow upsert/insert from a dataframe using the datasource api.
>>>> It will also allow the creation and deletion of tables from a dataframe
>>>> http://gerrit.cloudera.org:8080/#/c/2992/
>>>>
>>>> Example usages will look something like:
>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc
>>>>
>>>> -Chris George
>>>>
>>>>
>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuild11@gmail.com> wrote:
>>>>
>>>> Can someone tell me what the state is of this Spark work?
>>>>
>>>> Also, does anyone have any sample code on how to update/insert data in
>>>> Kudu using DataFrames?
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Apr 13, 2016, at 8:22 AM, Chris George <Christopher.George@rms.com>
>>>> wrote:
>>>>
>>>> SparkSQL cannot support these type of statements but we may be able to
>>>> implement similar functionality through the api.
>>>> -Chris
>>>>
>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuild11@gmail.com> wrote:
>>>>
>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if
>>>> it were to be implemented.
>>>>
>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>  WHEN MATCHED THEN
>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>  WHEN NOT MATCHED THEN
>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>>
>>>> Cheers,
>>>> Ben
>>>>
>>>> On Apr 11, 2016, at 12:21 PM, Chris George <Christopher.George@rms.com>
>>>> wrote:
>>>>
>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into
>>>> gerrit if you want to take a look.
>>>> http://gerrit.cloudera.org:8080/#/c/2754/
>>>> It does pushdown predicates which the existing input formatter based
>>>> rdd does not.
>>>>
>>>> Within the next two weeks I’m planning to implement a datasource for
>>>> spark that will have pushdown predicates and insertion/update functionality
>>>> (need to look more at cassandra and the hbase datasource for best way to
do
>>>> this) I agree that server side upsert would be helpful.
>>>> Having a datasource would give us useful data frames and also make
>>>> spark sql usable for kudu.
>>>>
>>>> My reasoning for having a spark datasource and not using Impala is: 1.
>>>> We have had trouble getting impala to run fast with high concurrency when
>>>> compared to spark 2. We interact with datasources which do not integrate
>>>> with impala. 3. We have custom sql query planners for extended sql
>>>> functionality.
>>>>
>>>> -Chris George
>>>>
>>>>
>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcryans@apache.org> wrote:
>>>>
>>>> You guys make a convincing point, although on the upsert side we'll
>>>> need more support from the servers. Right now all you can do is an INSERT
>>>> then, if you get a dup key, do an UPDATE. I guess we could at least add an
>>>> API on the client side that would manage it, but it wouldn't be atomic.
>>>>
>>>> J-D
>>>>
>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <mark@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> It's pretty simple, actually.  I need to support versioned datasets in
>>>>> a Spark SQL environment.  Instead of a hack on top of a Parquet data
store,
>>>>> I'm hoping (among other reasons) to be able to use Kudu's write and
>>>>> timestamp-based read operations to support not only appending data, but
>>>>> also updating existing data, and even some schema migration.  The most
>>>>> typical use case is a dataset that is updated periodically (e.g., weekly
or
>>>>> monthly) in which the the preliminary data in the previous window (week
or
>>>>> month) is updated with values that are expected to remain unchanged from
>>>>> then on, and a new set of preliminary values for the current window need
to
>>>>> be added/appended.
>>>>>
>>>>> Using Kudu's Java API and developing additional functionality on top
>>>>> of what Kudu has to offer isn't too much to ask, but the ease of
>>>>> integration with Spark SQL will gate how quickly we would move to using
>>>>> Kudu and how seriously we'd look at alternatives before making that
>>>>> decision.
>>>>>
>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <
>>>>> jdcryans@apache.org> wrote:
>>>>>
>>>>>> Mark,
>>>>>>
>>>>>> Thanks for taking some time to reply in this thread, glad it caught
>>>>>> the attention of other folks!
>>>>>>
>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <
>>>>>> mark@clearstorydata.com> wrote:
>>>>>>
>>>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>>>
>>>>>>>
>>>>>>> I care about insert into Kudu with Spark SQL.  I'm currently
>>>>>>> delaying a refactoring of some Spark SQL-oriented insert functionality
>>>>>>> while trying to evaluate what to expect from Kudu.  Whether Kudu
does a
>>>>>>> good job supporting inserts with Spark SQL will be a key consideration
as
>>>>>>> to whether we adopt Kudu.
>>>>>>>
>>>>>>
>>>>>> I'd like to know more about why SparkSQL inserts in necessary for
>>>>>> you. Is it just that you currently do it that way into some database
or
>>>>>> parquet so with minimal refactoring you'd be able to use Kudu? Would
>>>>>> re-writing those SQL lines into Scala and directly use the Java API's
>>>>>> KuduSession be too much work?
>>>>>>
>>>>>> Additionally, what do you expect to gain from using Kudu VS your
>>>>>> current solution? If it's not completely clear, I'd love to help
you think
>>>>>> through it.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <
>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>
>>>>>>>> Yup, starting to get a good idea.
>>>>>>>>
>>>>>>>> What are your DS folks looking for in terms of functionality
>>>>>>>> related to Spark? A SparkSQL integration that's as fully
featured as
>>>>>>>> Impala's? Do they care being able to insert into Kudu with
SparkSQL or just
>>>>>>>> being able to query real fast? Anything more specific to
Spark that I'm
>>>>>>>> missing?
>>>>>>>>
>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall.
At
>>>>>>>> Cloudera all our resources are committed to making things
happen in time,
>>>>>>>> and a more fully featured Spark integration isn't in our
plans during that
>>>>>>>> period. I'm really hoping someone in the community will help
with Spark,
>>>>>>>> the same way we got a big contribution for the Flume sink.
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuild11@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions.
But,
>>>>>>>>> since it’s not “production-ready”, upper management
doesn’t want to fully
>>>>>>>>> deploy it yet. They just want to keep an eye on it though.
Kudu was so much
>>>>>>>>> simpler and easier to use in every aspect compared to
HBase. Impala was
>>>>>>>>> great for the report writers and analysts to experiment
with for the short
>>>>>>>>> time it was up. But, once again, the only blocker was
the lack of Spark
>>>>>>>>> support for our Data Developers/Scientists. So, production-level
data
>>>>>>>>> population won’t happen until then.
>>>>>>>>>
>>>>>>>>> I hope this helps you get an idea where I am coming from…
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <
>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>
>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuild11@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> J-D,
>>>>>>>>>>
>>>>>>>>>> The main thing I hear that Cassandra is being used
as an
>>>>>>>>>> updatable hot data store to ensure that duplicates
are taken care of and
>>>>>>>>>> idempotency is maintained. Whether data was directly
retrieved from
>>>>>>>>>> Cassandra for analytics, reports, or searches, it
was not clear as to what
>>>>>>>>>> was its main use. Some also just used it for a staging
area to populate
>>>>>>>>>> downstream tables in parquet format. The last thing
I heard was that CQL
>>>>>>>>>> was terrible, so that rules out much use of direct
queries against it.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for
real
>>>>>>>>> analytics, just ease of use instead of plainly using
the APIs. Even then,
>>>>>>>>> Kudu should beat it easily on big scans. Same for HBase.
We've done
>>>>>>>>> benchmarks against the latter, not the former.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As for our company, we have been looking for an updatable
data
>>>>>>>>>> store for a long time that can be quickly queried
directly either using
>>>>>>>>>> Spark SQL or Impala or some other SQL engine and
still handle TB or PB of
>>>>>>>>>> data without performance degradation and many configuration
headaches. For
>>>>>>>>>> now, we are using HBase to take on this role with
Phoenix as a fast way to
>>>>>>>>>> directly query the data. I can see Kudu as the best
way to fill this gap
>>>>>>>>>> easily, especially being the closest thing to other
relational databases
>>>>>>>>>> out there in familiarity for the many SQL analytics
people in our company.
>>>>>>>>>> The other alternative would be to go with AWS Redshift
for the same
>>>>>>>>>> reasons, but it would come at a cost, of course.
If we went with either
>>>>>>>>>> solutions, Kudu or Redshift, it would get rid of
the need to extract from
>>>>>>>>>> HBase to parquet tables or export to PostgreSQL to
support more of the SQL
>>>>>>>>>> language using by analysts or the reporting software
we use..
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too
far off with
>>>>>>>>> Kudu. Have you folks tried Kudu with Impala yet with
those use cases?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I hope this helps.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It does, thanks for nice reply.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Ben
>>>>>>>>>>
>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <
>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera
we like to
>>>>>>>>>> refer to "Impala + Kudu" as Kimpala, but yeah it's
not as sexy. My
>>>>>>>>>> colleagues who were also there did say that the hype
around Spark isn't
>>>>>>>>>> dying down.
>>>>>>>>>>
>>>>>>>>>> There's definitely an overlap in the use cases that
Cassandra,
>>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as
saying that C* is just an
>>>>>>>>>> interim solution for the use case you describe.
>>>>>>>>>>
>>>>>>>>>> Nothing significant happened in Kudu over the past
month, it's a
>>>>>>>>>> storage engine so things move slowly *smile*. I'd
love to see more
>>>>>>>>>> contributions on the Spark front. I know there's
code out there that could
>>>>>>>>>> be integrated in kudu-spark, it just needs to land
in gerrit. I'm sure
>>>>>>>>>> folks will happily review it.
>>>>>>>>>>
>>>>>>>>>> Do you have relevant experiences you can share? I'd
love to learn
>>>>>>>>>> more about the use cases for which you envision using
Kudu as a C*
>>>>>>>>>> replacement.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuild11@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>
>>>>>>>>>>> My colleagues recently came back from Strata
in San Jose. They
>>>>>>>>>>> told me that everything was about Spark and there
is a big buzz about the
>>>>>>>>>>> SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka).
I still think that
>>>>>>>>>>> Cassandra is just an interim solution as a low-latency,
easily queried data
>>>>>>>>>>> store. I was wondering if anything significant
happened in regards to Kudu,
>>>>>>>>>>> especially on the Spark front. Plus, can you
come up with your own proposed
>>>>>>>>>>> stack acronym to promote?
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Ben
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans
<
>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Ben,
>>>>>>>>>>>
>>>>>>>>>>> AFAIK no one in the dev community committed to
any timeline. I
>>>>>>>>>>> know of one person on the Kudu Slack who's working
on a better RDD, but
>>>>>>>>>>> that's about it.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim
<bkim@amobee.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>
>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214?
I want to target
>>>>>>>>>>>> a version of Kudu to begin real testing of
Spark against it for our devs.
>>>>>>>>>>>> At least, I can tell them what timeframe
to anticipate.
>>>>>>>>>>>>
>>>>>>>>>>>> Just curious,
>>>>>>>>>>>> *Benjamin Kim*
>>>>>>>>>>>> *Data Solutions Architect*
>>>>>>>>>>>>
>>>>>>>>>>>> [a•mo•bee] *(n.)* the company defining
digital marketing.
>>>>>>>>>>>>
>>>>>>>>>>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>*
>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa
Monica, CA 90405  |
>>>>>>>>>>>> www.amobee.com
>>>>>>>>>>>>
>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel
Cryans <
>>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> The DStream stuff isn't there at all. I'm
not sure if it's
>>>>>>>>>>>> needed either.
>>>>>>>>>>>>
>>>>>>>>>>>> The kuduRDD is just leveraging the MR input
format, ideally
>>>>>>>>>>>> we'd use scans directly.
>>>>>>>>>>>>
>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't
do any sort of
>>>>>>>>>>>> pushdown. It's really basic.
>>>>>>>>>>>>
>>>>>>>>>>>> The goal was to provide something for others
to contribute to.
>>>>>>>>>>>> We have some basic unit tests that others
can easily extend. None of us on
>>>>>>>>>>>> the team are Spark experts, but we'd be really
happy to assist one improve
>>>>>>>>>>>> the kudu-spark code.
>>>>>>>>>>>>
>>>>>>>>>>>> J-D
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin
Kim <
>>>>>>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>
>>>>>>>>>>>>> It looks like it fulfills most of the
basic requirements (kudu
>>>>>>>>>>>>> RDD, kudu DStream) in KUDU-1214. Am I
right? Besides shoring up more Spark
>>>>>>>>>>>>> SQL functionality (Dataframes) and doing
the documentation, what more needs
>>>>>>>>>>>>> to be done? Optimizations?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I believe that it’s a good place to
start using Spark with
>>>>>>>>>>>>> Kudu and compare it to HBase with Spark
(not clean).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel
Cryans <
>>>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> AFAIK no one is working on it, but we
did manage to get this
>>>>>>>>>>>>> in for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's a really simple wrapper, and yes
you can use SparkSQL on
>>>>>>>>>>>>> Kudu, but it will require a lot more
work to make it fast/useful.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>>
>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin
Kim <
>>>>>>>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see this KUDU-1214
>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214>
targeted for
>>>>>>>>>>>>>> 0.8.0, but I see no progress on it.
When this is complete, will this mean
>>>>>>>>>>>>>> that Spark will be able to work with
Kudu both programmatically and as a
>>>>>>>>>>>>>> client via Spark SQL? Or is there
more work that needs to be done on the
>>>>>>>>>>>>>> Spark side for it to work?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message