kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Henke <ghe...@cloudera.com>
Subject Re: Check existing range partitions using the Java API
Date Wed, 06 Mar 2019 15:33:02 GMT
The work to add a public partition info api is tracked in KUDU-1872
<https://issues.apache.org/jira/browse/KUDU-1872>. I agree with Adar that
using the KuduPartitioner to detected rows in uncovered ranges is likely
the best option that exists today.

I don't know anyway to handle the errors coming from the KuduContext. The
writeRows method catches all errors and throws a RuntimeException without
enough context to handle the errors themselves. It would be cool if the API
allowed the user to provide an error handler function, or at minimum threw
the exception with enough context to handle. I filed KUDU-2737
<https://issues.apache.org/jira/browse/KUDU-2737> to track supporting row
error handling in KuduContext.

It's worth mentioning, if you would like to contribute to Kudu and provide
patches for the functionality you need, we would be happy to review and
commit those patches.







On Wed, Mar 6, 2019 at 3:14 AM Adar Lieber-Dembo <adar@cloudera.com> wrote:

> FWIW, you can use a newer Kudu client with an older server as we take care
> to preserve backwards compatibility. The decoupling of client and server
> artifacts sort of makes sense anyway, because the server artifacts are
> found on the cluster nodes and the client artifacts are typically
> distributed along with the application.
>
> In any case, I agree that I don't see an obvious way to get at the
> underlying per-row errors if you're using the KuduContext. Maybe someone
> more familiar with the Kudu Spark bindings can chime in with suggestions.
>
> On Wed, Mar 6, 2019 at 12:57 AM Nabeelah Harris <
> nabeelah.harris@impact.com> wrote:
>
>> Hi Adar
>>
>> Thanks
>>
>> Option 1 isn't really viable, since we're running Cloudera with Kudu 1.7,
>> thus using the 1.7 client libraries. Option 2 seems to be the way to go,
>> though since I am using KuduContext, I'm not sure that there is a clean way
>> for me to check for errors row by row. Based on naively wrapping my
>> kukuContext.upsert call in a try...catch, and running an alterTable if a
>> SparkException is caught - I'm able to catch the SparkException that occurs
>> with 'java.lang.RuntimeException: failed to write 1 rows from DataFrame to
>> Kudu; sample errors: Not found: non-covered range' on the tasks, but of
>> course I still end up with a bunch of failed tasks, and the partition is
>> only added once all my tasks have failed.
>>
>> Do you perhaps have some guidance in this regard?
>>
>> On Wed, Mar 6, 2019 at 7:58 AM Adar Lieber-Dembo <adar@cloudera.com>
>> wrote:
>>
>>> Here are some other options:
>>> 1. Use the new KuduPartitioner class, available in master but not yet
>>> in any releases. Given a PartialRow (i.e. a row to be inserted), you
>>> can find its "partition index" and, more importantly for your use
>>> case, receive an exception if no partition exists for the row.
>>> 2. Insert the data anyway, and rely on per-row errors to tell you that
>>> a partition is missing. This is a more "optimistic" approach, but a
>>> somewhat expensive one at that.
>>>
>>> Would either of these work for you?
>>>
>>> On Tue, Mar 5, 2019 at 6:33 AM Nabeelah Harris
>>> <nabeelah.harris@impact.com> wrote:
>>> >
>>> > Hi there
>>> >
>>> > Currently, the only method available on KuduTable to check which
>>> > partitions already exist is 'KuduTable.getFormattedRangePartitions'.
>>> > This however looks to be experimental and only intended for use by
>>> > Impala. Other than replicating the logic used in the above-mentioned
>>> > method, is there any way I can easily retrieve the range partitions
>>> > (or partitions at all) using the Java API? My use-case at the moment
>>> > is to create range partitions based on the data I am about to insert,
>>> > and to do so I want to first check if that range partition already
>>> > exists, to prevent errors.
>>> >
>>> > Thanks
>>> > Nabeelah
>>>
>>
>>
>> --
>> Nabeelah Harris
>> nabeelah.harris@impact.com |
>> https://impact.com
>> <https://www.linkedin.com/company/impact-martech/>
>> <https://www.facebook.com/ImpactMarTech/>
>> <https://twitter.com/impactmartech>
>> <https://www.youtube.com/c/impactmartech>
>> <https://impactgrowth.com/>
>>
>

-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Mime
View raw message