kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'
Date Thu, 16 Feb 2017 18:17:15 GMT
On Tue, Feb 14, 2017 at 11:51 PM, Frank Heimerzheim <fh.ordix@gmail.com>
wrote:

> Hello Todd,
>
> i was so naive to assume a decent automatic type assignment. With explicit
> type assignement via schema everything works fine.
>
> From an pythonic viewpoint the necessity to infer data types is not what i
> want to do all day. But this is a philosophical discussion and i got
> trapped with this way of thinking right now.
>
> I´ve expected that the attemp to store 42 in an int8 would work and the
> attemp to store 4242 would raise an error. But the drive ist not
> "inteligent" enough to test every individual case but checks ones for a
> data type. Not pythonic, but now that i´m realy aware of the topic: no
> further problem.
>

Yea, I see that the "static typing" we are doing isn't very pythonic.

Our thinking here (which really comes from the thinking on the C++ side) is
that, given the underlying Kudu data has static column typing, we didn't
want to have a scenario where someone uses a too-small column, and then
tests on data that happens to fit in range. Then, they get a surprise one
day when all of their inserts start failing with "data out of range for
int32" errors or whatever. Forcing people to evaluate the column sizes up
front avoids nasty surprises later.

But, maybe you can see my biases towards static-typed languages leaking
through here ;-)

-Todd


> 2017-02-14 19:44 GMT+01:00 Todd Lipcon <todd@cloudera.com>:
>
>> Hi Frank,
>>
>> Could you try something like:
>>
>> data = [(42, 2017, 'John')]
>> schema = StructType([
>>     StructField("id", ByteType(), True),
>>     StructField("year", ByteType(), True),
>>     StructField("name", StringType(), True)])
>> df = sqlContext.createDataFrame(data, schema)
>>
>> That should explicitly set the types (based on my reading of the pyspark
>> docs for createDataFrame)
>>
>> -Todd
>>
>>
>> On Tue, Feb 14, 2017 at 1:11 AM, Frank Heimerzheim <fh.ordix@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> here a snippet which produces the error.
>>>
>>> Call from the shell:
>>> spark-submit --jars /opt/storage/data_nfs/cloudera
>>> /pyspark/libs/kudu-spark_2.10-1.2.0.jar test.py
>>>
>>>
>>> Snippet from the python-code test.py:
>>>
>>> (..)
>>> builder = kudu.schema_builder()
>>> builder.add_column('id', kudu.int64, nullable=False)
>>> builder.add_column('year', kudu.int32)
>>> builder.add_column('name', kudu.string)
>>> (..)
>>>
>>> (..)
>>> data = [(42, 2017, 'John')]
>>> df = sqlContext.createDataFrame(data, ['id', 'year', 'name'])
>>> df.write.format('org.apache.kudu.spark.kudu').option('kudu.master', kudu_master)\
>>>                                              .option('kudu.table', kudu_table)\
>>>                                              .mode('append')\
>>>                                              .save()
>>> (..)
>>>
>>> Error:
>>> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0
(TID 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096 bytes)
>>> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0
(TID 5) in 113 ms on ls00152y.xxx.com (1/2)
>>> 17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID
6, ls00152y.xx.com): java.lang.IllegalArgumentException: year isn't [Type: int64, size: 8,
Type: unixtime_micros, size: 8], it's int32
>>> 	at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462)
>>> 	at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:215)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)
>>> 	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>> 	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>> 	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>> 	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:205)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:203)
>>> 	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> 	at org.apache.kudu.spark.kudu.KuduContext.org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows(KuduContext.scala:203)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:181)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:180)
>>> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>>> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>>> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>>> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>>> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> 	at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> Same result with kudu.int8 and kudu.int16. Only kudu.int64 works for me. The
problem persists, be the attribute part of the key or not.
>>>
>>> My
>>>
>>> Greeting
>>>
>>> Frank
>>>
>>>
>>> 2017-02-13 6:23 GMT+01:00 Todd Lipcon <todd@cloudera.com>:
>>>
>>>> On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh.ordix@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> quite a while i´ve worked successfully with https://maven2repo.com/org
>>>>> .apache.kudu/kudu-spark_2.10/1.2.0/jar
>>>>>
>>>>> For a bit i ignored a problem with kudu datatype int8. With the
>>>>> connector i can´t write int8 as int in python will always bring up
>>>>> errors like
>>>>>
>>>>> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8,
>>>>> Tye: unixtime_micros, size: 8], it´s int8"
>>>>>
>>>>> As python isn´t hard typed the connector is trying to find a suitable
>>>>> type for python int in java/kudu. Apparently the python int is
>>>>> matched to int64/unixtime_micros and not int8 as kudu is expecting at
>>>>> this place.
>>>>>
>>>>> As a quick solution all my int in kudu are int64 at the moment
>>>>>
>>>>> In the long run i can´t accept this waste of hdd space or even worse
>>>>> I/O. Any idea when i can store int8 from python/spark to kudu?
>>>>>
>>>>> With the "normal" python api everything works fine, only the spark/
>>>>> kudu/python connector brings up the problem.
>>>>>
>>>>
>>>> Not 100% sure I'm following. You're using pyspark here? Can you post a
>>>> bit of sample code that reproduces the issue?
>>>>
>>>> -Todd
>>>>
>>>>
>>>>> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh.ordix@gmail.com>:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> within the impala-shell i can create an external table and thereafter
>>>>>> select and insert data from an underlying kudu table. Within the
statement
>>>>>> for creation of the table an 'StorageHandler' will be set to
>>>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine
>>>>>> as there exists apparently an *.jar with the referenced library within.
>>>>>>
>>>>>> When trying to select from a hive-shell there is an error that the
>>>>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx
within
>>>>>> an sparkSession i also get an error JavaClassNotFoundException as
>>>>>> the KuduStorageHandler is not available.
>>>>>>
>>>>>> I then tried to find a jar in my system with the intention to copy
it
>>>>>> to all my data nodes. Sadly i couldn´t find the specific jar. I
think it
>>>>>> exists in the system as impala apparently is using it. For a test
i´ve
>>>>>> changed the 'StorageHandler' in the creation statement to
>>>>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create
>>>>>> statement worked. Also the select from impala, but i didin´t return
any
>>>>>> data. There was no error as i expected. The test was just for the
case
>>>>>> impala would in a magic way select data from kudu without an correct
>>>>>> 'StorageHandler'. Apparently this is not the case and impala has
access to
>>>>>> an  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>>>>
>>>>>> Long story, short question:
>>>>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>>>>> torageHandler'?
>>>>>> Is the approach to copy the jar per hand to all nodes an appropriate
>>>>>> way to bring spark in a position to work with kudu?
>>>>>> What about the beeline-shell from hive and the possibility to read
>>>>>> from kudu?
>>>>>>
>>>>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>>>>> parcels. Build a working python-kudu library successfully from scratch
(git)
>>>>>>
>>>>>> Thanks a lot!
>>>>>> Frank
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message