spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: HiveContext saveAsTable create wrong partition
Date Wed, 17 Jun 2015 22:25:58 GMT
Thanks for reporting this. Would you mind to help creating a JIRA for this?

On 6/16/15 2:25 AM, patcharee wrote:
> I found if I move the partitioned columns in schemaString and in Row 
> to the end of the sequence, then it works correctly...
>
> On 16. juni 2015 11:14, patcharee wrote:
>> Hi,
>>
>> I am using spark 1.4 and HiveContext to append data into a 
>> partitioned hive table. I found that the data insert into the table 
>> is correct, but the partition(folder) created is totally wrong.
>> Below is my code snippet>>
>>
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------

>>
>> val schemaString = "zone z year month date hh x y height u v w ph phb 
>> t p pb qvapor qgraup qnice qnrain tke_pbl el_pbl"
>>     val schema =
>>       StructType(
>>         schemaString.split(" ").map(fieldName =>
>>           if (fieldName.equals("zone") || fieldName.equals("z") || 
>> fieldName.equals("year") || fieldName.equals("month") ||
>>               fieldName.equals("date") || fieldName.equals("hh") || 
>> fieldName.equals("x") || fieldName.equals("y"))
>>             StructField(fieldName, IntegerType, true)
>>           else
>>             StructField(fieldName, FloatType, true)
>>         ))
>>
>> val pairVarRDD =
>> sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(),

>>
>> 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(),

>>
>>         0.0.floatValue(),0.1.floatValue(),0.0.floatValue()))
>>     ))
>>
>> val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema)
>>
>> partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource") 
>>
>> .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark")

>>
>>
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------

>>
>>
>> The table contains 23 columns (longer than Tuple maximum length), so 
>> I use Row Object to store raw data, not Tuple.
>> Here is some message from spark when it saved data>>
>>
>> 15/06/16 10:39:22 INFO metadata.Hive: Renaming 
>> src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-10000/zone=13195/z=0/year=0/month=0/part-00001;dest:

>> hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-00001;Status:true

>>
>> 15/06/16 10:39:22 INFO metadata.Hive: New loading path = 
>> hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-10000/zone=13195/z=0/year=0/month=0

>> with partSpec {zone=13195, z=0, year=0, month=0}
>>
>> From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = 
>> 3. But spark created a partition {zone=13195, z=0, year=0, month=0}.
>>
>> When I queried from hive>>
>>
>> hive> select * from test4dimBySpark;
>> OK
>> 2    42    2009    3    1.0    0.0    218.0    365.0    9989.497 
>> 29.627113    19.071793    0.11982734    -3174.6812    97735.2 
>> 16.389032    -96.62891    25135.365    2.6476808E-5    0.0 13195    
>> 0    0    0
>> hive> select zone, z, year, month from test4dimBySpark;
>> OK
>> 13195    0    0    0
>> hive> dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*;
>> Found 2 items
>> -rw-r--r--   3 patcharee hdfs       1411 2015-06-16 10:39 
>> /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-00001
>>
>> The data stored in the table is correct zone = 2, z = 42, year = 
>> 2009, month = 3, but the partition created was wrong 
>> "zone=13195/z=0/year=0/month=0"
>>
>> Is this a bug or what could be wrong? Any suggestion is appreciated.
>>
>> BR,
>> Patcharee
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message