spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ☼ R Nair (रविशंकर नायर) <ravishankar.n...@gmail.com>
Subject Re: Spark Dataframe and HIVE
Date Sun, 11 Feb 2018 05:44:18 GMT
NO, can you pease explain the command ? Let me try now.

Best,

On Sun, Feb 11, 2018 at 12:40 AM, Deepak Sharma <deepakmca05@gmail.com>
wrote:

> I am not sure about the exact issue bjt i see you are partioning while
> writing from spark.
> Did you tried msck repair on the table before reading it in hive ?
>
> Thanks
> Deepak
>
> On Feb 11, 2018 11:06, "☼ R Nair (रविशंकर नायर)" <
> ravishankar.nair@gmail.com> wrote:
>
>> All,
>>
>> Thanks for the inputs. Again I am not successful. I think, we need to
>> resolve this, as this is a very common requirement.
>>
>> Please go through my complete code:
>>
>> STEP 1:  Started Spark shell as spark-shell --master yarn
>>
>> STEP 2: Flowing code is being given as inout to shark shell
>>
>> import org.apache.spark.sql.Row
>> import org.apache.spark.sql.SparkSession
>> val warehouseLocation ="/user/hive/warehouse"
>>
>> val spark = SparkSession.builder().appName("Spark Hive
>> Example").config("spark.sql.warehouse.dir",
>> warehouseLocation).enableHiveSupport().getOrCreate()
>>
>> import org.apache.spark.sql._
>> var passion_df = spark.read.
>> format("jdbc").
>> option("url", "jdbc:mysql://localhost:3307/policies").
>> option("driver" ,"com.mysql.jdbc.Driver").
>> option("user", "root").
>> option("password", "root").
>> option("dbtable", "insurancedetails").
>> option("partitionColumn", "policyid").
>> option("lowerBound", "1").
>> option("upperBound", "100000").
>> option("numPartitions", "4").
>> load()
>> //Made sure that passion_df is created, as passion_df.show(5) shows me
>> correct data.
>> passion_df.write.saveAsTable("default.mine") //Default parquet
>>
>> STEP 3: Went to HIVE. Started HIVE prompt.
>>
>> hive> show tables;
>> OK
>> callcentervoicelogs
>> mine
>> Time taken: 0.035 seconds, Fetched: 2 row(s)
>> //As you can see HIVE is showing the table "mine" in default schema.
>>
>> STEP 4: HERE IS THE PROBLEM.
>>
>> hive> select * from mine;
>> OK
>> Time taken: 0.354 seconds
>> hive>
>> //Where is the data ???
>>
>> STEP 5:
>>
>> See the below command on HIVE
>>
>> describe formatted mine;
>> OK
>> # col_name             data_type           comment
>>
>> policyid             int
>> statecode           string
>> socialid             string
>> county               string
>> eq_site_limit       decimal(10,2)
>> hu_site_limit       decimal(10,2)
>> fl_site_limit       decimal(10,2)
>> fr_site_limit       decimal(10,2)
>> tiv_2014             decimal(10,2)
>> tiv_2015             decimal(10,2)
>> eq_site_deductible   int
>> hu_site_deductible   int
>> fl_site_deductible   int
>> fr_site_deductible   int
>> latitude             decimal(6,6)
>> longitude           decimal(6,6)
>> line                 string
>> construction         string
>> point_granularity   int
>>
>> # Detailed Table Information
>> Database:           default
>> Owner:               ravishankarnair
>> CreateTime:         Sun Feb 11 00:26:40 EST 2018
>> LastAccessTime:     UNKNOWN
>> Protect Mode:       None
>> Retention:           0
>> Location:           file:/Users/ravishankarnair/spark-warehouse/mine
>> Table Type:         MANAGED_TABLE
>> Table Parameters:
>> spark.sql.sources.provider parquet
>> spark.sql.sources.schema.numParts 1
>> spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\
>> ":[{\"name\":\"policyid\",\"type\":\"integer\",\"nullable\":
>> true,\"metadata\":{\"name\":\"policyid\",\"scale\":0}},{\"na
>> me\":\"statecode\",\"type\":\"string\",\"nullable\":true,\"
>> metadata\":{\"name\":\"statecode\",\"scale\":0}},{\"name\":\
>> "Socialid\",\"type\":\"string\",\"nullable\":true,\"metadata
>> \":{\"name\":\"Socialid\",\"scale\":0}},{\"name\":\"
>> county\",\"type\":\"string\",\"nullable\":true,\"metadata\":
>> {\"name\":\"county\",\"scale\":0}},{\"name\":\"eq_site_
>> limit\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"
>> metadata\":{\"name\":\"eq_site_limit\",\"scale\":2}},{\"
>> name\":\"hu_site_limit\",\"type\":\"decimal(10,2)\",\"nullab
>> le\":true,\"metadata\":{\"name\":\"hu_site_limit\",\"sca
>> le\":2}},{\"name\":\"fl_site_limit\",\"type\":\"decimal(10,
>> 2)\",\"nullable\":true,\"metadata\":{\"name\":\"fl_
>> site_limit\",\"scale\":2}},{\"name\":\"fr_site_limit\",\"typ
>> e\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{\"
>> name\":\"fr_site_limit\",\"scale\":2}},{\"name\":\"tiv_2014\
>> ",\"type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":
>> {\"name\":\"tiv_2014\",\"scale\":2}},{\"name\":\"tiv_
>> 2015\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"
>> metadata\":{\"name\":\"tiv_2015\",\"scale\":2}},{\"name\"
>> :\"eq_site_deductible\",\"type\":\"integer\",\"nullable\
>> ":true,\"metadata\":{\"name\":\"eq_site_deductible\",\"
>> scale\":0}},{\"name\":\"hu_site_deductible\",\"type\":\"
>> integer\",\"nullable\":true,\"metadata\":{\"name\":\"hu_site
>> _deductible\",\"scale\":0}},{\"name\":\"fl_site_deductible\"
>> ,\"type\":\"integer\",\"nullable\":true,\"metadata\":{
>> \"name\":\"fl_site_deductible\",\"scale\":0}},{\"name\":\"
>> fr_site_deductible\",\"type\":\"integer\",\"nullable\":true,
>> \"metadata\":{\"name\":\"fr_site_deductible\",\"scale\":0}},
>> {\"name\":\"latitude\",\"type\":\"decimal(6,6)\",\"nullable\
>> ":true,\"metadata\":{\"name\":\"latitude\",\"scale\":6}},{\"
>> name\":\"longitude\",\"type\":\"decimal(6,6)\",\"nullable\":
>> true,\"metadata\":{\"name\":\"longitude\",\"scale\":6}},{\"
>> name\":\"line\",\"type\":\"string\",\"nullable\":true,\"
>> metadata\":{\"name\":\"line\",\"scale\":0}},{\"name\":\"
>> construction\",\"type\":\"string\",\"nullable\":true,\"
>> metadata\":{\"name\":\"construction\",\"scale\":0}},{
>> \"name\":\"point_granularity\",\"type\":\"integer\",\"
>> nullable\":true,\"metadata\":{\"name\":\"point_granularity\"
>> ,\"scale\":0}}]}
>> transient_lastDdlTime 1518326800
>>
>> # Storage Information
>> SerDe Library:       org.apache.hadoop.hive.ql.io.p
>> arquet.serde.ParquetHiveSerDe
>> InputFormat:         org.apache.hadoop.hive.ql.io.p
>> arquet.MapredParquetInputFormat
>> OutputFormat:       org.apache.hadoop.hive.ql.io.p
>> arquet.MapredParquetOutputFormat
>> Compressed:         No
>> Num Buckets:         -1
>> Bucket Columns:     []
>> Sort Columns:       []
>> Storage Desc Params:
>> path                 hdfs://localhost:8020/user/hive/warehouse/mine
>> serialization.format 1
>> Time taken: 0.077 seconds, Fetched: 48 row(s)
>>
>> Now, I see your advise and support. Whats the issue? Am I doing wrong, it
>> it a bug ? I am using Spark 2.2.1, HIVE 1.2.1, HADOOP 2.7.3. All class
>> path, configuration are set properly.
>>
>> Best,
>>
>> Ravion
>>
>> On Fri, Feb 9, 2018 at 1:29 PM, Nicholas Hakobian <
>> nicholas.hakobian@rallyhealth.com> wrote:
>>
>>> Its possible that the format of your table is not compatible with your
>>> version of hive, so Spark saved it in a way such that only Spark can read
>>> it. When this happens it prints out a very visible warning letting you know
>>> this has happened.
>>>
>>> We've seen it most frequently when trying to save a parquet file with a
>>> column in date format into a Hive table. In older versions of hive, its
>>> parquet reader/writer did not support Date formats (among a couple others).
>>>
>>> Nicholas Szandor Hakobian, Ph.D.
>>> Staff Data Scientist
>>> Rally Health
>>> nicholas.hakobian@rallyhealth.com
>>>
>>>
>>> On Fri, Feb 9, 2018 at 9:59 AM, Prakash Joshi <prakashcjoshi5@gmail.com>
>>> wrote:
>>>
>>>> Ravi,
>>>>
>>>> Can you send the result of
>>>> Show create table your_table_name
>>>>
>>>> Thanks
>>>> Prakash
>>>>
>>>> On Feb 9, 2018 8:20 PM, "☼ R Nair (रविशंकर नायर)"
<
>>>> ravishankar.nair@gmail.com> wrote:
>>>>
>>>>> All,
>>>>>
>>>>> It has been three days continuously I am on this issue. Not getting
>>>>> any clue.
>>>>>
>>>>> Environment: Spark 2.2.x, all configurations are correct.
>>>>> hive-site.xml is in spark's conf.
>>>>>
>>>>> 1) Step 1: I created a data frame DF1 reading a csv file.
>>>>>
>>>>> 2) Did  manipulations on DF1. Resulting frame is passion_df.
>>>>>
>>>>> 3) passion_df.write.format("orc").saveAsTable("sampledb.passion")
>>>>>
>>>>> 4) The metastore shows the hive table., when I do "show tables" in
>>>>> HIVE, I can see table name
>>>>>
>>>>> 5) I can't select in HIVE, though I can select from SPARK as
>>>>> spark.sql("select * from sampledb.passion")
>>>>>
>>>>> Whats going on here? Please help. Why I am not seeing data from HIVE
>>>>> prompt?
>>>>> The "describe formatted " command on the table in HIVE shows he data
>>>>> is is in default warehouse location ( /user/hive/warehouse) since I set
it.
>>>>>
>>>>> I am not getting any definite answer anywhere. Many suggestions and
>>>>> answers given in Stackoverflow et al.Nothing really works.
>>>>>
>>>>> So asking experts here for some light on this, thanks
>>>>>
>>>>> Best,
>>>>> Ravion
>>>>>
>>>>>
>>>>>
>>>
>>
>>
>> --
>>
>>


--

Mime
View raw message