spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ☼ R Nair (रविशंकर नायर) <ravishankar.n...@gmail.com>
Subject Re: Spark Dataframe and HIVE
Date Sun, 11 Feb 2018 05:36:40 GMT
All,

Thanks for the inputs. Again I am not successful. I think, we need to
resolve this, as this is a very common requirement.

Please go through my complete code:

STEP 1:  Started Spark shell as spark-shell --master yarn

STEP 2: Flowing code is being given as inout to shark shell

import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
val warehouseLocation ="/user/hive/warehouse"

val spark = SparkSession.builder().appName("Spark Hive
Example").config("spark.sql.warehouse.dir",
warehouseLocation).enableHiveSupport().getOrCreate()

import org.apache.spark.sql._
var passion_df = spark.read.
format("jdbc").
option("url", "jdbc:mysql://localhost:3307/policies").
option("driver" ,"com.mysql.jdbc.Driver").
option("user", "root").
option("password", "root").
option("dbtable", "insurancedetails").
option("partitionColumn", "policyid").
option("lowerBound", "1").
option("upperBound", "100000").
option("numPartitions", "4").
load()
//Made sure that passion_df is created, as passion_df.show(5) shows me
correct data.
passion_df.write.saveAsTable("default.mine") //Default parquet

STEP 3: Went to HIVE. Started HIVE prompt.

hive> show tables;
OK
callcentervoicelogs
mine
Time taken: 0.035 seconds, Fetched: 2 row(s)
//As you can see HIVE is showing the table "mine" in default schema.

STEP 4: HERE IS THE PROBLEM.

hive> select * from mine;
OK
Time taken: 0.354 seconds
hive>
//Where is the data ???

STEP 5:

See the below command on HIVE

describe formatted mine;
OK
# col_name             data_type           comment

policyid             int
statecode           string
socialid             string
county               string
eq_site_limit       decimal(10,2)
hu_site_limit       decimal(10,2)
fl_site_limit       decimal(10,2)
fr_site_limit       decimal(10,2)
tiv_2014             decimal(10,2)
tiv_2015             decimal(10,2)
eq_site_deductible   int
hu_site_deductible   int
fl_site_deductible   int
fr_site_deductible   int
latitude             decimal(6,6)
longitude           decimal(6,6)
line                 string
construction         string
point_granularity   int

# Detailed Table Information
Database:           default
Owner:               ravishankarnair
CreateTime:         Sun Feb 11 00:26:40 EST 2018
LastAccessTime:     UNKNOWN
Protect Mode:       None
Retention:           0
Location:           file:/Users/ravishankarnair/spark-warehouse/mine
Table Type:         MANAGED_TABLE
Table Parameters:
spark.sql.sources.provider parquet
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.part.0
{\"type\":\"struct\",\"fields\":[{\"name\":\"policyid\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{\"name\":\"policyid\",\"scale\":0}},{\"name\":\"statecode\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"name\":\"statecode\",\"scale\":0}},{\"name\":\"Socialid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"name\":\"Socialid\",\"scale\":0}},{\"name\":\"county\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"name\":\"county\",\"scale\":0}},{\"name\":\"eq_site_limit\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{\"name\":\"eq_site_limit\",\"scale\":2}},{\"name\":\"hu_site_limit\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{\"name\":\"hu_site_limit\",\"scale\":2}},{\"name\":\"fl_site_limit\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{\"name\":\"fl_site_limit\",\"scale\":2}},{\"name\":\"fr_site_limit\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{\"name\":\"fr_site_limit\",\"scale\":2}},{\"name\":\"tiv_2014\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{\"name\":\"tiv_2014\",\"scale\":2}},{\"name\":\"tiv_2015\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{\"name\":\"tiv_2015\",\"scale\":2}},{\"name\":\"eq_site_deductible\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{\"name\":\"eq_site_deductible\",\"scale\":0}},{\"name\":\"hu_site_deductible\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{\"name\":\"hu_site_deductible\",\"scale\":0}},{\"name\":\"fl_site_deductible\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{\"name\":\"fl_site_deductible\",\"scale\":0}},{\"name\":\"fr_site_deductible\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{\"name\":\"fr_site_deductible\",\"scale\":0}},{\"name\":\"latitude\",\"type\":\"decimal(6,6)\",\"nullable\":true,\"metadata\":{\"name\":\"latitude\",\"scale\":6}},{\"name\":\"longitude\",\"type\":\"decimal(6,6)\",\"nullable\":true,\"metadata\":{\"name\":\"longitude\",\"scale\":6}},{\"name\":\"line\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"name\":\"line\",\"scale\":0}},{\"name\":\"construction\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"name\":\"construction\",\"scale\":0}},{\"name\":\"point_granularity\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{\"name\":\"point_granularity\",\"scale\":0}}]}
transient_lastDdlTime 1518326800

# Storage Information
SerDe Library:
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed:         No
Num Buckets:         -1
Bucket Columns:     []
Sort Columns:       []
Storage Desc Params:
path                 hdfs://localhost:8020/user/hive/warehouse/mine
serialization.format 1
Time taken: 0.077 seconds, Fetched: 48 row(s)

Now, I see your advise and support. Whats the issue? Am I doing wrong, it
it a bug ? I am using Spark 2.2.1, HIVE 1.2.1, HADOOP 2.7.3. All class
path, configuration are set properly.

Best,

Ravion

On Fri, Feb 9, 2018 at 1:29 PM, Nicholas Hakobian <
nicholas.hakobian@rallyhealth.com> wrote:

> Its possible that the format of your table is not compatible with your
> version of hive, so Spark saved it in a way such that only Spark can read
> it. When this happens it prints out a very visible warning letting you know
> this has happened.
>
> We've seen it most frequently when trying to save a parquet file with a
> column in date format into a Hive table. In older versions of hive, its
> parquet reader/writer did not support Date formats (among a couple others).
>
> Nicholas Szandor Hakobian, Ph.D.
> Staff Data Scientist
> Rally Health
> nicholas.hakobian@rallyhealth.com
>
>
> On Fri, Feb 9, 2018 at 9:59 AM, Prakash Joshi <prakashcjoshi5@gmail.com>
> wrote:
>
>> Ravi,
>>
>> Can you send the result of
>> Show create table your_table_name
>>
>> Thanks
>> Prakash
>>
>> On Feb 9, 2018 8:20 PM, "☼ R Nair (रविशंकर नायर)" <
>> ravishankar.nair@gmail.com> wrote:
>>
>>> All,
>>>
>>> It has been three days continuously I am on this issue. Not getting any
>>> clue.
>>>
>>> Environment: Spark 2.2.x, all configurations are correct. hive-site.xml
>>> is in spark's conf.
>>>
>>> 1) Step 1: I created a data frame DF1 reading a csv file.
>>>
>>> 2) Did  manipulations on DF1. Resulting frame is passion_df.
>>>
>>> 3) passion_df.write.format("orc").saveAsTable("sampledb.passion")
>>>
>>> 4) The metastore shows the hive table., when I do "show tables" in HIVE,
>>> I can see table name
>>>
>>> 5) I can't select in HIVE, though I can select from SPARK as
>>> spark.sql("select * from sampledb.passion")
>>>
>>> Whats going on here? Please help. Why I am not seeing data from HIVE
>>> prompt?
>>> The "describe formatted " command on the table in HIVE shows he data is
>>> is in default warehouse location ( /user/hive/warehouse) since I set it.
>>>
>>> I am not getting any definite answer anywhere. Many suggestions and
>>> answers given in Stackoverflow et al.Nothing really works.
>>>
>>> So asking experts here for some light on this, thanks
>>>
>>> Best,
>>> Ravion
>>>
>>>
>>>
>


--

Mime
View raw message