spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wollert, Fabian" <fabian.woll...@zalando.de>
Subject Re: Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished
Date Thu, 02 Apr 2015 10:03:06 GMT
Hey Christopher,

I'm working with Teng on this issue. Thank you for the explanation. I tried
both workarounds:

just leaving hive.metastore.warehouse.dir empty is not doing anything.
Still the tmp data is written to S3 and the job attempts to
rename/copy+delete from S3 to S3. But anyway, since the wished effect of
this setting was not working before, we will discard this. So it will be
empty in the future.

I tried you're attempt with copying the hive-jars to the spark class path.
It was breaking then (the query did not execute at all) because of this
error message:

errorMessage:java.lang.NoSuchMethodException:
org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo(com.esotericsoftware.kryo.Kryo,
java.io.InputStream, java.lang.Class))

I used a super simple query to produce this:

SELECT CONCAT('some_string', some_string_col) FROM some_table;

We suspect, that this comes from an too old Spark Hive Version, which was
used to compile the Spark Version you build in your github project or other
recency problems. We suggest recompiling your Spark Version with the AWS
Hive Version, which has the Hive adaptions you mentioned already
implemented. Or what do you think?

Cheers
Fabian

2015-04-02 10:19 GMT+02:00 Teng Qiu <tengqiu@gmail.com>:

> ---------- Forwarded message ----------
> From: Bozeman, Christopher <bozemanc@amazon.com>
> Date: 2015-04-01 22:43 GMT+02:00
> Subject: RE: Issue on Spark SQL insert or create table with Spark
> running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished
> To: chutium <teng.qiu@gmail.com>, "user@spark.apache.org"
> <user@spark.apache.org>
>
>
> Teng,
>
> There is no need to alter hive.metastore.warehouse.dir.   Leave it as
> is and just create external tables with location pointing to S3.
> What I suspect you are seeing is that spark-sql is writing to a temp
> directory within S3 then issuing a rename to the final location as
> would be done with HDFS.  But in S3, there is not a rename operation
> so there is a performance hit as S3 performs a copy then delete.  I
> tested 1TB from/to S3 external tables and it worked, it is just there
> the additional delay for the rename (copy).
>
> EMR has modified Hive to avoid the expensive rename and you can take
> advantage of this, too with Spark SQL by just copying the EMR Hive
> jars into the Spark class path.   Like:
> /bin/ls /home/hadoop/.versions/hive-*/lib/*.jar | xargs -n 1 -I %% cp
> %% ~/spark/classpath/emr
>
> Please note that since EMR Hive is 0.13 at this time, this does break
> some other features already supported by spark-sql if using the
> built-in Hive library (for example, AVRO support).  So if using this
> workaround to make a better performant query when writing to S3 be
> sure to test your use-case.
>
> Thanks
> Christopher
>
>
> -----Original Message-----
> From: chutium [mailto:teng.qiu@gmail.com]
> Sent: Wednesday, April 01, 2015 9:34 AM
> To: user@spark.apache.org
> Subject: Issue on Spark SQL insert or create table with Spark running
> on AWS EMR -- s3n.S3NativeFileSystem: rename never finished
>
> Hi,
>
> we always get issues on inserting or creating table with Amazon EMR
> Spark version, by inserting about 1GB resultset, the spark sql query
> will never be finished.
>
> by inserting small resultset (like 500MB), works fine.
>
> *spark.sql.shuffle.partitions* by default 200 or *set
> spark.sql.shuffle.partitions=1* do not help.
>
> the log stopped at:
> */15/04/01 15:48:13 INFO s3n.S3NativeFileSystem: rename
>
> s3://hive-db/tmp/hive-hadoop/hive_2015-04-01_15-47-43_036_1196347178448825102-15/-ext-10000
> s3://hive-db/db_xxx/some_huge_table/*
>
> then only metrics.MetricsSaver logs.
>
> we set
> /  <property>
>     <name>hive.metastore.warehouse.dir</name>
>     <value>s3://hive-db</value>
>   </property>/
> but hive.exec.scratchdir ist not set, i have no idea why the tmp files
> were created in /s3://hive-db/tmp/hive-hadoop//
>
> we just tried the newest Spark 1.3.0 on AMI 3.5.x and AMI 3.6
> (
> https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md
> ),
> still not work.
>
> anyone get same issue? any idea about how to fix it?
>
> i believe Amazon EMR's Spark version use
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem to access s3, but
> not the original hadoop s3n implementation, right?
>
> /home/hadoop/spark/classpath/emr/*
> and
> /home/hadoop/spark/classpath/emrfs/*
> is in classpath
>
> btw. is there any plan to use the new hadoop s3a implementation instead of
> s3n ?
>
> Thanks for any help.
>
> Teng
>
>
>
>
> --
> View this message in context:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-on-Spark-SQL-insert-or-create-table-with-Spark-running-on-AWS-EMR-s3n-S3NativeFileSystem-renamd-tp22340.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For
> additional commands, e-mail: user-help@spark.apache.org
>



-- 
*Fabian Wollert*
Business Intelligence



*POSTANSCHRIFT*
Zalando SE
11501 Berlin

*STANDORT*
Zalando SE
Mollstra├če 1
10178 Berlin
Germany

Phone: +49 30 20968 1819
Fax:   +49 30 27594 693
E-Mail: fabian.wollert@zalando.de
Web: www.zalando.de
Jobs: jobs.zalando.de

Zalando SE, Tamara-Danz-Stra├če 1, 10243 Berlin
Handelsregister: Amtsgericht Charlottenburg, HRB 158855 B
Steuer-Nr. 29/560/00596 * USt-ID-Nr. DE 260543043
Vorstand: Robert Gentz, David Schneider, Rubin Ritter
Vorsitzende des Aufsichtsrates: Cristina Stenbeck
Sitz der Gesellschaft: Berlin

Mime
View raw message