spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "malouke (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-12553) join is absloutly slow
Date Tue, 29 Dec 2015 09:48:49 GMT
malouke created SPARK-12553:
-------------------------------

             Summary: join is absloutly slow 
                 Key: SPARK-12553
                 URL: https://issues.apache.org/jira/browse/SPARK-12553
             Project: Spark
          Issue Type: Bug
         Environment: cloudera cdh 5 
centos 6 
            Reporter: malouke


Hello ,
I have 7 tables to join with a left join, I did this:

start = time.time ()
df_test = hc.sql ("select * from rapexp201412 left join CLIENT1412 is rapNUMCNT CLINMCLI =
\
               left join SRN1412 is SRNSIRET CLISIRET = \
               left join bodacc2014 is SRNSIREN bodSORCS = \
               left join sinagr2014 is rapNUMCNT sinagNUMCNT = \
               left join sinfix2014 is rapNUMCNT sinfiNUMCNT = \
               left join sinimag2014 is rapNUMCNT sinimNUMCNT = \
               left join up2014 is rapNUMCNT up2NUMCNT = \
               left join upagr2014 is rapNUMCNT upaNUMCNT = \
               left join aeveh is rapNUMCNT aevNUMCNT = \
               left join premiums are rapNUMCNT = priNUMCNT ")

time.time () - start
take : 2.289154052734375s


after I do:


df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\
      , partitionBy = "date_part")




df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\
      , partitionBy = "date_part")
here is the configuration of my pyspark:
sc._conf.set (u'spark.dynamicAllocation.enabled 'u'false') \
 .set (u'spark.eventLog.enabled 'u'true') \
 .set (u'spark.shuffle.service.enabled 'u'false') \
 .set (u'spark.yarn.historyServer.address 'u'http: //prssncdhna02.bigplay.bigdata: 18088')
\
.set (u'spark.driver.port 'u'54330') \
.set (u'spark.eventLog.dir 'u'hdfs: // bigplay-nameservice / user / spark / applicationHistory')
\
.set (u'spark.blockManager.port 'u'54332') \
 .set (u'spark.yarn.jar 'u'local: /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar
') \
 .set (u'spark.dynamicAllocation.executorIdleTimeout 'u'60') \
 .set (u'spark.serializer 'u'org.apache.spark.serializer.KryoSerializer') \
.set (u'spark.authenticate 'u'false') \
 .set (u'spark.serializer.objectStreamReset 'u'100') \
 .set (u'spark.submit.deployMode 'u'client') \
 .set (u'spark.executor.memory 'u'4g') \
 .set (u'spark.master 'u'yarn client') \
 .set (u'spark.driver.memory 'u'10g') \
 .set (u'spark.driver.extraLibraryPath 'u' / opt / Cloudera / parcels / CDH-5.4.7-1.cdh5.4.7.p0.3
/ lib / hadoop / lib / native ') \
 .set (u'spark.dynamicAllocation.schedulerBacklogTimeout 'U'1') \
.set (u'spark.executor.instances 'u'8') \
 .set (u'spark.shuffle.service.port 'u'7337') \
.set (u'spark.fileserver.port 'u'54331') \
 .set (u'spark.app.name 'u'PySparkShell') \
.set (u'spark.yarn.config.gatewayPath 'u' / opt / Cloudera / parcels') \
.set (u'spark.rdd.compress 'u'True') \
.set (u'spark.yarn.config.replacementPath 'u' {{}} /../../ HADOOP_COMMON_HOME .. ') \
.set (u'spark.yarn.isPython 'u'true') \
.set (u'spark.dynamicAllocation.minExecutors 'u'0') \
.set(u'spark.executor.extraLibraryPath',u'/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native')\
.set (u'spark.ui.proxyBase 'u' / proxy / application_1450819756020_0615 ') \
.set (u'spark.yarn.am.extraLibraryPath 'u' / opt / Cloudera / parcels / CDH-5.4.7-1.cdh5.4.7.p0.3
/ lib / hadoop / lib / native ') \
.set (u'hadoop.major.version 'u'yarn') \
.set (u'spark.version 'u'1.5.2')

I do not understand why the join does not work
Thank you beforehand




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message