spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kushal Mahajan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-28575) Time lag between two consecutive spark actions using Spark 2.3.1
Date Mon, 05 Aug 2019 08:22:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-28575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kushal Mahajan updated SPARK-28575:
-----------------------------------
    Description: 
Steps to reproduce:
 # Read a directory(consisting of txt files) using spark context's wholetextfile method
 # Perform transformation on the resultant paired rdd
 # Perform an action(foreach) on each entry corresponding to each txt file
 # Time lag can be seen between these actions in Spark UI. 

The action itself is not taking that much time. There is time lag between start time for each
action(excluding the time taken by the job itself). Kindly refer to the attachments

PS: This time lag is not seen when running the job in Spark 2.1.1

  was:
I am running a spark job using standalone cluster with Spark 2.1.1. The standalone cluster
was upgraded from 2.1.1 to Spark 2.3.1. There was considerable drop in performance(~3-4 times)
in the spark job. Upon investigation, I found out that there is considerable time lag(ranging
from 30 sec to 2 min) between start time of different spark actions(excluding the time taken
by the action itself).(as can be seen from start time of each job in Spark UI page). This
was not there in Spark 2.1.1. Can anybody tell what is the issue here?

PS: I am reading multiple text files from S3 using wholeTextFile, creating multiple dataframes
for thos textfiles and writing them out to S3 in csv format.


> Time lag between two consecutive spark actions using Spark 2.3.1
> ----------------------------------------------------------------
>
>                 Key: SPARK-28575
>                 URL: https://issues.apache.org/jira/browse/SPARK-28575
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.3.1
>            Reporter: Kushal Mahajan
>            Priority: Major
>         Attachments: spark_2.1_screenshot.PNG, spark_2.3_screenshot.PNG
>
>
> Steps to reproduce:
>  # Read a directory(consisting of txt files) using spark context's wholetextfile method
>  # Perform transformation on the resultant paired rdd
>  # Perform an action(foreach) on each entry corresponding to each txt file
>  # Time lag can be seen between these actions in Spark UI. 
> The action itself is not taking that much time. There is time lag between start time
for each action(excluding the time taken by the job itself). Kindly refer to the attachments
> PS: This time lag is not seen when running the job in Spark 2.1.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message