spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "antonkulaga (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)
Date Wed, 31 Jul 2019 13:48:08 GMT

    [ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897194#comment-16897194
] 

antonkulaga commented on SPARK-28547:
-------------------------------------

[~maropu] I think I was quite clear: even describe works slow as hell. So the easiest way
to reproduce is just to run describe on all numeric columns in GTEX. 

> Make it work for wide (> 10K columns data)
> ------------------------------------------
>
>                 Key: SPARK-28547
>                 URL: https://issues.apache.org/jira/browse/SPARK-28547
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.4, 2.4.3
>         Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per node, 32
cores (tried different configurations of executors)
>            Reporter: antonkulaga
>            Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows).
Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb
and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance
RNA-Seq data at  https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is
just a .tsv file with two comments in the beginning). Everything done in wide tables (even
simple "describe" functions applied to all the genes-columns) either takes hours or gets frozen
(because of lost executors) irrespective of memory and numbers of cores. While the same operations
work fast (minutes) and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message