spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhaval Patel <dhaval1...@gmail.com>
Subject DataFrame/JDBC very slow performance
Date Mon, 24 Aug 2015 15:17:33 GMT
I am trying to access a mid-size Teradata table (~100 million rows) via
JDBC in standalone mode on a single node (local[*]). When I tried with BIG
table (5B records) then no results returned upon completion of query.

I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24
cores, 126G RAM).

I have tried several memory setup and tuning options to make it work
faster, but neither of them made a huge impact.

I am sure there is something I am missing and below is my final try that
took about 11 minutes to get this simple counts vs it only took 40 seconds
using a JDBC connection through R to get the counts.


bin/pyspark --driver-memory 40g --executor-memory 40g

df = sqlContext.read.jdbc("jdbc:teradata://......)
df.count()


[image: Inline image 1]

Mime
View raw message