spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ignacio Zendejas>
Subject createDataframe from s3 results in error
Date Tue, 02 Jun 2015 22:13:40 GMT
I've run into an error when trying to create a dataframe. Here's the code:

from pyspark import StorageLevel
from pyspark.sql import Row

table = 'blah'
ssc = HiveContext(sc)

data = sc.textFile('s3://bucket/some.tsv')

def deserialize(s):
  p = s.strip().split('\t')
  p[-1] = float(p[-1])
  return Row(normalized_page_sha1=p[0], name=p[1], phrase=p[2],
created_at=p[3], layer_id=p[4], score=p[5])

blah =
df = sqlContext.inferSchema(blah)


I've also tried s3n and using createDataFrame. Our setup is on EMR
instances, using the setup script Amazon provides. After lots of debugging,
I suspect there'll be a problem with this setup.

What's weird is that if I run this on pyspark shell, and re-run the last
line (inferSchema/createDataFrame), it actually works.

We're getting warnings like this:

Here's the actual error:

Any help would be greatly appreciated.


View raw message