spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand <>
Subject Question about Google Books Ngrams with pyspark (1.4.1)
Date Tue, 01 Sep 2015 15:39:20 GMT
Hello everybody,

I am trying to read the Google Books Ngrams with pyspark on Amazon EC2. 

I followed the steps from :

and everything is working fine.

I am able to read the file  :
lines =

If I now want to read the file using :

lines =

I have the following error message :

15/09/01 15:28:51 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0
(TID 1, java.lang.IllegalArgumentException: Unknown codec:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/spark/python/pyspark/", line 544, in sequenceFile
    keyConverter, valueConverter, minSplits, batchSize)
  File "/root/spark/python/lib/",
line 538, in __call__
  File "/root/spark/python/lib/", line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0
(TID 4, java.lang.IllegalArgumentException: Unknown codec:

Could you please help me reading the file with pyspark ?

Thank you for your help,



View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message