spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fernando Paladini <fnpalad...@gmail.com>
Subject Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark
Date Mon, 05 Oct 2015 20:04:09 GMT
I don't know what this method does, but I get no issue running it on Spark
:3. Here's the full log from spark-submit command
<https://gist.github.com/paladini/b5f8982c8ec4035d5dfe> (Strange, but
small).

And here the output that matters for you (I think):


​
Any other test I should run to validate the dataframe or I can trust that
the dataframe is actually working fine?

Thank you for all the help, guys!!

2015-10-05 16:44 GMT-03:00 Michael Armbrust <michael@databricks.com>:

> Looks correct to me.  Try for example:
>
> from pyspark.sql.functions import *
> df.withColumn("value", explode(df['values'])).show()
>
> On Mon, Oct 5, 2015 at 2:15 PM, Fernando Paladini <fnpaladini@gmail.com>
> wrote:
>
>> Update:
>>
>> I've updated my code and now I have the following JSON:
>> https://gist.github.com/paladini/27bb5636d91dec79bd56
>> In the same link you can check the output from "spark-submit
>> myPythonScript.py", where I call "myDataframe.show()". The following is
>> printed by Spark (among other useless debug information):
>>
>>
>> ​
>> That's correct for the given JSON input
>> <https://gist.github.com/paladini/27bb5636d91dec79bd56> (gist link
>> above)? How can I test if Spark can understand this DataFrame and make
>> complex manipulations with that?
>>
>> Thank you! Hope you can help me soon :3
>> Fernando Paladini.
>>
>> 2015-10-05 15:23 GMT-03:00 Fernando Paladini <fnpaladini@gmail.com>:
>>
>>> Thank you for the replies and sorry about the delay, my e-mail client
>>> send this conversation to Spam (??).
>>>
>>> I'll take a look in your tips and come back later to post my questions /
>>> progress. Again, thank you so much!
>>>
>>> 2015-09-30 18:37 GMT-03:00 Michael Armbrust <michael@databricks.com>:
>>>
>>>> I think the problem here is that you are passing in parsed JSON that
>>>> stored as a dictionary (which is converted to a hashmap when going into the
>>>> JVM).  You should instead be passing in the path to the json file
>>>> (formatted as Akhil suggests) so that Spark can do the parsing in
>>>> parallel.  The other option would be to construct and RDD of JSON string
>>>> and pass that to the JSON method.
>>>>
>>>> On Wed, Sep 30, 2015 at 2:28 AM, Akhil Das <akhil@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>>> Each Json Doc should be in a single line i guess.
>>>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>>>>>
>>>>> Note that the file that is offered as *a json file* is not a typical
>>>>> JSON file. Each line must contain a separate, self-contained valid JSON
>>>>> object. As a consequence, a regular multi-line JSON file will most often
>>>>> fail.
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Tue, Sep 29, 2015 at 11:07 AM, Fernando Paladini <
>>>>> fnpaladini@gmail.com> wrote:
>>>>>
>>>>>> Hello guys,
>>>>>>
>>>>>> I'm very new to Spark and I'm having some troubles when reading a
>>>>>> JSON to dataframe on PySpark.
>>>>>>
>>>>>> I'm getting a JSON object from an API response and I would like to
>>>>>> store it in Spark as a DataFrame (I've read that DataFrame is better
than
>>>>>> RDD, that's accurate?). For what I've read
>>>>>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext>
>>>>>> on documentation, I just need to call the method sqlContext.read.json
in
>>>>>> order to do what I want.
>>>>>>
>>>>>> *Following is the code from my test application:*
>>>>>> json_object = json.loads(response.text)
>>>>>> sc = SparkContext("local", appName="JSON to RDD")
>>>>>> sqlContext = SQLContext(sc)
>>>>>> dataframe = sqlContext.read.json(json_object)
>>>>>> dataframe.show()
>>>>>>
>>>>>> *The problem is that when I run **"spark-submit myExample.py" I got
>>>>>> the following error:*
>>>>>> 15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block
>>>>>> manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver,
>>>>>> localhost, 48634)
>>>>>> 15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager
>>>>>> Traceback (most recent call last):
>>>>>>   File
>>>>>> "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py", line
35, in
>>>>>> <module>
>>>>>>     dataframe = sqlContext.read.json(json_object)
>>>>>>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>>>>>> line 144, in json
>>>>>>   File
>>>>>> "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line
>>>>>> 538, in __call__
>>>>>>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
line
>>>>>> 36, in deco
>>>>>>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>>>>>> line 304, in get_return_value
>>>>>> py4j.protocol.Py4JError: An error occurred while calling o21.json.
>>>>>> Trace:
>>>>>> py4j.Py4JException: Method json([class java.util.HashMap]) does not
>>>>>> exist
>>>>>>     at
>>>>>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>>>>>>     at
>>>>>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>>>>>>     at py4j.Gateway.invoke(Gateway.java:252)
>>>>>>     at
>>>>>> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>>>>>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>>>>     at py4j.GatewayConnection.run(GatewayConnection.java:207)
>>>>>>     at java.lang.Thread.run(Thread.java:745)
>>>>>>
>>>>>> *What I'm doing wrong? *
>>>>>> Check out this gist
>>>>>> <https://gist.github.com/paladini/2e2ea913d545a407b842> to
see the
>>>>>> JSON I'm trying to load.
>>>>>>
>>>>>> Thanks!
>>>>>> Fernando Paladini
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Fernando Paladini
>>>
>>
>>
>>
>> --
>> Fernando Paladini
>>
>
>


-- 
Fernando Paladini

Mime
View raw message