spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali Bajwa <ali.ba...@gmail.com>
Subject Re: PySpark: slicing issue with dataframes
Date Sun, 03 May 2015 17:36:50 GMT
Friendly reminder on this one. Just wanted to get a confirmation that this
is not by design before I logged a JIRA

Thanks!
Ali


On Tue, Apr 28, 2015 at 9:53 AM, Ali Bajwa <ali.bajwa@gmail.com> wrote:

> Hi experts,
>
> Trying to use the "slicing" functionality in strings as part of a Spark
> program (PySpark) I get this error:
>
> **** Code ****
>
> import pandas as pd
> from pyspark.sql import SQLContext
> hc = SQLContext(sc)
> A = pd.DataFrame({'Firstname': ['James', 'Ali', 'Daniel'], 'Lastname':
> ['Jones', 'Bajwa', 'Day']})
> a = hc.createDataFrame(A)
> print A
>
> b = a.select(a.Firstname[:2])
> print b.toPandas()
> c = a.select(a.Lastname[2:])
> print c.toPandas()
>
> Output:
>
>  Firstname Lastname
> 0     James    Jones
> 1       Ali    Bajwa
> 2    Daniel      Day
>   SUBSTR(Firstname, 0, 2)
> 0                      Ja
> 1                      Al
> 2                      Da
>
> ---------------------------------------------------------------------------
> Py4JError                                 Traceback (most recent call last)
> <ipython-input-17-6ee5d7d069ce> in <module>()
>      10 b = a.select(a.Firstname[:2])
>      11 print b.toPandas()
> ---> 12 c = a.select(a.Lastname[2:])
>      13 print c.toPandas()
>
> /home/jupyter/spark-1.3.1/python/pyspark/sql/dataframe.pyc in substr(self,
> startPos, length)
>    1089             raise TypeError("Can not mix the type")
>    1090         if isinstance(startPos, (int, long)):
> -> 1091             jc = self._jc.substr(startPos, length)
>    1092         elif isinstance(startPos, Column):
>    1093             jc = self._jc.substr(startPos._jc, length._jc)
>
> /home/jupyter/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
> in __call__(self, *args)
>     536         answer = self.gateway_client.send_command(command)
>     537         return_value = get_return_value(answer,
> self.gateway_client,
> --> 538                 self.target_id, self.name)
>     539
>     540         for temp_arg in temp_args:
>
> /home/jupyter/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
> in get_return_value(answer, gateway_client, target_id, name)
>     302                 raise Py4JError(
>     303                     'An error occurred while calling {0}{1}{2}.
> Trace:\n{3}\n'.
> --> 304                     format(target_id, '.', name, value))
>     305         else:
>     306             raise Py4JError(
>
> Py4JError: An error occurred while calling o1887.substr. Trace:
> py4j.Py4JException: Method substr([class java.lang.Integer, class
> java.lang.Long]) does not exist
> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
> at py4j.Gateway.invoke(Gateway.java:252)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
>
> Looks like X[:2] works but X[2:] fails with the error above
> Anyone else have this issue?
>
> Clearly I can use substr() to workaround this, but if this is a confirmed
> bug we should open a JIRA.
>
> Thanks,
> Ali
>

Mime
View raw message