spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ☼ R Nair (रविशंकर नायर) <ravishankar.n...@gmail.com>
Subject Re: Create an Empty dataframe
Date Sun, 08 Jul 2018 14:50:57 GMT
>From Stackoverflow:

from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType

sc = SparkContext(conf=SparkConf())
spark = SparkSession(sc)     # Need to use SparkSession(sc) to
createDataFrame

schema = StructType([
    StructField("column1",StringType(),True),
    StructField("column2",StringType(),True)
])
empty = spark.createDataFrame(sc.emptyRDD(), schema)
empty = empty.unionAll(addOndata)

Best,
Ravion

On Sun, Jul 8, 2018 at 10:44 AM Shmuel Blitz <shmuel.blitz@similarweb.com>
wrote:

> Hi Dimitris,
>
> Could you explain your use case in a bit more details?
>
> What you are asking for, if I understand you correctly, is not the advised
> way to go about.
>
> If you're running analytics and expect their output to be a Dataframe with
> the specified columns, then you should compose your queries in such a way
> that they result in a DataFrame.
>
> If your preparing data to be analyzed (i.e. getting the input ready for
> manipulation), then I expect you to be doing one of the following:
> a. Read in the data using one of Spark's provided input APIs (e.g. reading
> a parquet file directly into a DataFrame)
> b. Read/prepare your data as a standard collection in your language
> (Python, in your case, but the same in Scala/Java/etc.), and then use
> Spark's API to parallelize the data and/or convert it into a DataFrame.
>
> That way or another, you want to be using Spark API for work that should
> be distributed to workers (heavy load, large amounts of data), and use your
> native language API, which usually is much more powerful, to run
> bootstrapping and light-weight preparations.
>
> Regards,
> Shmuel
>
> On Sat, Jun 30, 2018 at 6:51 PM Apostolos N. Papadopoulos <
> papadopo@csd.auth.gr> wrote:
>
>> Hi Dimitri,
>>
>> you can do the following:
>>
>> 1. create an initial dataframe from an empty csv
>>
>> 2. use "union" to insert new rows
>>
>> Do not forget that Spark cannot replace a DBMS. Spark is mainly be used
>> for analytics.
>>
>> If you need select/insert/delete/update capabilities, perhaps you should
>> look at a DBMS.
>>
>>
>> Another alternative, in case you need "append only" semantics, is to use
>> streaming or structured streaming.
>>
>>
>> regards,
>>
>> Apostolos
>>
>>
>>
>>
>> On 30/06/2018 05:46 μμ, dimitris plakas wrote:
>> > I am new to Pyspark and want to initialize a new empty dataframe with
>> > sqlContext() with two columns ("Column1", "Column2"), and i want to
>> > append rows dynamically in a for loop.
>> > Is there any way to achieve this?
>> >
>> > Thank you in advance.
>>
>> --
>> Apostolos N. Papadopoulos, Associate Professor
>> Department of Informatics
>> Aristotle University of Thessaloniki
>> Thessaloniki, GREECE
>> tel: ++0030312310991918
>> email: papadopo@csd.auth.gr
>> twitter: @papadopoulos_ap
>> web: http://delab.csd.auth.gr/~apostol
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
> --
> Shmuel Blitz
> Big Data Developer
> Email: shmuel.blitz@similarweb.com
> www.similarweb.com
> <https://www.similarweb.com?utm_source=WiseStamp&utm_medium=email&utm_term&utm_content&utm_campaign=signature>
>
> <https://www.facebook.com/SimilarWeb/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
> <https://www.linkedin.com/company/429838/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
> <https://twitter.com/similarweb?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
>

Mime
View raw message