spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shmuel Blitz <shmuel.bl...@similarweb.com>
Subject Re: Create an Empty dataframe
Date Sun, 08 Jul 2018 14:43:56 GMT
Hi Dimitris,

Could you explain your use case in a bit more details?

What you are asking for, if I understand you correctly, is not the advised
way to go about.

If you're running analytics and expect their output to be a Dataframe with
the specified columns, then you should compose your queries in such a way
that they result in a DataFrame.

If your preparing data to be analyzed (i.e. getting the input ready for
manipulation), then I expect you to be doing one of the following:
a. Read in the data using one of Spark's provided input APIs (e.g. reading
a parquet file directly into a DataFrame)
b. Read/prepare your data as a standard collection in your language
(Python, in your case, but the same in Scala/Java/etc.), and then use
Spark's API to parallelize the data and/or convert it into a DataFrame.

That way or another, you want to be using Spark API for work that should be
distributed to workers (heavy load, large amounts of data), and use your
native language API, which usually is much more powerful, to run
bootstrapping and light-weight preparations.

Regards,
Shmuel

On Sat, Jun 30, 2018 at 6:51 PM Apostolos N. Papadopoulos <
papadopo@csd.auth.gr> wrote:

> Hi Dimitri,
>
> you can do the following:
>
> 1. create an initial dataframe from an empty csv
>
> 2. use "union" to insert new rows
>
> Do not forget that Spark cannot replace a DBMS. Spark is mainly be used
> for analytics.
>
> If you need select/insert/delete/update capabilities, perhaps you should
> look at a DBMS.
>
>
> Another alternative, in case you need "append only" semantics, is to use
> streaming or structured streaming.
>
>
> regards,
>
> Apostolos
>
>
>
>
> On 30/06/2018 05:46 μμ, dimitris plakas wrote:
> > I am new to Pyspark and want to initialize a new empty dataframe with
> > sqlContext() with two columns ("Column1", "Column2"), and i want to
> > append rows dynamically in a for loop.
> > Is there any way to achieve this?
> >
> > Thank you in advance.
>
> --
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: papadopo@csd.auth.gr
> twitter: @papadopoulos_ap
> web: http://delab.csd.auth.gr/~apostol
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

-- 
Shmuel Blitz
Big Data Developer
Email: shmuel.blitz@similarweb.com
www.similarweb.com
<https://www.similarweb.com?utm_source=WiseStamp&utm_medium=email&utm_term&utm_content&utm_campaign=signature>
<https://www.facebook.com/SimilarWeb/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<https://www.linkedin.com/company/429838/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<https://twitter.com/similarweb?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

Mime
View raw message