spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: Structured streaming use of DataFrame vs Datasource
Date Thu, 16 Jun 2016 21:05:33 GMT
I'm clear on what a type alias is.  My question is more that moving
from e.g. Dataset[T] to Dataset[Row] involves throwing away
information.  Reading through code that uses the Dataframe alias, it's
a little hard for me to know when that's intentional or not.


On Thu, Jun 16, 2016 at 2:50 PM, Tathagata Das
<tathagata.das1565@gmail.com> wrote:
> There are different ways to view this. If its confusing to think that Source
> API returning DataFrames, its equivalent to thinking that you are returning
> a Dataset[Row], and DataFrame is just a shorthand. And
> DataFrame/Datasetp[Row] is to Dataset[String] is what java Array[Object] is
> to Array[String]. DataFrame is more general in a way, as every Dataset can
> be boiled down to a DataFrame. So to keep the Source APIs general (and also
> source-compatible with 1.x), they return DataFrame.
>
> On Thu, Jun 16, 2016 at 12:38 PM, Cody Koeninger <cody@koeninger.org> wrote:
>>
>> Is this really an internal / external distinction?
>>
>> For a concrete example, Source.getBatch seems to be a public
>> interface, but returns DataFrame.
>>
>> On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
>> <tathagata.das1565@gmail.com> wrote:
>> > DataFrame is a type alias of Dataset[Row], so externally it seems like
>> > Dataset is the main type and DataFrame is a derivative type.
>> > However, internally, since everything is processed as Rows, everything
>> > uses
>> > DataFrames, Type classes used in a Dataset is internally converted to
>> > rows
>> > for processing. . Therefore internally DataFrame is like "main" type
>> > that is
>> > used.
>> >
>> > On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger <cody@koeninger.org>
>> > wrote:
>> >>
>> >> Sorry, meant DataFrame vs Dataset
>> >>
>> >> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <cody@koeninger.org>
>> >> wrote:
>> >> > Is there a principled reason why sql.streaming.* and
>> >> > sql.execution.streaming.* are making extensive use of DataFrame
>> >> > instead of Datasource?
>> >> >
>> >> > Or is that just a holdover from code written before the move / type
>> >> > alias?
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> For additional commands, e-mail: dev-help@spark.apache.org
>> >>
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message