spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gonzalo Zarza <gonzalo.za...@globant.com>
Subject Re: Spark Questions
Date Mon, 14 Jul 2014 14:08:51 GMT
Thanks for your answers Shuo Xiang and Aaron Davidson!

Regards,


--
*Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data Specialist |
*GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext. 15494
 | [image: Facebook] <https://www.facebook.com/Globant> [image: Twitter]
<http://www.twitter.com/globant> [image: Youtube]
<http://www.youtube.com/Globant> [image: Linkedin]
<http://www.linkedin.com/company/globant> [image: Pinterest]
<http://pinterest.com/globant/> [image: Globant] <http://www.globant.com/>


On Sat, Jul 12, 2014 at 9:02 PM, Aaron Davidson <ilikerps@gmail.com> wrote:

> I am not entirely certain I understand your questions, but let me assume
> you are mostly interested in SparkSQL and are thinking about your problem
> in terms of SQL-like tables.
>
> 1. Shuo Xiang mentioned Spark partitioning strategies, but in case you are
> talking about data partitioning or sharding as exist in Hive, SparkSQL does
> not currently support this, though it is on the roadmap. We can read from
> partitioned Hive tables, however.
>
> 2. If by entries/record you mean something like columns/row, SparkSQL does
> allow you to project out the columns you want, or select all columns. The
> efficiency of such a projection is determined by the how the data is
> stored, however: If your data is stored in an inherently row-based format,
> this projection will be no faster than doing an initial map() over the data
> to only select the desired columns. If it's stored in something like
> Parquet, or cached in memory, however, we would avoid ever looking at the
> unused columns.
>
> 3. Spark has a very generalized data source API, so it is capable of
> interacting with whatever data source. However, I don't think we currently
> have any SparkSQL connectors to RDBMSes that would support column pruning
> or other push-downs. This is all very much viable, however.
>
>
> On Fri, Jul 11, 2014 at 1:35 PM, Gonzalo Zarza <gonzalo.zarza@globant.com>
> wrote:
>
>> Hi all,
>>
>> We've been evaluating Spark for a long-term project. Although we've been
>> reading several topics in forum, any hints on the following topics we'll be
>> extremely welcomed:
>>
>> 1. Which are the data partition strategies available in Spark? How
>> configurable are these strategies?
>>
>> 2. How would be the best way to use Spark if queries can touch only 3-5
>> entries/records? Which strategy is the best if they want to perform a full
>> scan of the entries?
>>
>> 3. Is Spark capable of interacting with RDBMS?
>>
>> Thanks a lot!
>>
>> Best regards,
>>
>> --
>> *Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data
>> Specialist |
>> *GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext.
>> 15494 | [image: Facebook] <https://www.facebook.com/Globant> [image:
>> Twitter] <http://www.twitter.com/globant> [image: Youtube]
>> <http://www.youtube.com/Globant> [image: Linkedin]
>> <http://www.linkedin.com/company/globant> [image: Pinterest]
>> <http://pinterest.com/globant/> [image: Globant]
>> <http://www.globant.com/>
>>
>
>

Mime
View raw message