spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Feynman Liang <fli...@databricks.com>
Subject Re: custom RDD in java
Date Wed, 01 Jul 2015 19:52:38 GMT
AFAIK RDDs can only be created on the driver, not the executors. Also,
`saveAsTextFile(...)` is an action and hence can also only be executed on
the driver.

As Silvio already mentioned, Sqoop may be a good option.

On Wed, Jul 1, 2015 at 12:46 PM, Shushant Arora <shushantarora09@gmail.com>
wrote:

> List of tables is not large , RDD is created on table list to parllelise
> the work of fetching tables in multiple mappers at same time.Since time
> taken to fetch a table is significant , so can't run that sequentially.
>
>
> Content of table fetched by a map job is large, so one option is to dump
> content to hdfs using filesystem api from inside map function for every few
> rows of table fetched.
>
> I cannot keep complete table in memory and then dump in hdfs using below
> map function-
>
> JavaRDD<String> tablecontent = tablelistrdd.map(new
> Function<String,Iterable<String>>)
> {public Iterable<String> call(String tablename){
> ..make jdbc connection get table data and populate in list and return
> that..
>  }
>  tablecontent .saveAsTextFile("hdfspath");
>
> Here I wanted to create customRDD- whose partitions would be in memory on
> multiple executors and contains parts of table data. And i would have
> called saveAsTextFile on customRDD directly to save in hdfs.
>
>
>
> On Thu, Jul 2, 2015 at 12:59 AM, Feynman Liang <fliang@databricks.com>
> wrote:
>
>>
>> On Wed, Jul 1, 2015 at 7:19 AM, Shushant Arora <shushantarora09@gmail.com
>> > wrote:
>>
>>> JavaRDD<String> rdd = javasparkcontext.parllelise(tables);
>>
>>
>> You are already creating an RDD in Java here ;)
>>
>> However, it's not clear to me why you'd want to make this an RDD. Is the
>> list of tables so large that it doesn't fit on a single machine? If not,
>> you may be better off spinning up one spark job for dumping each table in
>> tables using a JDBC datasource
>> <https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>
>> .
>>
>> On Wed, Jul 1, 2015 at 12:00 PM, Silvio Fiorito <
>> silvio.fiorito@granturing.com> wrote:
>>
>>>   Sure, you can create custom RDDs. Haven’t done so in Java, but in
>>> Scala absolutely.
>>>
>>>   From: Shushant Arora
>>> Date: Wednesday, July 1, 2015 at 1:44 PM
>>> To: Silvio Fiorito
>>> Cc: user
>>> Subject: Re: custom RDD in java
>>>
>>>   ok..will evaluate these options but is it possible to create RDD in
>>> java?
>>>
>>>
>>> On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito <
>>> silvio.fiorito@granturing.com> wrote:
>>>
>>>>  If all you’re doing is just dumping tables from SQLServer to HDFS,
>>>> have you looked at Sqoop?
>>>>
>>>>  Otherwise, if you need to run this in Spark could you just use the
>>>> existing JdbcRDD?
>>>>
>>>>
>>>>   From: Shushant Arora
>>>> Date: Wednesday, July 1, 2015 at 10:19 AM
>>>> To: user
>>>> Subject: custom RDD in java
>>>>
>>>>   Hi
>>>>
>>>>  Is it possible to write custom RDD in java?
>>>>
>>>>  Requirement is - I am having a list of Sqlserver tables  need to be
>>>> dumped in HDFS.
>>>>
>>>>  So I have a
>>>> List<String> tables = {dbname.tablename,dbname.tablename2......};
>>>>
>>>>  then
>>>> JavaRDD<String> rdd = javasparkcontext.parllelise(tables);
>>>>
>>>>  JavaRDDString> tablecontent = rdd.map(new
>>>> Function<String,Iterable<String>>){fetch table and return populate
iterable}
>>>>
>>>>  tablecontent.storeAsTextFile("hffs path");
>>>>
>>>>
>>>>  In rdd.map(new Function<String,>). I cannot keep complete table
>>>> content in memory , so I want to creat my own RDD to handle it.
>>>>
>>>>  Thanks
>>>> Shushant
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message