spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?
Date Mon, 02 Nov 2015 07:31:30 GMT
You can use the .saveAsObjectFiles("hdfs://sigmoid/twitter/status/") since
you want to store the Status object and for every batch it will create a
directory under /status (name will mostly be the timestamp), since the data
is small (hardly couple of MBs for 1 sec interval) it will not overwhelm
the cluster.

Thanks
Best Regards

On Sat, Oct 24, 2015 at 7:05 AM, Andy Davidson <
Andy@santacruzintegration.com> wrote:

> I need to save the twitter status I receive so that I can do additional
> batch based processing on them in the future. Is it safe to assume HDFS is
> the best way to go?
>
> Any idea what is the best way to save twitter status to HDFS?
>
>         JavaStreamingContext ssc = new JavaStreamingContext(jsc, new
> Duration(1000));
>
>         Authorization twitterAuth = setupTwitterAuthorization();
>
>         JavaDStream<Status> tweets = TwitterFilterQueryUtils.createStream(
> ssc, twitterAuth, query);
>
>
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
>
>
>
> *saveAsHadoopFiles*(*prefix*, [*suffix*])Save this DStream's contents as
> Hadoop files. The file name at each batch interval is generated based on
> *prefix* and *suffix*: *"prefix-TIME_IN_MS[.suffix]"*.
> Python API This is not available in the Python API.
>
> How ever JavaDStream<> does not support any savesAs* functions
>
>
>         DStream<Status> dStream = tweets.dstream();
>
>
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstream/DStream.html
>
> DStream<Status> only supports *saveAsObjectFiles
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstream/DStream.html#saveAsObjectFiles(java.lang.String,%20java.lang.String)>()and
**saveAsTextFiles
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstream/DStream.html#saveAsTextFiles(java.lang.String,%20java.lang.String)>*
> (()
>
>
> saveAsTextFiles
>
> public void saveAsTextFiles(java.lang.String prefix,
>                    java.lang.String suffix)
>
> Save each RDD in this DStream as at text file, using string representation
> of elements. The file name at each batch interval is generated based on
> prefix andsuffix: "prefix-TIME_IN_MS.suffixā€¯.
>
>
> Any idea where I would find these files? I assume they will be spread out
> all over my cluster?
>
>
> Also I wonder if using the saveAs*() functions are going to cause other
> problems. My duration is set to 1 sec. Am I going to overwhelm the system
> with a bunch of tiny files? Many of them will be empty
>
>
> Kind regards
>
>
> Andy
>

Mime
View raw message