spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krakna H <>
Subject Re: example of non-line oriented input data?
Date Mon, 17 Mar 2014 16:09:52 GMT

Not sure if this is what you had in mind, but here's some simple pyspark
code that I recently wrote to deal with JSON files.

from pyspark import SparkContext, SparkConf
from operator import add
import json
import random
import numpy as np

def concatenate_paragraphs(sentence_array):

	return ' '.join(sentence_array).split(' ')

logFile = 'foo.json'
conf = SparkConf()
sc = SparkContext(conf=conf)
logData = sc.textFile(logFile).cache()
num_lines = logData.count()
print 'Number of lines: %d' % num_lines
# JSON object has the structure: {"key": {'paragraphs': [sentence1,
sentence2, ...]}}
tm = s: (json.loads(s)['key'],
tm = tm.reduceByKey(lambda _, x: _ + x)
op = tm.collect()
for key, num_words in op:
	print 'state: %s, num_words: %d' % (state, num_words)

On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User
List] <> wrote:

> I don't actually have any data.  I'm writing a course that teaches
> students how to do this sort of thing and am interested in looking at a
> variety of real life examples of people doing things like that.  I'd love
> to see some working code implementing the "obvious work-around" you
> you have any to share?  It's an approach that makes a lot of
> sense, and as I said, I'd love to not have to re-invent the wheel if
> someone else has already written that code.  Thanks!
> Diana
> On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas <[hidden email]<http://user/SendEmail.jtp?type=node&node=2752&i=0>
> > wrote:
>> There was a previous discussion about this here:
>> How big are the XML or JSON files you're looking to deal with?
>> It may not be practical to deserialize the entire document at once. In
>> that case an obvious work-around would be to have some kind of
>> pre-processing step that separates XML nodes/JSON objects with newlines so
>> that you *can* analyze the data with Spark in a "line-oriented format".
>> Your preprocessor wouldn't have to parse/deserialize the massive document;
>> it would just have to track open/closed tags/braces to know when to insert
>> a newline.
>> Then you'd just open the line-delimited result and deserialize the
>> individual objects/nodes with map().
>> Nick
>> On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll <[hidden email]<http://user/SendEmail.jtp?type=node&node=2752&i=1>
>> > wrote:
>>> Has anyone got a working example of a Spark application that analyzes
>>> data in a non-line-oriented format, such as XML or JSON?  I'd like to do
>>> this without re-inventing the wheel...anyone care to share?  Thanks!
>>> Diana
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>  To start a new topic under Apache Spark User List, email
> To unsubscribe from Apache Spark User List, click here<>
> .
> NAML<>

View this message in context:
Sent from the Apache Spark User List mailing list archive at
View raw message