spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Re: [pyspark] Load a master data file to spark ecosystem
Date Sat, 25 Apr 2020 04:58:33 GMT
How does your tree_lookup_value function work?

Thanks,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>




On Fri, Apr 24, 2020 at 8:47 PM Arjun Chundiran <arjuncec@gmail.com> wrote:

> Hi Team,
>
> I have asked this question in stack overflow
> <https://stackoverflow.com/questions/61386719/load-a-master-data-file-to-spark-ecosystem>
> and I didn't really get any convincing answers. Can somebody help me to
> solve this issue?
>
> Below is my problem
>
> While building a log processing system, I came across a scenario where I
> need to look up data from a tree file (Like a DB) for each and every log
> line for corresponding value. What is the best approach to load an external
> file which is very large into the spark ecosystem? The tree file is of size
> 2GB.
>
> Here is my scenario
>
>    1. I have a file contains huge number of log lines.
>    2. Each log line needs to be split by a delimiter to 70 fields
>    3. Need to lookup the data from tree file for one of the 70 fields of
>    a log line.
>
> I am using Apache Spark Python API and running on a 3 node cluster.
>
> Below is the code which I have written. But it is really slow
>
> def process_logline(line, tree):
>     row_dict = {}
>     line_list = line.split(" ")
>     row_dict["host"] = tree_lookup_value(tree, line_list[0])
>     new_row = Row(**row_dict)
>     return new_row
> def run_job(vals):
>     spark.sparkContext.addFile('somefile')
>     tree_val = open(SparkFiles.get('somefile'))
>     lines = spark.sparkContext.textFile("log_file")
>     converted_lines_rdd = lines.map(lambda l: process_logline(l, tree_val))
>     log_line_rdd = spark.createDataFrame(converted_lines_rdd)
>     log_line_rdd.show()
>
> Basically I need some option to load the file one time in memory of workers and start
using it entire job life time using Python API.
>
> Thanks in advance
> Arjun
>
>
>
>

Mime
View raw message