spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: How to consider HTML files in Spark
Date Thu, 12 Mar 2015 17:36:16 GMT
sc.wholeTextFile() is what you need.

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles

On Thu, Mar 12, 2015 at 9:26 AM, yh18190 <yh18190@gmail.com> wrote:
> Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark +
> Beautifulsoup to parse HTML files.I am facing problems to load html file
> into beautiful soup.
> Example
> filepath= file:///path to html directory
> def readhtml(inputhtml):
> {
> soup=Beautifulsoup(inputhtml) //to load html content
> }
> loaddata=sc.textFile(filepath).map(readhtml)
>
> The problem is here spark considers loaded file as textfile and goes through
> process line by line.I want to consider to load the entire html content into
> Beautifulsoup for further processing..
> Does anyone have any idea to how to take the whole html file as input
> instead of linebyline processing?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-consider-HTML-files-in-Spark-tp22017.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message