spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yh18190 <yh18...@gmail.com>
Subject How to consider HTML files in Spark
Date Thu, 12 Mar 2015 16:26:34 GMT
Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark +
Beautifulsoup to parse HTML files.I am facing problems to load html file
into beautiful soup.
Example
filepath= file:///path to html directory
def readhtml(inputhtml):
{
soup=Beautifulsoup(inputhtml) //to load html content
}
loaddata=sc.textFile(filepath).map(readhtml)

The problem is here spark considers loaded file as textfile and goes through
process line by line.I want to consider to load the entire html content into
Beautifulsoup for further processing..
Does anyone have any idea to how to take the whole html file as input
instead of linebyline processing?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-consider-HTML-files-in-Spark-tp22017.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message