I want to know that how can I filte the spam from a fetched html page?

For example , nutch fetched a news page ,but there are so much spam info besides the useful.so how can filter them?

Any reply will be appreciated!