Sanitize Content: removing markup from Amazon's content

Posted by StackOverflowNewbie on Stack Overflow See other posts from Stack Overflow or by StackOverflowNewbie
Published on 2011-01-31T09:13:13Z Indexed on 2011/02/12 23:25 UTC
Read the original article Hit count: 305

Filed under:
|
|
|

I'm using Amazon Web Service to get product descriptions of various items. The problem is that Amazon's content contains mark up that is sometimes destructive to the layout of my web page (e.g. unclosed DIVs, etc.).

I want to sanitize the content I get from Amazon. My solution would be to do the following (my initial list so far):

  • Remove unnecessary tags such as div, span, etc. while keeping tags like p, ul, ol, etc.
  • Remove all attributes from all the tags (e.g. seems like there are style attributes in some of the tags)
  • Remove excess white space (e.g. multiple spaces, carriage returns, new lines, tabs, etc.)
  • Etc.

Before I go off trying to build my solution, I'm wondering if anyone has a better idea (or an already existing solution). Thanks.

© Stack Overflow or respective owner

Related posts about php

Related posts about html