I am developing a website updater. The front end uses HTML, CSS and JavaScript, and the backend uses Python.
The way it works is that <p/>, <b/> and some other HTML elements can be updated by the user. To enable this, I load the webpage and, with JQuery, convert all those elements to <textarea/> elements. Once they the content of the text area is changed, I apply the change to the original elements and send it to a Python script to store the new content.
The problem is that I'm finding that different browsers change the original HTML.
How do you get around this issue?
What Python libraries do you use?
What techniques or application designs do you use to avoid or overcome this issue?
The problems I found are:
IE removes the quotes around class and id attributes. For example, <img class='abc'/> becomes <img class=abc/>.
Firefox removes the backslash from the line breaks: <br \> becomes <br>.
Some websites have very specific display technicalities, so an insertion of a simple "\n"(which IE does) can affect the display of a website. Example: changing <img class='headingpic' /><div id="maincontent"> to <img class='headingpic'/>\n <div id="maincontent"> inserts a vertical gap in IE.
The things I have unsuccessfully tried to overcome these issues:
Using either JQuery or Python to remove all >\n< occurences, <br> etc. But this fails because I get different patterns in IE, sometimes a ·\n, sometimes a \n···.
In a Python, parse the new HTML, extract the new text/content, insert it into the old HTML so the elements and format never change, just the content. This is very difficult and seems to be overkill.