I have the following input:
Hi! How are you? <script>//NOT EVIL!</script>
Wassup? :P
LOOOL!!! :D :D :D
Which is then run through emoticon library and it become this:
Hi! How are you? <script>//NOT EVIL!</script>
Wassup? <img class="smiley" alt="" title="tongue, :P" src="ui/emoticons/15.gif">
LOOOL!!! <img class="smiley" alt="" title="big grin, :D" src="ui/emoticons/5.gif"> <img class="smiley" alt="" title="big grin, :P" src="ui/emoticons/5.gif"> <img class="smiley" alt="" title="big grin, :P" src="ui/emoticons/5.gif">
I have a function that escapes HTML entites to prevent XSS. So running it on raw input for the first line would produce:
Hi! How are you? <script>//NOT EVIL!</script>
Now I need to escape all the input, but at the same time I need to preserve emoticons in their initial state. So when there is <:-P emoticon, it stays like that and does not become <:-P.
I was thinking of running a regex split on the emotified text. Then processing each part on its own and then concatenating the string together, but I am not sure how easily can Regex be bypassed? I know the format will always be this:
[<img class="smiley" alt="]
[empty string]
[" title="]
[one of the values from a big list]
[, ]
[another value from the list (may be matching original emoticon)]
[" src="ui/emoticons/]
[integer from Y to X]
[.gif">]
Using the list MAY be slow, since I need to run that regex on text that may have 20-30-40 emoticons. Plus there may be 5-10-15 text messages to process. What could be an elegant solution to this? I am ready to use third-party library or jQuery for this. PHP preprocessing is possible as well.