Detect if 2 HTML fragments have identical hierarchical structure

Posted by sergzach on Stack Overflow See other posts from Stack Overflow or by sergzach
Published on 2012-10-28T10:41:12Z Indexed on 2012/10/28 11:00 UTC
Read the original article Hit count: 192

Filed under:
|
|
|

An example of fragments that have identical hierarchical structure:

(1)
<div>
  <span>It's a message</span>
</div>

(2)
<div>
  <span class='bold'>This is a new text</span>
</div>

An example of fragments that have different structure:

(1)
<div>
  <span><b>It's a message</b></span>
</div>

(2)
<div>
  <span>This is a new text</span>
</div>

So, fragments with a similar structure correspond to one hierarchical tree (the same tag names, the same hierarchical structure).

How can I detect if 2 elements (html fragments) have the same structure simply with lxml?

I have a function that does not work properly for some more difficult case (than the example):

def _is_equal( el1, el2 ):      
    # input: 2 elements with possible equal structure and tag names
    # e.g. root = lxml.html.fromstring( buf )
    # el1 = root[ 0 ]
    # el2 = root[ 1 ]
    # move from top to bottom, compare elements
    result = False  

    if el1.tag == el2.tag:
        # has no children
        if len( el1 ) == len( el2 ):
            if len( el1 ) == 0:             
                return True
            else:
                # iterate one of them, for example el1
                i = 0
                for child1 in el1:
                    child2 = el2[ i ]
                    is_equal2 = _is_equal( child1, child2 )
                    if not is_equal2:
                        return False
                return True                     
        else:
            return False
    else:
        return False

The code fails to detect that 2 divs with class='tovar2' have an identical structure:

<body>


    <div class="tovar2">
        <h2 class="new">
            <a href="http://modnyedeti-krsk.ru/magazin/product/333193003">
                ??????  ?/?
            </a>
        </h2>
        <ul class="art">
            <li>
                ???????: <span>1759</span>
            </li>
        </ul>
        <div>
            <div class="wrap" style="width:180px;"> 
                <div class="new">
                    <img src="shop_files/new-t.png" alt="">
                </div>     
                <a class="highslide" href="http://modnyedeti-krsk.ru/d/459730/d/820.jpg" onclick="return hs.expand(this)"> 
                    <img src="shop_files/fr_5.gif" style="background:url(/d/459730/d/548470803_5.jpg) 50% 50% no-repeat scroll;" alt="??????  ?/?" height="160" width="180"> 
                </a>     
            </div>
        </div>

        <form action="" onsubmit="return addProductForm(17094601,333193003,3150.00,this,false);">
            <ul class="bott ">
                <li class="price">????:<br>
                    <span>
                        <b>
                            3 150
                        </b> ???.
                    </span>
                </li>
                <li class="amount">???-??:<br><input class="number" onclick="this.select()" value="1" name="product_amount" type="text">
                </li>
                <li class="buy"><input value="" type="submit">
                </li>
            </ul>
        </form>
    </div>


    <div class="tovar2">
        <h2 class="new">
            <a href="http://modnyedeti-krsk.ru/magazin/product/333124803">??????  ?/?</a>
        </h2>
        <ul class="art">
            <li>
                ???????: <span>1759</span>
            </li>
        </ul>
        <div>
            <div class="wrap" style="width:180px;"> 
                <div class="new">
                    <img src="shop_files/new-t.png" alt="">
                </div>     
                <a class="highslide" href="http://modnyedeti-krsk.ru/d/459730/d/820.jpg" onclick="return hs.expand(this)"> 
                    <img src="shop_files/fr_5.gif" style="background:url(/d/459730/d/548470803_5.jpg) 50% 50% no-repeat scroll;" alt="??????  ?/?" height="160" width="180"> 
                </a>      
            </div>
        </div>      

        <form action="" onsubmit="return addProductForm(17094601,333124803,3150.00,this,false);">
            <ul class="bott ">
                <li class="price">????:<br>
                    <span>
                        <b>3 150</b> ???.
                    </span>
                </li>
                <li class="amount">???-??:<br><input class="number" onclick="this.select()" value="1" name="product_amount" type="text">
                </li>
                <li class="buy">
                    <input value="" type="submit">
                </li>
            </ul>
        </form>
    </div>

    </body>        

© Stack Overflow or respective owner

Related posts about python

Related posts about html