Detect if 2 HTML fragments have identical hierarchical structure
Posted
by
sergzach
on Stack Overflow
See other posts from Stack Overflow
or by sergzach
Published on 2012-10-28T10:41:12Z
Indexed on
2012/10/28
11:00 UTC
Read the original article
Hit count: 223
An example of fragments that have identical hierarchical structure:
(1)
<div>
<span>It's a message</span>
</div>
(2)
<div>
<span class='bold'>This is a new text</span>
</div>
An example of fragments that have different structure:
(1)
<div>
<span><b>It's a message</b></span>
</div>
(2)
<div>
<span>This is a new text</span>
</div>
So, fragments with a similar structure correspond to one hierarchical tree (the same tag names, the same hierarchical structure).
How can I detect if 2 elements (html fragments) have the same structure simply with lxml?
I have a function that does not work properly for some more difficult case (than the example):
def _is_equal( el1, el2 ):
# input: 2 elements with possible equal structure and tag names
# e.g. root = lxml.html.fromstring( buf )
# el1 = root[ 0 ]
# el2 = root[ 1 ]
# move from top to bottom, compare elements
result = False
if el1.tag == el2.tag:
# has no children
if len( el1 ) == len( el2 ):
if len( el1 ) == 0:
return True
else:
# iterate one of them, for example el1
i = 0
for child1 in el1:
child2 = el2[ i ]
is_equal2 = _is_equal( child1, child2 )
if not is_equal2:
return False
return True
else:
return False
else:
return False
The code fails to detect that 2 divs with class='tovar2' have an identical structure:
<body>
<div class="tovar2">
<h2 class="new">
<a href="http://modnyedeti-krsk.ru/magazin/product/333193003">
?????? ?/?
</a>
</h2>
<ul class="art">
<li>
???????: <span>1759</span>
</li>
</ul>
<div>
<div class="wrap" style="width:180px;">
<div class="new">
<img src="shop_files/new-t.png" alt="">
</div>
<a class="highslide" href="http://modnyedeti-krsk.ru/d/459730/d/820.jpg" onclick="return hs.expand(this)">
<img src="shop_files/fr_5.gif" style="background:url(/d/459730/d/548470803_5.jpg) 50% 50% no-repeat scroll;" alt="?????? ?/?" height="160" width="180">
</a>
</div>
</div>
<form action="" onsubmit="return addProductForm(17094601,333193003,3150.00,this,false);">
<ul class="bott ">
<li class="price">????:<br>
<span>
<b>
3 150
</b> ???.
</span>
</li>
<li class="amount">???-??:<br><input class="number" onclick="this.select()" value="1" name="product_amount" type="text">
</li>
<li class="buy"><input value="" type="submit">
</li>
</ul>
</form>
</div>
<div class="tovar2">
<h2 class="new">
<a href="http://modnyedeti-krsk.ru/magazin/product/333124803">?????? ?/?</a>
</h2>
<ul class="art">
<li>
???????: <span>1759</span>
</li>
</ul>
<div>
<div class="wrap" style="width:180px;">
<div class="new">
<img src="shop_files/new-t.png" alt="">
</div>
<a class="highslide" href="http://modnyedeti-krsk.ru/d/459730/d/820.jpg" onclick="return hs.expand(this)">
<img src="shop_files/fr_5.gif" style="background:url(/d/459730/d/548470803_5.jpg) 50% 50% no-repeat scroll;" alt="?????? ?/?" height="160" width="180">
</a>
</div>
</div>
<form action="" onsubmit="return addProductForm(17094601,333124803,3150.00,this,false);">
<ul class="bott ">
<li class="price">????:<br>
<span>
<b>3 150</b> ???.
</span>
</li>
<li class="amount">???-??:<br><input class="number" onclick="this.select()" value="1" name="product_amount" type="text">
</li>
<li class="buy">
<input value="" type="submit">
</li>
</ul>
</form>
</div>
</body>
© Stack Overflow or respective owner