Can not parse table information from html document.
Posted
by Harikrishna
on Stack Overflow
See other posts from Stack Overflow
or by Harikrishna
Published on 2010-03-31T09:18:39Z
Indexed on
2010/03/31
9:23 UTC
Read the original article
Hit count: 532
I am parsing many html documents.I am using html agility pack And I want to parse the tabular information from each document. And there may be any number of tables in each document.But I want to extract only one table from each document which has column header name NAME,PHONE NO,ADDRESS.And this table can be anywhere in the document,like in the document there is ten tables and from ten table there is one table which has many nested tables and from nested table there may be a table what I want to extract means table can be anywhere in the document and I want to find that table from the document by column header name.If I got that table then I want to then extract the information from that table.
Now I can find the table which has column header NAME,PHONE NO,ADDRESS and also can extract the information from that.I am doing for that is, first I find the all tables in a document by
foreach (var table in doc.DocumentNode.Descendants("table"))
then for each table got I find the row for each table like,
var rows = table.Descendants("tr");
and then for each row I am checking that row has that header name NAME,ADDRESS,PHONENO and if it is then I skip that row and extract all information after that row
foreach (var row in rows.Skip(rowNo))
{
var data = new List<string>();
foreach (var column in row.Descendants("td"))
{
data.Add(properText);
}
}
Such that I am extracting all information from almost many document. But now problem is sometimes what happened that in some document I can not parse the information.Like a document in which there are like 10 tables and from these 10 tables 1 table is like there are many nested tables in that table. And from these nested tables I want to find the table which tabel has column header like NAME,ADDRESS,PHONE NO.So if table may be anywhere in the document even in the nested tables or anywhere it can be find through column header name.So I can parse the information from that table and skip the outer tabular information from that table.
© Stack Overflow or respective owner