Parse and transform XML with missing elements into table structure
- by dnlbrky
I'm trying to parse an XML file. A simplified version of it looks like this:
x <- '<grandparent><parent><child1>ABC123</child1><child2>1381956044</child2></parent><parent><child2>1397527137</child2></parent><parent><child3>4675</child3></parent><parent><child1>DEF456</child1><child3>3735</child3></parent><parent><child1/><child3>3735</child3></parent></grandparent>'
library(XML)
xmlRoot(xmlTreeParse(x))
## <grandparent>
## <parent>
## <child1>ABC123</child1>
## <child2>1381956044</child2>
## </parent>
## <parent>
## <child2>1397527137</child2>
## </parent>
## <parent>
## <child3>4675</child3>
## </parent>
## <parent>
## <child1>DEF456</child1>
## <child3>3735</child3>
## </parent>
## <parent>
## <child1/>
## <child3>3735</child3>
## </parent>
## </grandparent>
I'd like to transform the XML into a data.frame / data.table that looks like this:
parent <- data.frame(child1=c("ABC123",NA,NA,"DEF456",NA), child2=c(1381956044, 1397527137, rep(NA, 3)), child3=c(rep(NA, 2), 4675, 3735, 3735))
parent
## child1 child2 child3
## 1 ABC123 1381956044 NA
## 2 <NA> 1397527137 NA
## 3 <NA> NA 4675
## 4 DEF456 NA 3735
## 5 <NA> NA 3735
If each parent node always contained all of the possible elements ("child1", "child2", "child3", etc.), I could use xmlToList and unlist to flatten it, and then dcast to put it into a table. But the XML often has missing child elements. Here is an attempt with incorrect output:
library(data.table)
## Flatten:
dt <- as.data.table(unlist(xmlToList(x)), keep.rownames=T)
setnames(dt, c("column", "value"))
## Add row numbers, but they're incorrect due to missing XML elements:
dt[, row:=.SD[,.I], by=column][]
column value row
1: parent.child1 ABC123 1
2: parent.child2 1381956044 1
3: parent.child2 1397527137 2
4: parent.child3 4675 1
5: parent.child1 DEF456 2
6: parent.child3 3735 2
7: parent.child3 3735 3
## Reshape from long to wide, but some value are in the wrong row:
dcast.data.table(dt, row~column, value.var="value", fill=NA)
## row parent.child1 parent.child2 parent.child3
## 1: 1 ABC123 1381956044 4675
## 2: 2 DEF456 1397527137 3735
## 3: 3 NA NA 3735
I won't know ahead of time the names of the child elements, or the count of unique element names for children of the grandparent, so the answer should be flexible.