Efficient alternative to merge() when building dataframe from json files with R?
Posted
by
Bryan
on Stack Overflow
See other posts from Stack Overflow
or by Bryan
Published on 2011-03-03T20:11:06Z
Indexed on
2011/03/05
7:24 UTC
Read the original article
Hit count: 407
I have written the following code which works, but is painfully slow once I start executing it over thousands of records:
require("RJSONIO")
people_data <- data.frame(person_id=numeric(0))
json_data <- fromJSON(json_file)
n_people <- length(json_data)
for(lender in 1:n_people) {
person_dataframe <- as.data.frame(t(unlist(json_data[[person]])))
people_data <- merge(people_data, person_dataframe, all=TRUE)
}
output_file <- paste("people_data",".csv")
write.csv(people_data, file=output_file)
I am attempting to build a unified data table from a series of json-formated files. The fromJSON()
function reads in the data as lists of lists. Each element of the list is a person, which then contains a list of the attributes for that person.
For example:
[[1]]
person_id
name
gender
hair_color
[[2]]
person_id
name
location
gender
height
[[...]]
structure(list(person_id = "Amy123", name = "Amy", gender = "F",
hair_color = "brown"),
.Names = c("person_id", "name", "gender", "hair_color"))
structure(list(person_id = "matt53", name = "Matt",
location = structure(c(47231, "IN"),
.Names = c("zip_code", "state")),
gender = "M", height = 172),
.Names = c("person_id", "name", "location", "gender", "height"))
The end result of the code above is matrix where the columns are every person-attribute that appears in the structure above, and the rows are the relevant values for each person. As you can see though, some data is missing for some of the people, so I need to ensure those show up as NA
and make sure things end up in the right columns. Further, location
itself is a vector with two components: state
and zip_code
, meaning it needs to be flattened to location.state
and location.zip_code
before it can be merged with another person record; this is what I use unlist()
for. I then keep the running master table in people_data
.
The above code works, but do you know of a more efficient way to accomplish what I'm trying to do? It appears the merge()
is slowing this to a crawl... I have hundreds of files with hundreds of people in each file.
Thanks! Bryan
© Stack Overflow or respective owner