One of the rather idiosyncratic things about working in pharma (although I am sure this is true for other highly regulated environments) is systems stay around for a while before they are replaced. Because of these old systems, this week I ran into the need to read in a bunch of tables from an xml document. For those who have never used an xml document, technically there isn’t such as thing as a table, but style sheets can be used to create tables. So you get something that looks like this when the xml is opened in a browser:
But, if you look at xml in a text editor you get something like this:
There are two main packages in R to handle xml files, XML and xml2. XML is the older package and contains a bunch of functions. xml2 is the RStudio backed xml package. It has fewer functions, but it is quite slick with good documentation.
I stuck with XML for my project because I found it easier to navigate the XML file I was using. The first step to reading in your table is to read in the document using the xmlParse function.
library(XML)
library(purrr)
library(dplyr)
doc <- xmlParse("data/document.xml")
class(doc)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
This creates an XMLInternalDocument, which we can get information from.
Before we can get too far into making a dataframe out of the xml file, we need to understand a bit more about how xml files are structured. Basically, xml files are just trees. Everything in the xml is part of the tree and you need to use xpath, the xml selection language, in order to find the branch/leaf you are looking for.
From looking at the xml file we can see catalog is the highest level in the tree, followed by pet, and then information about the pet.
For my table I want get all the pet information in the pet lowest node. So I am going to use the getNodeSet
function, which will return a list of nodes. In this case each node will be all the information about one pet
pet_list <- getNodeSet(doc, "//catalog/pet")
pet_list[[1]]
## <pet>
## <name>Mittens</name>
## <owner>Isabelle Khan</owner>
## <species>cat</species>
## <breed>Scottish fold</breed>
## <age>4</age>
## </pet>
Now that I have all the nodes I want in list I am able to convert the list to a dataframe using xmlToDataFrame
.
xmlToDataFrame(nodes = pet_list)
## name owner species breed age
## 1 Mittens Isabelle Khan cat Scottish fold 4
## 2 Cody Layton Claytonn dog Collie 8
## 3 Arrow Chihiro Yamada bird Grey Parrot 14
## 4 Jim Alex Miller cat British Shorthair 4
## 5 Cashew Hafsah Downs cat Norwegian Forest Cat 2
## 6 Chase Carrie Pope dog Labrador Retriever 1
## 7 Otis Jim Chandler turtle Yellow-Bellied Slider 20
More complicated xml documents will split information between attributes and elements. In this dummy example we can see the name, which was an element before is now a name attribute.
mix_doc <- xmlParse("data/document2.xml")
pet_list <- getNodeSet(mix_doc, "//catalog/pet")
pet_list[1]
## [[1]]
## <pet name="Mittens">
## <owner>Isabelle Khan</owner>
## <species>cat</species>
## <breed>Scottish fold</breed>
## <age>4</age>
## </pet>
This means that if we do the same thing we did above, we would not get the pet’s name in our dataset.
xmlToDataFrame(nodes = pet_list)
## owner species breed age
## 1 Isabelle Khan cat Scottish fold 4
## 2 Layton Claytonn dog Collie 8
## 3 Chihiro Yamada bird Grey Parrot 14
## 4 Alex Miller cat British Shorthair 4
## 5 Hafsah Downs cat Norwegian Forest Cat 2
## 6 Carrie Pope dog Labrador Retriever 1
## 7 Jim Chandler turtle Yellow-Bellied Slider 20
To get the pet’s name we will need to get the information from the attribute as well as the elements.
For the names we will use the xmlAttrs
function to get the name attribute from each node. Unlike, the xmlToDataFrame
function xmlAttrs
expects a node rather than a node set. To ensure only a node is passed we will use map_chr
with xmlAttrs
to get the names out.
For the information in the elements, we can use the xmlToDataFrame
as before.
xmlToDataFrame(nodes = pet_list) %>%
mutate(pet = pet_list %>%
map_chr(xmlAttrs, "name"))
## owner species breed age pet
## 1 Isabelle Khan cat Scottish fold 4 Mittens
## 2 Layton Claytonn dog Collie 8 Cody
## 3 Chihiro Yamada bird Grey Parrot 14 Arrow
## 4 Alex Miller cat British Shorthair 4 Jim
## 5 Hafsah Downs cat Norwegian Forest Cat 2 Cashew
## 6 Carrie Pope dog Labrador Retriever 1 Chase
## 7 Jim Chandler turtle Yellow-Bellied Slider 20 Otis
With these functions (along with xmlValue
, which can be quite helpful), you can parse most xml’s you’ll run across. I think the biggest thing I learned when working with xml, is to look at the data. Cause xml is a tree, if you get the wrong branch nothing will make sense. So looking at the data is super helpful when debugging the xpath and when figuring out what information is an element and what is an attribute.