2.2.3. Working with XML Data
1. Introduction to XML
- XML (Extensible Markup Language) is a widely used, human- and machine-readable format for data storage and exchange.
- XML is hierarchical and nested, similar to JSON, but uses nodes, tags, and elements instead of key-value pairs.
- Each XML document has a tree structure, with elements that can contain attributes, text, or other elements.
- XML is common in web services, APIs, regulatory submissions (e.g., SDTM Define.xml), and data interchange between systems.
Example of XML Structure:
<Subject>
<USUBJID>1015</USUBJID>
<SEX>M</SEX>
<AGE>56</AGE>
<ARM>Placebo</ARM>
</Subject>
- Tags (e.g.,
<USUBJID>) define the type of data. - Elements can be nested to represent complex relationships.
2. XML vs JSON and Tabular Data
- Tabular data (CSV, Excel) is flat, with rows and columns.
- JSON and XML both support nested, hierarchical data.
- XML is more verbose than JSON and uses explicit opening and closing tags.
- XML is often used for data exchange in clinical research, regulatory submissions, and legacy systems.
3. Importing XML Data in R
- The
xml2package is the standard for reading and working with XML in R.
3.1. Reading XML from a File
# install.packages("xml2")
library(xml2)
doc <- read_xml("sdtm_subjects.xml")
doc
Input Example (sdtm_subjects.xml):
<Subjects>
<Subject>
<USUBJID>1015</USUBJID>
<SEX>M</SEX>
<AGE>56</AGE>
<ARM>Placebo</ARM>
</Subject>
<Subject>
<USUBJID>1016</USUBJID>
<SEX>F</SEX>
<AGE>62</AGE>
<ARM>Active</ARM>
</Subject>
</Subjects>
Expected Output (xml_document):
> doc
{xml_document}
<Subjects>
[1] <Subject>\n <USUBJID>1015</USUBJID>\n <SEX>M</SEX>\n <AGE>56</AGE>\n <ARM>Placebo</ARM>\n</Subject>
[2] <Subject>\n <USUBJID>1016</USUBJID>\n <SEX>F</SEX>\n <AGE>62</AGE>\n <ARM>Active</ARM>\n</Subject>
3.2. Extracting Data from XML
- Use
xml_find_all()andxml_text()to extract values.
subjects <- xml_find_all(doc, ".//Subject")
usubjid <- xml_text(xml_find_all(subjects, "./USUBJID"))
sex <- xml_text(xml_find_all(subjects, "./SEX"))
age <- as.integer(xml_text(xml_find_all(subjects, "./AGE")))
arm <- xml_text(xml_find_all(subjects, "./ARM"))
sdtm_df <- data.frame(USUBJID = usubjid, SEX = sex, AGE = age, ARM = arm)
sdtm_df
Output (Data Frame):
> sdtm_df
USUBJID SEX AGE ARM
1 1015 M 56 Placebo
2 1016 F 62 Active
4. Exporting Data to XML in R
- R does not have a base function for exporting data frames to XML, but you can use
xml2to build XML nodes programmatically.
library(xml2)
subjects_xml <- xml_new_root("Subjects")
for (i in seq_len(nrow(sdtm_df))) {
subject <- xml_add_child(subjects_xml, "Subject")
xml_add_child(subject, "USUBJID", sdtm_df$USUBJID[i])
xml_add_child(subject, "SEX", sdtm_df$SEX[i])
xml_add_child(subject, "AGE", as.character(sdtm_df$AGE[i]))
xml_add_child(subject, "ARM", sdtm_df$ARM[i])
}
write_xml(subjects_xml, "exported_subjects.xml")
Output (exported_subjects.xml):
<Subjects>
<Subject>
<USUBJID>1015</USUBJID>
<SEX>M</SEX>
<AGE>56</AGE>
<ARM>Placebo</ARM>
</Subject>
<Subject>
<USUBJID>1016</USUBJID>
<SEX>F</SEX>
<AGE>62</AGE>
<ARM>Active</ARM>
</Subject>
</Subjects>
5. Input and Output Table Summary
| Operation | R Function / Package | Input Example | Output Example |
|---|---|---|---|
| Import XML file | read_xml() |
.xml file | xml_document |
| Extract data to table | xml_find_all() |
xml_document | Data frame |
| Export to XML | xml_new_root(), xml_add_child(), write_xml() |
Data frame | .xml file |
6. Beyond the Basics: Exploring XML in R
Working with XML in real-world scenarios often involves more than just reading and writing simple files. Here are advanced techniques and considerations:
Handling XML Attributes
- XML elements can store data as attributes (e.g.,
<Subject USUBJID="1015" SEX="M" AGE="56" ARM="Placebo"/>). - Use
xml_attr()to extract attribute values from nodes. - This is common in regulatory files (e.g., SDTM Define.xml), where metadata is often stored as attributes.
Input Example (attributes):
<Subjects> <Subject USUBJID="1015" SEX="M" AGE="56" ARM="Placebo"/> <Subject USUBJID="1016" SEX="F" AGE="62" ARM="Active"/> </Subjects>R Example:
library(xml2) doc <- read_xml("sdtm_subjects_attributes.xml") subjects <- xml_find_all(doc, ".//Subject") usubjid <- xml_attr(subjects, "USUBJID") sex <- xml_attr(subjects, "SEX") age <- as.integer(xml_attr(subjects, "AGE")) arm <- xml_attr(subjects, "ARM") sdtm_df <- data.frame(USUBJID = usubjid, SEX = sex, AGE = age, ARM = arm) sdtm_dfOutput:
> sdtm_df USUBJID SEX AGE ARM 1 1015 M 56 Placebo 2 1016 F 62 Active- XML elements can store data as attributes (e.g.,
Parsing Complex or Nested XML
- Real-world XML often contains deeply nested structures.
- Use XPath queries with
xml_find_all()orxml_find_first()to navigate and extract data from nested nodes. - XPath expressions like
.//Node/SubNodehelp target specific elements regardless of depth.
Example:
<Study> <Subjects> <Subject> <Demographics> <USUBJID>1015</USUBJID> <SEX>M</SEX> </Demographics> <ARM>Placebo</ARM> </Subject> </Subjects> </Study>doc <- read_xml("nested_subjects.xml") subjects <- xml_find_all(doc, ".//Subject") usubjid <- xml_text(xml_find_all(subjects, "./Demographics/USUBJID")) sex <- xml_text(xml_find_all(subjects, "./Demographics/SEX")) arm <- xml_text(xml_find_all(subjects, "./ARM")) data.frame(USUBJID = usubjid, SEX = sex, ARM = arm)Output:
USUBJID SEX ARM 1 1015 M PlaceboExploring XML Structure and Comments
- Sometimes, you need to explore an unknown XML structure or inspect comments and non-element nodes.
- Use
xml_children()to list all child nodes of a node. - Use
xml_type()to identify node types (element, comment, text, etc.). - Use
xml_contents()to access all content, including comments and text nodes.
Example:
<Subjects> <!-- This is a comment --> <Subject USUBJID="1015"/> <Subject USUBJID="1016"/> </Subjects>doc <- read_xml("subjects_with_comments.xml") children <- xml_children(doc) types <- sapply(xml_contents(doc), xml_type) children typesOutput Example:
> children {xml_nodeset (2)} [1] <Subject USUBJID="1015"/> [2] <Subject USUBJID="1016"/> > types [1] "comment" "element" "element"Working with Regulatory Files
- Regulatory XML files (e.g., SDTM Define.xml) often use attributes and nested metadata.
- Use XPath to extract specific metadata, and combine with
xml_attr()for attributes.
Converting XML to Other Formats
- Use
jsonlite::toJSON()to convert parsed XML data frames to JSON. - Use
writexl::write_xlsx()to export data frames extracted from XML to Excel.
Example:
library(jsonlite) json_data <- toJSON(sdtm_df, pretty = TRUE) write(json_data, "subjects.json") library(writexl) write_xlsx(sdtm_df, "subjects.xlsx")- Use
Validating XML
- XML validation ensures the file conforms to a schema (XSD).
- Use
xml_validate()from thexml2package (requires schema file).
Example:
schema <- read_xml("define_schema.xsd") valid <- xml_validate(doc, schema) valid # TRUE if valid, FALSE otherwise
Summary Table: Advanced XML Operations
| Task | Function/Package | Example Input/Output |
|---|---|---|
| Extract attributes | xml_attr() |
<Subject USUBJID="1015"/> |
| Parse nested XML | xml_find_all() + XPath |
Nested XML structure |
| Explore structure/comments | xml_children(), xml_type(), xml_contents() |
XML with comments |
| Convert to JSON/Excel | jsonlite, writexl |
Data frame to JSON/Excel |
| Validate XML | xml_validate() |
XML + XSD schema |
Best Practices:
- Always inspect the XML structure before extraction.
- Use XPath for flexible and powerful data selection.
- Validate XML files when working with regulatory or critical data.
- Document your extraction and transformation logic for reproducibility.
7. Summary and Best Practices
- XML is ideal for hierarchical, nested, and metadata-rich data, and is widely used in clinical and regulatory domains.
- Use the
xml2package for robust XML import, parsing, and export in R. - Always inspect the structure of imported XML, especially for nested or attribute-based data.
- Use XPath queries for flexible extraction of information.
- For reproducibility, document the structure of your XML data and any transformations applied.
**Resource download links**
2.2.3.-Working-with-XML-Data.zip
⁂