2.2.3. Working with XML Data

1. Introduction to XML

XML (Extensible Markup Language) is a widely used, human- and machine-readable format for data storage and exchange.
XML is hierarchical and nested, similar to JSON, but uses nodes, tags, and elements instead of key-value pairs.
Each XML document has a tree structure, with elements that can contain attributes, text, or other elements.
XML is common in web services, APIs, regulatory submissions (e.g., SDTM Define.xml), and data interchange between systems.

Example of XML Structure:

<Subject>
  <USUBJID>1015</USUBJID>
  <SEX>M</SEX>
  <AGE>56</AGE>
  <ARM>Placebo</ARM>
</Subject>

Tags (e.g., <USUBJID>) define the type of data.
Elements can be nested to represent complex relationships.

2. XML vs JSON and Tabular Data

Tabular data (CSV, Excel) is flat, with rows and columns.
JSON and XML both support nested, hierarchical data.
XML is more verbose than JSON and uses explicit opening and closing tags.
XML is often used for data exchange in clinical research, regulatory submissions, and legacy systems.

3. Importing XML Data in R

The xml2 package is the standard for reading and working with XML in R.

3.1. Reading XML from a File

# install.packages("xml2")
library(xml2)
doc <- read_xml("sdtm_subjects.xml")
doc

Input Example (sdtm_subjects.xml):

<Subjects>
  <Subject>
    <USUBJID>1015</USUBJID>
    <SEX>M</SEX>
    <AGE>56</AGE>
    <ARM>Placebo</ARM>
  </Subject>
  <Subject>
    <USUBJID>1016</USUBJID>
    <SEX>F</SEX>
    <AGE>62</AGE>
    <ARM>Active</ARM>
  </Subject>
</Subjects>

Expected Output (xml_document):

> doc
{xml_document}
<Subjects>
[1] <Subject>\n  <USUBJID>1015</USUBJID>\n  <SEX>M</SEX>\n  <AGE>56</AGE>\n  <ARM>Placebo</ARM>\n</Subject>
[2] <Subject>\n  <USUBJID>1016</USUBJID>\n  <SEX>F</SEX>\n  <AGE>62</AGE>\n  <ARM>Active</ARM>\n</Subject>

3.2. Extracting Data from XML

Use xml_find_all() and xml_text() to extract values.

subjects <- xml_find_all(doc, ".//Subject")
usubjid <- xml_text(xml_find_all(subjects, "./USUBJID"))
sex <- xml_text(xml_find_all(subjects, "./SEX"))
age <- as.integer(xml_text(xml_find_all(subjects, "./AGE")))
arm <- xml_text(xml_find_all(subjects, "./ARM"))
sdtm_df <- data.frame(USUBJID = usubjid, SEX = sex, AGE = age, ARM = arm)
sdtm_df

Output (Data Frame):

> sdtm_df
  USUBJID SEX AGE     ARM
1    1015   M  56 Placebo
2    1016   F  62  Active

4. Exporting Data to XML in R

R does not have a base function for exporting data frames to XML, but you can use xml2 to build XML nodes programmatically.

library(xml2)
subjects_xml <- xml_new_root("Subjects")
for (i in seq_len(nrow(sdtm_df))) {
  subject <- xml_add_child(subjects_xml, "Subject")
  xml_add_child(subject, "USUBJID", sdtm_df$USUBJID[i])
  xml_add_child(subject, "SEX", sdtm_df$SEX[i])
  xml_add_child(subject, "AGE", as.character(sdtm_df$AGE[i]))
  xml_add_child(subject, "ARM", sdtm_df$ARM[i])
}
write_xml(subjects_xml, "exported_subjects.xml")

Output (exported_subjects.xml):

<Subjects>
  <Subject>
    <USUBJID>1015</USUBJID>
    <SEX>M</SEX>
    <AGE>56</AGE>
    <ARM>Placebo</ARM>
  </Subject>
  <Subject>
    <USUBJID>1016</USUBJID>
    <SEX>F</SEX>
    <AGE>62</AGE>
    <ARM>Active</ARM>
  </Subject>
</Subjects>

5. Input and Output Table Summary

Operation	R Function / Package	Input Example	Output Example
Import XML file	`read_xml()`	.xml file	xml_document
Extract data to table	`xml_find_all()`	xml_document	Data frame
Export to XML	`xml_new_root()`, `xml_add_child()`, `write_xml()`	Data frame	.xml file

6. Beyond the Basics: Exploring XML in R

Working with XML in real-world scenarios often involves more than just reading and writing simple files. Here are advanced techniques and considerations:

Handling XML Attributes

XML elements can store data as attributes (e.g., <Subject USUBJID="1015" SEX="M" AGE="56" ARM="Placebo"/>).
Use xml_attr() to extract attribute values from nodes.
This is common in regulatory files (e.g., SDTM Define.xml), where metadata is often stored as attributes.

Input Example (attributes):

<Subjects>
    <Subject USUBJID="1015" SEX="M" AGE="56" ARM="Placebo"/>
    <Subject USUBJID="1016" SEX="F" AGE="62" ARM="Active"/>
</Subjects>

R Example:

library(xml2)
doc <- read_xml("sdtm_subjects_attributes.xml")
subjects <- xml_find_all(doc, ".//Subject")
usubjid <- xml_attr(subjects, "USUBJID")
sex <- xml_attr(subjects, "SEX")
age <- as.integer(xml_attr(subjects, "AGE"))
arm <- xml_attr(subjects, "ARM")
sdtm_df <- data.frame(USUBJID = usubjid, SEX = sex, AGE = age, ARM = arm)
sdtm_df

Output:

> sdtm_df
  USUBJID SEX AGE     ARM
1    1015   M  56 Placebo
2    1016   F  62  Active

Parsing Complex or Nested XML

Real-world XML often contains deeply nested structures.
Use XPath queries with xml_find_all() or xml_find_first() to navigate and extract data from nested nodes.
XPath expressions like .//Node/SubNode help target specific elements regardless of depth.

Example:

<Study>
    <Subjects>
        <Subject>
            <Demographics>
                <USUBJID>1015</USUBJID>
                <SEX>M</SEX>
            </Demographics>
            <ARM>Placebo</ARM>
        </Subject>
    </Subjects>
</Study>

doc <- read_xml("nested_subjects.xml")
subjects <- xml_find_all(doc, ".//Subject")
usubjid <- xml_text(xml_find_all(subjects, "./Demographics/USUBJID"))
sex <- xml_text(xml_find_all(subjects, "./Demographics/SEX"))
arm <- xml_text(xml_find_all(subjects, "./ARM"))
data.frame(USUBJID = usubjid, SEX = sex, ARM = arm)

Output:

  USUBJID SEX     ARM
1    1015   M Placebo

Exploring XML Structure and Comments

Sometimes, you need to explore an unknown XML structure or inspect comments and non-element nodes.
Use xml_children() to list all child nodes of a node.
Use xml_type() to identify node types (element, comment, text, etc.).
Use xml_contents() to access all content, including comments and text nodes.

Example:

<Subjects>
    <!-- This is a comment -->
    <Subject USUBJID="1015"/>
    <Subject USUBJID="1016"/>
</Subjects>

doc <- read_xml("subjects_with_comments.xml")
children <- xml_children(doc)
types <- sapply(xml_contents(doc), xml_type)
children
types

Output Example:

> children
{xml_nodeset (2)}
[1] <Subject USUBJID="1015"/>
[2] <Subject USUBJID="1016"/>
> types
[1] "comment" "element" "element"

Working with Regulatory Files
- Regulatory XML files (e.g., SDTM Define.xml) often use attributes and nested metadata.
- Use XPath to extract specific metadata, and combine with xml_attr() for attributes.
Converting XML to Other Formats
- Use jsonlite::toJSON() to convert parsed XML data frames to JSON.
- Use writexl::write_xlsx() to export data frames extracted from XML to Excel.
Example:
```
library(jsonlite)
json_data <- toJSON(sdtm_df, pretty = TRUE)
write(json_data, "subjects.json")

library(writexl)
write_xlsx(sdtm_df, "subjects.xlsx")
```
Validating XML
- XML validation ensures the file conforms to a schema (XSD).
- Use xml_validate() from the xml2 package (requires schema file).
Example:
```
schema <- read_xml("define_schema.xsd")
valid <- xml_validate(doc, schema)
valid  # TRUE if valid, FALSE otherwise
```

Summary Table: Advanced XML Operations

Task	Function/Package	Example Input/Output
Extract attributes	`xml_attr()`	`<Subject USUBJID="1015"/>`
Parse nested XML	`xml_find_all()` + XPath	Nested XML structure
Explore structure/comments	`xml_children()`, `xml_type()`, `xml_contents()`	XML with comments
Convert to JSON/Excel	`jsonlite`, `writexl`	Data frame to JSON/Excel
Validate XML	`xml_validate()`	XML + XSD schema

Best Practices:

Always inspect the XML structure before extraction.
Use XPath for flexible and powerful data selection.
Validate XML files when working with regulatory or critical data.
Document your extraction and transformation logic for reproducibility.

7. Summary and Best Practices

XML is ideal for hierarchical, nested, and metadata-rich data, and is widely used in clinical and regulatory domains.
Use the xml2 package for robust XML import, parsing, and export in R.
Always inspect the structure of imported XML, especially for nested or attribute-based data.
Use XPath queries for flexible extraction of information.
For reproducibility, document the structure of your XML data and any transformations applied.

Resource download links

2.2.3.-Working-with-XML-Data.zip

⁂