contact@a2zlearners.com

2.2.3. Working with XML Data

1. Introduction to XML

  • XML (Extensible Markup Language) is a widely used, human- and machine-readable format for data storage and exchange.
  • XML is hierarchical and nested, similar to JSON, but uses nodes, tags, and elements instead of key-value pairs.
  • Each XML document has a tree structure, with elements that can contain attributes, text, or other elements.
  • XML is common in web services, APIs, regulatory submissions (e.g., SDTM Define.xml), and data interchange between systems.

Example of XML Structure:

<Subject>
  <USUBJID>1015</USUBJID>
  <SEX>M</SEX>
  <AGE>56</AGE>
  <ARM>Placebo</ARM>
</Subject>
  • Tags (e.g., <USUBJID>) define the type of data.
  • Elements can be nested to represent complex relationships.

2. XML vs JSON and Tabular Data

  • Tabular data (CSV, Excel) is flat, with rows and columns.
  • JSON and XML both support nested, hierarchical data.
  • XML is more verbose than JSON and uses explicit opening and closing tags.
  • XML is often used for data exchange in clinical research, regulatory submissions, and legacy systems.

3. Importing XML Data in R

  • The xml2 package is the standard for reading and working with XML in R.
3.1. Reading XML from a File
# install.packages("xml2")
library(xml2)
doc <- read_xml("sdtm_subjects.xml")
doc

Input Example (sdtm_subjects.xml):

<Subjects>
  <Subject>
    <USUBJID>1015</USUBJID>
    <SEX>M</SEX>
    <AGE>56</AGE>
    <ARM>Placebo</ARM>
  </Subject>
  <Subject>
    <USUBJID>1016</USUBJID>
    <SEX>F</SEX>
    <AGE>62</AGE>
    <ARM>Active</ARM>
  </Subject>
</Subjects>

Expected Output (xml_document):

> doc
{xml_document}
<Subjects>
[1] <Subject>\n  <USUBJID>1015</USUBJID>\n  <SEX>M</SEX>\n  <AGE>56</AGE>\n  <ARM>Placebo</ARM>\n</Subject>
[2] <Subject>\n  <USUBJID>1016</USUBJID>\n  <SEX>F</SEX>\n  <AGE>62</AGE>\n  <ARM>Active</ARM>\n</Subject>

3.2. Extracting Data from XML
  • Use xml_find_all() and xml_text() to extract values.
subjects <- xml_find_all(doc, ".//Subject")
usubjid <- xml_text(xml_find_all(subjects, "./USUBJID"))
sex <- xml_text(xml_find_all(subjects, "./SEX"))
age <- as.integer(xml_text(xml_find_all(subjects, "./AGE")))
arm <- xml_text(xml_find_all(subjects, "./ARM"))
sdtm_df <- data.frame(USUBJID = usubjid, SEX = sex, AGE = age, ARM = arm)
sdtm_df

Output (Data Frame):

> sdtm_df
  USUBJID SEX AGE     ARM
1    1015   M  56 Placebo
2    1016   F  62  Active

4. Exporting Data to XML in R

  • R does not have a base function for exporting data frames to XML, but you can use xml2 to build XML nodes programmatically.
library(xml2)
subjects_xml <- xml_new_root("Subjects")
for (i in seq_len(nrow(sdtm_df))) {
  subject <- xml_add_child(subjects_xml, "Subject")
  xml_add_child(subject, "USUBJID", sdtm_df$USUBJID[i])
  xml_add_child(subject, "SEX", sdtm_df$SEX[i])
  xml_add_child(subject, "AGE", as.character(sdtm_df$AGE[i]))
  xml_add_child(subject, "ARM", sdtm_df$ARM[i])
}
write_xml(subjects_xml, "exported_subjects.xml")

Output (exported_subjects.xml):

<Subjects>
  <Subject>
    <USUBJID>1015</USUBJID>
    <SEX>M</SEX>
    <AGE>56</AGE>
    <ARM>Placebo</ARM>
  </Subject>
  <Subject>
    <USUBJID>1016</USUBJID>
    <SEX>F</SEX>
    <AGE>62</AGE>
    <ARM>Active</ARM>
  </Subject>
</Subjects>

5. Input and Output Table Summary

Operation R Function / Package Input Example Output Example
Import XML file read_xml() .xml file xml_document
Extract data to table xml_find_all() xml_document Data frame
Export to XML xml_new_root(), xml_add_child(), write_xml() Data frame .xml file

6. Beyond the Basics: Exploring XML in R

Working with XML in real-world scenarios often involves more than just reading and writing simple files. Here are advanced techniques and considerations:

  • Handling XML Attributes

    • XML elements can store data as attributes (e.g., <Subject USUBJID="1015" SEX="M" AGE="56" ARM="Placebo"/>).
    • Use xml_attr() to extract attribute values from nodes.
    • This is common in regulatory files (e.g., SDTM Define.xml), where metadata is often stored as attributes.

    Input Example (attributes):

    <Subjects>
        <Subject USUBJID="1015" SEX="M" AGE="56" ARM="Placebo"/>
        <Subject USUBJID="1016" SEX="F" AGE="62" ARM="Active"/>
    </Subjects>
    

    R Example:

    library(xml2)
    doc <- read_xml("sdtm_subjects_attributes.xml")
    subjects <- xml_find_all(doc, ".//Subject")
    usubjid <- xml_attr(subjects, "USUBJID")
    sex <- xml_attr(subjects, "SEX")
    age <- as.integer(xml_attr(subjects, "AGE"))
    arm <- xml_attr(subjects, "ARM")
    sdtm_df <- data.frame(USUBJID = usubjid, SEX = sex, AGE = age, ARM = arm)
    sdtm_df
    

    Output:

    > sdtm_df
      USUBJID SEX AGE     ARM
    1    1015   M  56 Placebo
    2    1016   F  62  Active
    
  • Parsing Complex or Nested XML

    • Real-world XML often contains deeply nested structures.
    • Use XPath queries with xml_find_all() or xml_find_first() to navigate and extract data from nested nodes.
    • XPath expressions like .//Node/SubNode help target specific elements regardless of depth.

    Example:

    <Study>
        <Subjects>
            <Subject>
                <Demographics>
                    <USUBJID>1015</USUBJID>
                    <SEX>M</SEX>
                </Demographics>
                <ARM>Placebo</ARM>
            </Subject>
        </Subjects>
    </Study>
    
    doc <- read_xml("nested_subjects.xml")
    subjects <- xml_find_all(doc, ".//Subject")
    usubjid <- xml_text(xml_find_all(subjects, "./Demographics/USUBJID"))
    sex <- xml_text(xml_find_all(subjects, "./Demographics/SEX"))
    arm <- xml_text(xml_find_all(subjects, "./ARM"))
    data.frame(USUBJID = usubjid, SEX = sex, ARM = arm)
    

    Output:

      USUBJID SEX     ARM
    1    1015   M Placebo
    
  • Exploring XML Structure and Comments

    • Sometimes, you need to explore an unknown XML structure or inspect comments and non-element nodes.
    • Use xml_children() to list all child nodes of a node.
    • Use xml_type() to identify node types (element, comment, text, etc.).
    • Use xml_contents() to access all content, including comments and text nodes.

    Example:

    <Subjects>
        <!-- This is a comment -->
        <Subject USUBJID="1015"/>
        <Subject USUBJID="1016"/>
    </Subjects>
    
    doc <- read_xml("subjects_with_comments.xml")
    children <- xml_children(doc)
    types <- sapply(xml_contents(doc), xml_type)
    children
    types
    

    Output Example:

    > children
    {xml_nodeset (2)}
    [1] <Subject USUBJID="1015"/>
    [2] <Subject USUBJID="1016"/>
    > types
    [1] "comment" "element" "element"
    
  • Working with Regulatory Files

    • Regulatory XML files (e.g., SDTM Define.xml) often use attributes and nested metadata.
    • Use XPath to extract specific metadata, and combine with xml_attr() for attributes.
  • Converting XML to Other Formats

    • Use jsonlite::toJSON() to convert parsed XML data frames to JSON.
    • Use writexl::write_xlsx() to export data frames extracted from XML to Excel.

    Example:

    library(jsonlite)
    json_data <- toJSON(sdtm_df, pretty = TRUE)
    write(json_data, "subjects.json")
    
    library(writexl)
    write_xlsx(sdtm_df, "subjects.xlsx")
    
  • Validating XML

    • XML validation ensures the file conforms to a schema (XSD).
    • Use xml_validate() from the xml2 package (requires schema file).

    Example:

    schema <- read_xml("define_schema.xsd")
    valid <- xml_validate(doc, schema)
    valid  # TRUE if valid, FALSE otherwise
    

Summary Table: Advanced XML Operations

Task Function/Package Example Input/Output
Extract attributes xml_attr() <Subject USUBJID="1015"/>
Parse nested XML xml_find_all() + XPath Nested XML structure
Explore structure/comments xml_children(), xml_type(), xml_contents() XML with comments
Convert to JSON/Excel jsonlite, writexl Data frame to JSON/Excel
Validate XML xml_validate() XML + XSD schema

Best Practices:

  • Always inspect the XML structure before extraction.
  • Use XPath for flexible and powerful data selection.
  • Validate XML files when working with regulatory or critical data.
  • Document your extraction and transformation logic for reproducibility.

7. Summary and Best Practices

  • XML is ideal for hierarchical, nested, and metadata-rich data, and is widely used in clinical and regulatory domains.
  • Use the xml2 package for robust XML import, parsing, and export in R.
  • Always inspect the structure of imported XML, especially for nested or attribute-based data.
  • Use XPath queries for flexible extraction of information.
  • For reproducibility, document the structure of your XML data and any transformations applied.

**Resource download links**

2.2.3.-Working-with-XML-Data.zip