There are multitude of business use cases for data from XMLs. For instance, some XML sources are converted into tabular or other forms to enhance existing datasets (facts, dimensions, measures) for DW, Data Lake or other security and digital services. While some others just need capabilities for querying the XML contents in isolation using Xquery and Xpath expressions. The hierarchy data model, unprecedentedly nested structures and extremely loose schema of XMLs can result in highly mutable schema and data instances from sources.
Whatever the use may be, prospecting starts with a casual peering into XML sources to ensure safety and smooth capturing, ingestion, processing and transformation of data prior to consumption by services. In this blog, I share some useful tools I developed in my recent project and methods to go about such tasks using Python. The full sample of python code is in link here (Google Colab Notebook) and at the bottom of this page.
2.0. How to get started?
Python and ElementTree (or ET) are powerful combination for both parsing and creating XMLs for various purposes. The xml.etree.ElementTree module in Python represents XML data in a tree form in two classes: i) ElementTree - representing the whole document and ii) Elements - a single node in the whole tree. I combined utilities of both and formulated the following steps to i) assess and evaluate XML data sources; and ii) extract data into tabular format for business use.
3.0. Step 1 - Prepare XML Parser for Evaluation Tasks
XMLPullParser function in ET is a useful function for this purpose as it does not require a blocking read to obtain the data within the XML document. Instead event based data is incrementally fed into the parser. This parser is used in following steps to traverse the XML for various types of inspections.
Step 2 - Examining the XML Content (Single Document)
Upon completion, the parser is passed into a function to traverse elements and its various nodes to view content. The function displays all the elements (tags, text, attribute and namespaces) as shown in diagram 3 below and the time taken for traversing and reading all nodes in the XML documents (using a simple custom Python timer function).
If expecting 100s or 1000s of lines, alternatively i) the contents can be dumped into a tabular CSV or parquet for quick inspection (as shown in examples in later parts of this blog); or ii) truncated to a limited number to heading and trailing elements.
Step 3 - Render the Data Instance
Next, render the data instance based on its data hierarchy. This function can be easily upgraded for graph generation or other visual representation of XML data.
Step 4 - Summary of Elements and Relevant Nodes
Last but not least a function to summarise total counts of elements to determine depth and width of the document.
Step 6 - Extracting Data for Use
If a XML source passed initial evaluation and shows great promise for the business ( Step 1 to 5), start building the necessary functions to extract data from the XMLs to the required target format. In our case, we came across many XMLs with deep structures and nested data. Some XML sources even had slightly varied schema each time (e.g. missing sub elements, etc). Our use case required the data to be flatten into a tabular form (or with nested data) before further processing. As such, we went for CSV or Parquet for the transformation as we also had an API for both in our OLAP database (e.g DuckDB).
What tools? To extract data we switched to xml.etree.ElementTree.fromstring() and Element.findall() functions to move from one element to another using a recursive function to capture all data correctly. Using this top-down approach all elements were captured with the necessary data nodes as keys and values in a dictionary structure, which at the end of the deepest structure (end of a row of data) gets appended to an array structure. When the recursive function completes traversing of all elements, the array is converted into a tabular CSV using python DataFrame.
We were not able to use the XMLPullParser here as we needed a mechanism to update the dictionary structures at the end of a complete record(row of data) and not for every element (every column). Plus we needed to maintain states for the start and end keys of the dictionary to address missing elements/data in the deep structures of some XMLs.
Conclusion - Mature Tools, Process and Best Practices
Now that builds of basic functions and tools needed for working with XMLs are ready, the business can start testing and evaluating more XML sources and its potential. This allows the tools and methods to mature, address common or recurring problems and upgrade features of each tool (e.g XML tree and sub tree clustering, custom XML validator and schema crystalliser to implement internal schema and rules, others to address bulk processing challenges etc).
Happy coding and feel free to share if you have an idea for this space!