When dealing with large amounts of XMLs, three (3) types of problems are faced as discussed in my previous post (refer below to reference section). One( problem 1) is extracting the raw data to enhance or extend other target formats including JSON and tabular dimension, fact or big (OBTs) tables for analytics or other digital applications. Secondly (problem 2), visualising the structure of the document to expose schema and the data instance with all of its attributes including subpaths of nodes (or subtrees and subset trees). Third (problem 3) is to determine similarities between two different XML trees based on its link structures.
In my previous blog I covered measures for problem 1 which maps XML into a more tabular form while still preserving nested data structures. In this blog, we’ll discuss strategies to overcome problem 2 another piece of the XML puzzle which centres around mapping XMLs to graphs for improved visibility and subpath extraction. This step is critical especially if planning to compute tree kernels for quantifying tree structure similarities between XMLs.
The Tools
The playground material for this post is built using python and various libraries on Google Collab. ElementTree for XML parsing, Networkx for graph construction and analysis, Graphviz layout and MatPlotLib to display/visualise the graph.
Alternatively, Networkx provides several different native layouts to display graphs and support features for custom design layouts. Though the latter is a tedious task as each node position must be explicitly specified. Graphviz on the other hand offers a broad range of predefined layouts but requires a pydot or pygraphviz interface to display graphs from Networkx. Graphviz is mostly a visualisation tool and has fewer capabilities to analyse graphs as opposed to Networkx.
Consequently, in this sample constructing the graph and analysing different parts of the graph (neighbours of upward, downward and adjacent nodes) is critical as XMLs can contain 100s and 1000s of nodes. Which became the main reason why Networkx was chosen for constructing the graph. Plus Networkx also has an active community which maintains and resolves problems promptly for users.
The process
The process of mapping XMLs into a graph is broken into 3 simple steps. In step 1, data (or key features of the XML) is extracted from the XML document and stored into an array structure. In step 2, the array output of step 1 is used to generate a new array containing element label, element node ID, subpath and ancestor node ID. In step 3, the array output of step 2 which contains all the subpaths, is consumed to construct a graph reflecting the XML structure.
This design creates room for future improvement. In the event, a new attribute has to be added or an existing one needs enhancements, the specific steps are revisited to make adjustments with minimal impact to other builds or steps.
Step 1 - Prepare features from raw XML File
As mentioned above, step 1 is all about extracting the ‘right stuff’ from the XMLs. We presented a recursive function in the last post capable of crawling into deep XML structures. The same function is reused here to perform tasks in step 1 with minor alterations. Data extracted are the element tag, parent tag, structural level of the element (depth), and a system generated unique node ID for each element. Data is a dictionary type stored in an array structure.
In successive stages, system generated attributes such as the element level and nodeID are beneficial in identifying specific nodes in large XMLs (100s and 1000s of elements/nodes) with deep structures and repeated element identifiers (e.g. same labels for elements).
Step 2 - Generate subpaths for XMLs
Step 2 consumes the output of step 1 which is an array table consisting features of XMLs mentioned in step 1.
In this step, the python function (generate_subpaths() as specified in sample code) traces back all predecessors of an element node in order and captures the resulting attribute as a subpath. The same function maps every element node with its parent and ancestors.
Step 3 - Constructing the graph from XML Data
Step 3 is easy if step 1 and step 2 are resolving correctly. Output of step 2 (the array containing subpath and accessory data) is consumed by the python function (e.g. construct_graph() as specified in sample code) to construct and generate the graph.
The graph output generated in this step is now ready for further analysis using Networkx capabilities (e.g. neighbours, predecessors, successors, adjacency). Slice and zoom into parts or subgraphs of the graph to extract patterns as necessary.
For visualising full graphs or subgraphs use Networkx’s draw function (e.g. sample code as specified in diagram 6 ) in conjunction with Graphviz layouts and MatPlotLib for plotting. It’s a bit of a labyrinth at the moment in python for graph visualisation but as needs mature, tools/skills can be reorganised according to task (graph construction, analysis and visualisation).
Conclusion - Experiment till the right tools Emerge
There are many other ways to map XML to a graph and requirements can differ from one business to another. Experiment with several viable options, compare and pick what works best. Think long term, balance features, support and adoption eco system before expanding use cases.
Happy coding !
Some References:
Link to sample code on Google Collab; Parsing XMLs; Processing data from XMLs