Python Pandas Tree Structure

Posted by admin
Python Pandas Tree Structure 3,7/5 6107 reviews
  1. Python Pandas Tutorial
  2. Python Pandas Tree Structure Pictures
Tree

I have a XML document that contains a hierarchical, tree-like structure, see the example below.

Sentence tree structure

The document contains several <Message> tags (I only copied one of them for convenience).

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas MultiIndex.tohierarchical function return a MultiIndex reshaped to conform to the shapes given by nrepeat and nshuffle. It is useful to replicate and rearrange a MultiIndex for combination with another Index with nrepeat items. Python has its own data structure with the following building blocks: list, tuple, set, and dictionary. Pandas and NumPy are Python packages–extra baking tools you can use to more effectively manipulate the flour (data), and they have their own data structures (similar to Python’s data structures but easier to deal with). Python – Pandas Data Structure (series, panel & Dataframe) Pandas in Python deals with three data structures namely. Series – 1D labeled homogeneous array, sizeimmutable. Data Frames – 2D labeled, size-mutable tabular structure with heterogenic columns. Panel – 3D labeled size mutable array.

Each <Message> has some associated data (id, status, priority) on its own.

Besides, each <Message> can contain one or more <Street> children which again have some relevant data (<name>, <length>).

Moreover, each <Street> can have one or more <Link> children which again have their own relevant data (<id>, <direction>).

Example XML document:

Parsing the XML with Python and storing the relevant data in variables is not the problem - I can use for example the lxml library and either read the whole document, then perform some xpath expressions to get the relevant fields, or read it line by line with the iterparse method.

However, I would like to put the data into a pandas dataframe while preserving the hierarchy in it. The goal is to query for single messages (e.g. by Boolean expressions like if status Active then get the Message with all its streets and its streets' links) and get all the data that belongs to the specific message (its streets and its streets' links). How would this best be done?

I tried different approaches but ran into problems with all of them.

If I create one dataframe row for each XML row that contains information and then set a MultiIndex on [MessageID, StreetName, LinkID], I get an Index with lots of NaN in it (which is generally discouraged) because MessageID does not know its children streets and links yet. Besides, I would not know how to select some sub-dataset by Boolean condition instead of only getting some single rows without its children.

Python Pandas Tutorial

When doing a GroupBy on [MessageID, StreetName, LinkID], I do not know how to get back a (probably MultiIndex) dataframe from the pandas GroupBy object since there is nothing to aggregate here (no mean/std/sum/whatsoever, the values should stay the same).

Any suggestions how this could be handled efficiently?

DirkDirk
3,0695 gold badges32 silver badges62 bronze badges

1 Answer

I finally managed to solve the problem as described above and this is how.

I extended the above given XML document to include two messages instead of one. This is how it looks as a valid Python string (it could also be loaded from a file of course):

To parse the hierarchical XML structure into a flat pandas dataframe, I used Python's ElementTree iterparse method which provides a SAX-like interface to iterate through a XML document line by line and fire events if specific XML tags start or end.

To each parsed XML line, the given information is stored in a dictionary. Three dictionaries are used, one for each set of data that somehow belongs together (message, street, link) and that is to be stored in its own dataframe row later on. When all information to one such row is collected, the dictionary is appended to a list storing all rows in their appropriate order.

This is what the XML parsing looks like (see inline comments for further explanation):

listOfRows is now a list of dictionaries where each dictionary stores the information that is to be put into one dataframe row. Creating a dataframe with this list as datasource can be done with

and gives the 'raw' dataframe:

We can now se the columns of interest (messageId, streetName, linkId) as MultiIndex on that dataframe: Stata 15 download crack.

which gives:

Even though having NaN in an index should be disregarded in general, I don't see any problem with it for this usecase.

Finally, to get the desired effect of accessing single messages by their messageId, including all of its 'children' streets and links, the MultiIndexed dataframe has to be grouped by the most outer index level:

Now, you can for example loop over all messages (and do whatever with them) with

which returns

or you can access specific messages by the messageId, returning the row containing the messageId and also all of its dedicated streets and links:

gives

Hope this will be helpful for somebody sometime.

DirkDirk
3,0695 gold badges32 silver badges62 bronze badges
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

Not the answer you're looking for? Browse other questions tagged pythonxmlpandastreehierarchical-data or ask your own question.

I am trying to plot a tree from a Pandas dataframe but I don't know which is the correct data structure I must use and how can I solve it.

The dataset has 4 columns: source, destination, application and timemark.

For example, a row in the dataset could be:

I would like to plot a tree graph generating a node for each of the sources, and showing the adjacency of each source with the destinations who has communicated with, and an adjacency of each destination with all the applications that this destination has used with the source and finally showing the adjacency of each (source, destination, application) leaf, with all the timemarks of the sessions that used this application between this destination and this source.

Could you please tell me how can I find a Python solution for this?

Thanks a lot!

Python Pandas Tree Structure Pictures

Pablo Ibañez Matía
Pablo Ibañez MatíaPablo Ibañez Matía

1 Answer

You should look at NetworkX:

'NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.'

You can feed your dataset to populate a graph and then plot the graph.

Pandas overview python

For example:

would give you:

dportmandportman
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

Not the answer you're looking for? Browse other questions tagged pythonpandasgraphtree or ask your own question.