Find all descendants for points in Python

python xml get attribute value
python elementtree
elementtree python 3
python elementtree tutorial
how to extract data from xml file using python
python xml xpath
python xml to dict
python read xml file line by line

I need to get all descendants point of links represented with side_a - side_b (in one dataframe) until reach for each side_a their end_point (in other dataframe). So:

side_a   side_b
  a        b
  b        c
  c        d
  k        l
  l        m
  l        n
  p        q
  q        r
  r        s

side_a    end_point
  a          c
  b          c
  c          c
  k          m
  k          n
  l          m
  l          n
  p          s
  q          s
  r          s

The point is to get all points for each side_a value until reach end_point from df2 for that value. If it has two end_point values (like "k" does) that it should be two lists.

I have some code but it's not written with this approach, it drops all rows from df1 if df1['side_a'] == df2['end_points'] and that causes certain problems. But if someone wants me to post the code I will, of course.

The desired output would be something like this:

side_a    end_point
  a          [b, c]
  b          [c]
  c          [c]
  k          [l, m]
  k          [l, n]
  l          [m]
  l          [n]
  p          [q, r, s]
  q          [r, s]
  r          [s]

And one more thing, if there is the same both side, that point doesn't need to be listed at all, I can append it later, whatever it's easier.

import pandas as pd
import numpy as np
import itertools

def get_child_list(df, parent_id):
    list_of_children = []
    list_of_children.append(df[df['side_a'] == parent_id]['side_b'].values)
    for c_, r_ in df[df['side_a'] == parent_id].iterrows():
        if r_['side_b'] != parent_id:
            list_of_children.append(get_child_list(df, r_['side_b']))

    # to flatten the list 
    list_of_children =  [item for sublist in list_of_children for item in sublist]
    return list_of_children

new_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
for index, row in df1.iterrows():
    temp_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
    temp_df['list_of_children'] = pd.Series(get_child_list(df1, row['side_a']))
    temp_df['side_a'] = row['side_a']

    new_df = new_df.append(temp_df)

So, the problem with this code is that works if I drop rows where side_a is equal to end_point from df2. I don't know how to implement condition that if catch the df2 in side_b column, then stop, don't go further.

Any help or hint is welcomed here, truly. Thanks in advance.

You can use networkx library and graphs:

import networkx as nx
G = nx.from_pandas_edgelist(df, source='side_a',target='side_b')
df2.apply(lambda x: [nx.shortest_path(G, x.side_a,x.end_point)[0],
                     nx.shortest_path(G, x.side_a,x.end_point)[1:]], axis=1)


  side_a  end_point
0      a     [b, c]
1      b        [c]
2      c         []
3      k     [l, m]
4      k     [l, n]
5      l        [m]
6      l        [n]
7      p  [q, r, s]
8      q     [r, s]
9      r        [s]

[Tutor] How to find descendants recursively?, I can't see any way of doing recursion without writing a function. For example, in a Unix directory tree, if a symlink points back up the tree,  Iterate through nodes_df again in reverse order to get node_descendants, a Can I use a python graph library like Networkx or Shortest path around a set of points.

Your rules are inconsistent and your definitions are unclear so you may need to add some constraints here and there because it is unclear exactly what you are asking. By organizing the data-structure to fit the problem and building a more robust function for traversal (shown below) it will be easier to add/edit constraints as needed - and solve the problem completely.

Transform the df to a dict to better represent a tree structure

This problem is a lot simpler if you transform the data structure to be more intuitive to the problem, instead of trying to solve the problem in the context of the current structure.

## Example dataframe
df = pd.DataFrame({'side_a':['a','b','c','k','l','l','p','q','r'],'side_b':['b','c','d','l','m','n','q','r','s']})

## Instantiate blank tree with every item
all_items = set(list(df['side_a']) + list(df['side_b']))
tree = {ii : set() for ii in all_items}

## Populate the tree with each row
for idx, row in df.iterrows():
    tree[row['side_a']] =  set(list(tree[row['side_a']]) + list(row['side_b']))
Traverse the Tree

This is much more straightforward now that the data structure is intuitive. Any standard Depth-First-Search algorithm w/ path saving will do the trick. I modified the one in the link to work with this example.

Edit: Reading again it looks you have a condition for search termination in endpoint (you need to be more clear in your question what is input and what is output). You can adjust dfs_path(tree,**target**, root) and change the termination condition to return only the correct paths.

## Standard DFS pathfinder
def dfs_paths(tree, root):
    stack = [(root, [root])]
    while stack:
        (node, path) = stack.pop()
        for nextNode in tree[node] - set(path):
            # Termination condition. 
            ### I set it to terminate search at the end of each path.
            ### You can edit the termination condition to fit the 
            ### constraints of your goal
            if not tree[nextNode]:
                yield set(list(path) + list(nextNode)) - set(root)
                stack.append((nextNode, path + [nextNode]))
Build a dataframe from the generators we yielded

If you're not super comfortable with generators, you can structure the DFS traversal so that it outputs in a list. instead of a generator

set_a = []
end_points = []
gen_dict = [{ii:dfs_paths(tree,ii)} for ii in all_items]
for gen in gen_dict:
    for row in list(gen.values()).pop():

## To dataframe
df_2 = pd.DataFrame({'set_a':set_a,'end_points':end_points}).sort_values('set_a')

set_a   end_points
a       {b, c, d}
b       {c, d}
c       {d}
k       {n, l}
k       {m, l}
l       {n}
l       {m}
p       {s, r, q}
q       {s, r}
r       {s}

20.5. xml.etree.ElementTree — The ElementTree XML API, Element.findall() finds only elements with a tag which are direct children of the current but the contents of the text and tail attributes are undefined at that point. Get all points of a straight line in python. Very simply, given a point A(x,y) and another point B(m,n), I need a function that can return in any iterable object a list[k,z] of all points in between.

If you're OK with an extra import, this can be posed as a path problem on a graph and solved in a handful of lines using NetworkX:

import networkx

g = networkx.DiGraph(zip(df1.side_a, df1.side_b))

outdf = df2.apply(lambda row: [row.side_a, 
                               set().union(*networkx.all_simple_paths(g, row.side_a, row.end_point)) - {row.side_a}], 

outdf looks like this. Note that this contains sets instead of lists as in your desired output - this allows all the paths to be combined in a simple way.

  side_a  end_point
0      a     {c, b}
1      b        {c}
2      c         {}
3      k     {l, m}
4      k     {l, n}
5      l        {m}
6      l        {n}
7      p  {r, q, s}
8      q     {r, s}
9      r        {s}

19.7. xml.etree.ElementTree — The ElementTree XML API, Element.findall() finds only elements with a tag which are direct children of the current but the contents of the text and tail attributes are undefined at that point. I need to find all paths from a given graph. I can do that for now, however my recursive code is not efficient and my graphs are very complicated, hence I need a better algorithm. def findLeaves(g

xml.etree.ElementTree — The ElementTree XML API, Element.findall() finds only elements with a tag which are direct children of the current but the contents of the text and tail attributes are undefined at that point. From this point, we can now use the Beautiful Soup methods on that soup object. We can use the prettify method to turn a BS parse tree into a nicely formatted Unicode string The Find_all method The find_all method is one of the most common methods in BeautifulSoup.

Working With Tree Data Structures, Getting Leaves, Descendants and Node's Relatives; Traversing (browsing) trees Let's say we want get all deepest nodes in a tree whose branch length is larger but using a sister as the starting # point R = A.add_child(name="R") # Adds a  If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on. If you only want Beautiful Soup to consider direct children, you can pass in recursive=False .

Strabo on the Troad: Book XIII, Cap. I, The descendants of this holy personage too claim to have inherited the kinsman of a python, offer the whole of his month's pay to save one of these Here we find all the chief points reported by the ancients, except the descent from snakes. In this chapter, we shall discuss about Navigating by Tags. One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag’s children). Beautiful Soup provides different ways to navigate and iterate over’s tag’s children

  • You do of course remember it's not "please write my code for me" site? Can you please show us your work? What is the exact problem with your code?
  • @rsm As I said, I can post my code, but it would make a post huge and I don't think it will be used by any of helpers. You could just write a comment that I need to add my code to and I will, just not to be arrogant.
  • Can you please show us your work, add any relevant(!) code you have? And explain problem you have with it? If you expect us to come up with an algorithm and implementation to your problem - it's not how this site works.
  • @jovicbg There is a typo in df2 : the end_point of q should be r, not s. I probably have a simple solution for your problem, but its performance is not good on large dataframes. What's the approximate size of your dataframes ?
  • @QusaiAlothman Thanks, I have edit it. It was a good ent_point, but I skip one step (q - r). It's not really large, 4000 of rows.
  • @chthonicdaemon I just ask that you upvote it first :) Enjoy.
  • I have data in numeric type but I have cast to string and now got an erro: ('Either source 7163 or target 1019 is not in G', u'occurred at index 5'). Is there any way to skip if catch error like this, maybe really is not defined the target in df2 for some side_a.
  • Hrm... You might need to filter your df2 dataframe first for where side_a and end_point both appear in df.
  • @ScottBoston I did, but doesn't help. Your code works fine, but have no idea where is the problem.
  • I'm going to try with this later, as soon as possible. Column end_point in df2 represent where iteration for each side_a should stop. So end_point for A should not contain D. In df2 for A end_point is C. Maybe is confusing because I named output column end_point too. I'm sorry.
  • The comment i added should fix that. Just add a parameter for endpoint and make the termination condition nextNode == endPoint
  • Will this work with a numeric datatype, not strings?
  • I got an error : KeyErrorTraceback (most recent call last) <ipython-input-45-29d3775fa7ea> in <module>() 3 gen_dict = [{ii:dfs_paths(tree,ii)} for ii in all_items] 4 for gen in gen_dict: ----> 5 for row in list(gen.values()).pop(): 6 set_a.append(list(gen.keys()).pop()) 7 end_points.append(row) KeyError: '1'
  • I just ran the exact code posted and it output the correct values. Did you name a variable list at some point?