Pandas - compute new colum based on the relative value in other rows

create pandas column with new values based on values in other columns
pandas create new column based on condition
pandas create new column based on multiple condition
pandas drop column
pandas apply
pandas dataframe
pandas rename column
create a new column based on two columns from two different dataframes

With data like below

data = """
Class,Location,Long,Lat
A,ABC11,139.6295542,35.61144069
A,ABC20,139.630596,35.61045559
A,ABC03,139.6300307,35.61327781
B,ABC54,139.7787818,35.68847945
B,ABC05,139.7814447,35.6816882
B,ABC06,139.7788191,35.681865
B,ABC24,139.7790396,35.67781697
"""
df = pd.read_csv(StringIO(data))

Each row contains data pertaining to a location. For each location, need to find the distance to other locations (rows) as follows (simplified for ease)

distance = sqrt((Long1-Long2)^2 + (Lat1-Lat2)^2)

if it was done outside pandas I would do as follows

import math

rows = df.to_dict('records')

# distance of each location w.r.t other locations excluding self
results = {}
for row in rows:
    loc = row['Location']
    results[loc] = {}
    # get a new list excl the curr row
    nrows = [row for row in rows if row['Location'] != loc]
    for nrow in nrows:
        dist = math.sqrt((row["Long"] - nrow["Long"])**2 + (row["Lat"] - nrow["Lat"])**2)
        results[loc][nrow["Location"]] = dist

# find the location with min distance 
fin_results = {}
for k, v in results.items():
    fin_results[k] = {}
    minValKey = min(v, key = v.get)
    fin_results[k]["location"] = minValKey 
    fin_results[k]["dist"] = v[minValKey]

This would give an output like below which for each location gives the location which is the most nearest and distance to that location.

{'ABC11': {'location': 'ABC20', 'dist': 0.001433795400325211}, 'ABC20': {'location': 'ABC11', 'dist': 0.001433795400325211}, 'ABC03': {'location': 'ABC11', 'dist': 0.001897909941062068}, 'ABC54': {'location': 'ABC06', 'dist': 0.006614555169662396}, 'ABC05': {'location': 'ABC06', 'dist': 0.002631545857463665}, 'ABC06': {'location': 'ABC05', 'dist': 0.002631545857463665}, 'ABC24': {'location': 'ABC06', 'dist': 0.004054030973106164}}

While this works functionally, wanted to know what would be the pandas way of doing this.

The desired output

+----------+-------------------+----------------------------+
| location |  nearest_location |  nearest_location_distance |
+----------+-------------------+----------------------------+
| 'ABC11'  | 'ABC20'           | 0.001433795400325211       |
| 'ABC20'  | 'ABC11'           | 0.001433795400325211       |
| 'ABC03'  | 'ABC11'           | 0.001897909941062068       |
| 'ABC54'  | 'ABC06'           | 0.006614555169662396       |
| 'ABC05'  | 'ABC06'           | 0.002631545857463665       |
| 'ABC06'  | 'ABC05'           | 0.002631545857463665       |
| 'ABC24'  | 'ABC06'           | 0.004054030973106164       |
+----------+-------------------+----------------------------+

You can use numpy broadcasting

long_ = df.Long.to_numpy()
lat   = df.Lat.to_numpy() 

distances = np.sqrt((long_ - long_[:, None]) ** 2 + (lat - lat[:,None]) **2)

dist_df = pd.DataFrame(distances, index=df.Location, columns=df.Location)

Location     ABC11     ABC20     ABC03     ABC54     ABC05     ABC06     ABC24

ABC11     0.000000  0.001434  0.001898  0.167940  0.167348  0.165044  0.163559
ABC20     0.001434  0.000000  0.002878  0.167472  0.166822  0.164528  0.163012
ABC03     0.001898  0.002878  0.000000  0.166680  0.166151  0.163836  0.162385
ABC54     0.167940  0.167472  0.166680  0.000000  0.007295  0.006615  0.010666
ABC05     0.167348  0.166822  0.166151  0.007295  0.000000  0.002632  0.004558
ABC06     0.165044  0.164528  0.163836  0.006615  0.002632  0.000000  0.004054
ABC24     0.163559  0.163012  0.162385  0.010666  0.004558  0.004054  0.000000

m = dist_df[dist_df>0]
pd.concat([m.idxmin(1).rename('nearest_location'),
           m.min(1).rename('nearest_location_distance'), ],1)

The output data frame would be something like

        nearest_location  nearest_location_distance
Location                                            
ABC11               ABC20                   0.001434
ABC20               ABC11                   0.001434
ABC03               ABC11                   0.001898
ABC54               ABC06                   0.006615
ABC05               ABC06                   0.002632
ABC06               ABC05                   0.002632
ABC24               ABC06                   0.004054

This will find the distance from one row to all others. That's how I had interpreted the question, not sure if is your goal.

Deriving New Columns & Defining Python Functions, Make new columns from existing data and build custom functions. This lesson builds on the pandas DataFrame data type you learned about in a previous lesson. Run this code so you can see the first five rows of the dataset. You can do this by creating a derived column based on the values in the platform column. pandas.DataFrame.diff¶ DataFrame.diff (self, periods = 1, axis = 0) → ’DataFrame’ [source] ¶ First discrete difference of element. Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is the element in the same column of the previous row).


You can use scipy's distance_matrix, which is actually what @rafaelc coded:

from scipy.spatial import distance_matrix

dist_mat = distance_matrix(df[['Long','Lat']],df[['Long','Lat']])

# assign distance matrix with appropriate name
dist_mat = pd.DataFrame(dist_mat, 
                        index=df.Location, 
                        columns=df.Location)

# convert the data frame to dict
(dist_mat.where(dist_mat>0)
     .agg(('idxmin', 'min'))
     .to_dict()
)

Output:

{'ABC11': {'idxmin': 'ABC20', 'min': 0.001433795400325211},
 'ABC20': {'idxmin': 'ABC11', 'min': 0.001433795400325211},
 'ABC03': {'idxmin': 'ABC11', 'min': 0.001897909941062068},
 'ABC54': {'idxmin': 'ABC06', 'min': 0.006614555169662396},
 'ABC05': {'idxmin': 'ABC06', 'min': 0.002631545857463665},
 'ABC06': {'idxmin': 'ABC05', 'min': 0.002631545857463665},
 'ABC24': {'idxmin': 'ABC06', 'min': 0.004054030973106164}}    ​

If you want the dataframe only:

(dist_mat.where(dist_mat>0)
     .agg(('idxmin', 'min'))
     .T
)

Output:

      idxmin         min
ABC11  ABC20   0.0014338
ABC20  ABC11   0.0014338
ABC03  ABC11  0.00189791
ABC54  ABC06  0.00661456
ABC05  ABC06  0.00263155
ABC06  ABC05  0.00263155
ABC24  ABC06  0.00405403

Indexing and Selecting Data, This makes interactive work intuitive, as there's little new to learn if you already As using integer slices with .ix have different behavior depending on whether the Getting values from an object with multi-axes selection uses the following -​0.370647 -1.157892 -1.344312 0.844885 [8 rows x 4 columns] In [10]: df[['B', 'A']]​  pandas: How do I select rows based on if X number of columns is greater than a number? Tag: python , pandas I can use data[data[data > 10].any(1)] to select rows where any of the columns are greater than 10.


Also you can use df.iterrows:

distance_min=[]
location_min=[]
output_df=df.copy()
for i, col in df.iterrows():
    dist=((col['Long']-df['Long']).pow(2)+(col['Lat']-df['Lat']).pow(2)).pow(1/2)
    location_min.append(df.at[dist[dist>0].idxmin(),'Location'])
    distance_min.append(dist[dist>0].min())

output_df['nearest_location']=location_min
output_df['nearest_location_distance']=distance_min
output_df=output_df.reindex(columns=['Location','nearest_location','nearest_location_distance'])
print(output_df)

 Location  nearest_location  nearest_location_distance
0    ABC11            ABC20                   0.001434
1    ABC20            ABC11                   0.001434
2    ABC03            ABC11                   0.001898
3    ABC54            ABC06                   0.006615
4    ABC05            ABC06                   0.002632
5    ABC06            ABC05                   0.002632
6    ABC24            ABC06                   0.004054

How to create new columns derived from existing columns?, The calculation is again element-wise, so the / is applied for the values in each row. Also other mathematical operators (+, -, *, /) or logical operators (<, >, =,…)  0 Pandas - compute new colum based on the relative value in other rows Oct 8 '19 0 How to extract text in-between 2 different closed html tags that are not inside the tags? Oct 22 '19


As ansev propose the same solution a bit more finished

import pandas as pd 
from io import StringIO

df = pd.read_csv(StringIO(data))
df['result']= (df['Lat'].diff(-1).pow(2)+df['Long'].diff(-1).pow(2)).pow(1/2)

Dealing with Rows and Columns in Pandas DataFrame , It contains well written, well thought and well explained computer science and We can perform basic operations on rows/columns like selecting, deleting, adding, In Order to add a column in Pandas DataFrame, we can declare a new list as a Those values were dropped since axis was set equal to 1 and the changes  I have a Pandas df (See below), I want to sum the values based on the index column. My index column contains string values. See the example below, here I am trying to add Moving, Playing and Using Phone together as "Active Time" and sum their co


Useful Pandas Snippets · GitHub, Grab DataFrame rows where column doesn't have certain values Add a column to a dataframe based on the text contents of another column on the ranking of values in another column (a.k.a., add the order relative the index if index is not default) #Similar to last example, but calculating with more than one column. Example of a pandas Series and DataFrame 3. Input and output data Input. DataFrames can be created in a variety of ways: A) Create an empty DataFrame: df = pd.DataFrame() B) Input data: df = pd.DataFrame(data = data), where the input data can be in many different formats, making building a DataFrame flexible and convenient as the data you work with may be in any number of structures including


Pandas create unique id for each row, Data frame is well-known by statistician and other data practitioners. Dealing with Rows and Columns in Pandas DataFrame A Data frame is a Drop rows in DataFrame by conditions on column values; Python Pandas : How to add new columns in You want to calculate sum of of values of Column_3, based on unique  See also. Series.diff. Compute the difference of two elements in a Series. DataFrame.diff. Compute the difference of two elements in a DataFrame. Series.shift


pyspark.sql module, SparkSession.builder.config("spark.some.config.option", "some-value") Creates a DataFrame from an RDD , a list or a pandas.DataFrame . The entry point for working with structured data (rows and columns) in Spark, in Spark 1.x. Return a new DataFrame containing rows only in both this frame and another frame. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.sum() function return the sum of the values for the requested axis. If the input is index axis then it adds all the values in a column and repeats the same for all the columns and returns a series containing the sum of all the values in each column.