## Pandas - compute new colum based on the relative value in other rows

With data like below

data = """ Class,Location,Long,Lat A,ABC11,139.6295542,35.61144069 A,ABC20,139.630596,35.61045559 A,ABC03,139.6300307,35.61327781 B,ABC54,139.7787818,35.68847945 B,ABC05,139.7814447,35.6816882 B,ABC06,139.7788191,35.681865 B,ABC24,139.7790396,35.67781697 """ df = pd.read_csv(StringIO(data))

Each row contains data pertaining to a location. For each location, need to find the distance to other locations (rows) as follows (simplified for ease)

distance = sqrt((Long1-Long2)^2 + (Lat1-Lat2)^2)

if it was done outside pandas I would do as follows

import math rows = df.to_dict('records') # distance of each location w.r.t other locations excluding self results = {} for row in rows: loc = row['Location'] results[loc] = {} # get a new list excl the curr row nrows = [row for row in rows if row['Location'] != loc] for nrow in nrows: dist = math.sqrt((row["Long"] - nrow["Long"])**2 + (row["Lat"] - nrow["Lat"])**2) results[loc][nrow["Location"]] = dist # find the location with min distance fin_results = {} for k, v in results.items(): fin_results[k] = {} minValKey = min(v, key = v.get) fin_results[k]["location"] = minValKey fin_results[k]["dist"] = v[minValKey]

This would give an output like below which for each location gives the location which is the most nearest and distance to that location.

{'ABC11': {'location': 'ABC20', 'dist': 0.001433795400325211}, 'ABC20': {'location': 'ABC11', 'dist': 0.001433795400325211}, 'ABC03': {'location': 'ABC11', 'dist': 0.001897909941062068}, 'ABC54': {'location': 'ABC06', 'dist': 0.006614555169662396}, 'ABC05': {'location': 'ABC06', 'dist': 0.002631545857463665}, 'ABC06': {'location': 'ABC05', 'dist': 0.002631545857463665}, 'ABC24': {'location': 'ABC06', 'dist': 0.004054030973106164}}

While this works functionally, wanted to know what would be the `pandas`

way of doing this.

The desired output

+----------+-------------------+----------------------------+ | location | nearest_location | nearest_location_distance | +----------+-------------------+----------------------------+ | 'ABC11' | 'ABC20' | 0.001433795400325211 | | 'ABC20' | 'ABC11' | 0.001433795400325211 | | 'ABC03' | 'ABC11' | 0.001897909941062068 | | 'ABC54' | 'ABC06' | 0.006614555169662396 | | 'ABC05' | 'ABC06' | 0.002631545857463665 | | 'ABC06' | 'ABC05' | 0.002631545857463665 | | 'ABC24' | 'ABC06' | 0.004054030973106164 | +----------+-------------------+----------------------------+

You can use `numpy`

broadcasting

long_ = df.Long.to_numpy() lat = df.Lat.to_numpy() distances = np.sqrt((long_ - long_[:, None]) ** 2 + (lat - lat[:,None]) **2) dist_df = pd.DataFrame(distances, index=df.Location, columns=df.Location)

Location ABC11 ABC20 ABC03 ABC54 ABC05 ABC06 ABC24 ABC11 0.000000 0.001434 0.001898 0.167940 0.167348 0.165044 0.163559 ABC20 0.001434 0.000000 0.002878 0.167472 0.166822 0.164528 0.163012 ABC03 0.001898 0.002878 0.000000 0.166680 0.166151 0.163836 0.162385 ABC54 0.167940 0.167472 0.166680 0.000000 0.007295 0.006615 0.010666 ABC05 0.167348 0.166822 0.166151 0.007295 0.000000 0.002632 0.004558 ABC06 0.165044 0.164528 0.163836 0.006615 0.002632 0.000000 0.004054 ABC24 0.163559 0.163012 0.162385 0.010666 0.004558 0.004054 0.000000

m = dist_df[dist_df>0] pd.concat([m.idxmin(1).rename('nearest_location'), m.min(1).rename('nearest_location_distance'), ],1)

The output data frame would be something like

nearest_location nearest_location_distance Location ABC11 ABC20 0.001434 ABC20 ABC11 0.001434 ABC03 ABC11 0.001898 ABC54 ABC06 0.006615 ABC05 ABC06 0.002632 ABC06 ABC05 0.002632 ABC24 ABC06 0.004054

This will find the distance from one row to *all* others. That's how I had interpreted the question, not sure if is your goal.

**Deriving New Columns & Defining Python Functions,** Make new columns from existing data and build custom functions. This lesson builds on the pandas DataFrame data type you learned about in a previous lesson. Run this code so you can see the first five rows of the dataset. You can do this by creating a derived column based on the values in the platform column. pandas.DataFrame.diff¶ DataFrame.diff (self, periods = 1, axis = 0) → ’DataFrame’ [source] ¶ First discrete difference of element. Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is the element in the same column of the previous row).

You can use `scipy`

's `distance_matrix`

, which is actually what @rafaelc coded:

from scipy.spatial import distance_matrix dist_mat = distance_matrix(df[['Long','Lat']],df[['Long','Lat']]) # assign distance matrix with appropriate name dist_mat = pd.DataFrame(dist_mat, index=df.Location, columns=df.Location) # convert the data frame to dict (dist_mat.where(dist_mat>0) .agg(('idxmin', 'min')) .to_dict() )

Output:

{'ABC11': {'idxmin': 'ABC20', 'min': 0.001433795400325211}, 'ABC20': {'idxmin': 'ABC11', 'min': 0.001433795400325211}, 'ABC03': {'idxmin': 'ABC11', 'min': 0.001897909941062068}, 'ABC54': {'idxmin': 'ABC06', 'min': 0.006614555169662396}, 'ABC05': {'idxmin': 'ABC06', 'min': 0.002631545857463665}, 'ABC06': {'idxmin': 'ABC05', 'min': 0.002631545857463665}, 'ABC24': {'idxmin': 'ABC06', 'min': 0.004054030973106164}}

If you want the dataframe only:

(dist_mat.where(dist_mat>0) .agg(('idxmin', 'min')) .T )

Output:

idxmin min ABC11 ABC20 0.0014338 ABC20 ABC11 0.0014338 ABC03 ABC11 0.00189791 ABC54 ABC06 0.00661456 ABC05 ABC06 0.00263155 ABC06 ABC05 0.00263155 ABC24 ABC06 0.00405403

**Indexing and Selecting Data,** This makes interactive work intuitive, as there's little new to learn if you already As using integer slices with .ix have different behavior depending on whether the Getting values from an object with multi-axes selection uses the following -0.370647 -1.157892 -1.344312 0.844885 [8 rows x 4 columns] In [10]: df[['B', 'A']] pandas: How do I select rows based on if X number of columns is greater than a number? Tag: python , pandas I can use data[data[data > 10].any(1)] to select rows where any of the columns are greater than 10.

Also you can use df.iterrows:

distance_min=[] location_min=[] output_df=df.copy() for i, col in df.iterrows(): dist=((col['Long']-df['Long']).pow(2)+(col['Lat']-df['Lat']).pow(2)).pow(1/2) location_min.append(df.at[dist[dist>0].idxmin(),'Location']) distance_min.append(dist[dist>0].min()) output_df['nearest_location']=location_min output_df['nearest_location_distance']=distance_min output_df=output_df.reindex(columns=['Location','nearest_location','nearest_location_distance']) print(output_df)

Location nearest_location nearest_location_distance 0 ABC11 ABC20 0.001434 1 ABC20 ABC11 0.001434 2 ABC03 ABC11 0.001898 3 ABC54 ABC06 0.006615 4 ABC05 ABC06 0.002632 5 ABC06 ABC05 0.002632 6 ABC24 ABC06 0.004054

**How to create new columns derived from existing columns?,** The calculation is again element-wise, so the / is applied for the values in each row. Also other mathematical operators (+, -, *, /) or logical operators (<, >, =,…) 0 Pandas - compute new colum based on the relative value in other rows Oct 8 '19 0 How to extract text in-between 2 different closed html tags that are not inside the tags? Oct 22 '19

As ansev propose the same solution a bit more finished

import pandas as pd from io import StringIO df = pd.read_csv(StringIO(data)) df['result']= (df['Lat'].diff(-1).pow(2)+df['Long'].diff(-1).pow(2)).pow(1/2)

**Dealing with Rows and Columns in Pandas DataFrame ,** It contains well written, well thought and well explained computer science and We can perform basic operations on rows/columns like selecting, deleting, adding, In Order to add a column in Pandas DataFrame, we can declare a new list as a Those values were dropped since axis was set equal to 1 and the changes I have a Pandas df (See below), I want to sum the values based on the index column. My index column contains string values. See the example below, here I am trying to add Moving, Playing and Using Phone together as "Active Time" and sum their co

**Useful Pandas Snippets · GitHub,** Grab DataFrame rows where column doesn't have certain values Add a column to a dataframe based on the text contents of another column on the ranking of values in another column (a.k.a., add the order relative the index if index is not default) #Similar to last example, but calculating with more than one column. Example of a pandas Series and DataFrame 3. Input and output data Input. DataFrames can be created in a variety of ways: A) Create an empty DataFrame: df = pd.DataFrame() B) Input data: df = pd.DataFrame(data = data), where the input data can be in many different formats, making building a DataFrame flexible and convenient as the data you work with may be in any number of structures including

**Pandas create unique id for each row,** Data frame is well-known by statistician and other data practitioners. Dealing with Rows and Columns in Pandas DataFrame A Data frame is a Drop rows in DataFrame by conditions on column values; Python Pandas : How to add new columns in You want to calculate sum of of values of Column_3, based on unique See also. Series.diff. Compute the difference of two elements in a Series. DataFrame.diff. Compute the difference of two elements in a DataFrame. Series.shift

**pyspark.sql module,** SparkSession.builder.config("spark.some.config.option", "some-value") Creates a DataFrame from an RDD , a list or a pandas.DataFrame . The entry point for working with structured data (rows and columns) in Spark, in Spark 1.x. Return a new DataFrame containing rows only in both this frame and another frame. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.sum() function return the sum of the values for the requested axis. If the input is index axis then it adds all the values in a column and repeats the same for all the columns and returns a series containing the sum of all the values in each column.