left_join based on closest LAT_LON in R

I am trying to find the ID of the closest LAT_LON in a data.frame with reference to my original data.frame. I have already figured this out by merging both data.frames on a unique identifier and the calculating the distance based on the distHaverSine function from geosphere. Now, I want to take step further and join the data.frames without the unique identifier and find ID the nearest LAT-LON. I have used the following code after merging:

v3 <-v2 %>% mutate(CTD = distHaversine(cbind(LON.x, LAT.x), cbind(LON.y, LAT.y)))

DATA:

loc <- data.frame(station = c('Baker Street','Bank'),
     lat = c(51.522236,51.5134047),
     lng = c(-0.157080, -0.08905843),
               postcode = c('NW1','EC3V'))
stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
                lat = c(51.53253,51.520865,51.490281,51.51224),
                lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
                postcode = c('EC1V','EC1A', 'W14', 'W2'))

As a final result I would like something like this:

df <- data.frame(loc = c('Baker Street','Bank','Baker Street','Bank','Baker Street','Bank','Baker 
        Street','Bank'), 
              stop = c('Angel','Barbican','Barons Court','Bayswater','Angel','Barbican','Barons Court','Bayswater'), 
              dist = c('x','x','x','x','x','x','x','x'), 
              lat = c(51.53253,51.520865,51.490281,51.51224,51.53253,51.520865,51.490281,51.51224), 
              lng = c(-0.10579,-0.097758,-0.214340,-0.187569,-0.10579,-0.097758,-0.214340,-0.187569),
              postcode = c('EC1V','EC1A', 'W14', 'W2','EC1V','EC1A', 'W14', 'W2')
              )

Any help is appreciated. Thanks.

As the distances between the object are small we can speed up the computation by using the euclidian distance between the coordinates. As we are not around the equator, the lng coordinates are squished a bit; we can make the comparison slightly better by scaling the lng a bit.

cor_stop <- stop[, c("lat", "lng")]
cor_stop$lng <- cor_stop$lng * sin(mean(cor_stop$lat, na.rm = TRUE)/180*pi)
cor_loc <- loc[, c("lat", "lng")]
cor_loc$lng <- cor_loc$lng * sin(mean(cor_loc$lat, na.rm = TRUE)/180*pi)

We can then calculate the closest stop for each location using the FNN package which uses tree based search to quickly find the closest K neighbours. This should scale to big data sets (I have used this for datasets with millions of records):

library(FNN)
matches <- knnx.index(cor_stop, cor_loc, k = 1)
matches
##      [,1]
## [1,]    4
## [2,]    2

We can then construct the end result:

res <- loc
res$stop_station  <- stop$station[matches[,1]]
res$stop_lat      <- stop$lat[matches[,1]]
res$stop_lng      <- stop$lng[matches[,1]]
res$stop_postcode <- stop$postcode[matches[,1]]

And calculate the actual distance:

library(geosphere)
res$dist <- distHaversine(res[, c("lng", "lat")], res[, c("stop_lng", "stop_lat")])
res
##          station      lat         lng postcode stop_station stop_lat  stop_lng
## 1 Baker Street 51.52224 -0.15708000      NW1    Bayswater 51.51224 -0.187569
## 2         Bank 51.51340 -0.08905843     EC3V     Barbican 51.52087 -0.097758
##   stop_postcode     dist
## 1            W2 2387.231
## 2          EC1A 1026.091

I you are unsure that the closest point in lat-long is also the closest point 'as the bird flies', you could use this method to first select the K closest points in lat-long; then calculate the distances for those points and then selecting the closest point.

left_join based on closest LAT_LON in R, left_join based on closest LAT_LON in R I want to take step further and join the data-frames without the unique identifier and find ID the nearest LAT-LON. Here, I will use a left_join(). As noted above, a left_join() only keeps the observations from the first data frame in the function. In other words, the result of a left_join() will have the same number of rows as the original left data frame, while adding the longitude and latitude columns from the locations data frame. The data frames will be

All of the joining, distance calculations, and plotting can be done with available R packages.

library(tidyverse)
library(sf)
#> Linking to GEOS 3.6.2, GDAL 2.2.3, PROJ 4.9.3
library(nngeo)
library(mapview)

## Original data
loc <- data.frame(station = c('Baker Street','Bank'),
                  lat = c(51.522236,51.5134047),
                  lng = c(-0.157080, -0.08905843),
                  postcode = c('NW1','EC3V'))

stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
                   lat = c(51.53253,51.520865,51.490281,51.51224),
                   lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
                   postcode = c('EC1V','EC1A', 'W14', 'W2'))

df <- data.frame(loc = c('Baker Street','Bank','Baker Street','Bank','Baker Street','Bank','Baker 
        Street','Bank'), 
                 stop = c('Angel','Barbican','Barons Court','Bayswater','Angel','Barbican','Barons Court','Bayswater'), 
                 dist = c('x','x','x','x','x','x','x','x'), 
                 lat = c(51.53253,51.520865,51.490281,51.51224,51.53253,51.520865,51.490281,51.51224), 
                 lng = c(-0.10579,-0.097758,-0.214340,-0.187569,-0.10579,-0.097758,-0.214340,-0.187569),
                 postcode = c('EC1V','EC1A', 'W14', 'W2','EC1V','EC1A', 'W14', 'W2')
)



## Create sf objects from lat/lon points
loc_sf <- loc %>% st_as_sf(coords = c('lng', 'lat'), remove = T) %>%
  st_set_crs(4326) 

stop_sf <- stop %>% st_as_sf(coords = c('lng', 'lat'), remove = T) %>%
  st_set_crs(4326) 


# Use st_nearest_feature to cbind loc to stop by nearest points
joined_sf <- stop_sf %>% 
  cbind(
    loc_sf[st_nearest_feature(stop_sf, loc_sf),])


## mutate to add column showing distance between geometries
joined_sf %>%
  mutate(dist = st_distance(geometry, geometry.1, by_element = T))
#> Simple feature collection with 4 features and 5 fields
#> Active geometry column: geometry
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: -0.21434 ymin: 51.49028 xmax: -0.097758 ymax: 51.53253
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
#>        station postcode    station.1 postcode.1                   geometry
#> 1        Angel     EC1V         Bank       EC3V  POINT (-0.10579 51.53253)
#> 2     Barbican     EC1A         Bank       EC3V POINT (-0.097758 51.52087)
#> 3 Barons Court      W14 Baker Street        NW1  POINT (-0.21434 51.49028)
#> 4    Bayswater       W2 Baker Street        NW1 POINT (-0.187569 51.51224)
#>                    geometry.1         dist
#> 1 POINT (-0.08905843 51.5134) 2424.102 [m]
#> 2 POINT (-0.08905843 51.5134) 1026.449 [m]
#> 3   POINT (-0.15708 51.52224) 5333.417 [m]
#> 4   POINT (-0.15708 51.52224) 2390.791 [m]



## Use nngeo and mapview to plot lines on a map
# NOT run for reprex, output image attached 
#connected <- st_connect(stop_sf, loc_sf)
# mapview(connected) + 
#   mapview(loc_sf, color = 'red') +
#   mapview(stop_sf, color = 'black')

Created on 2020-01-21 by the reprex package (v0.3.0)

r - 基于R中最接近的LAT_LON的left_join, left_join based on closest LAT_LON in R. I am trying to find the ID of the closest LAT_LON in a data.frame with reference to my original data.frame. I have already � Figure 3: dplyr left_join Function. The difference to the inner_join function is that left_join retains all rows of the data table, which is inserted first into the function (i.e. the X-data). Have a look at the R documentation for a precise definition: Example 3: right_join dplyr R Function. Right join is the reversed brother of left join:

You can avoid searching for nearest neighbours completely if you are able to use a projected coordinate system. If you can, then you can cheaply construct Voronoi polygons around each location - these polygons define areas that are closest to each of the input points.

You can then just use GIS intersections to find which points lie in which polygons and then calculate the distances for known pairs of closest points. I think this should be much faster. However, you can't use Voronoi polygons with geographic coordinates.

loc <- data.frame(station = c('Baker Street','Bank'),
     lat = c(51.522236,51.5134047),
     lng = c(-0.157080, -0.08905843),
               postcode = c('NW1','EC3V'))

stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
                lat = c(51.53253,51.520865,51.490281,51.51224),
                lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
                postcode = c('EC1V','EC1A', 'W14', 'W2'))

# Convert to a suitable PCS (in this case OSGB)
stop <- st_as_sf(stop, coords=c('lng','lat'), crs=4326)
stop <- st_transform(stop, crs=27700)
loc <- st_as_sf(loc, coords=c('lng','lat'), crs=4326)
loc <- st_transform(loc, crs=27700)

# Extract Voronoi polygons around locations and convert to an sf object
loc_voronoi <- st_collection_extract(st_voronoi(do.call(c, st_geometry(loc))))
loc_voronoi <- st_sf(loc_voronoi, crs=crs(loc))

# Match Voronoi polygons to locations and select that geometry
loc$voronoi <- loc_voronoi$loc_voronoi[unlist(st_intersects(loc, loc_voronoi))]
st_geometry(loc) <- 'voronoi'

# Find which stop is closest to each location
stop$loc <- loc$station[unlist(st_intersects(stop, loc))]

# Reset locs to use the point geometry and get distances
st_geometry(loc) <- 'geometry'
stop$loc_dist <- st_distance(stop, loc[stop$loc,], by_element=TRUE)

That gives you the following output:

Simple feature collection with 4 features and 4 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 524069.7 ymin: 178326.3 xmax: 532074.6 ymax: 183213.9
epsg (SRID):    27700
proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
       station postcode                  geometry          loc     loc_dist
1        Angel     EC1V POINT (531483.8 183213.9)         Bank 2423.722 [m]
2     Barbican     EC1A POINT (532074.6 181931.2)         Bank 1026.289 [m]
3 Barons Court      W14 POINT (524069.7 178326.3) Baker Street 5332.478 [m]
4    Bayswater       W2 POINT (525867.7 180813.9) Baker Street 2390.377 [m]

Spatial Joins In R, Spatial joins are based on the intersection between two spatial objects, USAF WBAN STN_NAME CTRY STATE CALL LAT LON ELEV_M ## 1 008268 How about picking the station nearest the centroid of the county. left_join based on closest LAT_LON in R. Ask Question Asked 14 days ago. Active 7 days ago. Viewed 188 times 3. 1. I am trying to find the ID of the closest LAT_LON

fuzzyjoin package, Join tables together based not on whether columns match exactly, but with 1,427 more rows closest %>% count(distance) #> # A tibble: 3 x 2 #> distance n� Note that the radius is the distance based on the decimal between lon/lat coordinates. Having had a look at the Wikipedia page on decimal degrees (mpre precisely: the table about degree precision versus length), we can see that 3 decimal places (0.001 degrees) correspond to 111.32 m in N/S and 78.71 m E/W at 45N/S.

Reverse Geocoding in R. Free Without the Google or Bing API, I found a great package in R called revgeo and thought this would be geocoded data.cities <- revgeo(latlong$longitude, latlong$latitude,� matrix with distance and lon/lat of the nearest point on the line. Distance is in the same unit as r in the distfun (default is meters). If line is a Spatial* object, the ID (index) of (one of) the nearest objects is also returned.

[PDF] Spatial Queries with k-Nearest-Neighbor and Relational Predicates, tional predicates, i.e., ones that have selects, joins and group-by's. One major mental results that are based on queries from the TPC-H benchmark and real spatial amongst its k-closest. ⋈. kNN. H. R seafood= (H ⋈kNN R) ∩ (seafood =)(R) r6 r7 r8 r1 r5 data represents the (lat, long) coordinates of real GPS data col-. Introduction. In this post in the R:case4base series we will look at one of the most common operations on multiple data frames - merge, also known as JOIN in SQL terms.. We will learn how to do the 4 basic types of join - inner, left, right and full join with base R and show how to perform the same with tidyverse’s dplyr and data.table’s methods.

Comments
  • This might be helpful stackoverflow.com/questions/21977720/…
  • @RonakShah, it soes not solve the question as my dataset is too large. keeps computing for a long time
  • Here is another potential option. stackoverflow.com/questions/58831578/…. This is a M*N problem, as either dataframe grows the it just takes longer. To improve performance, reduce the size of the problem, either using a divide and conquer algorithm or reduce the precision of the starting locations from 5 decimal places down to three places. If you round the starting locations, you may find a large number of duplicates and thus save the time of recalculating.
  • Thanks for that @Dave2e. I cannot reduce the precision as I am dealing with objects very close to each other.I can reduce the size of the problem, does distmatrix calculate Haversine distance by default? Thanks
  • I believe it uses the distGeo method which assumes an ellipsoidal and not a sphere.
  • Thanks, @Jan van der Laan, Will check this today. bit occupied with something else. Thanks
  • @Jan van der Laan, any idea why our distance calculations are so far apart? ~600m off for Baker to Bayswater.
  • @mrhellmann I switched around the long and lat in the distHaversine call. I'll correct once I get to a computer .