How to replace the missing values with the median for the variable using gsub in R?

replace na with mean in r
replacing missing values with mean
median imputation in r

I have a data frame that I extracted from an html file of a Wikipedia page table. I want to replace the missing values with the median of each variable.

From the hints given I know that I need to convert the factor type to numeric values, and I likely need to use as.numeric(gsub()).

renew$Hydro[grep('\\s', renew$Hydro)]
as.numeric(gsub('', median(as.numeric(renew$Hydro)), renew$Hydro))
lapply(renew, function(x) as.numeric(gsub('', median(as.numeric(x)), x)))

I tried using grep() to show that '\\s' is the pattern for extracting spaces, but the spaces were actually excluded from the output and only digits were shown.

When I tried using as.numeric(gsub()), the output looked like:

[1] 5.415405e+13 5.475475e+13 5.475425e+07 5.475415e+13 5.400000e+01 5.400000e+01 5.435405e+16
[8] 5.425435e+13 5.400000e+01 5.415455e+16 5.445425e+16 5.415495e+13 5.400000e+01 5.400000e+01

which does not at all resemble the data frame which looks like:

[1] 1035.3   7782     72       7109                       30134.8  2351.2            15318   

I expected the output to look exactly like the original data frame but with the spaces filled in with the column medians.

Edit: This is what the beginning of the data frame looks like. It's from "https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production_from_renewable_sources".

> renew
                             Country    Hydro     Wind     Bio   Solar
1                        Afghanistan   1035.3      0.1            35.5
2                            Albania     7782                      1.9
3                            Algeria       72     19.4           339.1
4                             Angola     7109              155    18.3
5                           Anguilla                               2.4
6                Antigua and Barbuda                               5.5
7                          Argentina  30134.8    554.1  1820.4    14.5
8                            Armenia   2351.2      1.8             1.2
9                              Aruba             130.3     8.9     9.2
10                         Australia    15318    12199    3722    6209
11                           Austria    42919     5235    4603    1096
12                        Azerbaijan   1959.3     22.8   174.5    35.3
13                           Bahamas                               1.9
14                           Bahrain               1.2             8.3
15                        Bangladesh      946      5.1     7.7   224.3

I'd like to note that the data isn't clean yet just after scraping, since lapply(renew, function(x) grep(",", x)) yields something.

Clean it first with gsub to avoid these values being converted to NAs when you convert the data to numeric. Here a one step solution, correct NAs are created automatically:

renew[-1] <- lapply(renew[-1], function(x) as.numeric(as.character(gsub(",", ".", x))))

After that you could run a sapply

# sapply(2:5, function(x) renew[[x]][is.na(renew[[x]])] <<- median(renew[[x]], na.rm=TRUE))

or of course a shorter adaption of @Ronak Shah's second base R code line, which is a lot better:

renew[-1] <- sapply(renew[-1], function(x) replace(x, is.na(x), median(x, na.rm=TRUE)))

Result

summary(renew)
#                      country        hydro                wind                bio              solar        
# Afghanistan        :  1   Min.   :      0.8   Min.   :     0.00   Min.   :    0.2   Min.   :    0.1  
# Albania            :  1   1st Qu.:    907.8   1st Qu.:    50.45   1st Qu.:  151.1   1st Qu.:    4.8  
# Algeria            :  1   Median :   2595.0   Median :   109.00   Median :  242.5   Median :   22.3  
# Angola             :  1   Mean   :  19989.3   Mean   :  4324.13   Mean   : 2136.3   Mean   : 1483.3  
# Anguilla           :  1   3rd Qu.:   7992.4   3rd Qu.:   293.55   3rd Qu.:  344.4   3rd Qu.:  124.5  
# Antigua and Barbuda:  1   Max.   :1193370.0   Max.   :242387.70   Max.   :69017.0   Max.   :67874.1  
# (Other)            :209                                                                              

Data

library(rvest)
renew <- setNames(html_table(
  read_html(paste0("https://en.wikipedia.org/wiki/List_of_countries",
                   "_by_electricity_production_from_renewable_sources")),
  fill=TRUE, header=TRUE)[[1]][c(1, 6:9)], c("country", "hydro", "wind", "bio", "solar"))
renew$country <- factor(renew$country)

Replacing NA values with different values in Data Frames in R , This is just a tutorial showing how you can replace NA's in a data frame with other values Duration: 5:32 Posted: Feb 18, 2015 The most widely used technique for imputing values for a numerical variable is to replace the missing values with the mean or the median value. In the lines of code below, we replace missing values in 'Loan_amount' with the median value, while the missing values in 'Term_months' are replaced by the mean value.

As you have empty spaces in your dataframe the columns are turned as characters and there is no meaning in taking median of character columns. We can first replace the empty spaces to NA, convert the columns to numeric and then replace NAs with median of the column. Using dplyr we could do the following steps.

library(dplyr)
renew[renew == ""] <- NA

renew %>%
   mutate_at(-1, as.numeric) %>% #-1 is to ignore Country column
   mutate_at(-1, ~ replace(., is.na(.), median(., na.rm = TRUE)))


#             Country   Hydro    Wind    Bio  Solar
#1        Afghanistan  1035.3     0.1  174.5   35.5
#2            Albania  7782.0    21.1  174.5    1.9
#3            Algeria    72.0    19.4  174.5  339.1
#4             Angola  7109.0    21.1  155.0   18.3
#5           Anguilla  4730.1    21.1  174.5    2.4
#6  AntiguaandBarbuda  4730.1    21.1  174.5    5.5
#7          Argentina 30134.8   554.1 1820.4   14.5
#8            Armenia  2351.2     1.8  174.5    1.2
#9              Aruba  4730.1   130.3    8.9    9.2
#10         Australia 15318.0 12199.0 3722.0 6209.0
#11           Austria 42919.0  5235.0 4603.0 1096.0
#12        Azerbaijan  1959.3    22.8  174.5   35.3
#13           Bahamas  4730.1    21.1  174.5    1.9
#14           Bahrain  4730.1     1.2  174.5    8.3
#15        Bangladesh   946.0     5.1    7.7  224.3

We could do the same using base R

renew[renew == ""] <- NA
renew[-1] <- lapply(renew[-1], function(x) 
      as.numeric(replace(x, is.na(x), median(as.numeric(x), na.rm = TRUE))))

How to replace all occurrences of a character in a character column , If you used sub() to replace the string, then use gsub() function instead replace all the missing values in a vector with the mean of values in R? How to change the value of a variable using R programming in a data frame? It is probably simpler to obtain the medians using -summarize- sum Inc if Inc>=0 & Inc<200, d replace Inc=r (p50) if Range==1 & missing (Inc) sum Inc if Inc>=200 & Inc<600, d replace Inc=r (p50) if Range==2 & missing (Inc) Having said that you need to be careful about imputing a median like this.

We could do this in a compact way with na.aggregate from zoo

library(dplyr)
library(hablar)
library(zoo)
renew %>%
    retype %>% # change the type of columns
    # replace missing value of numeric columns with median
     mutate_if(is.numeric, na.aggregate, FUN = median)
# A tibble: 15 x 5
#   Country              Hydro    Wind    Bio  Solar
#   <chr>                <dbl>   <dbl>  <dbl>  <dbl>
# 1 Afghanistan          1035.     0.1  174.    35.5
# 2 Albania              7782     21.1  174.     1.9
# 3 Algeria                72     19.4  174.   339. 
# 4 Angola               7109     21.1  155     18.3
# 5 Anguilla             4730.    21.1  174.     2.4
# 6 Antigua and Barbuda  4730.    21.1  174.     5.5
# 7 Argentina           30135.   554.  1820.    14.5
# 8 Armenia              2351.     1.8  174.     1.2
# 9 Aruba                4730.   130.     8.9    9.2
#10 Australia           15318  12199   3722   6209  
#11 Austria             42919   5235   4603   1096  
#12 Azerbaijan           1959.    22.8  174.    35.3
#13 Bahamas              4730.    21.1  174.     1.9
#14 Bahrain              4730.     1.2  174.     8.3
#15 Bangladesh            946      5.1    7.7  224. 
data
renew <- structure(list(Country = c("Afghanistan", "Albania", "Algeria", 
"Angola", "Anguilla", "Antigua and Barbuda", "Argentina", "Armenia", 
"Aruba", "Australia", "Austria", "Azerbaijan", "Bahamas", "Bahrain", 
"Bangladesh"), Hydro = c("1035.3", "7782", "72", "7109", "", 
"", "30134.8", "2351.2", "", "15318", "42919", "1959.3", "", 
"", "946"), Wind = c("0.1", "", "19.4", "", "", "", "554.1", 
"1.8", "130.3", "12199", "5235", "22.8", "", "1.2", "5.1"), Bio = c("", 
"", "", "155", "", "", "1820.4", "", "8.9", "3722", "4603", "174.5", 
"", "", "7.7"), Solar = c(35.5, 1.9, 339.1, 18.3, 2.4, 5.5, 14.5, 
1.2, 9.2, 6209, 1096, 35.3, 1.9, 8.3, 224.3)), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15"), class = "data.frame")

Replacing values with NA, To do so we use the replace argument, and specify a named list, which contains the names of the variable and the value it would take to replace� Now let us replace the missing values in the Item_Weight variable. There are many ways to deal with missing values in a continuous variable which includes mean replacement, median replacement, replacing with an arbitrary constant, regression methods etc. I will be using mean replacement and regression in this example.

[PDF] An introduction to data cleaning with R, the reader to build data cleaning scripts for data suffering from a wide like imputation of missing values will influence statistical results and so mean all the methods and tricks in R that allow you to select and The most basic variable in R is a vector. Hint: use gsub to replace comma's with a period. I have a data frame that I extracted from an html file of a Wikipedia page table. I want to replace the missing values with the median of each variable. From the hints given I know that I need to convert the factor type to numeric values, and I likely need to use as.numeric(gsub()).

Session 2: Data Wrangling and Visualization in R, It looks like each column is a variable (like “reviews_per_month” or “price”) and We can prevent R from converting string-like or non-number-y values into (In fact, you can install the entire suite with the library call tidyverse .) Here we used the argument na.rm=TRUE in mean , which just removes any NA values from the� Data cleaning is one of the most important aspects of data science.. As a data scientist, you can expect to spend up to 80% of your time cleaning data.. In a previous post I walked through a number of data cleaning tasks using Python and the Pandas library.

Data Wrangling in R: Regular Expressions, Regular Expressions are the language we # use to describe the pattern. Replace Repeated Whitespace with a Single Space gsub(" +"," ", someText) [1] NA "2" "19" "4" "10" NA NA "1" NA NA NA "22" "3" "21" ## [15] "22" [0-9]{1,3}", replacement = "", allStocks$Stock) # and let's make our new variable a factor:� Imputation with mean / median / mode Replacing the missing values with the mean / median / mode is a crude way of treating missing values. Depending on the context, like if the variation is low or if the variable has low leverage over the response, such a rough approximation is acceptable and could possibly give satisfactory results.

Comments
  • I'd be happy to help with this if you could provide some example data.
  • added head of data frame, hope that helps
  • It might be worth going back a few steps and trying to set up your data nicely in the first place. Country should be a character/factor, but all the other columns you show should be numeric, so you should see if you can a) read them in as numeric, or b) convert them to numeric right after reading them in. gsub() is generally for changing text data, so you will probably have a bad time applying it to numeric data.
  • This seems to work well, but when I run the base R code all of the data frame values change. I get [1] 6.0 143.0 141.0 140.0 81.5 81.5 81.0 rather than [1] 1035.3 7782 72 7109
  • @LaurelDaly seems to work for me. Did you apply the base R code on changed renew dataframe?