Create new variable in dataframe based on condition in one column, pulling from other column? (dplyr)

r add column to dataframe based on other columns
r create new column based on condition
mutate in r
r create new column based on multiple condition
dplyr replace values in multiple columns
create new variable in r based on condition
dplyr mutate
mutate_if

I have the following dataframe:

    df <- structure(list(country = c("Ghana", "Eritrea", "Ethiopia", "Ethiopia", 
"Congo - Kinshasa", "Ethiopia", "Ethiopia", "Ghana", "Botswana", 
"Nigeria"), CommodRank = c(1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 
1L), topCommodInCountry = c(TRUE, FALSE, FALSE, TRUE, FALSE, 
TRUE, TRUE, TRUE, TRUE, TRUE), Main_Commod = c("Gold", "Copper", 
"Nickel", "Gold", "Gold", "Gold", "Gold", "Gold", "Diamonds", 
"Iron Ore")), row.names = c(NA, -10L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "country", drop = TRUE, indices = list(
    8L, 4L, 1L, c(2L, 3L, 5L, 6L), c(0L, 7L), 9L), group_sizes = c(1L, 
1L, 1L, 4L, 2L, 1L), biggest_group_size = 4L, labels = structure(list(
    country = c("Botswana", "Congo - Kinshasa", "Eritrea", "Ethiopia", 
    "Ghana", "Nigeria")), row.names = c(NA, -6L), class = "data.frame", vars = "country", drop = TRUE, .Names = "country"), .Names = c("country", 
"CommodRank", "topCommodInCountry", "Main_Commod"))

df

            country CommodRank topCommodInCountry Main_Commod
1             Ghana          1               TRUE        Gold
2           Eritrea          2              FALSE      Copper
3          Ethiopia          3              FALSE      Nickel
4          Ethiopia          1               TRUE        Gold
5  Congo - Kinshasa          3              FALSE        Gold
6          Ethiopia          1               TRUE        Gold
7          Ethiopia          1               TRUE        Gold
8             Ghana          1               TRUE        Gold
9          Botswana          1               TRUE    Diamonds
10          Nigeria          1               TRUE    Iron Ore  

I am trying to add another column showing the top commodity (top CommodRank) for every country in this dataset, but I'm not sure how. I'm able to label 'topcommod' with the 'Main_Commod' where CommodRank == 1, but I want to copy this same value to cases where CommodRank != 1. Looking below, both Ethiopia values at rows 3 & 4 should read 'Gold'.

df %>% mutate(topcommod = ifelse(CommodRank == 1, Main_Commod, 'unknown'))


            country CommodRank topCommodInCountry Main_Commod topcommod
1             Ghana          1               TRUE        Gold      Gold
2           Eritrea          2              FALSE      Copper   unknown
3          Ethiopia          3              FALSE      Nickel   unknown
4          Ethiopia          1               TRUE        Gold      Gold
5  Congo - Kinshasa          3              FALSE        Gold   unknown
6          Ethiopia          1               TRUE        Gold      Gold
7          Ethiopia          1               TRUE        Gold      Gold
8             Ghana          1               TRUE        Gold      Gold
9          Botswana          1               TRUE    Diamonds  Diamonds
10          Nigeria          1               TRUE    Iron Ore  Iron Ore

I'm ideally looking for a dplyr solution I can add to an existing long series of pipe %>% function calls, but any solution would help.

IIUC, there are multiple ways to do this, for example:

df %>% mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
                       else Main_Commod[which.max(topCommodInCountry)])

# A tibble: 10 x 5
# Groups:   country [6]
   country          CommodRank topCommodInCountry Main_Commod topCom  
   <chr>                 <int> <lgl>              <chr>       <chr>   
 1 Ghana                     1 TRUE               Gold        Gold    
 2 Eritrea                   2 FALSE              Copper      unknown 
 3 Ethiopia                  3 FALSE              Nickel      Gold    
 4 Ethiopia                  1 TRUE               Gold        Gold    
 5 Congo - Kinshasa          3 FALSE              Gold        unknown 
 6 Ethiopia                  1 TRUE               Gold        Gold    
 7 Ethiopia                  1 TRUE               Gold        Gold    
 8 Ghana                     1 TRUE               Gold        Gold    
 9 Botswana                  1 TRUE               Diamonds    Diamonds
10 Nigeria                   1 TRUE               Iron Ore    Iron Ore

Regarding OP's question in comment how to handle ties of multiple top Commodities, you could do the following:

df %>% 
  mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
              else paste(unique(Main_Commod[topCommodInCountry]), collapse = "/"))

If there are multiple unique top Commodities in a country, they will be paste together into a single string, separated by /.

Manipulating data tables with dplyr, Tables can be subsetted by rows based on column values. Here, we make use of an embedded function, ifelse , which performs a conditional operation: if the  Often while cleaning data, one might want to create a new variable or column based on the values of another column using conditions. In this post we will see two different ways to create a column based on values of another column using conditional statements. First we will use NumPy’s little unknown function where to …

another pattern with dplyr...

df %>% arrange(CommodRank) %>%
    mutate(topCommod = Main_Commod[1])

Manipulating, analyzing and exporting data with tidyverse, Select certain columns in a data frame with the dplyr function select . The results from a base R function sometimes depend on the type of data. to a database of many hundreds of GB, conduct queries on it directly, and pull back into on conditions; mutate() : create new columns by using information from other columns  Create a new column in Pandas DataFrame based on the existing columns While working with data in Pandas, we perform a vast array of operations on the data to get the data in the desired form. One of these operations could be that we want to create new columns in the DataFrame based on the result of some operations on the existing columns in the

It's not an answer but learning greatly from @docendo discimus answer, it took me a second to understand the "if negative" (!any(topCommodInCountry)), and I was wondering if it's only me or it would take my computer a second more to do that too :)

Using the same dataset I examined the idea of making the if else positive. First I tested for identical between the two solutions:

identical(
  #Negative
  df %>% 
    mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
           else Main_Commod[which.max(topCommodInCountry)]), 
  #Positive
  df %>% 
    mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)] 
           else "unknown"))

[1] TRUE

Next, I tested the benchmark of the two:

require(rbenchmark)

benchmark("Negative" = {
  df %>% 
    mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
           else Main_Commod[which.max(topCommodInCountry)])
},
"Positive" = {
  df %>% 
    mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)] 
           else  "unknown")
},
replications = 10000,
columns = c("test", "replications", "elapsed",
            "relative", "user.self", "sys.self"))

The difference is not that big but I'm assuming that with a bigger dataset it will increase.

      test replications elapsed relative user.self sys.self
1 Negative        10000   12.59    1.015     12.44        0
2 Positive        10000   12.41    1.000     12.30        0 

Create or transform variables, When applied to a data frame, row names are silently dropped. To preserve, convert to an explicit variable with tibble::rownames_to_column() . See also. Other  Create a Column Based on a Conditional in Make a dataframe. data = # Create a new column called df.elderly where the value is yes # if df.age is greater than

Need to create a new variable with conditions from multiple variables, The examples I see on youtube or other help sites have a lot of code and without TLDR: Is dplyr the best way to create a new variable with specific Trying to turn key phrases in a text column into a new column with yes/no format (i.e. ha data frame) thus getting you into an endless back and forth Output: Method #4: By using a dictionary We can use a Python dictionary to add a new column in pandas DataFrame. Use an existing column as the key values and their respective values will be the values for new column.

4 data wrangling tasks in R for advanced beginners, (Now updated with dplyr examples.) I've created a sample data set with three years of revenue and profit data from Apple, The code above will create a data frame like the one below, stored in a variable named "companiesData": in R is adding a new column to a data frame based on one or more other columns. Name-value pairs of expressions, each with length 1 or the same length as the number of rows in the group (if using group_by()) or in the entire input (if not using groups). The name of each argument will be the name of a new variable, and the value will be its corresponding value. Use a NULL value in mutate to drop a variable.

5 Data Wrangling via dplyr, We can simply specify what variable/column we would like on one axis, (if applicable) filter() : Pick rows based on conditions about their values; summarize() mutate() : Create a new variable in the data frame by mutating existing ones Another summarize diagram from Data Wrangling with dplyr and tidyr cheatsheet. How do I add a column to a Pandas dataframe based on other rows and columns in the dataframe? I want to create a new column based on the One can create a new

Comments
  • Thanks so much! Off the top of your head is there an obvious way to split and label ties here, such that topCom gets assigned to something like "Gold / Diamonds / ..."? (Say there are 2 or more Main_Commods with CommodRank == 1)
  • Nevermind you just use 'which' instead of 'which.max' to get all row indices, which you can then access and paste unique names together : df %>% mutate(topCom = Main_Commod[which(topCommodInCountry == max(topCommodInCountry))]) %>% unique %>% paste (sep = '', collapse = '/'))
  • Sorting the entire data frame (group) will be much slower than getting the max of a single column (group)
  • In addition to @Ryan comment, if you don't arrange your dataset correctly, doing 'Main_Commod[1]' can be very dangerous/wrong
  • On my four year old laptop, running for(i in 1:1e6) !TRUE takes about 1/10th of a second. Not worth worrying about.
  • It may be worth removing unnecessary ! just for readability, but for what it's worth I think it's pretty intuitive if ! is read as "not" i.e. "If not any topCommodInCountry"