R tidyverse, Reshaping data to one row per subject, but several columns are affected
dplyr : : cheat sheet
spread function in r
gather in r
separate function in r
tidyverse cheat sheet
mutate in r
I have a sample data frame like this:
sample2<-structure(list(`Full Name` = c("Smith, Jane", NA, NA, NA, NA, NA, "Doe, John", NA, NA, NA), `Age (Y)` = c("24", NA, NA, NA, NA, NA, "22", NA, NA, NA), Gender = c("F", NA, NA, NA, NA, NA, "M", NA, NA, NA), `Procedure Performed (ICD9 Code)` = c("34.04 INSERTION OF INTERCOSTAL CATHETER FOR DRAINAGE", "86.59 CLOSURE OF SKIN AND SUBCUTANEOUS TISSUE OTHER SITES", "87.03 COMPUTERIZED AXIAL TOMOGRAPHY OF HEAD", "88.01 COMPUTERIZED AXIAL TOMOGRAPHY OF ABDOMEN", "87.41 COMPUTERIZED AXIAL TOMOGRAPHY OF THORAX", NA, "96.04 INSERTION OF ENDOTRACHEAL TUBE", "57.94 INSERTION OF INDWELLING URINARY CATHETER", "99.29 INJECTION OR INFUSION OF OTHER THERAPEUTIC OR PROPHYLACTIC SUBSTANCE", "38.02 INCISION OF OTHER VESSELS OF HEAD AND NECK"), `Interventions RH` = c("xray", "CT Head", NA, NA, NA, NA, "CT Chest - Referring Hospital", "Chest Tube Placement", "Ct Head", NA)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame" ))
As you can see, there are lots of mostly empty rows. All of the mostly empty rows below a subjects name are associated with that subject. I'd like to reshape my data set to be one row per subject, and I need some help.
I've looked at other answers on here for this, and I've asked my friends about it, and I keep seeing the "gather and then spread it" answer, but this particular situation is challenging me for a few specific reasons:
- The rows are empty in the "full name" column that I'd usually key by. I.e. I can't tell R to gather up all the Jane Smith rows, because they don't have her name in them.
- There are several columns I want to spread out. I would like to create multiple columns out of the "procedure performed" column, i.e. procedure1, procedure2, etc.. and I'd like to create multiple columns out of the "Interventions Rih" column.
- I'm probably going to end up text searching the data for particular words or phrases during the analysis, so if the coding is easier to just squish every procedure for a particular patient into the "procedure performed" column for that patient (in one row), I'm ok with that too.
So for expected output, I am ok with either:
Thank you for your help!
This can be pretty readily tackled with the newest updates to
I'm also going to rename your columns so they are a bit easier to work with:
library(tidyr) library(dplyr) sample_pivot_df <- sample2 %>% # First renaming cols, but you can choose not to do this rename(full_name = `Full Name`, age = `Age (Y)`, gender = Gender, procedure = `Procedure Performed (ICD9 Code)`, intervention = `Interventions RH`) %>% # Fill for repeated entries fill(full_name, age, gender) %>% # Add a dummy column for number of procedures group_by(full_name) %>% mutate(procedure_count = row_number()) %>% ungroup() %>% # pivot the data wider for `procedure` and `intervention` pivot_wider(names_from = procedure_count, values_from = c(procedure, intervention))
Which gives us the following:
> sample_pivot_df # A tibble: 2 x 15 full_name age gender procedure_1 procedure_2 procedure_3 procedure_4 procedure_5 procedure_6 <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 Smith, J… 24 F 34.04 INSE… 86.59 CLOS… 87.03 COMP… 88.01 COMP… 87.41 COMP… NA 2 Doe, John 22 M 96.04 INSE… 57.94 INSE… 99.29 INJE… 38.02 INCI… NA NA # … with 6 more variables: intervention_1 <chr>, intervention_2 <chr>, intervention_3 <chr>, # intervention_4 <chr>, intervention_5 <chr>, intervention_6 <chr>
The caveat here is that using
tidyr::fill may give you unintended results if any patients have an
Data wrangling in R, Tidy data has a simple convention: put variables in the columns and pick observations by their values ( filter() ), But Jenny Bryan has also released this as an R package, so you could format each row is often a site/subject/patient and you have multiple These data formats mainly affect readability. Spread the surveys data frame with year as columns, plot_id as rows, and the number of genera per plot as the values. You will need to summarize before reshaping, and use the function n_distinct() to get the number of unique genera within a particular chunk of data.
You can also do this using
data.table for the aggregation and
zoo to fill
NA values. I've changed your column names to make the code more readable.
library(data.table) library(zoo) setDT(sample2) names(sample2) <- c("Name", "Age", "Gender", "Procedure", "Interventions") sample2[, Name := na.locf(Name)] newSample = sample2[,.( Age = first(Age), Gender = first(Gender), aggProcedure = paste(Procedure[!is.na(Procedure)],collapse=","), aggInterventions = paste(Interventions[!is.na(Interventions)],collapse=",")), by= Name]
Tidying/reshaping tables using tidyr, Combining elements from many columns into a single column to think of a table as consisting of three components: rows, columns and data values. Another way of describing a dataset is by defining its variable(s), values and observations. but is the format of choice for many data analysis and visualization operations. <tidy-select> Columns to unite. sep: Separator to use between values. remove: If TRUE, remove input columns from output data frame. na.rm: If TRUE, missing values will be remove prior to uniting each value.
You can do that in a much easier format first you use fill() function which fills null values with previous data, then you group your data and summarize it with pasting/concatenating it.
library(dplyr) library(tidyr) sample2<- structure(list( `Full Name` = c("Smith, Jane", NA, NA, NA, NA, NA, "Doe, John", NA, NA, NA), `Age (Y)` = c("24", NA, NA, NA, NA, NA, "22", NA, NA, NA), Gender = c("F", NA, NA, NA, NA, NA, "M", NA, NA, NA), `Procedure Performed (ICD9 Code)` = c("34.04 INSERTION OF INTERCOSTAL CATHETER FOR DRAINAGE", "86.59 CLOSURE OF SKIN AND SUBCUTANEOUS TISSUE OTHER SITES", "87.03 COMPUTERIZED AXIAL TOMOGRAPHY OF HEAD", "88.01 COMPUTERIZED AXIAL TOMOGRAPHY OF ABDOMEN", "87.41 COMPUTERIZED AXIAL TOMOGRAPHY OF THORAX", NA, "96.04 INSERTION OF ENDOTRACHEAL TUBE", "57.94 INSERTION OF INDWELLING URINARY CATHETER", "99.29 INJECTION OR INFUSION OF OTHER THERAPEUTIC OR PROPHYLACTIC SUBSTANCE", "38.02 INCISION OF OTHER VESSELS OF HEAD AND NECK"), `Interventions RH` = c("xray", "CT Head", NA, NA, NA, NA, "CT Chest - Referring Hospital", "Chest Tube Placement", "Ct Head", NA)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame")) new_sample <- sample2 %>% fill("Full Name") %>% fill("Age (Y)") %>% fill("Gender") %>% mutate(`Procedure Performed (ICD9 Code)` = replace_na(`Procedure Performed (ICD9 Code)`, ""), `Interventions RH` = replace_na(`Interventions RH`, "")) %>% group_by(`Full Name`, `Age (Y)`, `Gender`) %>% summarise(`Procedure Performed (ICD9 Code)` = paste(`Procedure Performed (ICD9 Code)`, collapse =" "), `Interventions RH` = paste(`Interventions RH`, collapse = " "))
12 Tidy data, “Tidy datasets are all alike, but every messy dataset is messy in its own way. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will a dataset tidy: variables are in columns, observations are in rows, and values are in cells. One variable might be spread across multiple columns. The goal is generally to get the data into a “tidy” format: one row per case, one column per field, and one cell per value. (You can read more about “tidy” data here .) One of the most common examples of this issue encountered is data where things that should be in the rows are in the columns or vice versa.
[PDF] Tidy Data, each variable is a column, each observation is a row, and each type of Keywords: data cleaning, data tidying, relational databases, R. 1. “philosophy of data”: one that underlies my work in the plyr (Wickham 2011) and ggplot2 Like families, tidy datasets are all alike but every messy dataset is messy in its own way. This data is considered wide since the time variable (represented as quarters) is structured such that each quarter represents a variable. To re-structure the time component as an individual variable, we can gather each quarter within one column variable and also gather the values associated with each quarter in a second column variable.
A Scientist's Guide to R: Step 2.1 Data Transformation, Combine multiple columns into a single column using unite(). The examples below will make use of 2 datasets provided by R packages: #For now this gives us access to both the pipe operator and the glimpse function. library(tidyverse) while the OBrienKaiser data has only 16 rows but 17 columns. The length of sep should be one less than into. remove: If TRUE, remove input column from output data frame. convert: If TRUE, will run type.convert() with as.is = TRUE on new columns. This is useful if the component columns are integer, numeric or logical. NB: this will cause string "NA"s to be converted to NAs. extra
Pivoting • tidyr, There are two important new features inspired by other R packages that have been to solve a variety of data reshaping challenges ranging from simple to complex. We can break these variables up by specifying multiple column names in been working with data frames that have one observation per row, but many� 1 R tidyverse, Reshaping data to one row per subject, but several columns are affected Jan 2 1 Drawing driving routes in R Jan 16 1 Fill column value conditionally on two different columns Mar 23 '18
- Please show your expected output
- absolutely need to rename, the column names I'm dealing with give me all sorts of other problems. This solution is perfect, thank you.
- In that case I'd recommend you check out the
janitor::clean_names. It was a lifesaver for dealing with healthcare data :)
- I just noticed one slight problem. It absolutely splits the procedure and intervention columns into multiple columns as intended, and by filling the name column I know which procedure is associated with who, but it did not leave me with 1 row per subject. Rather I have as many rows as that person had procedures. Is there a way to "collapse" them down? i.e. if one row has a blank for procedure2 but the next row has something.... squish those? That make sense?
- Hmm, it sounds as if (1) one of your columns did not get renamed appropriately, or (2) you have additional columns associated with procedures/interventions. (1) Using the two columns in the
values_fromargument should collapse them down. Make sure these names are correct. (2) IF you have additional columns associated with procedures/interventions, you will need to drop those, or include those in the
values_fromcall as well
- Im 99% sure everything was renamed correctly. What do you mean by additional columns associated with procedures/interventions? Sorry I'm dense, could you give me an example? And why that would cause problems?