Collapsing rows where some are all NA, others are disjoint with some NAs

combine rows with na in r
dplyr collapse rows
r collapse duplicate rows
r combine rows by group
r combine multiple rows into one
concatenate rows in r
combine rows in r
dplyr first non na

I have a simple dataframe as such:

ID    Col1    Col2    Col3    Col4
1     NA      NA      NA      NA  
1     5       10      NA      NA
1     NA      NA      15      20
2     NA      NA      NA      NA  
2     25      30      NA      NA
2     NA      NA      35      40 

And I would like to reformat it as such:

ID    Col1    Col2    Col3    Col4
1     5       10      15      20
2     25      30      35      40

(please note: the real data set has thousands of rows and the values are from biological data -- the NAs follow no simple pattern, except that the NAs are disjoint, and yes there are exactly 3 rows for each ID).

STEP ONE: get rid of rows that have only NA values.

On the surface this looked simple, but I've run across some problems.

complete.cases(DF) returns all FALSE, so I can't really use this to remove the rows with all NAs, as in DF[complete.cases(DF),]. This is because all rows contain at least one NA.

Since NAs want to propagate themselves, other schemes using is.na fail for the same reason.

STEP TWO: collapse the remaining two rows into one.

Thinking about using something like aggregate to pull this off, but there has got to be an easier way than this, which doesn't work at all.

Thanks for any advice.

Try

library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE))) 

Edit: To account for the case in which one column has all NAs for a certain ID, we need sum_NA() function which returns NA if all are NAs

txt <- "ID    Col1    Col2    Col3    Col4
        1     NA      NA      NA      NA
        1     5       10      NA      NA
        1     NA      NA      15      20
        2     NA      NA      NA      NA
        2     NA      30      NA      NA
        2     NA      NA      35      40"
DF <- read.table(text = txt, header = TRUE)

# original code
DF %>% 
  group_by(ID) %>% 
  summarise_each(funs(sum(., na.rm = TRUE)))

# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2     0    30    35    40

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

DF %>%
  group_by(ID) %>%
  summarise_all(funs(sum_NA))

DF %>%
  group_by(ID) %>%
  summarise_if(is.numeric, funs(sum_NA))

# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

r, Hi, I want collapse rows in columns Month & Value together while Coalesce rows and remove NAs but keep all non-NA rows for each That works but can you make the function more generic such as accepting any column names (not just provide the group columns (and apply na.omit to all the other  On the surface this looked simple, but I've run across some problems. complete.cases(DF) returns all FALSE, so I can't really use this to remove the rows with all NAs, as in DF[complete.cases(DF),]. This is because all rows contain at least one NA. Since NAs want to propagate themselves, other schemes using is.na fail for the same reason.

Here's a data table approach that uses na.omit() across the columns, grouped by ID.

library(data.table)
setDT(df)[, lapply(.SD, na.omit), by = ID]
#    ID Col1 Col2 Col3 Col4
# 1:  1    5   10   15   20
# 2:  2   25   30   35   40

Coalesce rows and remove NAs but keep all non-NA rows for each , NA is a special value whose properties are different from other values. NA is one of NAs can arise when you read in a Excel spreadsheet with empty cells, for example. You will Here are some examples of operations that produce NA's. > var (8) To find all the rows in a data frame with at least one NA, try this: > unique  That is, you’d like to collapse rows so that if there’s a NA in a column it is replaced by the value found in some other line. We assume there’s only one value to be found, so no need to worry to decide which value to take. A simple, tidy way is to summarise each column using sum, where NAs are ignored.

Here's a couple of aggregate attempts:

aggregate(. ~ ID, data=dat, FUN=na.omit, na.action="na.pass")
#  ID Col1 Col2 Col3 Col4
#1  1    5   10   15   20
#2  2   25   30   35   40

Since aggregate's formula interface by default uses na.omit on the entire data before doing any grouping, it will delete every row of dat as they all contain at least one NA value. Try it: nrow(na.omit(dat)) returns 0. So in this case, use na.pass in aggregate and then na.omit to skip over the NAs that were passed through.

Alternatively, don't use the formula interface and specify the columns to aggregate manually:

aggregate(dat[-1], dat[1], FUN=na.omit )
aggregate(dat[c("Col1","Col2","Col3","Col4")], dat["ID"], FUN=na.omit)
#  ID Col1 Col2 Col3 Col4
#1  1    5   10   15   20
#2  2   25   30   35   40

Missing Values in R, a few issues with R about string processing, some of us argue that R can be very well However, even if you don't plan to do text analysis, text mining, or natural language pro- The labels on the rows such as Alabama or Alaska are displayed strings. indicate if we want all the terms to be collapsed into a single string. It is not clear what is the ultimate goal and there are several paths: provide the group columns (and apply na.omit to all the other columns); provide the "coalesce" columns (but too many to type)

the simple way is:

as.data.frame(lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)[!is.na(x)]))

but if not all columns have the same number of non-NA values then you'll need to trim them like so:

temp  <-  lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)x[!is.na(x)])
len  <-  min(sapply(temp,length))
as.data.frame(lapply(temp,`[`,seq(len)))

[PDF] Handling and Processing Strings in R, NA - missing or undefined data 5 + NA # When used in an expression, the result is dfr1[1,] # First row, all columns dfr1[,1] # First column, all rows dfr1$Age # Age R comes with some predefined palette function that can generate those for us. Combine graphs (disjoint union, assuming separate vertex sets): %du% Remove rows with all or some NAs (missing values) in data.frame. Asked 9 years ago. Active 8 months ago. Viewed 1.5m times. I'd like to remove the lines in this data frame that: a) contain NAs across all columns. Below is my example data frame. gene hsap mmul mmus rnor cfam 1 ENSG00000208234 0 NA NA NA NA 2 ENSG00000199674 0 2 2 2 2 3

Network Analysis and Visualization with R and igraph, as_points = any(st_dimension(by) == 2, na.rm = TRUE), In case of multiple objects, all objects should have the same c(new) # collapses two arrays into one with an additional dimension other parameters passed on to contour downsampling rate: e.g. 3 keeps rows and cols 1, 4, 7, 10 etc.; a value of  Hi, I would like to find a short/efficient manner to replace the string "#N/A" with "." in all variables (there are a lot of

[PDF] Package 'stars', methods and classes are used internally by other packages on CRAN that the author has published. If you consider implement some of the above, make sure it is not already A character string used for collapsing the captured rows. except that a (character) NA is return if any argument is NA. This function exists on all  Abstractly speaking, the function allows one to collapse the rows of a numeric matrix, e.g. by forming an average or selecting one representative row for each group of rows specified by a grouping variable (referred to as rowGroup). The word "collapse" reflects the fact that the method yields a new matrix whose rows correspond to other rows of the original input data.

[PDF] Package 'R.utils' - CRAN, if(!all(floor(x) == x, na.rm = TRUE) & max(x, na.rm = TRUE) < 100) diagnostic = '​logged' Coerce a SingleCellExperiment to some class defined in MAST warning('Dimnames mismatch on assays', paste(which(!dn_equal), collapse = ', ')​, '. if(ncol(assays[[1]]) != nrow(cData)) stop('`cData` must contain as many rows as  13 Find fastest way to get all intervals between identical elements in a vector 12 How to speed up or vectorize a for loop? 12 Collapsing rows where some are all NA, others are disjoint with some NAs

Comments
  • Is the data always 100% numeric?
  • related: blend of na.omit and na.pass using aggregate in r
  • Is there always exactly one non-NA value per column per group or could there be more?
  • More! For better or worse
  • I think summarise_each(funs(na.omit(.)) ) also works
  • Yes, but I think that would yield a little dissonant result if any of the columns had more than 2 non-NA observations.
  • This solution doesn't work if any column has all NAs for a certain ID. For example replace 25 by NA for ID = 2 in Col1. Whereas funs(na.omit(.)) will throw error Error: Column Col1 must be length 1 (a summary value), not 0
  • na.omit is clever. Better than sum and will work on non-numeric data. I'm discreetly going to edit...