Subsetting R data frame results in mysterious NA rows

r subset dataframe by list of values
r subset dataframe by multiple column value
r extract rows with certain value
subset in r
plot subset of data in r
r subset na
remove na rows in r
subset by factor in r

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.

When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:

example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example

  var1 var2
1    A    X
2    B    Y
3    A    Z

then I run:

example[example$var1=="A",]

  var1 var2
1    A    X
3    A    Z
NA<NA> <NA>

Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.

Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.

Thanks

Wrap the condition in which:

df[which(df$number1 < df$number2), ]

How it works:

It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.

Say that:

which(df$number1 < df$number2)

returns row numbers 1, 2, 3, 4 and 5.

As such, writing:

df[which(df$number1 < df$number2), ]

is the same as writing:

df[c(1, 2, 3, 4, 5), ]

Or an even simpler version is:

df[1:5, ]

How can I subset a data set?, Subsetting R data frame results in mysterious NA r When subsetting my data, I occassionally get mysterious NA rows that aren't in my original  Subsetting R data frame results in mysterious NA rows. However they don't answer my question because (1) I do not understand what this whole "If your code is

I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).

First of all, some sample data:

> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
   name number1 number2
1     A       1      10
2     B       2       9
3     C       3       8
4     D       4       7
5     E       5       6
6     F       6       5
7     G       7       4
8     H       8       3
9     I       9      NA
10    J      10      NA

Now for a simple filter:

> df[df$number1 < df$number2, ]
     name number1 number2
1       A       1      10
2       B       2       9
3       C       3       8
4       D       4       7
5       E       5       6
NA   <NA>      NA      NA
NA.1 <NA>      NA      NA

The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:

> df[df$number1 < df$number2 & !is.na(df$number2), ]
  name number1 number2
1    A       1      10
2    B       2       9
3    C       3       8
4    D       4       7
5    E       5       6

7.5 Extracting a subset of a data frame, sub data frame contains only the observations for which the values of the variable y is greater than 2. x.sub <- subset(x.df, y > 2) x.sub V1 V2 V3 V4 V5 y 4 -  There are a few questions regarding something similar such as Subsetting R data frame results in mysterious NA rows However they don't answer my question because (1) I do not understand what this

I get the same problem when using code similar to what you posted. Using the function subset()

subset(example,example$var1=="A")

the NA row instead gets excluded.

Frequently Asked Questions about data.table, This is a somewhat strange chapter, even by my standards. Sometimes might be a lot of missing values in your data set. Extracting a subset of a vector. to take the time to figure out whether or not the resulting categories make any sense​  This version of the subset command narrows your data frame down to only the elements you want to look at. Other Ways to Subset A Data Frame in R. There are actually many ways to subset a data frame using R. While the subset command is the simplest and most intuitive way to handle this, you can manipulate data directly from the data frame syntax.

Using dplyr:

library(dplyr)
filter(df, number1 < number2)

15 Easy Solutions To Your Data Frame Problems In R, 2.3 I'm using c() in j and getting strange results. 2.17 What are the smaller syntax differences between data.frame and 3.1 I have 20 columns and a large number of rows. 6.1 v1.3 appears to be missing from the CRAN archive? So we can't subset a data.frame by a data.frame in base R. What if we  This tutorial describes how to subset or extract data frame rows based on certain criteria. Additionally, we'll describe how to subset a random number or fraction of rows. You will also learn how to remove rows with missing values in a given column.

   > example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
    > example
      var1 var2
    1    A    X
    2 <NA>    Y
    3    A    Z
    > example[example$var1=="A",]
       var1 var2
    1     A    X
    NA <NA> <NA>
    3     A    Z

Probably this must be your result u are expecting...Try this try using which condition before condition to avoid NA's

  example[which(example$var1=="A"),]
      var1 var2
    1    A    X
    3    A    Z

Subsetting R data frame with NAs in index variable, This means that a data frame's rows do not need to contain, but can contain, the same data, such as read.csv() and read.delim() , a data frame is returned as the result. Otherwise, the other variables will be interpreted as “NA”. Subsetting or extracting specific rows and columns from a data frame is an  This post here Subsetting data frames in R suggests that there is in fact difference between above 2 methods. One of them handles NA accurately. One of them handles NA accurately. Which one is safe to use then?

Never trust rownames of a dataframe – Perfectly Random, Column names, which are used frequently, give the dataframes in R their When we print the dataframe, we can see that the row names are printed as well. that each row has df[TRUE, ] produces the same result as df[rep(TRUE, 5), ] . Again, using as.numeric(NA) to index a dataframe makes no sense. The subset () function takes 3 arguments: the data frame you want subsetted, the rows corresponding to the condition by which you want it subsetted, and the columns you want returned. In our case, we take a subset of education where “Region” is equal to 2 and then we select the “State,” “Minor.Population,” and “Education

Remove columns and rows which have only NAs without deleting all , To remove rows and columns with NA: > x[complete.cases(x) Finally, you just make logic subsetting of the dataframe. When you get in touch  Subsetting Data . R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations. The following code snippets demonstrate ways to keep or delete variables and observations and to take random samples from a dataset. Selecting (Keeping) Variables # select variables v1, v2, v3

with - subset in r Subsetting R data frame results in mysterious NA rows (4) Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels.

Comments
  • While it's impossible to be sure without seeing your data, the problem is almost certainly that some of your indices are greater than the number of rows are in the data. For example, try example[c(1, 2, 4),] or example[c(TRUE, TRUE, FALSE, TRUE),] using your data frame above. Check the length (if it's boolean) and the maximum (if it's numeric) of the vector you are using to subset the rows.
  • ...and/or some of your indices are NA themselves.
  • As David said, we need to know more... but looking at str(yourdata) and summary(yourdata) will help you out a lot. I have a feeling you have at least one NA in your var column. Test it: example <- data.frame("var1"=c("A", "B", "A", NA), "var2"=c("Q", "X", "Y", "Z")); example[example$var=='A',]
  • If your code is analogous to this example (of the form d[d$v == x, ], your problem is indeed almost certainly NA`s in your column.
  • Answered! I have NAs in the index column. I can't believe I've never come across this before. It's funny to me that R "censors" the data in other columns with NAs (even the row name!) when you hit an NA in your index column. I'm new to posting on StackOverflow so it will take me a minute to figure out how to designate this question answered.
  • This is how I've always dealt with this issue, but is there a way to combine the !is.na and < into one command?
  • @Nova, I don't think so, since they are two distinct logical tests. I'd love to be proven wrong, though.
  • Answered above, the which() function may fit that role but it's less than satisfactory. I strongly believe this to be a bug imho and it's unfortunate that this "feature" (NA selection craziness) won't be fixed.
  • This is helpful, but please beware of the potential problems of using subset anywhere other than in an interactive R session. From the function's help page: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences."
  • Indeed that library doesn't suffer from that NA affliction.
  • Dear downvoters, please explain the reason for downvoting, thanks!