Pandas: Find rows where a particular column is not NA but all other columns are

pandas find rows with nan
pandas find columns with nan
pandas select not na rows
drop nan rows pandas
pandas check if cell is empty
pandas drop empty columns
pandas notna
pandas is not null

I have a DataFrame which contains a lot of NA values. I want to write a query which returns rows where a particular column is not NA but all other columns are NA.

I can get a Dataframe where all the column values are not NA easily enough:

df[df.interesting_column.notna()]

However, I cant figure out how to then say "from that DataFrame return only rows were every column that is not 'interesting_column' is NA". I can't use .dropna as all rows and columns will contain at least one NA value.

I realise this is probably embarrassingly simple. I have tried lots of .loc variations, join/merges in various configurations and I am not getting anywhere.

Any pointers before I just do a for loop over this thing would be appreciated.

Pandas: Find Rows Where Column/Field Is Null, The the code you need to count null columns and see examples where a single column If we want to get a count of the number of null fields by column we can use the following Kafka is also not replacing other databases. Pandas: Find Rows Where Column/Field Is Null DZone 's Guide to I did some experimenting with a dataset I've been playing around with to find any columns/fields that have null values in them.

The & operator lets you row-by-row "and" together two boolean columns. Right now, you are using df.interesting_column.notna() to give you a column of TRUE or FALSE values. You could repeat this for all columns, using notna() or isna() as desired, and use the & operator to combine the results.

For example, if you have columns a, b, and c, and you want to find rows where the value in columns a is not NaN and the values in the other columns are NaN, then do the following:

df[df.a.notna() & df.b.isna() & df.c.isna()]

This is clear and simple when you have a small number of columns that you know about ahead of time. But, if you have many columns, or if you don't know the column names, you would want a solution that loops over all columns and checks notna() for the interesting_column and isna() for the other columns. The solution by @AmiTavory is a clever way to achieve this. But, if you didn't know about that solution, here is a simpler approach.

for colName in df.columns:
    if colName == "interesting_column":
        df = df[ df[colName].notna() ]
    else:
        df = df[ df[colName].isna() ]

How to Filter a Pandas Dataframe Based on Null Values of a Column?, One might want to filter the pandas dataframe based on a column such that we would like to keep the rows of data frame where the specific column don't have data and not NA. Our toy dataframe contains three columns and three rows. pandas handles the missing values in numeric as NaN and other  Python Pandas replace NaN in one column with value from corresponding row of second column asked Aug 31, 2019 in Data Science by sourav ( 17.6k points) pandas

You can use:

rows = df.drop('interesting_column', axis=1).isna().all(1) & df['interesting_column'].notna()

Example (suppose c is the interesting column):

In [99]: df = pd.DataFrame({'a': [1, np.nan, 2], 'b': [1, np.nan, 3], 'c':[4, 5, np.nan]})

In [100]: df
Out[100]: 
     a    b    c
0  1.0  1.0  4.0
1  NaN  NaN  5.0
2  2.0  3.0  NaN

In [101]: rows = df.drop('c', axis=1).isna().all(1) & df.c.notna()

In [102]: rows
Out[102]: 
0    False
1     True
2    False
dtype: bool

In [103]: df[rows]
Out[103]: 
    a   b    c
1 NaN NaN  5.0

pandas.DataFrame.any, 1 / 'columns' : reduce the columns, return a Series whose index is the original index. If the entire row/column is NA and skipna is True, then the result will be False, If skipna is False, then NA are treated as True, because these are not equal to zero. If the axis is a MultiIndex (hierarchical), count along a particular level,  Dealing with Rows and Columns in Pandas DataFrame A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. We can perform basic operations on rows/columns like selecting, deleting, adding, and renaming.

pandas.DataFrame.notna, Return a boolean same-sized object indicating if the values are not NA. Non-​missing values get mapped to True. Characters such as empty strings '' or numpy.​inf  That would only columns 2005, 2008, and 2009 with all their rows. Extracting specific rows of a pandas dataframe ¶ df2[1:3] That would return the row with index 1, and 2. The row with index 3 is not included in the extract because that’s how the slicing syntax works. Note also that row with index 1 is the second row.

pandas.DataFrame.isin, When values is a Series or DataFrame the index and column must match. Note that 'falcon' does not match based on the number of legs in df2. >>> other  One might want to filter the pandas dataframe based on a column such that we would like to keep the rows of data frame where the specific column don’t have data and not NA. Let us consider a toy example to illustrate this. Let us first load the pandas library and create a pandas dataframe from multiple lists.

Pandas : Drop rows from a dataframe with missing values or NaN in , Drop Rows with missing value / NaN in any column. Python Drop Rows in dataframe which has NaN in all columns. What if we want to  Sometimes, you may want tot keep rows of a data frame based on values of a column that does not equal something. Let us filter our gapminder dataframe whose year column is not equal to 2002. Basically we want to have all the years data except for the year 2002.

Comments
  • Given the complexity of the filter logic, I would probably generate a rows mask, like @liliscent did in their post to improve readability. In this case it would look something like interesting_data_provided = df.interesting_column.notna(), and then other_data_not_provided = df.isnull().sum(axis=1) == len(df.columns) - 1, and finally df[interesting_data_provided & other_data_not_provided]
  • Yes, I agree that this is only reasonable for a small number of columns. If you have a large number of columns, then your solution would be cleaner.
  • However, I like your detailed explanation of the parts.
  • Thanks Tim, I did think about doing it this way but unfortunately I don't know before hand what all the column names will be.
  • This is just missing the final step where you select the rows from df, by doing df[rows].
  • @TimJohns I interpret rows as that mask. But added in case others interpret it differently.