Keep certain columns in a pandas DataFrame, deleting everything else

pandas drop column
pandas drop rows with string
rename column pandas
drop subset of dataframe pandas
table drop column pandas
drop column if exists pandas
pandas drop 0
pandas drop index

Say I have a data table

    1  2  3  4  5  6 ..  n
A   x  x  x  x  x  x ..  x
B   x  x  x  x  x  x ..  x
C   x  x  x  x  x  x ..  x

And I want to slim it down so that I only have, say, columns 3 and 5 deleting all other and maintaining the structure. How could I do this with pandas? I think I understand how to delete a single column, but I don't know how to save a select few and delete all others.

If you have a list of columns you can just select those:

In [11]: df
Out[11]:
   1  2  3  4  5  6
A  x  x  x  x  x  x
B  x  x  x  x  x  x
C  x  x  x  x  x  x

In [12]: col_list = [3, 5]

In [13]: df = df[col_list]

In [14]: df
Out[14]:
   3  5
A  x  x
B  x  x
C  x  x

Feature Request: Keep only these columns (vs. dropping all the , import pandas as pd # Create an example DataFrame data = [ [1, """Keep only the columns [keep_these] in a DataFrame, delete all other columns. specific example there was not much more typing between # `.drop` and  Might be worth noting that in most cases it's easier just to keep the columns you want then delete the ones that you don't: df = df['col_list'] – sparrow Apr 27 '18 at 22:14 add a comment | 10 Answers 10

You could reassign a new value to your DataFrame, df:

df = df.loc[:,[3, 5]]

As long as there are no other references to the original DataFrame, the old DataFrame will get garbage collected.

Note that when using df.loc, the index is specified by labels. Thus above 3 and 5 are not ordinals, they represent the label names of the columns. If you wish to specify the columns by ordinal index, use df.iloc.

How to drop one or multiple columns in Pandas Dataframe , How do I drop multiple columns in a data frame? If you have DataFrame columns that you're never going to use, you may want to remove them entirely in order to focus on the columns that you do use. In this video, I'll show you how to remove

How do I keep certain columns in a pandas DataFrame, deleting everything else?

The answer to this question is the same as the answer to "How do I delete certain columns in a pandas DataFrame?" Here are some additional options to those mentioned so far, along with timings.

DataFrame.loc

One simple option is selection, as mentioned by in other answers,

# Setup.
df
   1  2  3  4  5  6
A  x  x  x  x  x  x
B  x  x  x  x  x  x
C  x  x  x  x  x  x

cols_to_keep = [3,5]

df[cols_to_keep]

   3  5
A  x  x
B  x  x
C  x  x

Or,

df.loc[:, cols_to_keep]

   3  5
A  x  x
B  x  x
C  x  x

DataFrame.reindex with axis=1 or 'columns' (0.21+)

However, we also have reindex, in recent versions you specify axis=1 to drop:

df.reindex(cols_to_keep, axis=1)
# df.reindex(cols_to_keep, axis='columns')

# for versions < 0.21, use
# df.reindex(columns=cols_to_keep)

   3  5
A  x  x
B  x  x
C  x  x

On older versions, you can also use reindex_axis: df.reindex_axis(cols_to_keep, axis=1).


DataFrame.drop

Another alternative is to use drop to select columns by pd.Index.difference:

# df.drop(cols_to_drop, axis=1)
df.drop(df.columns.difference(cols_to_keep), axis=1)

   3  5
A  x  x
B  x  x
C  x  x

Performance

The methods are roughly the same in terms of performance; reindex is faster for smaller N, while drop is faster for larger N. The performance is relative as the Y-axis is logarithmic.

Setup and Code

import pandas as pd
import perfplot

def make_sample(n):
    np.random.seed(0)
    df = pd.DataFrame(np.full((n, n), 'x'))
    cols_to_keep = np.random.choice(df.columns, max(2, n // 4), replace=False)

    return df, cols_to_keep 

perfplot.show(
    setup=lambda n: make_sample(n),
    kernels=[
        lambda inp: inp[0][inp[1]],
        lambda inp: inp[0].loc[:, inp[1]],
        lambda inp: inp[0].reindex(inp[1], axis=1),
        lambda inp: inp[0].drop(inp[0].columns.difference(inp[1]), axis=1)
    ],
    labels=['__getitem__', 'loc', 'reindex', 'drop'],
    n_range=[2**k for k in range(2, 13)],
    xlabel='N',   
    logy=True,
    equality_check=lambda x, y: (x.reindex_like(y) == y).values.all()
)

How To Drop One or More Columns in Pandas Dataframe?, How do you delete a row from a DataFrame in Python based on a condition? The idea is that instead of specifying all of the columns that you wish to delete from a DataFrame via the .drop method, you specify instead the columns you wish to keep through a .keep_cols method - all other columns are deleted. This would save typing in cases where there are many columns, and we only want to keep a small subset of columns.

For those who are searching an method to do this inplace:

from pandas import DataFrame
from typing import Set, Any
def remove_others(df: DataFrame, columns: Set[Any]):
    cols_total: Set[Any] = set(df.columns)
    diff: Set[Any] = cols_total - columns
    df.drop(diff, axis=1, inplace=True)

This will create the complement of all the columns in the dataframe and the columns which should be removed. Those can safely be removed. Drop works even on an empty set.

>>> df = DataFrame({"a":[1,2,3],"b":[2,3,4],"c":[3,4,5]})
>>> df
   a  b  c
0  1  2  3
1  2  3  4
2  3  4  5

>>> remove_others(df, {"a","b","c"})
>>> df
   a  b  c
0  1  2  3
1  2  3  4
2  3  4  5

>>> remove_others(df, {"a"})
>>> df
   a
0  1
1  2
2  3

>>> remove_others(df, {"a","not","existent"})
>>> df
   a
0  1
1  2
2  3

Python Pandas DataFrame: load, edit, view data, How do I remove multiple columns from a DataFrame in Python? Note that when you extract a single row or column, you get a one-dimensional object as output. That is called a pandas Series. Whereas, when we extracted portions of a pandas dataframe like we did earlier, we got a two-dimensional DataFrame type of object. Just something to keep in mind for later.

Python Pandas : Drop columns in DataFrame by label Names or by , How do I drop a column in a Pandas DataFrame? Picking specific columns. Picking certain values from a column. You choose all of the values in column 1 that are equal to the value. All of the values in column 1 that are not equal to the value. All of the values in column 1 are smaller than the value. All of the values in column 1 are bigger than the value.

Python Pandas : How to Drop rows in DataFrame by conditions on , Drop Multiple Columns by Label Names in DataFrame Before delete a column using drop() always check if column exists or not otherwise  Drop a row if it contains a certain value (in this case, “Tina”) Specifically: Create a new dataframe called df that includes all rows where the value of a cell in the name column does not equal “Tina”.

Dropping Rows Using Pandas, or 1 for columns). Let's use this do delete multiple rows by conditions. Let's delete all rows for which column 'Age' has value 30 i.e.. Python. R : Delete column by name. Method I : The most easiest way to drop columns is by using subset() function. In the code below, we are telling R to drop variables x and z. The '-' sign indicates dropping variables. Make sure the variable names would NOT be specified in quotes when using subset() function.

Comments
  • More (faster) options along with timings are available in this answer.
  • @andyhayden how about deleting all columns except the nth column (without using the column naming),