Selecting specific columns from df -h output in python

pandas select columns by name
pandas dataframe filter by column value
pandas select columns by number
pandas select columns by condition
python subset dataframe by column name
python dataframe select columns by list
pandas select columns by index
select only one column from dataframe pandas

I'm trying to create a simple script that will select specific columns from the unix df - h command. I can use awk to do this but how can we do this in python?

Here is df -h output:

Filesystem                    Size  Used  Avail  Use%  Mounted on
/dev/mapper/vg_base-lv_root   28G   4.8G    22G   19%  /
tmpfs                        814M   176K   814M    1%  /dev/shm
/dev/sda1                    485M   120M   340M   27%  /boot

I want something like:

Column 1:

Filesystem
/dev/mapper/vg_base-lv_root           
tmpfs                 
/dev/sda1

Column 2:

Size
28G
814M 
485M   

You can use op.popen to run the command and retrieve its output, then splitlines and split to split the lines and fields. Run df -Ph rather than df -h so that lines are not split if a column is too long.

df_output_lines = [s.split() for s in os.popen("df -Ph").read().splitlines()]

The result is a list of lines. To extract the first column, you can use [line[0] for line in df_output_lines] (note that columns are numbered from 0) and so on. You may want to use df_output_lines[1:] instead of df_output_lines to strip the title line.

If you already have the output of df -h stored in a file somewhere, you'll need to join the lines first.

fixed_df_output = re.sub('\n\s+', ' ', raw_df_output.read())
df_output_lines = [s.split() for s in fixed_df_output.splitlines()]

Note that this assumes that neither the filesystem name nor the mount point contain whitespace. If they do (which is possible with some setups on some unix variants), it's practically impossible to parse the output of df, even df -P. You can use os.statvfs to obtain information on a given filesystem (this is the Python interface to the C function that df calls internally for each filesystem), but there's no portable way of enumerating the filesystems.

Selecting multiple columns in a pandas dataframe, The column names (which are strings) cannot be sliced in the manner you tried. Here you have a couple of options. If you know from context which variables you​  df. iloc [:, lambda df: df. columns. str. contains ('run', case = False)] The benefits of using str functions are that you can get sophisticated with the potential filter options. For instance, if we want all the columns with “district,” “precinct” or “boundaries” in the name:

Here is the complete example:

import subprocess
import re

p = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)
dfdata, _ = p.communicate()

dfdata = dfdata.replace("Mounted on", "Mounted_on")

columns = [list() for i in range(10)]
for line in dfdata.split("\n"):
    line = re.sub(" +", " ", line)
    for i,l in enumerate(line.split(" ")):
        columns[i].append(l)

print columns[0]

Its assumes that mount points do not contain spaces.

Here is the more complete (and complicated solution) that does not hard-cores number of columns:

import subprocess
import re

def yield_lines(data):
    for line in data.split("\n"):
        yield line

def line_to_list(line):
    return re.sub(" +", " ", line).split()

p = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)
dfdata, _ = p.communicate()

dfdata = dfdata.replace("Mounted on", "Mounted_on")

lines = yield_lines(dfdata)

headers = line_to_list(lines.next())

columns = [list() for i in range(len(headers))]
for i,h in enumerate(headers):
    columns[i].append(h)

for line in lines:
    for i,l in enumerate(line_to_list(line)):
        columns[i].append(l)

print columns[0]

Indexing, Slicing and Subsetting DataFrames in Python – Data , How can I access specific data within my data set? How can Python and Pandas help me to analyse my data? Objectives. Describe TIP: use the .head() method we saw earlier to make output shorter # Method 1: select a 'subset' of the data using the column name Select the species and plot columns from the DataFrame  Select specific columns: name score a Anastasia 12.5 b Dima 9.0 c Katherine 16.5 d James NaN e Emily 9.0 f Michael 20.0 g Matthew 14.5 h Laura NaN i Kevin 8.0 j Jonas 19.0 Python Code Editor: Have another way to solve this solution?

Not an answer to the question, but I tried to solve the problem. :)

from os import statvfs

with open("/proc/mounts", "r") as mounts:
    split_mounts = [s.split() for s in mounts.read().splitlines()]

    print "{0:24} {1:24} {2:16} {3:16} {4:15} {5:13}".format(
            "FS", "Mountpoint", "Blocks", "Blocks Free", "Size", "Free")
    for p in split_mounts:
        stat = statvfs(p[1])
        block_size = stat.f_bsize
        blocks_total = stat.f_blocks
        blocks_free = stat.f_bavail

        size_mb = float(blocks_total * block_size) / 1024 / 1024
        free_mb = float(blocks_free * block_size) / 1024 / 1024

        print "{0:24} {1:24} {2:16} {3:16} {4:10.2f}MiB {5:10.2f}MiB".format(
                p[0], p[1], blocks_total, blocks_free, size_mb, free_mb)

Tips for Selecting Columns in a DataFrame, Taking care of business, one python script at a time Why Do We Care About Selecting Columns? import pandas as pd import numpy as np df = pd.read_csv​( as input, there are times when these functions produce the same output. If you expect you ID column to always be in a specific location and it  Pandas: DataFrame Exercise-6 with Solution. Write a Pandas program to select the specified columns and rows from a given DataFrame. Select 'name' and 'score' columns in rows 1, 3, 5, 6 from the following data frame.

Not using os.popen as it is deprecated (http://docs.python.org/library/os#os.popen).

I have put the output of df -h in a file: test.txt and just reading from this file. But, you can read using subprocess too. Just assuming that you are able to read each line of the output of df -h, the following code would help:-

f = open('test.txt')

lines = (line.strip() for line in f.readlines())
f.close()    
splittedLines = (line.split() for line in lines)
listOfColumnData = zip(*splittedLines)
for eachColumn in listOfColumnData:
    print eachColumn

eachColumn will display the entire column that you want as a list. You can just iterate over it. If you need, I can give the code for reading the output from df -h so that you can remove the dependency on test.txt, but, if you go to the subprocess documentation, you can find how to do it easily.

Selecting Subsets of Data in Pandas: Part 1 - Dunder Data, The easiest way to get pandas along with Python and the rest of the main Subset selection is simply selecting particular rows and columns of data from a Let's output the type of each component to understand exactly what kind of object they are. Selecting multiple columns returns a DataFrame. Here specify your column numbers which you want to select. In dataframe, column start from index = 0. You can select column by name wise also. Just use following line. df = df[ ["Column Name","Column Name2"]] improve this answer. edited Feb 6 '18 at 11:48. answered Feb 6 '18 at 11:25. 11 bronze badges. thank you for your help. However, I still

I had a mount point with a space in it. This threw off most of the examples. This borrows a lot from @ZarrHai 's example but puts the result in a dict

#!/usr/bin/python
import subprocess
import re
from pprint import pprint

DF_OPTIONS = "-laTh" # remove h if you want bytes.

def yield_lines(data):
    for line in data.split("\n"):
        yield line

def line_to_list(line):
    pattern = re.compile(r"([\w\/\s\-\_]+)\s+(\w+)\s+([\d\.]+?[GKM]|\d+)"
                          "\s+([\d\.]+[GKM]|\d+)\s+([\d\.]+[GKM]|\d+)\s+"
                          "(\d+%)\s+(.*)")
    matches = pattern.search(line)
    if matches:
        return matches.groups()
    _line = re.sub(r" +", " ", line).split()
    return _line

p = subprocess.Popen(["df", DF_OPTIONS], stdout=subprocess.PIPE)
dfdata, _ = p.communicate()

dfdata = dfdata.replace("Mounted on", "Mounted_on")

lines = yield_lines(dfdata)

headers = line_to_list(lines.next())

columns = [list() for i in range(len(headers))]
for i,h in enumerate(headers):
    columns[i].append(h)

grouped = {}
for li, line in enumerate(lines):
    if not line:
        continue
    grouped[li] = {}
    for i,l in enumerate(line_to_list(line)):
        columns[i].append(l)
        key = headers[i].lower().replace("%","")
        grouped[li][key] = l.strip()

pprint(grouped)

How to select rows and columns in Pandas using [ ], .loc, iloc, .at and , First, I import the Pandas library, and read the dataset into a DataFrame. To select multiple columns, you can pass a list of column names to the indexing operator. wine_four The list values can be a string or a Python object. You can also  How to select multiple columns in a pandas dataframe Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

How do I select a subset of a DataFrame?, As a single column is selected, the returned object is a pandas DataFrame . We can verify this by checking the type of the output: square brackets define a Python list with column names, whereas the outer brackets are used to select the data  However, for dataframe2 you have specified .iloc which allows you to specific the rows and columns you want in a numerical format. Using :, your selecting all rows, but [0:5] selects the first 5 columns. You could use .loc to specify by name, but if your dealing with long column names, then .iloc may be better.

How To Select One or More Columns in Pandas?, Selecting a column or multiple columns from a Pandas dataframe is a column and get a Pandas data frame with single column as output  At this point you know how to load CSV data in Python. In this lesson, you will learn how to access rows, columns, cells, and subsets of rows and columns from a pandas dataframe. Let’s open the CSV file again, but this time we will work smarter. We will not download the CSV from the web manually. We will let Python directly access the CSV

iloc, loc, and ix for data selection in Python Pandas, The iloc, loc and ix indexers for Python Pandas select rows and columns from Rows: data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output. Multiple row and column selections using iloc and DataFrame​. Slicing dataframes by rows and columns is a basic tool every analyst should have in their skill-set. We'll run through a quick tutorial covering the basics of selecting rows, columns and both rows and columns.This is an extremely lightweight introduction to rows, columns and pandas—perfect for beginners!

Comments
  • In addition to os.statvfs Python 3.3 will add a new function shutil.disk_usage which returns a named tuple with the attributes total, used and free space.
  • Prefer subprocess instead of os.popen, as os.popen is deprecated (docs.python.org/library/os#os.popen).
  • Thanks @Gilles. I've tried the first option and it worked. I had to do like this for subprocess module. df_output_lines = [s.split() for s in subprocess.Popen(["df", "-Ph"], stdout=subprocess.PIPE).communicate()[0].strip().splitlines()]
  • This does no work if the logical file system name has a space in it, like map auto_home or Mac HD. The Python split will split on all single spaces.
  • Hard-coding the columns as 10 is not recommended. It should be kept dynamic so as to make sure that the code works on various flavors of linux/unix
  • You are right of course. I've assumed that question author is a Python beginner and did not want to unnecessarily complicate the answer. I've added alternative code that does not assume number of columns.
  • You may want to add attribution links to those answers you drew this from.
  • @NathanTuggy I don't usually comment on these things. How would you give attribution? A link to the original example?
  • Yeah, that should be fine.
  • This does not work as we get the following output:- [['Filesystem', 'Size', 'Used Avail Use% Mounted on'], ['/dev/mapper/vg_base-lv_root', '28G', '4.8G', '22G', '19% /'], ['tmpfs', '814M', '176K', '814M', '1% /dev/shm'], ['/dev /sda1', '485M', '120M', '340M', '27% /boot']] There are 2 problems in this code:- 1. The items are not split properly as required, as is evident in the first item of the list l 2. After the problem 1 is solved, user1610085's requirement is to get the data in a transposed format. Your output gives just all the lines splitted, but, not transposed
  • I could solve problem 2 by for i in [line[0] for line in df_output_lines]: ... print i