Add a new column to a dataframe based on first value in row

pandas create new column based on condition
add new column to dataframe pandas based on other columns
pandas create new column based on multiple condition
pandas add column to dataframe based on index
create a new column based on two columns from two different dataframes
add empty column to dataframe pandas
how to add a new column to pd dataframe
pandas add column with value

I have a dataframe like such:

>>> import pandas as pd

>>> pd.read_csv('csv/10_no_headers_with_com.csv')
                  //field  field2
0   //first field is time     NaN
1                 132605     1.0
2                 132750     2.0
3                 132772     3.0
4                 132773     4.0
5                 133065     5.0
6                 133150     6.0

I would like to add another field that says whether the first value of the first field is a comment character, //. So far I have something like this:

# may not have a heading value, so use the index not the key
df[0].str.startswith('//')  

What would be the correct way to add on a new column with this value, so that the result is something like:

pd>>> pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
                       0       1       _starts_with_comment
0                 //field  field2       True
1  //first field is time     NaN       True
2                 132605       1       False
3                 132750       2       False
4                 132772       3       False

What is the issue with your command, simply assigned to a new column?:

df['comment_flag'] = df[0].str.startswith('//')

Or do you indeed have mixed type columns as mentioned by jpp?


EDIT: I'm not quite sure, but from your comments I get the impression you don't really need an additional column of comment flags. Just in case you want to load the data without comments into a dataframe but still use field names somewhat hidden in the commented header as column names, you might want to check this out: So based on this textfile:

//field  field2
//first field is time     NaN
132605     1.0
132750     2.0
132772     3.0
132773     4.0
133065     5.0
133150     6.0

You could do:

cmt = '//'

header = []
with open(textfilename, 'r') as f:
    for line in f:
        if line.startswith(cmt):
            header.append(line)
        else:                      # leave that out if collecting all comments of entire file is ok/wanted
            break
print(header)
# ['//field  field2\n', '//first field is time     NaN\n']  

This way you have the header information prepared for being used for e.g. column names. Getting the names from the first header line and using it for pandas import would be like

nms = header[0][2:].split()
df = pd.read_csv(textfilename, comment=cmt, names=nms, sep='\s+ ', engine='python')

    field  field2                                           
0  132605     1.0                                         
1  132750     2.0                                       
2  132772     3.0                                      
3  132773     4.0                                       
4  133065     5.0                                       
5  133150     6.0                                       

How To Create a Column Using Condition on Another Column in , How do I create a new column in a data frame? Create a new column in Pandas DataFrame based on the existing columns While working with data in Pandas, we perform a vast array of operations on the data to get the data in the desired form. One of these operations could be that we want to create new columns in the DataFrame based on the result of some operations on the existing columns in the

One way is to utilise pd.to_numeric, assuming non-numeric data in the first column must indicate a comment:

df = pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
df['_starts_with_comment'] = pd.to_numeric(df[0], errors='coerce').isnull()

Just note this kind of mixing types within series is strongly discouraged. Your first two series will no longer support vectorised operations as they will be stored in object dtype series. You lose some of the main benefits of Pandas.

A much better idea is to use the csv module to extract those attributes at the top of your file and store them as separate variables. Here's an example of how you can achieve this.

Python, How do I create a new column based on condition in pandas? Add new columns in a DataFrame using [] operator Add a new column with values in list. Suppose we want to add a new column ‘Marks’ with default values from a list. Let’s see how to do this, # Add column with Name Marks dfObj['Marks'] = [10,20, 45, 33, 22, 11] As dataframe dfObj didn’t had any column with name ‘Marks’ , so it will add a new column in this dataframe.

Try this:

import pandas as pd
import numpy as np

df.loc[:,'_starts_with_comment'] = np.where(df[0].str.startswith(r'//'), True, False)

R Data Frame Operations, How do you add a column value to a DataFrame in Python? pandas.DataFrame.apply to create new DataFrame columns based on a given condition in Pandas pandas.Series.map() to create new DataFrame columns based on a given condition in Pandas We can create the DataFrame columns based on a given condition in Pandas using list comprehension, NumPy methods, apply() method, and map() method of the DataFrame

Create a new column in Pandas DataFrame based on the existing , How do I add a row to a DataFrame in R? A quick and dirty solution which all of us have tried atleast once while working with pandas is re-creating the entire dataframe once again by adding that new row or column in the source i.e. csv, txt, DB etc. Pandas is a feature rich Data Analytics library and gives lot of features to achieve these simple tasks of add, delete and update.

Add new rows and columns to Pandas dataframe, using rbind() function. The basic syntax of rbind() is as shown below. Add row in the dataframe using dataframe.append() and Dictionary. In dataframe.append() we can pass a dictionary of key value pairs i.e. key = Column name; Value = Value at that column in new row; Let’s add a new row in above dataframe by passing dictionary i.e.

How to create new columns derived from existing columns?, Create a new column in Pandas DataFrame based on the existing columns Creating the DataFrame df[ 'Discounted_Price' ] = df. apply ( lambda row: row.​Cost - performing the required operation on the desired column element-wise. Average of Cubes of first N natural numbers · String Slicing in Python · Python exit  Method #4: By using a dictionary We can use a Python dictionary to add a new column in pandas DataFrame. Use an existing column as the key values and their respective values will be the values for new column.

Comments
  • In case you rather like to optimize named columns import when dealing with commented headers, please consider looking at my edit below.
  • thanks for mentioning this approach. About the answer you linked, what if the header itself has a comment? I've actually seen that quite frequently to designate that the first row of the csv file is a header and not data.
  • @David542, You'll have to write some logic to store the logic separately, then add it later via df.columns = [....], where [...] represents a list of strings.
  • Or just df[0].str.startswith(r'//').. np.where not necessary. You also need df.loc[:, '_startswith_comment'].
  • @jorge could you please explain the difference between doing np.where and just doing it without that?
  • David542 as @jpp pointed out, in this example, there is no difference. If you have other options in column [0] that you want to use in the new column, you can try to add more np.where inside the np.where that I wrote. Something like np.where(df[0].str.startswith(r'//'), 'Starts with '//', np.where(df[0] == 132750, 'number', 'Something_else')). Just keep track of the parenthesis and where you place them. I find np.where very useful in my work.
  • @Jorge thanks for the explanation. This may be a silly question, but does pandas automatically import numpy or do I need to import that separately?
  • @David542, no, pandas does NOT upload numpy. You need to import it separately. As for your second question. Both produce the same results. You may get a 'warning' from pandas with df['_starts_with_comment'], Using .loc is for indexing purposes. I found this site that explain some of the differences shanelynn.ie/…