Reshaping data frame and counting values based on criteria

r add column to dataframe based on other columns
r change column value based on another column
r create new column based on multiple condition
r create new column based on condition
r extract rows with certain value
r assign value to column based on another column
dplyr : : cheat sheet
r add column with values based on another column

I have the data set below. I am trying to determine the type of customer by providing a tag. My excel crashes due to too much data when I attempt, so trying to complete with Python.

item  customer qty
------------------
ProdA CustA    1 
ProdA CustB    1
ProdA CustC    1
ProdA CustD    1
ProdB CustA    1
ProdB CustB    1

In Excel, I would:

1. Create new columns "ProdA", "ProdB", "Type"
2. Remove duplicates for column "customer"
3. COUNTIF Customer = ProdA, COUNTIF customer = ProdB
4. IF(AND(ProdA = 1, ProdB = 1), "Both", "One")


customer ProdA ProdB Type
--------------------------
CustA    1     1     Both
CustB    1     1     Both
CustC    1     0     One
CustD    1     0     One
Method 1:

We can achieve this using pd.crosstab, and then using the sum of ProdA and ProdB to Series.map 2 -> Both & 1 -> One:

dfn = pd.crosstab(df['customer'], df['item']).reset_index()
dfn['Type'] = dfn[['ProdA', 'ProdB']].sum(axis=1).map({2:'Both', 1:'One'})

Or we can use np.where in the last line to conditionally assign Both or One:

dfn['Type'] = np.where(dfn['ProdA'].eq(1) & dfn['ProdB'].eq(1), 'Both', 'One')
item customer  ProdA  ProdB  Type
0       CustA      1      1  Both
1       CustB      1      1  Both
2       CustC      1      0   One
3       CustD      1      0   One

Method 2

We can also use pd.crosstab more extensively with the margins=True argument:

dfn = pd.crosstab(df['customer'], df['item'], 
                  margins=True, 
                  margins_name='Type').iloc[:-1].reset_index()

dfn['Type'] = dfn['Type'].map({2:'Both', 1:'One'})
item customer  ProdA  ProdB  Type
0       CustA      1      1  Both
1       CustB      1      1  Both
2       CustC      1      0   One
3       CustD      1      0   One

Manipulating, analyzing and exporting data with tidyverse, Select certain rows in a data frame according to filtering conditions with the dplyr Use summarize , group_by , and count to split a data frame into groups of The package tidyr addresses the common problem of wanting to reshape your data Frequently you'll want to create new columns based on the values in existing� data: a DataFrame object. values: a column or a list of columns to aggregate. index: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.

Try using set_index, unstack and np.select:

df_out = df.set_index(['customer', 'item'])['qty'].unstack(fill_value=0)
SumProd = df_out['ProdA'] + df_out['ProdB']
df_out['Type'] = np.select([SumProd==2, SumProd==1, SumProd==0],['Both', 'One', 'None'])
print(df_out)

Output:

item      ProdA  ProdB  Type
customer                    
CustA         1      1  Both
CustB         1      1  Both
CustC         1      0   One
CustD         1      0   One

Data Wrangling in R: Combining, Merging and Reshaping Data, Rda") # Sometimes we have multiple data frames we want to combine. There are Also, the names and classes of values being joined must match. Using %in% means we can easily count the number of matches: sum(few %in% alot) These examples take wide data files and reshape them into long form. These show common examples of reshaping data, but do not exhaustively demonstrate the different kinds of data reshaping that you could encounter. Example #1: Reshaping data wide to long. Consider the family income data file below.

In addition to the other suggestions, you could skip Pandas entirely:

################################################################################
## Data ingestion
################################################################################
import csv
import StringIO

# Formated to make the example more straightforward.
input_data = StringIO.StringIO('''item,customer,qty
ProdA,CustA,1
ProdA,CustB,1
ProdA,CustC,1
ProdA,CustD,1
ProdB,CustA,1
ProdB,CustB,1
''')

records = []
reader = csv.DictReader(input_data)
for row in reader:
  records.append(row)

################################################################################
## Data transformation.
## Makes a Dict-of-Dicts. Each inner Dict contains all data for a single
## customer. 
################################################################################
products = {'ProdA', 'ProdB'}
customer_data = {}

for r in records:
  customer_id = r['customer']
  if not customer_id in customer_data:
    customer_data[customer_id] = {}
  customer_data[customer_id][r['item']] = int(r['qty'])

# Determines the customer type. 
for c in customer_data:
  c_data = customer_data[c]
  missing_product = products.difference(c_data.keys())
  matching_product = products.intersection(c_data.keys())
  if missing_product:
    for missing_p in missing_product:
      c_data[missing_p] = 0
    c_data['type'] = 'One'
  else:
    c_data['type'] = 'Both'

################################################################################
## Data display
################################################################################
for i, c in enumerate(customer_data):
  if i == 0:
    print('\t'.join(['ID'] + customer_data[c].keys()))
  print('\t'.join([c] + [str(x) for x in customer_data[c].values()]))

Which, for me, prints this

ID      ProdA   type    ProdB
CustC   1       One     0
CustB   1       Both    1
CustA   1       Both    1
CustD   1       One     0

Manipulating data tables with dplyr, When used alone, dataframe dat is inside the select function dat2 <- select(dat, column1) # When Tables can be subsetted by rows based on column values. Based on the result it returns a bool series. By counting the number of True in the returned series we can find out the number of rows in dataframe that satisfies the condition. Let’s see some examples, Example 1: Count the number of rows in a dataframe for which ‘Age’ column contains value more than 30 i.e.

4 data wrangling tasks in R for advanced beginners, Learn how to add columns, get summaries, sort your results and reshape your The code above will create a data frame like the one below, stored in a its value if, for example, you want a new column that's the sum of two existing columns:. dat <- data.frame(x,y) # However if one of the objects you're binding is a data frame, then you do end # up with a data frame. dat <- data.frame(x,y) z <- c("a","a","c") is.data.frame(cbind(dat,z)) ## [1] TRUE

Best Pandas Tutorial, In this pandas tutorial, you will learn various functions of pandas package along with 50+ Create a structured data set similar to R's data frame and Excel spreadsheet. data based on some conditions; Summarizing data by classification variable; Reshape data It returns 19 as index column contains distinct 19 values. Output: Method #3: Using GroupBy.size() This method can be used to count frequencies of objects over single or multiple columns. After grouping a DataFrame object on one or more columns, we can apply size() method on the resulting groupby object to get a Series object containing frequency count.

How can I count multiple text values in a column?, In each cell of this column there are multiple text values separated by ";" (e.g. option 1: you have a data.frame A with a column Cities having the names could use SPLIT to parse the multiple values into separate columns and RESHAPE LONG to Maybe both limits are valid and that it depends on the researcher criteria. A short post about counting and aggregating in R, because I learned a couple of things while improving the work I did earlier in the year about analyzing reference desk statistics. I’ll post about that soon. I often want to count things in data frames.

Comments
  • can you post a part of the dataset?