pyspark how to return the average of a column based on the value of another column?

Related searches

I wouldn't expect this to be difficult, but I'm having trouble understanding how to take the average of a column in my spark dataframe.

The dataframe looks like:

+-------+------------+--------+------------------+
|Private|Applications|Accepted|              Rate|
+-------+------------+--------+------------------+
|    Yes|         417|     349|0.8369304556354916|
|    Yes|        1899|    1720|0.9057398630858347|
|    Yes|        1732|    1425|0.8227482678983834|
|    Yes|         494|     313|0.6336032388663968|
|     No|        3540|    2001|0.5652542372881356|
|     No|        7313|    4664|0.6377683577191303|
|    Yes|         619|     516|0.8336025848142165|
|    Yes|         662|     513|0.7749244712990937|
|    Yes|         761|     725|0.9526938239159002|
|    Yes|        1690|    1366| 0.808284023668639|
|    Yes|        6075|    5349|0.8804938271604938|
|    Yes|         632|     494|0.7816455696202531|
|     No|        1208|     877|0.7259933774834437|
|    Yes|       20192|   13007|0.6441660063391442|
|    Yes|        1436|    1228|0.8551532033426184|
|    Yes|         392|     351|0.8954081632653061|
|    Yes|       12586|    3239|0.2573494358811378|
|    Yes|        1011|     604|0.5974282888229476|
|    Yes|         848|     587|0.6922169811320755|
|    Yes|        8728|    5201|0.5958982584784601|
+-------+------------+--------+------------------+

I want to return the average of the Rate column when Private is equal to "Yes". How can I do this?

A third version to do the same would be:

from pyspark.sql.functions import col, avg
df_avg = df.filter(df["Private"] == "Yes").agg(avg(col("Rate")))
df_avg.show()

pyspark.sql module — PySpark master documentation, DataFrameNaFunctions Methods for handling missing data (null values). Returns a new SparkSession as new session, that has separate SQLConf, name, javaUDAF(id) as avg from df group by name").collect() [Row(name='b', Selects column based on the column name specified as a regex and returns it as Column . Calculate average in a column based on same value in another column with formulas. You can calculate average in a column based on same value in another column with formulas. Please do as follows. 1. Select a blank cell, enter the below formula into it and press the Enter key. =AVERAGEIF(B2:B13,E2,C2:C13)

This would work in scala. pyspark code should be very similar.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df = List(
("yes", 10),
("yes", 30),
("No", 40)).toDF("private", "rate")

val df = l.toDF(List("private", "rate"))

val window =Window.partitionBy($"private")

df.
    withColumn("avg", 
                when($"private" === "No", null).
                otherwise(avg($"rate").over(window))
            ).
    show()

Input DF

+-------+----+
|private|rate|
+-------+----+
|    yes|  10|
|    yes|  30|
|     No|  40|
+-------+----+

output df

+-------+----+----+
|private|rate| avg|
+-------+----+----+
|     No|  40|null|
|    yes|  10|20.0|
|    yes|  30|20.0|
+-------+----+----+

pyspark.sql.functions — PySpark 1.3.0 documentation, 'col': 'Returns a :class:`Column` based on the given column name. 'avg': ' Aggregate function: returns the average of the values in a group. [docs]def countDistinct(col, *cols): """ Return a new Column for distinct count of `col` or ` cols`� # order _asc_doc = """ Returns a sort expression based on the ascending order of the given column name >>> from pyspark.sql import Row >>> df = spark.createDataFrame

Try

df.filter(df['Private'] == 'Yes').agg({'Rate': 'avg'}).collect()[0]

pyspark.sql.functions — PySpark 2.1.0 documentation, 'col': 'Returns a :class:`Column` based on the given column name. 'avg': ' Aggregate function: returns the average of the values in a group. [docs]def corr( col1, col2): """Returns a new :class:`Column` for the Pearson Correlation Coefficient� # order asc = _unary_op ("asc", "Returns a sort expression based on the"" ascending order of the given column name.") desc = _unary_op ("desc", "Returns a sort expression based on the"" descending order of the given column name."

Try:

from pyspark.sql.functions import col, mean, lit

df.where(col("Private")==lit("Yes")).select(mean(col("Rate"))).collect()

pyspark.sql module — PySpark 2.2.0 documentation, StructType as its only field, and the field name will be “value”, each record will also be wrapped Returns a new SparkSession as new session, that has separate SQLConf, [Row(name=u'Alice', avg(age)=2.0), Row(name=u'Bob', avg(age)=5.0)] >>> sorted(df. Return a Boolean Column based on matching end of string. def when (self, condition, value): """ Evaluates a list of conditions and returns one of multiple possible result expressions. If :func:`Column.otherwise` is not invoked, None is returned for unmatched conditions.

5 Ways to add a new column in a PySpark Dataframe, Sometimes to utilize Pandas functionality, or occasionally to use RDDs based partitioning or sometimes to make use of the mature python� Add a columns in DataFrame based on other column using lambda function. Add column ‘Percentage’ in dataframe, it’s each value will be calculated based on other columns in each row i.e. df_obj = df_obj.assign(Percentage=lambda x: (x['Marks'] / x['Total']) * 100) Contents of the returned dataframe are,

If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string. subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored.

2. Change the value of an existing column. Spark withColumn() function of DataFrame can also be used to update the value of an existing column. In order to change the value, pass an existing column name as a first argument and value to be assigned as a second column. Note that the second argument should be Column type .

Comments
  • Filter and then calculate the average ?
  • exactly, I am not sure how to calculate the average.
  • I think this, along with Vishnudev 's answer are working, but the answer isn't displaying properly. Like I mentioned before, printing this gives me:DataFrame[avg(Rate): double] and I need to see the number. I'm curious about the display function you mentioned, it seems like I need to import something first for that to work.
  • df_avg.show() gives me a very long error (one I have been seeing a lot when trying very different things)
  • I'm not sure filter works this way. Filter works on the index and column headers.
  • I have tried privateRate = df.filter(df['Private'] == 'Yes').agg({'Rate': 'avg'}), but print(privateRate) returns DataFrame[avg(Rate): double]. Seems like its close, but I need to see the actual number @Vishnudev
  • spyder is warning me "undefined name display"
  • spyder gives me a code analysis warning, "'lit' may be undefined or defined from star imports". This also causes an error while running the code
  • Ou, try now - F.lit(...) should do
  • I'm receiving a big long error. I've seen it several times on this project but it doesn't always come up. I'm beginning to wonder if the problem isn't this part of my code but something else? Which would be odd because I can run up to these lines without issue.
  • Maybe it's this: stackoverflow.com/questions/46799137/…
  • I'm starting to thing - from pyspark.sql import functions as F is the problem here (which shouldn't be the case in general). You can check now - but if it's still not it, that some environmental discrepancy, which I just cannot reproduce...