How to create new DataFrame with dict

pandas dataframe from dict of dicts
pandas from dict
nested dictionary to dataframe
pandas dataframe from list of dicts
dataframe to dictionary by row
valueerror: if using all scalar values, you must pass an index
create dataframe
pandas unpack dictionary

I had one dict, like:

cMap = {"k1" : "v1", "k2" : "v1", "k3" : "v2", "k4" : "v2"}

and one DataFrame A, like:

| k1|
| k2|
| k3|
| k4|

to create the DataFame above with code:

data = [('k1'),
A = spark.createDataFrame(data, ['key'])

I want to get the new DataFrame, like:

|key|   v1     |    v2    |
| k1|true      |false     |
| k2|true      |false     |
| k3|false     |true      |
| k4|false     |true      |

I wish to get some suggestions, thanks!

I just wanted to contribute a different and possibly easier way to solve this.

In my code I convert a dict to a pandas dataframe, which I find is much easier. Then I directly convert the pandas dataframe to spark.

data = {'visitor': ['foo', 'bar', 'jelmer'], 
        'A': [0, 1, 0],
        'B': [1, 0, 1],
        'C': [1, 0, 0]}

df = pd.DataFrame(data)
ddf = spark.createDataFrame(df)

|  A|  B|  C|visitor|
|  0|  1|  1|    foo|
|  1|  0|  0|    bar|
|  0|  1|  0| jelmer|

The dictionary can be converted to dataframe and joined with other one. My piece of code,

data = sc.parallelize([(k,)+(v,) for k,v in cMap.items()]).toDF(['key','val'])
keys = sc.parallelize([('k1',),('k2',),('k3',),('k4',)]).toDF(["key"])
newDF = data.join(keys,'key').select("key",F.when(F.col("val") == "v1","True").otherwise("False").alias("v1"),F.when(F.col("val") == "v2","True").otherwise("False").alias("v2"))

 |key|   v1|   v2|
 | k1| True|False|
 | k2| True|False|
 | k3|False| True|
 | k4|False| True|

If there are more values, you can code that when clause as a UDF and use it.

I parallelize cMap.items() and check if value equal to v1 or v2 or not. Then joining back to dataframe A on column key

# example dataframe A
df_A = spark.sparkContext.parallelize(['k1', 'k2', 'k3', 'k4']).map(lambda x: Row(**{'key': x})).toDF()

cmap_rdd = spark.sparkContext.parallelize(cMap.items())
cmap_df = x: Row(**dict([('key', x[0]), ('v1', x[1]=='v1'), ('v2', x[1]=='v2')]))).toDF()

df_A.join(cmap_df, on='key').orderBy('key').show()


|key|   v1|   v2|
| k1| true|false|
| k2| true|false|
| k3|false| true|
| k4|false| true|

Thanks everyone for some suggestions, I figured out the other way to resolve my problem with pivot, the code is:

cMap = {"k1" : "v1", "k2" : "v1", "k3" : "v2", "k4" : "v2"}
a_cMap = [(k,)+(v,) for k,v in cMap.items()] 
data = spark.createDataFrame(a_cMap, ['key','val'])

from pyspark.sql.functions import count
data = data.groupBy('key').pivot('val').agg(count('val'))

|key|  v1|  v2|
| k2|   1|null|
| k4|null|   1|
| k1|   1|null|
| k3|null|   1|

data =

|key| v1| v2|
| k2|  1|  0|
| k4|  0|  1|
| k1|  1|  0|
| k3|  0|  1|

keys = spark.createDataFrame([('k1','2'),('k2','3'),('k3','4'),('k4','5'),('k5','6')], ["key",'temp'])

newDF = keys.join(data,'key')
|key|temp| v1| v2|
| k2|   3|  1|  0|
| k4|   5|  0|  1|
| k1|   2|  1|  0|
| k3|   4|  0|  1|

But, I can't convert 1 to true, 0 to false.

I just wanted to add an easy way to create DF, using pyspark

values = [("K1","true","false"),("K2","true","false")]
columns = ['Key', 'V1', 'V2']
df = spark.createDataFrame(values, columns)

  • Actually, there are more values, could you tell me how to construct the UDF?