How to create dataframe from list in Spark SQL?

pyspark create dataframe from list of strings
spark create dataframe with column names
pyspark create dataframe from dictionary
todf() pyspark
spark dataframe
spark dataframe example
spark parallelize dataframe
pyspark create dataframe from list of dictionaries

Spark version : 2.1

For example, in pyspark, i create a list

test_list = [['Hello', 'world'], ['I', 'am', 'fine']]

then how to create a dataframe form the test_list, where the dataframe's type is like below:

DataFrame[words: array<string>]

here is how -

from pyspark.sql.types import *

cSchema = StructType([StructField("WordList", ArrayType(StringType()))])

# notice extra square brackets around each element of list 
test_list = [['Hello', 'world']], [['I', 'am', 'fine']]

df = spark.createDataFrame(test_list,schema=cSchema) 

Pyspark convert a standard list to data frame, from pyspark.sql.types import IntegerType # notice the variable name (more below) mylist = [1, 2, 3, 4] # notice the parens after the type name spark.​createDataFrame(mylist, IntegerType()).show(). NOTE: About naming your  Ways to create DataFrame in Apache Spark – DATAFRAME is the representation of a matrix but we can have columns of different datatypes or similar table with different rows and having different types of columns (values of each column will be same data type).

i had to work with multiple columns and types - the example below has one string column and one integer column. A slight adjustment to Pushkr's code (above) gives:

from pyspark.sql.types import *

cSchema = StructType([StructField("Words", StringType())\
                      ,StructField("total", IntegerType())])

test_list = [['Hello', 1], ['I am fine', 3]]

df = spark.createDataFrame(test_list,schema=cSchema) 

output:

 df.show()
 +---------+-----+
|    Words|total|
+---------+-----+
|    Hello|    1|
|I am fine|    3|
+---------+-----+

Different ways to Create DataFrame in Spark, Create Spark DataFrame from List and Seq Collection. In this section sparkContext) val hiveDF = hiveContext.sql(“select * from emp”). Copy  You have a list of float64 and I think it doesn't like that type. On the other hand, when you hard code it it's just a list of float. Here is a question with an answer that goes over on how to convert from numpy's datatype to python's native ones.

You should use list of Row objects([Row]) to create data frame.

from pyspark.sql import Row

spark.createDataFrame(list(map(lambda x: Row(words=x), test_list)))

What are the different ways of representing data in Spark?, . By default, it creates column names as “_1” and “_2” as we have two columns for each row. toDF() has another signature which takes arguments for custom column names as shown below. By default, the datatype of these columns infers to the type of data. toDF() toDF() provides a concise syntax for creating DataFrames and can be accessed after importing Spark implicits. The toDF() method can be called on a sequence object to create a DataFrame. someDF has the following schema. toDF() is limited because the column type and nullable flag cannot be customized.

   You can create a RDD first from the input and then convert to dataframe from the constructed RDD
   <code>  
     import sqlContext.implicits._
       val testList = Array(Array("Hello", "world"), Array("I", "am", "fine"))
       // CREATE RDD
       val testListRDD = sc.parallelize(testList)
     val flatTestListRDD = testListRDD.flatMap(entry => entry)
     // COnvert RDD to DF 
     val testListDF = flatTestListRDD.toDF
     testListDF.show
    </code> 

Different approaches to manually create Spark DataFrames, How many ways can you make a DataFrame in spark? I'm sure this is a simple SQLContext question, but I can't find any answer in the Spark docs or Stackoverflow I want to create a Spark Dataframe from a SQL Query on MySQL For example, I have a

How to create DataFrame from Scala's List of Iterables?, How do I make a PySpark DataFrame from a list? This is because the results are returned as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources. The JDBC data source is also easier to use from Java or Python as it does not require the user to provide a ClassTag. (Note that this is different than the Spark SQL JDBC server,

Create Spark DataFrame From List[Any] · GitHub, () method can be called on a sequence object to create a DataFrame. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section.

Convert List to Spark Data Frame in Scala / Spark, 1 Answer. You just need to first convert List[Iterable[Any]] to List[Row] and then put rows in RDD and prepare schema for the spark data frame. To convert List[Iterable[Any]] to List[Row], we can say. val rows = values.map{x => Row(x:_*)} and then having schema like schema, we can make RDD. Spark SQL - Column of Dataframe as a List (Scala) Import Notebook. import org. apache. spark. sql. SparkSession val spark = SparkSession. builder. getOrCreate import

Comments
  • For anyone who just wants to convert a list of strings and is impressed by the ridiculous lack of proper documentation: you cannot convert 1d objects, you have to transform it into a list of tuples like: [(t,) for t in list_of_strings]
  • Is there a reason why from ... import *, almost universally considered an antipattern in Python, is advisable here?
  • Same question I asked on another answer: Is there a reason why from ... import *, almost universally considered an antipattern in Python, is advisable here?
  • Should be spark.createDataFrame
  • This appears to be Scala code and not Python, for anyone wondering why this is downvoted. The question is explicitly tagged pyspark.