Read Csv Spark Scala With Automatic Schema

PySpark Read CSV file : In this tutorial, I will explain how to create a spark dataframe using a CSV file.

Introduction

CSV is a widely used data format for processing data. The read.csv() role present in PySpark allows you to read a CSV file and salvage this file in a Pyspark dataframe. Nosotros will therefore see in this tutorial how to read i or more than CSV files from a local directory and use the different transformations possible with the options of the function.

If yous need to install spark in your auto, you tin can consult this showtime of the tutorial :

Pyspark read csv Syntax

To illustrate the different examples, we will go to this file which contains the list of the different pokemons. Y'all can download it via this link :

This file contains 13 columns which are as follows :

Alphabetize
Name
Type1
Type2
Total
HP
Attack
Defense
Specia
Atk
Specia
Def
Speed
Generation
Legendary

The basic syntax for using the read.csv function is equally follows:

# The path or file is stored  spark.read.csv("path")

To read the CSV file every bit an case, proceed as follows:

from pyspark.sql import SparkSession from pyspark.sql import functions as f from pyspark.sql.types import StructType,StructField, StringType, IntegerType , BooleanType spark = SparkSession.architect.appName('pyspark - example read csv').getOrCreate() sc = spark.sparkContext  df = spark.read.csv("amiradata/pokedex.csv") df.printSchema() df.show(5,Fake)

# Result of the printSchema()  root  |-- _c0: string (nullable = true)  |-- _c1: string (nullable = true)  |-- _c2: string (nullable = true)  |-- _c3: string (nullable = true)  |-- _c4: string (nullable = true)  |-- _c5: cord (nullable = true)  |-- _c6: string (nullable = truthful)  |-- _c7: string (nullable = true)  |-- _c8: string (nullable = true)  |-- _c9: cord (nullable = true)  |-- _c10: cord (nullable = truthful)  |-- _c11: string (nullable = true)  |-- _c12: string (nullable = true)  # Result of bear witness() function  +-----+---------------------+-----+------+-----+---+------+-------+----------+----------+-----+----------+---------+ |_c0  |_c1                  |_c2  |_c3   |_c4  |_c5|_c6   |_c7    |_c8       |_c9       |_c10 |_c11      |_c12     | +-----+---------------------+-----+------+-----+---+------+-------+----------+----------+-----+----------+---------+ |Index|Proper name                 |Type1|Type2 |Total|HP |Assault|Defense|SpecialAtk|SpecialDef|Speed|Generation|Legendary| |1    |Bulbasaur            |Grass|Poisonous substance|318  |45 |49    |49     |65        |65        |45   |i         |False    | |2    |Ivysaur              |Grass|Toxicant|405  |60 |62    |63     |lxxx        |lxxx        |sixty   |1         |Simulated    | |3    |Venusaur             |Grass|Poison|525  |fourscore |82    |83     |100       |100       |80   |1         |False    | |3    |VenusaurMega Venusaur|Grass|Poison|625  |80 |100   |123    |122       |120       |80   |ane         |Faux    | +-----+---------------------+-----+------+-----+---+------+-------+----------+----------+-----+----------+---------+ only showing top v rows

By default, when only the path of the file is specified, the header is equal to Imitation whereas the file contains a header on the starting time line. All columns are too considered as strings. To solve these problems the read.csv() function takes several optional arguments, the most common of which are :

header : uses the outset line equally names of columns. By default, the value is Simulated
sep : sets a separator for each field and value. By default, the value is comma
schema : an optionalpyspark.sql.types.StructType for the input schema or a DDL-formatted string
path : cord, or listing of strings, for input path(due south), or RDD of Strings storing CSV rows.

You will find the complete list of parameters on the official spark website.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read%20csv#pyspark.sql.DataFrameReader

If your file already contains a header on the first line, yous must specify information technology explicitly by declaring the Header parameter to True.

# Specifies the header to True  df = spark.read.csv("amiradata/pokedex.csv",header=True) df.printSchema()

root  |-- Index: string (nullable = true)  |-- Name: string (nullable = true)  |-- Type1: cord (nullable = true)  |-- Type2: cord (nullable = true)  |-- Total: cord (nullable = truthful)  |-- HP: string (nullable = true)  |-- Set on: string (nullable = true)  |-- Defense: string (nullable = truthful)  |-- SpecialAtk: string (nullable = truthful)  |-- SpecialDef: string (nullable = truthful)  |-- Speed: string (nullable = true)  |-- Generation: string (nullable = true)  |-- Legendary: string (nullable = truthful)

With the printSchema(), nosotros can come across that the Header has been taken into consideration.

Read CSV file using a user custom schema

Every bit we have seen, by default, all columns were considered as strings. If nosotros desire to modify this, we can use the structures. Once our structure is created we can specify it in the schema parameter of the read.csv() office.

# Schematic of the tabular array  schema = StructType() \       .add("Index",IntegerType(),True) \       .add("Proper noun",StringType(),True) \       .add("Type1",StringType(),True) \       .add("Type2",StringType(),True) \       .add together("Total",IntegerType(),True) \       .add("HP",IntegerType(),True) \       .add together("Assault",IntegerType(),True) \       .add("Defense",IntegerType(),True) \       .add together("SpecialAtk",IntegerType(),True) \       .add("SpecialDef",IntegerType(),True) \       .add together("Speed",IntegerType(),True) \       .add("Generation",IntegerType(),True) \       .add("Legendary",BooleanType(),True)  df = spark.read.csv("amiradata/pokedex.csv",header=True,schema=schema) df.printSchema()

root  |-- Alphabetize: integer (nullable = true)  |-- Name: string (nullable = truthful)  |-- Type1: string (nullable = true)  |-- Type2: string (nullable = true)  |-- Full: integer (nullable = true)  |-- HP: integer (nullable = true)  |-- Attack: integer (nullable = true)  |-- Defense force: integer (nullable = true)  |-- SpecialAtk: integer (nullable = true)  |-- SpecialDef: integer (nullable = true)  |-- Speed: integer (nullable = true)  |-- Generation: integer (nullable = true)  |-- Legendary: boolean (nullable = true)

As yous can see, the schema has been changed and contains the types we specified in our Construction.

Read multiple CSV files

With this function it is possible to read several files directly (either by listing all the paths of each file or by specifying the binder where your different files are located):

# reads the 3 files specified in the PATH parameter  df = spark.read.csv("amiradata/pokedex.csv,amiradata/pokedex2.csv,amiradata/pokedex3.csv")  # Reads the files in the folder  df = spark.read.csv("amiradata/")

Conclusion

In this tutorial we have learned how to read a CSV file using the read.csv() function in Spark. This function is very useful and nosotros have only seen a tiny part of the options it offers the states.

If you want to acquire more about PySpark, you can read this book : ( Every bit an Amazon Partner, I brand a profit on qualifying purchases ) :

In our side by side article, we volition see how to create a CSV file from inside Pyspark Dataframe. Pay attending! 🙂

Dorsum to the python section

I'm a data scientist. Passionate most new technologies and programming I created this website mainly for people who desire to larn more than about data science and programming :)

View all of ayed_amira'southward posts.

sweeneycionew.blogspot.com

Source: https://amiradata.com/pyspark-read-csv-file-into-pyspark-dataframe/

Read Csv Spark Scala With Automatic Schema

Introduction

Pyspark read csv Syntax

Read CSV file using a user custom schema

Read multiple CSV files

Conclusion

0 Response to "Read Csv Spark Scala With Automatic Schema"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel