Read Csv Spark Scala With Automatic Schema
PySpark Read CSV file : In this tutorial, I will explain how to create a spark dataframe using a CSV file.
Introduction
CSV is a widely used data format for processing data. The read.csv() role present in PySpark allows you to read a CSV file and salvage this file in a Pyspark dataframe. Nosotros will therefore see in this tutorial how to read i or more than CSV files from a local directory and use the different transformations possible with the options of the function.
If yous need to install spark in your auto, you tin can consult this showtime of the tutorial :
Pyspark read csv Syntax
To illustrate the different examples, we will go to this file which contains the list of the different pokemons. Y'all can download it via this link :
This file contains 13 columns which are as follows :
- Alphabetize
- Name
- Type1
- Type2
- Total
- HP
- Attack
- Defense
- Specia
- Atk
- Specia
- Def
- Speed
- Generation
- Legendary
The basic syntax for using the read.csv function is equally follows:
# The path or file is stored spark.read.csv("path") To read the CSV file every bit an case, proceed as follows:
from pyspark.sql import SparkSession from pyspark.sql import functions as f from pyspark.sql.types import StructType,StructField, StringType, IntegerType , BooleanType spark = SparkSession.architect.appName('pyspark - example read csv').getOrCreate() sc = spark.sparkContext df = spark.read.csv("amiradata/pokedex.csv") df.printSchema() df.show(5,Fake) # Result of the printSchema() root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) |-- _c3: string (nullable = true) |-- _c4: string (nullable = true) |-- _c5: cord (nullable = true) |-- _c6: string (nullable = truthful) |-- _c7: string (nullable = true) |-- _c8: string (nullable = true) |-- _c9: cord (nullable = true) |-- _c10: cord (nullable = truthful) |-- _c11: string (nullable = true) |-- _c12: string (nullable = true) # Result of bear witness() function +-----+---------------------+-----+------+-----+---+------+-------+----------+----------+-----+----------+---------+ |_c0 |_c1 |_c2 |_c3 |_c4 |_c5|_c6 |_c7 |_c8 |_c9 |_c10 |_c11 |_c12 | +-----+---------------------+-----+------+-----+---+------+-------+----------+----------+-----+----------+---------+ |Index|Proper name |Type1|Type2 |Total|HP |Assault|Defense|SpecialAtk|SpecialDef|Speed|Generation|Legendary| |1 |Bulbasaur |Grass|Poisonous substance|318 |45 |49 |49 |65 |65 |45 |i |False | |2 |Ivysaur |Grass|Toxicant|405 |60 |62 |63 |lxxx |lxxx |sixty |1 |Simulated | |3 |Venusaur |Grass|Poison|525 |fourscore |82 |83 |100 |100 |80 |1 |False | |3 |VenusaurMega Venusaur|Grass|Poison|625 |80 |100 |123 |122 |120 |80 |ane |Faux | +-----+---------------------+-----+------+-----+---+------+-------+----------+----------+-----+----------+---------+ only showing top v rows By default, when only the path of the file is specified, the header is equal to Imitation whereas the file contains a header on the starting time line. All columns are too considered as strings. To solve these problems the read.csv() function takes several optional arguments, the most common of which are :
- header : uses the outset line equally names of columns. By default, the value is Simulated
- sep : sets a separator for each field and value. By default, the value is comma
- schema : an optional
pyspark.sql.types.StructTypefor the input schema or a DDL-formatted string - path : cord, or listing of strings, for input path(due south), or RDD of Strings storing CSV rows.
You will find the complete list of parameters on the official spark website.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read%20csv#pyspark.sql.DataFrameReader
If your file already contains a header on the first line, yous must specify information technology explicitly by declaring the Header parameter to True.
# Specifies the header to True df = spark.read.csv("amiradata/pokedex.csv",header=True) df.printSchema() root |-- Index: string (nullable = true) |-- Name: string (nullable = true) |-- Type1: cord (nullable = true) |-- Type2: cord (nullable = true) |-- Total: cord (nullable = truthful) |-- HP: string (nullable = true) |-- Set on: string (nullable = true) |-- Defense: string (nullable = truthful) |-- SpecialAtk: string (nullable = truthful) |-- SpecialDef: string (nullable = truthful) |-- Speed: string (nullable = true) |-- Generation: string (nullable = true) |-- Legendary: string (nullable = truthful) With the printSchema(), nosotros can come across that the Header has been taken into consideration.
Read CSV file using a user custom schema
Every bit we have seen, by default, all columns were considered as strings. If nosotros desire to modify this, we can use the structures. Once our structure is created we can specify it in the schema parameter of the read.csv() office.
# Schematic of the tabular array schema = StructType() \ .add("Index",IntegerType(),True) \ .add("Proper noun",StringType(),True) \ .add("Type1",StringType(),True) \ .add("Type2",StringType(),True) \ .add together("Total",IntegerType(),True) \ .add("HP",IntegerType(),True) \ .add together("Assault",IntegerType(),True) \ .add("Defense",IntegerType(),True) \ .add together("SpecialAtk",IntegerType(),True) \ .add("SpecialDef",IntegerType(),True) \ .add together("Speed",IntegerType(),True) \ .add("Generation",IntegerType(),True) \ .add("Legendary",BooleanType(),True) df = spark.read.csv("amiradata/pokedex.csv",header=True,schema=schema) df.printSchema() root |-- Alphabetize: integer (nullable = true) |-- Name: string (nullable = truthful) |-- Type1: string (nullable = true) |-- Type2: string (nullable = true) |-- Full: integer (nullable = true) |-- HP: integer (nullable = true) |-- Attack: integer (nullable = true) |-- Defense force: integer (nullable = true) |-- SpecialAtk: integer (nullable = true) |-- SpecialDef: integer (nullable = true) |-- Speed: integer (nullable = true) |-- Generation: integer (nullable = true) |-- Legendary: boolean (nullable = true) As yous can see, the schema has been changed and contains the types we specified in our Construction.
Read multiple CSV files
With this function it is possible to read several files directly (either by listing all the paths of each file or by specifying the binder where your different files are located):
# reads the 3 files specified in the PATH parameter df = spark.read.csv("amiradata/pokedex.csv,amiradata/pokedex2.csv,amiradata/pokedex3.csv") # Reads the files in the folder df = spark.read.csv("amiradata/") Conclusion
In this tutorial we have learned how to read a CSV file using the read.csv() function in Spark. This function is very useful and nosotros have only seen a tiny part of the options it offers the states.
If you want to acquire more about PySpark, you can read this book : ( Every bit an Amazon Partner, I brand a profit on qualifying purchases ) :
In our side by side article, we volition see how to create a CSV file from inside Pyspark Dataframe. Pay attending! 🙂
Dorsum to the python section
Source: https://amiradata.com/pyspark-read-csv-file-into-pyspark-dataframe/
0 Response to "Read Csv Spark Scala With Automatic Schema"
Post a Comment