#pyspark online classes
Explore tagged Tumblr posts
mysticpandakid · 3 months ago
Text
How to Read and Write Data in PySpark 
The Python application programming interface known as PySpark serves as the front end for Apache Spark execution of big data operations. The most crucial skill required for PySpark work involves accessing and writing data from sources which include CSV, JSON and Parquet files. 
In this blog, you’ll learn how to: 
Initialize a Spark session 
Read data from various formats 
Write data to different formats 
See expected outputs for each operation 
Let’s dive in step-by-step. 
Getting Started 
Before reading or writing, start by initializing a SparkSession. 
Tumblr media
Reading Data in PySpark 
1. Reading CSV Files 
Tumblr media
Sample CSV Data (sample.csv): 
Tumblr media
Output: 
Tumblr media
2. Reading JSON Files 
Tumblr media
Sample JSON (sample.json): 
Tumblr media
Output: 
Tumblr media
3. Reading Parquet Files 
Parquet is optimized for performance and often used in big data pipelines. 
Tumblr media
Assuming the parquet file has similar content: 
Output: 
Tumblr media
4. Reading from a Database (JDBC) 
Tumblr media
Sample Table employees in MySQL: 
Tumblr media
Output: 
Tumblr media
Writing Data in PySpark 
1. Writing to CSV 
Tumblr media
Output Files (folder output/employees_csv/): 
Tumblr media
Sample content: 
Tumblr media
2. Writing to JSON 
Tumblr media
Sample JSON output (employees_json/part-*.json): 
Tumblr media
3. Writing to Parquet 
Tumblr media
Output: 
Binary Parquet files saved inside output/employees_parquet/ 
You can verify the contents by reading it again: 
Tumblr media
4. Writing to a Database 
Tumblr media
Check the new_employees table in your database — it should now include all the records. 
Write Modes in PySpark 
Mode 
Description 
overwrite 
Overwrites existing data 
append 
Appends to existing data 
ignore 
Ignores if the output already exists 
error 
(default) Fails if data exists 
Real-Life Use Case 
Tumblr media
Filtered Output: 
Tumblr media
Wrap-Up 
Reading and writing data in PySpark is efficient, scalable, and easy once you understand the syntax and options. This blog covered: 
Reading from CSV, JSON, Parquet, and JDBC 
 Writing to CSV, JSON, Parquet, and back to Databases 
 Example outputs for every format 
 Best practices for production use 
Keep experimenting and building real-world data pipelines — and you’ll be a PySpark pro in no time! 
🚀Enroll Now: https://www.accentfuture.com/enquiry-form/
📞Call Us: +91-9640001789
📧Email Us: [email protected]
🌍Visit Us: AccentFuture
0 notes