Big Data Pipeline on Hadoop & Apache Spark

At a glance

Project overview

Course: Big Data and Applications — Semester 2, 2025–2026
Score: 10/10
Team size: 3
Role: Team Leader
Tools / methods: Hadoop · Apache Spark · PySpark · Spark SQL · Machine Learning · Big Data

The problem: big data doesn’t create value on its own

Businesses need to extract value from large datasets to support decision-making, but data is often scattered, unclean, and hard to process with traditional tools like Excel or a single database. This project is a comprehensive capstone covering Big Data, Data Engineering, and Machine Learning — not just focused on building a predictive model, but walking through the entire lifecycle of a real data project: choosing a dataset, cleaning it, exploring it, engineering features, training a model, and extracting business insight.

The team set up a Hadoop and Apache Spark environment to store, process, and analyze data at scale — building a pipeline on a distributed platform to standardize the data, extract information, and train a Machine Learning model for analysis and prediction.

Role: Team Leader, coordinating while staying hands-on

Within a 3-person team, the Team Leader role wasn’t just about assigning tasks — it meant directly executing most of the core technical steps:

Assigned work and tracked the team’s progress
Researched the dataset and prepared the data
Loaded data into Hadoop HDFS
Read data from HDFS using Spark, ran Spark SQL query sets
Performed Exploratory Data Analysis (EDA)
Preprocessed data for Feature Selection and Machine Learning
Wrote the report and designed the presentation slides

Beyond the assigned scope, also proactively researched Kafka and PowerShell, along with EDA techniques that weren’t covered in the classroom.

The process: when the hardest technical problem isn’t the model

The most interesting part of this project: the hardest problem wasn’t Machine Learning — it was successfully setting up and running the Hadoop and Apache Spark environment in the first place. During implementation, configuring Spark in PyCharm ran into repeated JDK conflicts. After multiple failed attempts, the decision was made to switch to running Python files through Visual Studio Code combined with PowerShell to stabilize the environment and keep development moving — a practical engineering call that wasn’t in any textbook, but was necessary to keep the project alive.

The workflow covered a complete Big Data + ML pipeline:

Select and research the dataset
Set up the Hadoop and Spark environment
Load data into Hadoop HDFS
Read data from HDFS using Spark
Run Spark SQL queries for initial data exploration
Clean and preprocess the data
Perform EDA to understand the data’s distribution and characteristics
Select appropriate features
Train and evaluate the Machine Learning model
Extract business insight from the analysis results
Finalize the report and presentation

Data preprocessing was also the part that took the most time — this was the first time truly understanding why data needs to be carefully cleaned, normalized, and feature-selected before going into Machine Learning, rather than jumping straight to the training step like many other assignments.

Results

Successfully set up a stable, working Hadoop and Apache Spark environment
Completed an end-to-end data pipeline from raw data to Machine Learning
Performed Data Cleaning and Feature Engineering to support model training
Applied Spark SQL to query and extract data at scale
Extracted concrete business insight from the analysis results
Completed the project with a perfect score of 10/10

The biggest takeaway

This was the favorite technical project of the entire degree — not because of the Machine Learning model, but because it was the first time successfully setting up and running a real Big Data environment from scratch. The process of debugging the environment, from JDK conflicts to switching development tools, taught a skill more valuable than the code itself: the ability to solve problems when nothing works the way the theory says it should.

Beyond that, the project built an intuition for how Data Engineering, Data Analytics, and Machine Learning connect as one continuous chain, rather than three separate, disconnected pieces.

Limitations

The deployment ran on a single node, so it didn’t fully exploit the true distributed capability of Hadoop and Spark
The Kafka streaming component remained at the research and basic experimentation stage
Not deployed on an actual Cloud platform or cluster
Performance wasn’t optimized for very large-scale data

If I did it again

Deploy Hadoop and Spark on a multi-node cluster to properly leverage their distributed nature
Extend into real-time data processing using Kafka and Spark Structured Streaming
Incorporate Airflow to build an automated ETL pipeline
Optimize Spark performance using Partitioning, Caching, and Broadcast Join
Deploy on a Cloud platform such as AWS EMR or Databricks
Build a visual dashboard to monitor the entire pipeline and the Machine Learning results

Continue exploring

Start a conversation

Have a question worth exploring?

I’m open to data roles, thoughtful collaborations, and conversations about the work behind this case study.

Get in touch