← Back to projects

Pillar project / 05

Big Data Pipeline on Hadoop & Apache Spark

Building an end-to-end big data and machine learning workflow

Pillar project4 min read

At a glance

Project overview

Case study content

The problem: big data doesn’t create value on its own

Businesses need to extract value from large datasets to support decision-making, but data is often scattered, unclean, and hard to process with traditional tools like Excel or a single database. This project is a comprehensive capstone covering Big Data, Data Engineering, and Machine Learning — not just focused on building a predictive model, but walking through the entire lifecycle of a real data project: choosing a dataset, cleaning it, exploring it, engineering features, training a model, and extracting business insight.

The team set up a Hadoop and Apache Spark environment to store, process, and analyze data at scale — building a pipeline on a distributed platform to standardize the data, extract information, and train a Machine Learning model for analysis and prediction.

Role: Team Leader, coordinating while staying hands-on

Within a 3-person team, the Team Leader role wasn’t just about assigning tasks — it meant directly executing most of the core technical steps:

  • Assigned work and tracked the team’s progress
  • Researched the dataset and prepared the data
  • Loaded data into Hadoop HDFS
  • Read data from HDFS using Spark, ran Spark SQL query sets
  • Performed Exploratory Data Analysis (EDA)
  • Preprocessed data for Feature Selection and Machine Learning
  • Wrote the report and designed the presentation slides

Beyond the assigned scope, also proactively researched Kafka and PowerShell, along with EDA techniques that weren’t covered in the classroom.

The process: when the hardest technical problem isn’t the model

The most interesting part of this project: the hardest problem wasn’t Machine Learning — it was successfully setting up and running the Hadoop and Apache Spark environment in the first place. During implementation, configuring Spark in PyCharm ran into repeated JDK conflicts. After multiple failed attempts, the decision was made to switch to running Python files through Visual Studio Code combined with PowerShell to stabilize the environment and keep development moving — a practical engineering call that wasn’t in any textbook, but was necessary to keep the project alive.

The workflow covered a complete Big Data + ML pipeline:

  1. Select and research the dataset
  2. Set up the Hadoop and Spark environment
  3. Load data into Hadoop HDFS
  4. Read data from HDFS using Spark
  5. Run Spark SQL queries for initial data exploration
  6. Clean and preprocess the data
  7. Perform EDA to understand the data’s distribution and characteristics
  8. Select appropriate features
  9. Train and evaluate the Machine Learning model
  10. Extract business insight from the analysis results
  11. Finalize the report and presentation

Data preprocessing was also the part that took the most time — this was the first time truly understanding why data needs to be carefully cleaned, normalized, and feature-selected before going into Machine Learning, rather than jumping straight to the training step like many other assignments.

Results

  • Successfully set up a stable, working Hadoop and Apache Spark environment
  • Completed an end-to-end data pipeline from raw data to Machine Learning
  • Performed Data Cleaning and Feature Engineering to support model training
  • Applied Spark SQL to query and extract data at scale
  • Extracted concrete business insight from the analysis results
  • Completed the project with a perfect score of 10/10

The biggest takeaway

This was the favorite technical project of the entire degree — not because of the Machine Learning model, but because it was the first time successfully setting up and running a real Big Data environment from scratch. The process of debugging the environment, from JDK conflicts to switching development tools, taught a skill more valuable than the code itself: the ability to solve problems when nothing works the way the theory says it should.

Beyond that, the project built an intuition for how Data Engineering, Data Analytics, and Machine Learning connect as one continuous chain, rather than three separate, disconnected pieces.

Limitations

  • The deployment ran on a single node, so it didn’t fully exploit the true distributed capability of Hadoop and Spark
  • The Kafka streaming component remained at the research and basic experimentation stage
  • Not deployed on an actual Cloud platform or cluster
  • Performance wasn’t optimized for very large-scale data

If I did it again

  • Deploy Hadoop and Spark on a multi-node cluster to properly leverage their distributed nature
  • Extend into real-time data processing using Kafka and Spark Structured Streaming
  • Incorporate Airflow to build an automated ETL pipeline
  • Optimize Spark performance using Partitioning, Caching, and Broadcast Join
  • Deploy on a Cloud platform such as AWS EMR or Databricks
  • Build a visual dashboard to monitor the entire pipeline and the Machine Learning results

Start a conversation

Have a question worth exploring?

I’m open to data roles, thoughtful collaborations, and conversations about the work behind this case study.

Get in touch