Distributed Movie Recommendation Pipelines with Apache Spark
Problem Statement & Motivation
Modern data-driven applications in video streaming rely on real-time analytics and personalization to provide relevant content. This project aims to explore large-scale, distributed data processing for movie recommendation tasks using Apache Spark. By implementing Spark-based pipelines, we address:
- The need for scalable movie rating aggregation and analytics.
- The demand for efficient personalized recommendation systems using distributed collaborative filtering.
- The use of locality-sensitive hashing (LSH) for scalable similarity search in keyword-based movie descriptions.
Our Method
We implemented a multi-phase pipeline based on the MovieLens dataset, covering:
1. Data Preprocessing & Analytics
- Loaded movie metadata and rating logs from pipe-separated files.
- Implemented Spark RDD transformations for grouping, filtering, and rating statistics (e.g., most-rated movies by year or genre).
2. Rating Aggregation & Incremental Updates
- Built an aggregator for computing average ratings per movie.
- Enabled incremental update logic to efficiently maintain aggregates after new ratings.
3. Keyword-based Filtering
- Supported genre/keyword filtering using both:
- Standard filtering
- Broadcast variables for optimization
4. Prediction Models & Recommenders
- Implemented a baseline predictor using user rating normalization.
- Applied collaborative filtering via ALS using Spark MLlib.
- Developed LSH-based indexing for near-neighbor search on movie genres.
- Combined LSH + prediction for personalized movie recommendations.
Evaluation
The system supports both local development and distributed Spark execution on a cluster (via EPFL IC infrastructure). It was validated through:
- Automated unit tests provided by CS460 instructors.
- Performance benchmarking on datasets of varying sizes (small to large).
- Accurate top-N recommendations for users based on previous ratings.