Distributed Movie Recommendation Pipelines with Apache Spark

Problem Statement & Motivation

Modern data-driven applications in video streaming rely on real-time analytics and personalization to provide relevant content. This project aims to explore large-scale, distributed data processing for movie recommendation tasks using Apache Spark. By implementing Spark-based pipelines, we address:

  • The need for scalable movie rating aggregation and analytics.
  • The demand for efficient personalized recommendation systems using distributed collaborative filtering.
  • The use of locality-sensitive hashing (LSH) for scalable similarity search in keyword-based movie descriptions.

Our Method

We implemented a multi-phase pipeline based on the MovieLens dataset, covering:

1. Data Preprocessing & Analytics

  • Loaded movie metadata and rating logs from pipe-separated files.
  • Implemented Spark RDD transformations for grouping, filtering, and rating statistics (e.g., most-rated movies by year or genre).

2. Rating Aggregation & Incremental Updates

  • Built an aggregator for computing average ratings per movie.
  • Enabled incremental update logic to efficiently maintain aggregates after new ratings.

3. Keyword-based Filtering

  • Supported genre/keyword filtering using both:
    • Standard filtering
    • Broadcast variables for optimization

4. Prediction Models & Recommenders

  • Implemented a baseline predictor using user rating normalization.
  • Applied collaborative filtering via ALS using Spark MLlib.
  • Developed LSH-based indexing for near-neighbor search on movie genres.
  • Combined LSH + prediction for personalized movie recommendations.

Evaluation

The system supports both local development and distributed Spark execution on a cluster (via EPFL IC infrastructure). It was validated through:

  • Automated unit tests provided by CS460 instructors.
  • Performance benchmarking on datasets of varying sizes (small to large).
  • Accurate top-N recommendations for users based on previous ratings.