GemmaEdu: Enhancing Scientific Learning via Fine-Tuned Language Models and RAG

1 minute read

Problem Statement & Motivation

Large language models are increasingly central to student learning, but they often fail to answer scientific multiple-choice questions (MCQs) accurately, especially in STEM subjects. Existing models lack proper alignment with educational contexts and struggle with returning concise, grounded responses.

To address this, GemmaEdu aims to create a fine-tuned, preference-aligned educational assistant that integrates retrieval mechanisms to enhance factual precision and context relevance.

Our Method

We built an advanced educational chatbot based on the quantized Gemma 7B model and improved it through:

1. Fine-Tuning with Direct Preference Optimization (DPO)

  • Preference data was generated from 2k MCQs by EPFL Master students across CS, physics, EE, and AI topics.
  • We fine-tuned the model with DPO-only and SFT + DPO approaches.
  • The best-performing model used only DPO on a merged dataset (M1 + Argilla), improving both accuracy and margin scores.

2. Retrieval-Augmented Generation (RAG)

  • We built a RAG pipeline using LlamaIndex + ChromaDB and the BAAI/bge-small-en-v1.5 embedding model.
  • Documents included:
    • Open-source textbooks across relevant subjects.
    • Wikipedia pages extracted via GPT keyword generation.
    • RAKE keyword summaries from student questions.

3. Prompt Engineering

  • We experimented with Chain-of-Thought (CoT) prompts, few-shot examples, and verbalizers to extract single-letter MCQ answers.
  • The final pipeline included generating hints first, then prompting for a concise choice (e.g., “A”, “B”).

Evaluation & Benchmarks

  • Dataset: Subset of MMLU (focusing on 22 STEM subjects).
  • Metrics: Accuracy and margin on MCQ responses.
  • Comparisons: Baselines include:
    • Gemma-7B
    • Llama2-7B
    • Mistral-7B

Results

  • Accuracy improved across all categories, notably in physics and EE.
  • Our model outperforms all baselines in most subjects and in overall MMLU STEM performance.
  • RAG integration showed selective improvements, especially outside the computer science domain.
SubjectEdu (RAG)GemmaMistralLlama2
Physics41.234.324.521.5
EE51.746.239.324.0
CS42.042.032.023.0
Overall STEM44.436.436.035.2

Challenges & Future Work

  • Noise in student-generated data limited DPO training efficiency.
  • RAG chunk tuning and prompt wording were crucial for effectiveness.
  • Future improvements include better prompt patterns for concise answer generation.