Research and Debug Journal

Speeding Up Graph Similarity Matching with Efficient Tensor Ops

The graph similarity algorithm for matching image-text graph pairs was too slow, particularly in the pairwise comparison step

Reducing Padding Overhead with Sequence Bucketing

Group similar-length samples to minimize VRAM waste and stabilize throughput in NLP tasks.

Resolving OOM in PPO/GRPO with Large Models

PPO and GRPO training with models >7B caused OOM errors on A100 GPUs due to multiple full model replicas. This post details optimization strategies to fix it.

Speeding Up Distributed Training with vLLM, Flash Attention, and Checkpoint Resuming

Improving distributed training speed using vLLM, Flash Attention, LoRA, gradient checkpointing, and stable checkpoint recovery across multi-node systems.

Scaling Data Mining with API Efficiency Under TPM Limits

Efficiently mining structured text or graphs using GPT-4 APIs while staying under 2M TPM.

Fixing Mixed Precision Underutilization for Speed Gains

Correctly configuring AMP and autocast led to 2× faster training on NVIDIA GPUs.

Speeding Up Evaluation with Cached Tokenization

Avoiding redundant tokenizer calls accelerated validation by up to 3× during fine-tuning.

Oussama Gabouj