Speeding Up Graph Similarity Matching with Efficient Tensor Ops
The graph similarity algorithm for matching image-text graph pairs was too slow, particularly in the pairwise comparison step
The graph similarity algorithm for matching image-text graph pairs was too slow, particularly in the pairwise comparison step
Group similar-length samples to minimize VRAM waste and stabilize throughput in NLP tasks.
PPO and GRPO training with models >7B caused OOM errors on A100 GPUs due to multiple full model replicas. This post details optimization strategies to fix it.
Improving distributed training speed using vLLM, Flash Attention, LoRA, gradient checkpointing, and stable checkpoint recovery across multi-node systems.
Efficiently mining structured text or graphs using GPT-4 APIs while staying under 2M TPM.
Correctly configuring AMP and autocast led to 2× faster training on NVIDIA GPUs.
Avoiding redundant tokenizer calls accelerated validation by up to 3× during fine-tuning.