Scaling Data Mining with API Efficiency Under TPM Limits

Issue

We needed to mine and transform large text datasets into structured formats like:

Structured JSON from raw text
Graphs from code or document relationships

However, GPT-4’s 2M TPM constraint and high latency posed a scaling bottleneck.

title: “Scaling Data Mining with API Efficiency Under TPM Limits” collection: journal excerpt: “Efficiently mining structured text or graphs using GPT-4 APIs while staying under 2M TPM.”

Issue

We needed to mine and transform large text datasets into structured formats like:

Structured JSON from raw text
Graphs from code or document relationships

However, GPT-4’s 2M TPM constraint and high latency posed a scaling bottleneck.

Solution

We implemented a parallel, optimized API pipeline using LangChain, with full token tracking and queue management:

1. Query Batching with Subtasks

Split tasks into independent subtasks (e.g., paragraph → entity list)
Batched multiple prompts per request using LangChain.map_reduce
Used OpenAI function-calling API to enforce structured output
Compressed prompts before each call to fit more per request

2. TPM-Conscious Scheduling

Used LangChain’s token-aware throttling to avoid rate limit breaches
Distributed load across multiple API keys/orgs to parallelize requests
Tracked token usage per key using a token_window sliding structure
Maintained a pending job queue with TTL and retries

3. Throttling Logic with Queue and Token Tracker

Used a deque-based job queue for pending transformations
Created a token_usage_log = {api_key: [(timestamp, token_count), ...]} structure
In each loop:

while job_queue:
    job = job_queue.popleft()
    if not token_within_limit(api_key):
        job_queue.append(job)  # Requeue if over TPM
        sleep(1)
        continue

    output = call_openai(job)
    log_token_use(api_key, token_count)
    save_output(output)

Oussama Gabouj

Issue

Issue

Solution

1. Query Batching with Subtasks

2. TPM-Conscious Scheduling

3. Throttling Logic with Queue and Token Tracker