Daily-It

개발, AI, 인프라, 자동화와 일상 IT 제품 후기를 직접 써보며 정리하는 기술 블로그입니다.

Why Claude Code /deep-research Hits API Rate Limits During Verify — and How Queue Batches Fixed It

Summary

Claude Code /deep-research API rate limits are not just a vague “the Verify step is heavy” problem. In my case, many agents repeatedly searched, fetched, summarized, and re-verified material, and the accumulated token usage finally hit the Anthropic API limit during the Verify phase.

The practical fix was to stop treating the whole research run as one large parallel job. I split the work into a queue-style pipeline: Verify ran in small claim batches, Fetch stopped overlapping with Verify, and a fresh run avoided carrying over the failed Fetch cache. The result was stable and good enough to keep the three-vote verification quality.

In this article

The situation: Verify failed at the end

While using Claude Code /deep-research, I hit an API rate limit during the final Verify phase. The earlier research steps had mostly progressed, but the run failed right when the result was being checked.

At first glance, it feels like Verify itself is the problem. In reality, Verify was closer to the point where the accumulated cost became visible: many agents had already produced large intermediate results, and the final check needed to read and compare them again.

The core cause: too many agents used too many tokens. The Verify phase did not create the whole problem by itself; it exposed the token spike that had built up across the run.

Why the rate limit happened

Anthropic’s API rate limits are not measured only by request count. The Messages API can be limited by RPM, input tokens per minute (ITPM), and output tokens per minute (OTPM). If any one of those is exceeded, a 429 error can occur.

LimitMeaningWhy it matters in deep-research
RPMRequests per minuteParallel agents can increase calls quickly
ITPMInput tokens per minuteSearch results, fetched pages, previous summaries, and claim lists keep entering prompts
OTPMOutput tokens per minuteAgents produce long analysis and verification notes
Acceleration limitA protection against sudden usage spikesA wide research run can jump from low usage to a large burst

Why it is common in deep-research

Claude Code subagents are useful because exploration can happen in a separate context window. That is exactly why deep research works well for complex work. But the same structure can also multiply token usage when the task fans out too broadly.

  1. Several subtopics are explored in parallel.
  2. Each agent reads documents, code, issues, and web pages.
  3. Each agent creates intermediate summaries.
  4. The main agent merges those summaries.
  5. Verify rereads the claims, evidence, and conflicts.

That is good for quality, but expensive for tokens. Verify is not a tiny final checkbox; it can become a second pass over everything that was already produced.

Splitting deep-research into a queue

The fix that worked for me was to change the execution shape, not just ask the model to “try again.” The goal was to reduce peak concurrency and keep token-per-minute usage under the ceiling.

The exact restructuring I used

StageBeforeAfterEffect
Verifyparallel over all 75 claim checksProcess 3 claims per batch and sequentially await each batchReduced peak concurrency, which had reached about 9 simultaneous checks, below the TPM ceiling
FetchNested parallel inside the pipelineFinish Search first as a barrier, then Fetch 4 items per sequential batchFetch and Verify no longer overlap
Total work25 claims and 15 fetches18 claims and 12 fetchesLower total token volume while preserving the important checks
Verification qualityLarge one-shot verificationKept the three-vote verification qualityReduced the bottleneck without dropping the quality bar
Run stateCould reuse the failed Fetch cache from the previous attemptFresh run and clean refetchAvoided carrying over stale failure state

The biggest improvement came from changing Verify from a 75-item parallel verification into 3-claim sequential batches. The quality target stayed the same; only the burst pattern changed. The Search barrier also mattered because Fetch and Verify no longer competed for token budget at the same time.

A prompt you can reuse

This is the kind of instruction I would use next time before starting a large /deep-research run.

Run /deep-research, but split the whole process into a queue-style pipeline.

Rules:
1. Do not verify all claims in one parallel block. Verify 3 claims per batch and await each batch sequentially.
2. Do not run Fetch as nested parallel work inside the pipeline.
3. Finish Search first, then Fetch 4 items per sequential batch.
4. Keep Fetch and Verify separated by a barrier so they do not overlap.
5. Reduce total volume if needed, for example claims 25→18 and fetches 15→12, while preserving the core three-vote verification quality.
6. Use a fresh run if a previous Fetch stage failed, so stale cache does not affect the next attempt.
7. If token usage may spike, stop and report the current queue state.

Practical failure cases and fixes

Case 1. Verify hits 429 or a rate limit error

Do not blindly retry the same shape. First compress the current findings, split the remaining claims into small batches, and verify them one batch at a time.

Compress the current research into 15 lines.
Split the remaining Verify work into 3-claim batches.
Verify the most important batch first and report the queue state after each batch.

Case 2. Fetch and Verify overlap

If fetching sources and verifying claims happen at the same time, both input and output token usage can spike. Add a barrier: complete Search, then Fetch in small batches, then Verify.

Case 3. A failed Fetch cache affects the next run

If the previous run failed in Fetch, a clean fresh run can be better than resuming from a messy intermediate state. This is especially true when the failure happened near a rate-limit boundary.

Conclusion

When Claude Code /deep-research fails during Verify with an API rate limit, the direct cause is often not one bad Verify call. It is the accumulated work of many agents, many fetched sources, many intermediate summaries, and a final verification pass that tries to read too much at once.

For large research tasks, I would now start with a queue design: small Verify batches, Fetch separated from Verify, lower total volume, and a clean run when a previous attempt failed. It is slightly slower, but much more reliable than hitting the TPM ceiling at the end.

References

Original Korean version: This article is based on the Korean version and lightly adapted for English readers. Read the original Korean post. Please show some love to Korean, too.