Research breakdown on why having Model B review Model A's code catches ~10% more issues. Includes Sonar benchmark data on where GPT/Gemini/Claude each fail differently, and context engineering techniques to avoid the "Lost in the Middle" problem in review prompts.