> DeepResearch is a cosmetic enhancement that wraps the results in a "report"
No, that's not what Xiao said here. Here's the relevant quote
> It often begins by creating a table of contents, then systematically applies DeepSearch to each required section – from introduction through related work and methodology, all the way to the conclusion. Each section is generated by feeding specific research questions into the DeepSearch. The final phase involves consolidating all sections into a single prompt to improve the overall narrative coherence.
(I also recommend that you stare very hard at the diagrams.)
Let me paraphrase what Xiao is saying here:
A DeepSearch is a primitive — it does mostly the same thing a regular LLM query does, but with a lot of trained-in thinking and searching work, to ensure that it is producing a rigorous answer to your question. Which is great: it means that DeepSearch is more likely to say "I don't know" than to hallucinate an answer. (This is extremely important as a building block; an agent needs to know when a query has failed so it can try again / try something else.)
However, DeepSearch alone still "hallucinates" in one particular way: it "hallucinates understanding" of the topic, thinking that it already has a complete mental toolkit of concepts needed to solve your problem. It will never say "solving this sub-problem seems to require inventing a new tool" and so "branch off" to another recursed DeepSearch to determine how to do that. Instead, it'll try to solve your problem with the toolkit it has — and if that toolkit is insufficient, it will simply fail.
Which, again, is great in some ways. It means that a single DeepSearch will do a (semi-)bounded amount of work. Which means that the costs of each marginal additional DeepSearch call are predictable.
But it also means that you can't ask DeepSearch itself to:
• come up with a mathematical proof of something, where any useful proof strategy will implicitly require inventing new math concepts to use as tools in solving the problem.
• do investigative journalism that involves "chasing leads" down a digraph of paths; evaluating what those leads have to say; and using that info to determine new leads.
• "code me a Facebook clone" — and have it understand that doing so involves iteratively/recursively building out a software architecture composed of many modules — where it won't be able to see the need for many of those modules at "design time", but will only "discover" the need to write them once it gets to implementation time of dependent modules and realizes that to achieve some goal, it must call into some code / entire library that doesn't exist yet. (And then make a buy-vs-build decision on writing that code vs pulling in a dependency... which requires researching the space of available packages in the ecosystem, and how well they solve the problem... and so on.)
A DeepResearch model, meanwhile, is a model that looks at a question, and says "is this a leaf question that can be answered directly — or is this a question that needs to be broken down and tackled by parts, perhaps with some of the parts themselves being unknowns until earlier parts are solved?"
A DeepResearch model does a lot of top-level work — probably using DeepSearch! — to test the "leaf-ness" of your question; and to break down non-leaf questions into a "battle plan" for solving the problem. It then attempts solutions to these component problems — not by calling DeepSearch, but by recursively calling itself (where that forked child will call DeepSearch if the sub-problem is leaf-y, or break down the sub-problem further if not.)
A DeepResearch model will then takes the derived solutions for dependent problems into account in the solution space for depending problems. (A DeepResearch model may also be trained to notice when it's "worked into a corner" by coming up with early-phase solutions that make later phases impossible; and backtracking to solve the earlier phases differently, now with in-context knowledge of the constraints of the later phases.)
Once a DeepResearch model finds a successful solution to all subproblems, it takes the hierarchy of thinking/searching logs it generated in the process, and strips out all the dead-ends and backtracking, to present a comprehensible linear "success path." (Probably it does this as the end-step of each recursive self-call, before returning to self, to minimize the amount of data returned.)
Note how this last reporting step isn't "generating a report" for human consumption; it's a DeepResearch call "generating a report" for its parent DeepResearch call to consume. That's special sauce. (And if you think about it, the top-level call to this whole thing is probably going to use a non-DeepResearch model at the end to rephrase the top-level DeepResearch result from a machine-readable recurse-result report into a human-readable report. It might even use a DeepSearch model to do that!)
---
Bonus tangent:
Despite DeepSearch + DeepResearch using a scientific-research metaphor, I think an enlightening comparison is with intelligence agencies.
DeepSearch alone does what an individual intelligence analyst does. You hand them an individually-actionable question; they run through a "branching, but vaguely bounded in time" process of thinking and searching, generating a thinking log in the process, eventually arriving at a conclusion; they hand you back an answer to your question, with a lot of citations — or they "throw an exception" and tell you that the facts available to the agency cannot support a conclusion at this time.
Meanwhile, DeepResearch does what an intelligence agency as a whole does:
1. You send the agency a high-level strategic Request For Information;
2. the agency puts together a workgroup composed of people with trained-in expertise with breaking down problems (Intelligence Managers), and domain-matter experts with a wide-ranging gestalt picture of the problem space (Senior Intelligence Analysts), and tasks them with breaking down the problem into sub-problems;
3. some of these sub-problems are actionable — they can be assigned directly for research by a ground-level analyst; some of these sub-problems have prerequisite work that must be done to gather intelligence in the field; and some of these sub-problems are unknown unknowns — missing parts of the map that cannot be "planned into" until other sub-problems are resolved.
4. from there, the problem gets "scheduled" — in parallel, (the first batch of) individually-actionable questions get sent to analysts, and any field missions to gather pre-requisite intelligence are kicked off for planning (involving spawning new sub-workgroups!)
5. the top-level workgroup persists after their first meeting, asynchronously observing the reports from actionable questions; scheduling newly-actionable questions to analysts once field data comes in to be chewed on; and exploring newly-legible parts of the map to outline further sub-problems.
6. If this scheduling process runs out of work to schedule, it's either because the top-level question is now answerable, or because the process has worked itself into a corner. In the former case, a final summary reporting step is kicked off, usually assigned to a senior analyst. In the latter case, the workgroup reconvene to figure out how to backtrack out of the corner and pursue alternate avenues. (Note that, if they have the time, they'll probably make "if this strategy produces results that are unworkable in a later step" plans for every possible step in their original plan, in advance, so that the "scheduling engine" of analyst assignments and fieldwork need never run dry waiting for the workgroup to come up with a new plan.)
You're right, Han didn't define DeepResearch as "a cosmetic enhancement". I quoted his sentence long definition:
> DeepResearch builds upon DeepSearch by adding a structured framework for generating long research reports.
But then called it "a cosmetic enhancement" really to be slightly dismissive of it - I'm a skeptic of the report format because I think the way it's presented makes the information look more solid than it actually is. My complaint is at the aesthetic level, not relating to the (impressive) way the report synthesis is engineered.
So yeah, I'm being inaccurate and a bit catty about it.
Your explanation is much closer to what Han described, and much more useful than mine.
No, that's not what Xiao said here. Here's the relevant quote
> It often begins by creating a table of contents, then systematically applies DeepSearch to each required section – from introduction through related work and methodology, all the way to the conclusion. Each section is generated by feeding specific research questions into the DeepSearch. The final phase involves consolidating all sections into a single prompt to improve the overall narrative coherence.
(I also recommend that you stare very hard at the diagrams.)
Let me paraphrase what Xiao is saying here:
A DeepSearch is a primitive — it does mostly the same thing a regular LLM query does, but with a lot of trained-in thinking and searching work, to ensure that it is producing a rigorous answer to your question. Which is great: it means that DeepSearch is more likely to say "I don't know" than to hallucinate an answer. (This is extremely important as a building block; an agent needs to know when a query has failed so it can try again / try something else.)
However, DeepSearch alone still "hallucinates" in one particular way: it "hallucinates understanding" of the topic, thinking that it already has a complete mental toolkit of concepts needed to solve your problem. It will never say "solving this sub-problem seems to require inventing a new tool" and so "branch off" to another recursed DeepSearch to determine how to do that. Instead, it'll try to solve your problem with the toolkit it has — and if that toolkit is insufficient, it will simply fail.
Which, again, is great in some ways. It means that a single DeepSearch will do a (semi-)bounded amount of work. Which means that the costs of each marginal additional DeepSearch call are predictable.
But it also means that you can't ask DeepSearch itself to:
• come up with a mathematical proof of something, where any useful proof strategy will implicitly require inventing new math concepts to use as tools in solving the problem.
• do investigative journalism that involves "chasing leads" down a digraph of paths; evaluating what those leads have to say; and using that info to determine new leads.
• "code me a Facebook clone" — and have it understand that doing so involves iteratively/recursively building out a software architecture composed of many modules — where it won't be able to see the need for many of those modules at "design time", but will only "discover" the need to write them once it gets to implementation time of dependent modules and realizes that to achieve some goal, it must call into some code / entire library that doesn't exist yet. (And then make a buy-vs-build decision on writing that code vs pulling in a dependency... which requires researching the space of available packages in the ecosystem, and how well they solve the problem... and so on.)
A DeepResearch model, meanwhile, is a model that looks at a question, and says "is this a leaf question that can be answered directly — or is this a question that needs to be broken down and tackled by parts, perhaps with some of the parts themselves being unknowns until earlier parts are solved?"
A DeepResearch model does a lot of top-level work — probably using DeepSearch! — to test the "leaf-ness" of your question; and to break down non-leaf questions into a "battle plan" for solving the problem. It then attempts solutions to these component problems — not by calling DeepSearch, but by recursively calling itself (where that forked child will call DeepSearch if the sub-problem is leaf-y, or break down the sub-problem further if not.)
A DeepResearch model will then takes the derived solutions for dependent problems into account in the solution space for depending problems. (A DeepResearch model may also be trained to notice when it's "worked into a corner" by coming up with early-phase solutions that make later phases impossible; and backtracking to solve the earlier phases differently, now with in-context knowledge of the constraints of the later phases.)
Once a DeepResearch model finds a successful solution to all subproblems, it takes the hierarchy of thinking/searching logs it generated in the process, and strips out all the dead-ends and backtracking, to present a comprehensible linear "success path." (Probably it does this as the end-step of each recursive self-call, before returning to self, to minimize the amount of data returned.)
Note how this last reporting step isn't "generating a report" for human consumption; it's a DeepResearch call "generating a report" for its parent DeepResearch call to consume. That's special sauce. (And if you think about it, the top-level call to this whole thing is probably going to use a non-DeepResearch model at the end to rephrase the top-level DeepResearch result from a machine-readable recurse-result report into a human-readable report. It might even use a DeepSearch model to do that!)
---
Bonus tangent:
Despite DeepSearch + DeepResearch using a scientific-research metaphor, I think an enlightening comparison is with intelligence agencies.
DeepSearch alone does what an individual intelligence analyst does. You hand them an individually-actionable question; they run through a "branching, but vaguely bounded in time" process of thinking and searching, generating a thinking log in the process, eventually arriving at a conclusion; they hand you back an answer to your question, with a lot of citations — or they "throw an exception" and tell you that the facts available to the agency cannot support a conclusion at this time.
Meanwhile, DeepResearch does what an intelligence agency as a whole does:
1. You send the agency a high-level strategic Request For Information;
2. the agency puts together a workgroup composed of people with trained-in expertise with breaking down problems (Intelligence Managers), and domain-matter experts with a wide-ranging gestalt picture of the problem space (Senior Intelligence Analysts), and tasks them with breaking down the problem into sub-problems;
3. some of these sub-problems are actionable — they can be assigned directly for research by a ground-level analyst; some of these sub-problems have prerequisite work that must be done to gather intelligence in the field; and some of these sub-problems are unknown unknowns — missing parts of the map that cannot be "planned into" until other sub-problems are resolved.
4. from there, the problem gets "scheduled" — in parallel, (the first batch of) individually-actionable questions get sent to analysts, and any field missions to gather pre-requisite intelligence are kicked off for planning (involving spawning new sub-workgroups!)
5. the top-level workgroup persists after their first meeting, asynchronously observing the reports from actionable questions; scheduling newly-actionable questions to analysts once field data comes in to be chewed on; and exploring newly-legible parts of the map to outline further sub-problems.
6. If this scheduling process runs out of work to schedule, it's either because the top-level question is now answerable, or because the process has worked itself into a corner. In the former case, a final summary reporting step is kicked off, usually assigned to a senior analyst. In the latter case, the workgroup reconvene to figure out how to backtrack out of the corner and pursue alternate avenues. (Note that, if they have the time, they'll probably make "if this strategy produces results that are unworkable in a later step" plans for every possible step in their original plan, in advance, so that the "scheduling engine" of analyst assignments and fieldwork need never run dry waiting for the workgroup to come up with a new plan.)