TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Yizhi Li1,3*, Qingshui Gu1*, Zhoufutu Wen2*, Ziniu Li2, Tianshun Xing1, Shuyue Guo1, Tianyu Zheng1, Xin Zhou2, Xingwei Qu1,2,3, Wangchunshu Zhou1, Zheng Zhang2, Wei Shen2, Qian Liu1, Chenghua Lin3♢, Jian Yang1♢, Ge Zhang1,2♢, Wenhao Huang2
1M-A-P    2Bytedance Seed    3The University of Manchester
*Equal contribution    Corresponding Author

TL;DR

Standard reinforcement learning methods for LLMs are inefficient, wasting compute by generating many independent reasoning paths from scratch. We propose TreePO, a new method that organizes generation into a tree structure. This allows for sharing common reasoning steps (like a shared trunk and branches), which makes the process much faster and more stable. Our method saves up to 43% of GPU time while achieving state-of-the-art performance on reasoning benchmarks.

Demonstration of the Validation Performance Curves along Training based on Qwen2.5-7B (Left, Mid) and Demonstration of TreePO Sampling (Right).

The Problem: Wasted Effort in Reasoning

When using reinforcement learning to teach LLMs how to reason, a common strategy is to generate multiple possible solutions for a single problem and reward the good ones. However, this is like asking 16 different people to solve a math problem, and finding out they all started by writing down the same first few steps. It's redundant and inefficient. Each of these independent solutions (or "trajectories") re-computes the same initial reasoning steps, wasting valuable GPU time and memory.

Case study showing shared reasoning paths

Multiple sampled trajectories from the same prompt, with shared reasoning segments highlighted in matching colors. Despite stochastic generation, key problem-solving steps are consistently reproduced.

Our Solution: Thinking in Trees

TreePO introduces a smarter way to explore the solution space. Instead of independent paths, we grow a single tree of possibilities. The model generates a segment of text, and then can "branch" off, creating multiple continuations. This has two major benefits:

TreePO Advantage Estimation

Demonstration of the TreePO Advantage Estimation, which calculates rewards based on sub-groups of trajectories within the tree.

The Payoff: Better, Faster, Cheaper

Our experiments show that TreePO is not just a theoretical improvement. It delivers concrete benefits.

State-of-the-Art Performance

TreePO achieves top results on several challenging math and reasoning benchmarks.

Model AIME AMC MATH Overall
GRPO (Baseline) 17.13% 44.42% 72.89% 46.63%
TreePO (Ours) 27.83% 55.53% 85.34% 58.21%

Massive Efficiency Gains

This is where TreePO truly shines. By avoiding redundant computations, we see significant reductions in the GPU hours required for training, making the whole process more scalable and accessible.

Model Sampling Method Overall Accuracy ↑ GPU Hours ↓
TreePO w/ More Init Divergence Sequential 58.21% 6.40
Tree-based (b=2) 54.67% 3.65 (↓43%)

Efficiency Analysis: A Deeper Look

We benchmarked the throughput of TreePO against conventional sampling. On average, our tree-based sampling yields a +40% increase in trajectories per second and a +30% increase in tokens per second across three different models. The optimal configuration, however, depends on the model and the task.

Finding the Sweet Spot: Depth vs. Segment Length

Efficiency peaks at an intermediate trade-off between tree depth and segment length. Shorter segments allow for more branching and parallelism, but increase computational overhead. Longer segments are better for prefilling, but limit the tree's depth. The ideal balance is model-specific.

Performance comparison across different tree depths.

Scaling with More Rollouts

For instruction-tuned models, throughput increases almost linearly as we generate more rollouts (solutions) per prompt. For base models, throughput peaks and then declines, as the increased diversity of solutions leads to less sharing of the computational path.

Performance comparison across different numbers of rollouts.

Further Analysis

How Should We Calculate Rewards in a Tree?

We explored several ways to calculate the reward signal (the "advantage"). The plots below show that a simple averaging of rewards across different subgroups in the tree works best. More complex weighting schemes or filtering out certain subgroups can actually hurt performance by creating a biased signal.

Advantage Analysis 1 Advantage Analysis 2 Advantage Analysis 3 Advantage Analysis 4

How Deep Should the Tree Be?

How should we slice up the generation? Should the tree be deep with short text segments, or shallow with long segments? We tested various combinations and found a sweet spot. A moderately deep tree with 512-token segments (14x512) gave the best results, while very long segments underperformed. This suggests that giving the model more frequent opportunities to branch is beneficial.

Depth Analysis 1 Depth Analysis 2 Depth Analysis 3 Depth Analysis 4

Is More Exploration Always Better?

Can we guide the search using the model's own confidence? We tested a heuristic where we gave more computational budget to less likely paths to encourage exploration. The results show this is a bad idea. Forcing the model down low-probability paths leads to longer, less coherent answers and worse performance. A balanced exploration strategy is more effective.

Logprob Analysis 1 Logprob Analysis 2 Logprob Analysis 3 Logprob Analysis 4

Citation

If you find this work useful, please consider citing the paper:

@misc{li2025treepo, title={TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling}, author={Yizhi Li and Qingshui Gu and Zhoufutu Wen and Ziniu Li and Tianshun Xing and Shuyue Guo and Tianyu Zheng and Xin Zhou and Xingwei Qu and Wangchunshu Zhou and Zheng Zhang and Wei Shen and Qian Liu and Chenghua Lin and Jian Yang and Ge Zhang and Wenhao Huang}, year={2025}, eprint={2508.17445}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.17445}, howpublished = {\url{https://m-a-p.ai/TreePO}} }