TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Yizhi Li^1,3*, Qingshui Gu^1*, Zhoufutu Wen^2*, Ziniu Li², Tianshun Xing¹, Shuyue Guo¹, Tianyu Zheng¹, Xin Zhou², Xingwei Qu^1,2,3, Wangchunshu Zhou¹, Zheng Zhang², Wei Shen², Qian Liu¹, Chenghua Lin^3♢, Jian Yang^1♢, Ge Zhang^1,2♢, Wenhao Huang²

¹M-A-P ²Bytedance Seed ³The University of Manchester
^*Equal contribution ^♢Corresponding Author

🤗 Checkpoints & Data 📄 Read the Paper 👾 Repository

TL;DR

Standard reinforcement learning methods for LLMs are inefficient, wasting compute by generating many independent reasoning paths from scratch. We propose TreePO, a new method that organizes generation into a tree structure. This allows for sharing common reasoning steps (like a shared trunk and branches), which makes the process much faster and more stable. Our method saves up to 43% of GPU time while achieving state-of-the-art performance on reasoning benchmarks.

Demonstration of the Validation Performance Curves along Training based on Qwen2.5-7B (Left, Mid) and Demonstration of TreePO Sampling (Right).

The Problem: Wasted Effort in Reasoning

When using reinforcement learning to teach LLMs how to reason, a common strategy is to generate multiple possible solutions for a single problem and reward the good ones. However, this is like asking 16 different people to solve a math problem, and finding out they all started by writing down the same first few steps. It's redundant and inefficient. Each of these independent solutions (or "trajectories") re-computes the same initial reasoning steps, wasting valuable GPU time and memory.

Case study showing shared reasoning paths

Multiple sampled trajectories from the same prompt, with shared reasoning segments highlighted in matching colors. Despite stochastic generation, key problem-solving steps are consistently reproduced.

Our Solution: Thinking in Trees

TreePO introduces a smarter way to explore the solution space. Instead of independent paths, we grow a single tree of possibilities. The model generates a segment of text, and then can "branch" off, creating multiple continuations. This has two major benefits:

Efficiency: The computation for the shared parts of the reasoning path (the trunk of the tree) is done only once. This significantly speeds up the process.
Smarter Exploration: We can be more strategic about where to allocate our computational budget. We can encourage the model to explore more diverse paths or focus on more promising ones using heuristics.

Demonstration of the TreePO Advantage Estimation, which calculates rewards based on sub-groups of trajectories within the tree.

The Payoff: Better, Faster, Cheaper

Our experiments show that TreePO is not just a theoretical improvement. It delivers concrete benefits.

State-of-the-Art Performance

TreePO achieves top results on several challenging math and reasoning benchmarks.

Model	AIME	AMC	MATH	Overall
GRPO (Baseline)	17.13%	44.42%	72.89%	46.63%
TreePO (Ours)	27.83%	55.53%	85.34%	58.21%

Massive Efficiency Gains

This is where TreePO truly shines. By avoiding redundant computations, we see significant reductions in the GPU hours required for training, making the whole process more scalable and accessible.

Model	Sampling Method	Overall Accuracy ↑	GPU Hours ↓
TreePO w/ More Init Divergence	Sequential	58.21%	6.40
TreePO w/ More Init Divergence	Tree-based (b=2)	54.67%	3.65 (↓43%)

Efficiency Analysis: A Deeper Look

We benchmarked the throughput of TreePO against conventional sampling. On average, our tree-based sampling yields a +40% increase in trajectories per second and a +30% increase in tokens per second across three different models. The optimal configuration, however, depends on the model and the task.

Finding the Sweet Spot: Depth vs. Segment Length

Efficiency peaks at an intermediate trade-off between tree depth and segment length. Shorter segments allow for more branching and parallelism, but increase computational overhead. Longer segments are better for prefilling, but limit the tree's depth. The ideal balance is model-specific.

Qwen-Instruct

Qwen-Math-Instruct

Qwen-Math

Performance comparison across different tree depths.

Scaling with More Rollouts

For instruction-tuned models, throughput increases almost linearly as we generate more rollouts (solutions) per prompt. For base models, throughput peaks and then declines, as the increased diversity of solutions leads to less sharing of the computational path.

Qwen-Instruct

Qwen-Math-Instruct

Qwen-Math

Performance comparison across different numbers of rollouts.

Further Analysis

How Should We Calculate Rewards in a Tree?

We explored several ways to calculate the reward signal (the "advantage"). The plots below show that a simple averaging of rewards across different subgroups in the tree works best. More complex weighting schemes or filtering out certain subgroups can actually hurt performance by creating a biased signal.

How Deep Should the Tree Be?

How should we slice up the generation? Should the tree be deep with short text segments, or shallow with long segments? We tested various combinations and found a sweet spot. A moderately deep tree with 512-token segments (14x512) gave the best results, while very long segments underperformed. This suggests that giving the model more frequent opportunities to branch is beneficial.

Is More Exploration Always Better?

Can we guide the search using the model's own confidence? We tested a heuristic where we gave more computational budget to less likely paths to encourage exploration. The results show this is a bad idea. Forcing the model down low-probability paths leads to longer, less coherent answers and worse performance. A balanced exploration strategy is more effective.

Citation

If you find this work useful, please consider citing the paper:

        @misc{li2025treepo,
            title={TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling}, 
            author={Yizhi Li and Qingshui Gu and Zhoufutu Wen and Ziniu Li and Tianshun Xing and Shuyue Guo and Tianyu Zheng and Xin Zhou and Xingwei Qu and Wangchunshu Zhou and Zheng Zhang and Wei Shen and Qian Liu and Chenghua Lin and Jian Yang and Ge Zhang and Wenhao Huang},
            year={2025},
            eprint={2508.17445},
            archivePrefix={arXiv},
            primaryClass={cs.LG},
            url={https://arxiv.org/abs/2508.17445}, 
            howpublished = {\url{https://m-a-p.ai/TreePO}}
        }