V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models

1 Shenzhen Campus of Sun Yat-sen University 2 Shenzhen Loop Area Institute
3 Sichuan University 4 EPIC Lab, Shanghai Jiao Tong University

Project leader: seanleo666@gmail.com. Corresponding author: renwq3@mail.sysu.edu.cn.

⚡ A training-free and plug-and-play curvature-aware spatio-temporal pruning framework for efficient long-context video inference.
V-CAST teaser

Highlight of V-CAST. Compared to existing token compression methods, V-CAST yields +1.5 long-video gains, achieves 98.2% original performance with a +1.9% margin, and reduces peak memory and latency to 86.7% and 86.4% of vanilla Qwen3-VL-8B-Instruct.

Contributions

(1) Spatio-temporal Coverage Rethinking: We revisit visual token compression for VideoLLMs under tight token budgets and identify spatio-temporal information coverage as the core bottleneck, together with two failure modes in prior pipelines: discontinuous coverage and misaligned spatio-temporal information.

(2) Video Curvature-Aware Spatio-Temporal Pruning: We propose V-CAST, a plug-and-play token pruning approach that couples per-frame token budgeting with content-driven token selection from the perspective of video spatio-temporal curvature, while preserving on-grid coordinates.

(3) State-of-the-Art Performance and Efficiency: Experiments on long-video benchmarks show that V-CAST achieves the best accuracy-efficiency trade-off on Qwen3-VL-8B-Instruct, Qwen3-VL-32B-Instruct, Qwen3-VL-30B-A3B-Instruct, and LLaVA-OV/Video-7B, while reducing peak memory and end-to-end latency.

Motivation

V-CAST budget analysis

Existing token compression methods often underperform because they allocate budgets in ways that fail to preserve critical temporal turns and informative spatial evidence. This analysis motivates V-CAST to prioritize spatio-temporal coverage instead of relying on uniform or myopic compression under tight budgets.

Comparisons of different spatial selection strategies at R=25%.
Spatial Selection MVBench LongVideo
Bench
VideoMME
All S M L
Qwen3-VL-8B-Instruct 68.6 60.3 64.5 76.0 60.4 57.0
(a) Token Merging
HoliTom [NeurIPS'25] 63.0 56.8 59.7 71.4 54.6 53.1
AvgPool (2x2) 65.4 58.6 60.5 72.3 56.7 52.4
(b) Token Pruning
Random (2x2) 66.9 57.8 61.9 73.6 58.7 53.4
First (2x2) 67.7 57.7 61.1 72.2 56.2 55.0
V-CAST rope analysis

Misaligned spatio-temporal information from token merging. Token merging drifts off the discrete (t,h,w) grid and weakens MRoPE bindings, while token pruning preserves on-grid coordinates.

Method

V-CAST formulates token pruning for VideoLLMs as an optimal semantic-trajectory approximation problem under a fixed budget. It applies Curvature-Guided Temporal Allocation to assign per-frame budgets by tracking semantic transitions, and then performs Dual-Anchor Spatial Token Selection to retain diverse and salient tokens while preserving original on-grid coordinates.

V-CAST framework

Overall framework of V-CAST. The method first allocates temporal budget with curvature cues, then performs dual-anchor spatial token pruning using contextual diversity and feature activation anchors.

Main Results

98.6%
original performance retained
+1.1%
over the second-best baseline
86.7%
peak memory of vanilla Qwen3-VL-8B
86.4%
total latency of vanilla Qwen3-VL-8B
Performance comparison with other baselines on Qwen3-VL-8B-Instruct.
Methods MVBench LongVideo
Bench
MLVU VideoMME Average
Overall Short Medium Long Score %
Max Input Frames=64
Qwen3-VL-8B-Instruct 69.262.868.966.978.866.255.867.0100.0
Retention Ratio=25%
VisionZip [CVPR'25] 64.858.963.463.173.760.355.462.693.4
VidCom2 [EMNLP'25] 67.559.664.064.975.463.455.964.095.5
FastVID [NeurIPS'25] 66.860.465.062.373.960.752.263.694.9
HoliTom [NeurIPS'25] 64.858.963.462.774.460.653.262.593.3
FlashVID [ICLR'26] OOMOOMOOMOOMOOMOOMOOM--
V-CAST [Ours] 68.461.265.465.176.863.255.365.097.0
Retention Ratio=15%
VisionZip [CVPR'25] 62.057.761.460.470.358.452.660.490.1
VidCom2 [EMNLP'25] 64.357.460.362.972.661.354.761.291.3
FastVID [NeurIPS'25] 64.959.763.360.572.257.252.162.192.7
HoliTom [NeurIPS'25] 62.758.361.860.972.358.252.260.990.9
FlashVID [ICLR'26] OOMOOMOOMOOMOOMOOMOOM--
V-CAST [Ours] 66.359.664.463.976.361.453.863.694.9
Max Input Frames=32
Qwen3-VL-8B-Instruct 68.660.363.564.576.060.457.064.2100.0
Retention Ratio=25%
VisionZip [CVPR'25] 62.256.760.860.169.656.754.260.093.5
VidCom2 [EMNLP'25] 67.058.060.662.472.159.156.162.096.6
FastVID [NeurIPS'25] 67.358.760.760.572.056.852.761.896.3
HoliTom [NeurIPS'25] 63.056.861.259.771.454.653.160.293.8
FlashVID [ICLR'26] 67.558.861.762.374.458.753.962.697.5
V-CAST [Ours] 67.958.263.563.574.460.255.863.398.6
Retention Ratio=15%
VisionZip [CVPR'25] 60.156.260.058.266.954.952.958.691.3
VidCom2 [EMNLP'25] 64.256.057.759.668.956.753.359.492.5
FastVID [NeurIPS'25] 66.157.159.058.369.654.351.160.193.6
HoliTom [NeurIPS'25] 60.055.859.658.368.954.751.758.491.0
FlashVID [ICLR'26] 66.557.860.061.472.457.654.161.495.6
V-CAST [Ours] 66.257.760.162.071.958.355.961.595.8
Performance comparison on the larger model Qwen3-VL-32B-Instruct.
Methods MVBench LongVideo
Bench
MLVU VideoMME Average
Overall Short Medium Long Score %
Qwen3-VL-32B-Instruct 73.262.466.169.378.367.162.467.8100.0
Retention Ratio=25%
VisionZip [CVPR'25] 62.960.062.264.872.863.158.662.592.2
VidCom2 [EMNLP'25] 70.260.464.967.075.664.361.265.696.8
FastVID [NeurIPS'25] 71.060.964.565.175.162.657.765.496.5
HoliTom [NeurIPS'25] 62.360.263.264.974.261.259.162.692.3
FlashVID [ICLR'26] OOMOOMOOMOOMOOMOOMOOM--
V-CAST [Ours] 71.861.564.768.177.765.960.866.598.1
Performance comparison on the MoE-based model Qwen3-VL-30B-A3B-Instruct.
Methods MVBench LongVideo
Bench
MLVU VideoMME Average Score
Overall Short Medium Long
Qwen3-VL-30B-A3B-Instruct OOMOOMOOMOOMOOMOOMOOM-
Retention Ratio=25%
VisionZip [CVPR'25] 61.362.464.162.770.761.356.062.6
VidCom2 [EMNLP'25] 69.765.568.568.179.168.058.968.0
FastVID [NeurIPS'25] 68.260.764.862.470.361.655.264.0
HoliTom [NeurIPS'25] 66.063.964.764.673.663.956.264.8
FlashVID [ICLR'26] OOMOOMOOMOOMOOMOOMOOM-
V-CAST [Ours] 69.266.668.268.278.965.959.868.1
V-CAST frame scaling

Consistent gains with more frames on Qwen3-VL-8B-Instruct. Performance trends on LongVideoBench, MLVU, VideoMME (Long), and EgoSchema as input frames increase. V-CAST improves accuracy and scales to longer inputs, while some baselines show limited gains or OOM failures at larger frame counts.

Performance comparison on LLaVA-OV-7B.
Methods MVBench LongVideo
Bench
MLVU VideoMME Average
Overall Short Medium Long Score %
LLaVA-OV-7B 58.356.663.158.469.956.748.859.1100.0
Retention Ratio=25%
FastV [ECCV'24] 55.553.359.655.365.053.847.055.994.6
PDrop [CVPR'25] 55.351.357.155.564.753.148.754.892.7
SparseVLM [ICML'25] 56.453.960.757.368.455.248.157.196.6
VisionZip [CVPR'25] 56.956.062.958.068.957.447.658.599.0
PruneVid [ACL'25] 55.755.163.457.068.854.447.757.897.8
FrameFusion [ICCV'25] 56.054.861.757.568.255.748.657.597.3
FastVID [NeurIPS'25] 56.556.360.958.369.458.247.258.098.1
VidCom2 [EMNLP'25] 57.055.462.858.469.356.349.458.498.8
V-CAST [Ours] 57.456.462.958.670.756.049.158.899.5
Performance comparison on LLaVA-Video-7B.
Methods MVBench LongVideo
Bench
MLVU VideoMME Average
Overall Short Medium Long Score %
LLaVA-Video-7B 60.458.967.364.477.362.453.462.8100.0
Retention Ratio=25%
FastV [ECCV'24] 52.154.857.858.668.758.448.755.888.9
SparseVLM [ICML'25] 55.454.258.960.171.159.150.157.291.1
VisionZip [CVPR'25] 57.956.362.662.573.662.351.959.895.2
HoliTom [NeurIPS'25] 58.457.160.563.074.662.352.159.895.2
VidCom2 [EMNLP'25] 57.057.158.761.773.061.750.058.693.4
V-CAST [Ours] 58.057.361.662.774.061.252.459.995.4
Inference efficiency. Qwen3-VL-8B-Instruct on VideoMME at R=25% with max input frames = 32. Lower is better for latency and memory, while higher is better for throughput and performance.
Methods Prefilling LLM Generation Total Latency GPU Peak Throughput Performance
Latency (s) ↓ Latency (s) ↓ Latency (s) ↓ Memory (MB) ↓ item/s ↑ Score ↑
Qwen3-VL-8B-Instruct 243.6 280.9 1440.4 22478.0 1.87 64.5
Random 116.9 147.3 1271.9 19547.5 2.12 61.5
VisionZip [CVPR'25] 155.6 190.5 1276.8 19974.7 2.11 60.1
VidCom2 [EMNLP'25] 119.0 150.2 1271.0 19547.5 2.12 62.4
FastVID [NeurIPS'25] 120.6 153.6 1317.6 19547.5 2.05 60.5
HoliTom [NeurIPS'25] 134.3 160.8 1261.8 19974.7 2.14 59.7
FlashVID [ICLR'26] 146.6 175.0 1308.7 41363.6 2.06 62.3
V-CAST [Ours] 118.8 149.9 1245.1 19547.5 2.17 63.5
V-CAST efficiency comparison

Efficiency comparison on LLaVA-OneVision-7B. We compare Vanilla, FastVID, VidCom2, and V-CAST on Prefill Latency, Total Latency, and Peak Memory.

Analysis

V-CAST frame budget allocation

Visualization of frame budget allocation. We compare Uniform allocation, Global Uniqueness, and Curvature-Aware allocation under R=25%. A higher curve indicates a larger per-frame token budget.

Qualitative Results

V-CAST qualitative comparison

Qualitative comparison. V-CAST highlights task-critical moments and yields correct answers where baselines and even the vanilla model fail.

Citation

@misc{lin2026vcast,
  title={V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models},
  author={Xinying Lin and Xuyang Liu and Yiyu Wang and Teng Ma and Wenqi Ren},
  year={2026},
  howpublished={\url{https://github.com/xinyouu/V-CAST}}
}