V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models

Xinying Lin^1,2, Xuyang Liu^3,†, Yiyu Wang⁴, Teng Ma¹, Wenqi Ren^1,2,✉

¹ Shenzhen Campus of Sun Yat-sen University ² Shenzhen Loop Area Institute
³ Sichuan University ⁴ EPIC Lab, Shanghai Jiao Tong University

^† Project leader: seanleo666@gmail.com. ^✉ Corresponding author: renwq3@mail.sysu.edu.cn.

Paper arXiv Code

⚡ A training-free and plug-and-play curvature-aware spatio-temporal pruning framework for efficient long-context video inference.

Highlight of V-CAST. Compared to existing token compression methods, V-CAST yields +1.5 long-video gains, achieves 98.2% original performance with a +1.9% margin, and reduces peak memory and latency to 86.7% and 86.4% of vanilla Qwen3-VL-8B-Instruct.

Contributions

(1) Spatio-temporal Coverage Rethinking: We revisit visual token compression for VideoLLMs under tight token budgets and identify spatio-temporal information coverage as the core bottleneck, together with two failure modes in prior pipelines: discontinuous coverage and misaligned spatio-temporal information.

(2) Video Curvature-Aware Spatio-Temporal Pruning: We propose V-CAST, a plug-and-play token pruning approach that couples per-frame token budgeting with content-driven token selection from the perspective of video spatio-temporal curvature, while preserving on-grid coordinates.

(3) State-of-the-Art Performance and Efficiency: Experiments on long-video benchmarks show that V-CAST achieves the best accuracy-efficiency trade-off on Qwen3-VL-8B-Instruct, Qwen3-VL-32B-Instruct, Qwen3-VL-30B-A3B-Instruct, and LLaVA-OV/Video-7B, while reducing peak memory and end-to-end latency.

Motivation

Existing token compression methods often underperform because they allocate budgets in ways that fail to preserve critical temporal turns and informative spatial evidence. This analysis motivates V-CAST to prioritize spatio-temporal coverage instead of relying on uniform or myopic compression under tight budgets.

Comparisons of different spatial selection strategies at R=25%.
Spatial Selection	MVBench	LongVideo Bench	VideoMME
Spatial Selection	MVBench	LongVideo Bench	All	S	M	L
Qwen3-VL-8B-Instruct	68.6	60.3	64.5	76.0	60.4	57.0
(a) Token Merging
HoliTom [NeurIPS'25]	63.0	56.8	59.7	71.4	54.6	53.1
AvgPool (2x2)	65.4	58.6	60.5	72.3	56.7	52.4
(b) Token Pruning
Random (2x2)	66.9	57.8	61.9	73.6	58.7	53.4
First (2x2)	67.7	57.7	61.1	72.2	56.2	55.0

Misaligned spatio-temporal information from token merging. Token merging drifts off the discrete (t,h,w) grid and weakens MRoPE bindings, while token pruning preserves on-grid coordinates.

Method

V-CAST formulates token pruning for VideoLLMs as an optimal semantic-trajectory approximation problem under a fixed budget. It applies Curvature-Guided Temporal Allocation to assign per-frame budgets by tracking semantic transitions, and then performs Dual-Anchor Spatial Token Selection to retain diverse and salient tokens while preserving original on-grid coordinates.

Overall framework of V-CAST. The method first allocates temporal budget with curvature cues, then performs dual-anchor spatial token pruning using contextual diversity and feature activation anchors.

Main Results

98.6%

original performance retained

+1.1%

over the second-best baseline

86.7%

peak memory of vanilla Qwen3-VL-8B

86.4%

total latency of vanilla Qwen3-VL-8B

Performance comparison with other baselines on Qwen3-VL-8B-Instruct.

Methods	MVBench	LongVideo Bench	MLVU	VideoMME				Average
Methods	MVBench	LongVideo Bench	MLVU	Overall	Short	Medium	Long	Score	%
Max Input Frames=64
Qwen3-VL-8B-Instruct	69.2	62.8	68.9	66.9	78.8	66.2	55.8	67.0	100.0
Retention Ratio=25%
VisionZip [CVPR'25]	64.8	58.9	63.4	63.1	73.7	60.3	55.4	62.6	93.4
VidCom² [EMNLP'25]	67.5	59.6	64.0	64.9	75.4	63.4	55.9	64.0	95.5
FastVID [NeurIPS'25]	66.8	60.4	65.0	62.3	73.9	60.7	52.2	63.6	94.9
HoliTom [NeurIPS'25]	64.8	58.9	63.4	62.7	74.4	60.6	53.2	62.5	93.3
FlashVID [ICLR'26]	OOM	OOM	OOM	OOM	OOM	OOM	OOM	-	-
V-CAST [Ours]	68.4	61.2	65.4	65.1	76.8	63.2	55.3	65.0	97.0
Retention Ratio=15%
VisionZip [CVPR'25]	62.0	57.7	61.4	60.4	70.3	58.4	52.6	60.4	90.1
VidCom² [EMNLP'25]	64.3	57.4	60.3	62.9	72.6	61.3	54.7	61.2	91.3
FastVID [NeurIPS'25]	64.9	59.7	63.3	60.5	72.2	57.2	52.1	62.1	92.7
HoliTom [NeurIPS'25]	62.7	58.3	61.8	60.9	72.3	58.2	52.2	60.9	90.9
FlashVID [ICLR'26]	OOM	OOM	OOM	OOM	OOM	OOM	OOM	-	-
V-CAST [Ours]	66.3	59.6	64.4	63.9	76.3	61.4	53.8	63.6	94.9
Max Input Frames=32
Qwen3-VL-8B-Instruct	68.6	60.3	63.5	64.5	76.0	60.4	57.0	64.2	100.0
Retention Ratio=25%
VisionZip [CVPR'25]	62.2	56.7	60.8	60.1	69.6	56.7	54.2	60.0	93.5
VidCom² [EMNLP'25]	67.0	58.0	60.6	62.4	72.1	59.1	56.1	62.0	96.6
FastVID [NeurIPS'25]	67.3	58.7	60.7	60.5	72.0	56.8	52.7	61.8	96.3
HoliTom [NeurIPS'25]	63.0	56.8	61.2	59.7	71.4	54.6	53.1	60.2	93.8
FlashVID [ICLR'26]	67.5	58.8	61.7	62.3	74.4	58.7	53.9	62.6	97.5
V-CAST [Ours]	67.9	58.2	63.5	63.5	74.4	60.2	55.8	63.3	98.6
Retention Ratio=15%
VisionZip [CVPR'25]	60.1	56.2	60.0	58.2	66.9	54.9	52.9	58.6	91.3
VidCom² [EMNLP'25]	64.2	56.0	57.7	59.6	68.9	56.7	53.3	59.4	92.5
FastVID [NeurIPS'25]	66.1	57.1	59.0	58.3	69.6	54.3	51.1	60.1	93.6
HoliTom [NeurIPS'25]	60.0	55.8	59.6	58.3	68.9	54.7	51.7	58.4	91.0
FlashVID [ICLR'26]	66.5	57.8	60.0	61.4	72.4	57.6	54.1	61.4	95.6
V-CAST [Ours]	66.2	57.7	60.1	62.0	71.9	58.3	55.9	61.5	95.8

Performance comparison on the larger model Qwen3-VL-32B-Instruct.

Methods	MVBench	LongVideo Bench	MLVU	VideoMME				Average
Methods	MVBench	LongVideo Bench	MLVU	Overall	Short	Medium	Long	Score	%
Qwen3-VL-32B-Instruct	73.2	62.4	66.1	69.3	78.3	67.1	62.4	67.8	100.0
Retention Ratio=25%
VisionZip [CVPR'25]	62.9	60.0	62.2	64.8	72.8	63.1	58.6	62.5	92.2
VidCom² [EMNLP'25]	70.2	60.4	64.9	67.0	75.6	64.3	61.2	65.6	96.8
FastVID [NeurIPS'25]	71.0	60.9	64.5	65.1	75.1	62.6	57.7	65.4	96.5
HoliTom [NeurIPS'25]	62.3	60.2	63.2	64.9	74.2	61.2	59.1	62.6	92.3
FlashVID [ICLR'26]	OOM	OOM	OOM	OOM	OOM	OOM	OOM	-	-
V-CAST [Ours]	71.8	61.5	64.7	68.1	77.7	65.9	60.8	66.5	98.1

Performance comparison on the MoE-based model Qwen3-VL-30B-A3B-Instruct.

Methods	MVBench	LongVideo Bench	MLVU	VideoMME				Average Score
Methods	MVBench	LongVideo Bench	MLVU	Overall	Short	Medium	Long	Average Score
Qwen3-VL-30B-A3B-Instruct	OOM	OOM	OOM	OOM	OOM	OOM	OOM	-
Retention Ratio=25%
VisionZip [CVPR'25]	61.3	62.4	64.1	62.7	70.7	61.3	56.0	62.6
VidCom² [EMNLP'25]	69.7	65.5	68.5	68.1	79.1	68.0	58.9	68.0
FastVID [NeurIPS'25]	68.2	60.7	64.8	62.4	70.3	61.6	55.2	64.0
HoliTom [NeurIPS'25]	66.0	63.9	64.7	64.6	73.6	63.9	56.2	64.8
FlashVID [ICLR'26]	OOM	OOM	OOM	OOM	OOM	OOM	OOM	-
V-CAST [Ours]	69.2	66.6	68.2	68.2	78.9	65.9	59.8	68.1

Consistent gains with more frames on Qwen3-VL-8B-Instruct. Performance trends on LongVideoBench, MLVU, VideoMME (Long), and EgoSchema as input frames increase. V-CAST improves accuracy and scales to longer inputs, while some baselines show limited gains or OOM failures at larger frame counts.

Performance comparison on LLaVA-OV-7B.

Methods	MVBench	LongVideo Bench	MLVU	VideoMME				Average
Methods	MVBench	LongVideo Bench	MLVU	Overall	Short	Medium	Long	Score	%
LLaVA-OV-7B	58.3	56.6	63.1	58.4	69.9	56.7	48.8	59.1	100.0
Retention Ratio=25%
FastV [ECCV'24]	55.5	53.3	59.6	55.3	65.0	53.8	47.0	55.9	94.6
PDrop [CVPR'25]	55.3	51.3	57.1	55.5	64.7	53.1	48.7	54.8	92.7
SparseVLM [ICML'25]	56.4	53.9	60.7	57.3	68.4	55.2	48.1	57.1	96.6
VisionZip [CVPR'25]	56.9	56.0	62.9	58.0	68.9	57.4	47.6	58.5	99.0
PruneVid [ACL'25]	55.7	55.1	63.4	57.0	68.8	54.4	47.7	57.8	97.8
FrameFusion [ICCV'25]	56.0	54.8	61.7	57.5	68.2	55.7	48.6	57.5	97.3
FastVID [NeurIPS'25]	56.5	56.3	60.9	58.3	69.4	58.2	47.2	58.0	98.1
VidCom² [EMNLP'25]	57.0	55.4	62.8	58.4	69.3	56.3	49.4	58.4	98.8
V-CAST [Ours]	57.4	56.4	62.9	58.6	70.7	56.0	49.1	58.8	99.5

Performance comparison on LLaVA-Video-7B.

Methods	MVBench	LongVideo Bench	MLVU	VideoMME				Average
Methods	MVBench	LongVideo Bench	MLVU	Overall	Short	Medium	Long	Score	%
LLaVA-Video-7B	60.4	58.9	67.3	64.4	77.3	62.4	53.4	62.8	100.0
Retention Ratio=25%
FastV [ECCV'24]	52.1	54.8	57.8	58.6	68.7	58.4	48.7	55.8	88.9
SparseVLM [ICML'25]	55.4	54.2	58.9	60.1	71.1	59.1	50.1	57.2	91.1
VisionZip [CVPR'25]	57.9	56.3	62.6	62.5	73.6	62.3	51.9	59.8	95.2
HoliTom [NeurIPS'25]	58.4	57.1	60.5	63.0	74.6	62.3	52.1	59.8	95.2
VidCom² [EMNLP'25]	57.0	57.1	58.7	61.7	73.0	61.7	50.0	58.6	93.4
V-CAST [Ours]	58.0	57.3	61.6	62.7	74.0	61.2	52.4	59.9	95.4

Inference efficiency. Qwen3-VL-8B-Instruct on VideoMME at R=25% with max input frames = 32. Lower is better for latency and memory, while higher is better for throughput and performance.

Methods	Prefilling	LLM Generation	Total Latency	GPU Peak	Throughput	Performance
	Latency (s) ↓	Latency (s) ↓	Latency (s) ↓	Memory (MB) ↓	item/s ↑	Score ↑
Qwen3-VL-8B-Instruct	243.6	280.9	1440.4	22478.0	1.87	64.5
Random	116.9	147.3	1271.9	19547.5	2.12	61.5
VisionZip [CVPR'25]	155.6	190.5	1276.8	19974.7	2.11	60.1
VidCom² [EMNLP'25]	119.0	150.2	1271.0	19547.5	2.12	62.4
FastVID [NeurIPS'25]	120.6	153.6	1317.6	19547.5	2.05	60.5
HoliTom [NeurIPS'25]	134.3	160.8	1261.8	19974.7	2.14	59.7
FlashVID [ICLR'26]	146.6	175.0	1308.7	41363.6	2.06	62.3
V-CAST [Ours]	118.8	149.9	1245.1	19547.5	2.17	63.5

Efficiency comparison on LLaVA-OneVision-7B. We compare Vanilla, FastVID, VidCom², and V-CAST on Prefill Latency, Total Latency, and Peak Memory.

Analysis

Visualization of frame budget allocation. We compare Uniform allocation, Global Uniqueness, and Curvature-Aware allocation under R=25%. A higher curve indicates a larger per-frame token budget.

Qualitative Results

Qualitative comparison. V-CAST highlights task-critical moments and yields correct answers where baselines and even the vanilla model fail.

Citation

@misc{lin2026vcast,
  title={V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models},
  author={Xinying Lin and Xuyang Liu and Yiyu Wang and Teng Ma and Wenqi Ren},
  year={2026},
  howpublished={\url{https://github.com/xinyouu/V-CAST}}
}