Accelerate Custom Video Foundation Model Pipelines with New NVIDIA NeMo Framework Capabilities

Generative AI has evolved from text-based models to multimodal models, with a recent expansion into video, opening up new potential uses across various industries. Video models can create new experiences for users or simulate scenarios for training autonomous agents at scale. They are helping revolutionize various industries including robotics, autonomous vehicles, and entertainment.

The development of video foundation models presents unique challenges due to the vast and varied nature of video data. This also underscores the necessity of scalable pipelines for curating data and effectively training models that can comprehend temporal and spatial dynamics.

We are announcing brand new video foundation model capabilities in the NVIDIA NeMo framework, an end-to-end training framework that enables you to pretrain and fine-tune your own video foundation models. The framework includes a high-throughput data curation, efficient multimodal data loading functionality, scalable model training, and a parallelized in-framework inference.

Model size	Context length	Training config	GPU used (TFLOPS/s)	Throughput (token/s/GPU)
DiT 7B	8k	baseline, no optimization	OOM
DiT 7B	8k	CP=2	457	8,969
DiT 7B	74k	TP=4 SP CP=4	414	2,933
DiT 28B	8k	TP=2 SP PP=2	435	2,392
DiT 28B	74k	TP=8 SP CP=4 PP=4	411	994

Layer	Input Seq	Communication primitive	Communication bandwidth
Temporal self-attention	Short seq	Local compute & A2A	(bhw/cp, t, d)
Spatial self-attention	Short seq	Local compute & A2A	(bt/cp, hw, d)
Full attention	Long seq	CP with P2P	(b, hwt/cp, d)

RNG seed	Data parallel	Context parallel	Pipeline parallel	Tensor parallel
Time step (t)	Diff	Same	Same	Same
Gaussian noise	Diff	Diff	Same	Same
Weight initialization	Same	Same	Diff	Diff

Accelerate Custom Video Foundation Model Pipelines with New NVIDIA NeMo Framework Capabilities

High-throughput video curation through optimized pipelines

Efficient multimodal dataloading

Scaling video foundation model training

Overview of the video diffusion pipeline

Parallelism optimizations for video diffusion models

Efficient in-framework inference

Conclusion

Acknowledgements

latest articles

RX 9070 XT listing appears online as AMD says “stay tuned” for an announcement in “near future”

How Ultra Ethernet And UALink Enable High-Performance, Scalable AI Networks

Team Group T-FORCE Dark AirFlow I SSD Cooler Review

Firefly Aerospace’s Blue Ghost mission launches to the moon

Sapphire teases its RX 9000 Pulse!

Terraria’s ‘final’ update might not be so final after all

explore more

This colorful Tetris variant is the gaming cooldown I needed

Mid-Tier Snapdragon Gets Cortex-A720 Treatment

Hyper Light Breaker is starting early access on the right foot

Beautifully Detailed Webb Image Shows ‘Shelled’ Star on the Verge of Death

5 actors who should play the Joker in the new DCU

Has The Mandalorian and Grogu Found Its Unlikely Villain?

most viewed

Experience Digital Twins in XR with NVIDIA Omniverse Spatial Streaming

Efficient Ray Tracing with NVIDIA OptiX Shader Binding Table Optimization

CLOSED AMD Ryzen™ 8000G CPU Launch Sweepstakes

trending right now

How To Blow Up A Computer

The RX 9070 XT’s performance on Wukong and Cyberpunk 2077

All the Best Asus CES 2025 Monitor News and Announcements!

G.SKILL Memory Showcases DDR5-10600 2x24GB on ASUS ROG X870E Apex Motherboard

MSI MAG B850 Tomahawk Max WiFi review – fantastic build quality and loaded with features

AMD RX 9070 XT & RX 9070 release date, price, and specs rumors