HLJ
← Back to projects

Western AI

Video Virality Predictor

October 2025 — March 2026

A multimodal YouTube Shorts forecasting platform that collects short-form video data, generates video/audio/text embeddings, fuses modalities, trains leakage-resistant predictors, and serves 7-day and 30-day view forecasts through a conference-demoed web application.

GitHub →

Overview

Video Virality Predictor is an end-to-end multimodal machine learning system for forecasting the future performance of YouTube Shorts. The project was designed around a practical question: can early video, audio, text, and metadata signals be used to estimate how many views a short-form video will receive after fixed post-publish horizons?

The final system predicts YouTube Shorts view counts at 7-day and 30-day horizons using a reproducible pipeline that collects videos, extracts raw media artifacts, generates dense embeddings for each modality, fuses those embeddings into model-ready feature sets, trains multiple supervised regressors, evaluates model behavior across missing-modality slices, and exposes the result through a live web application for inference on uploaded MP4 files and metadata.

My work focused on redesigning the project from a weak early supervised-learning setup into a more rigorous multimodal forecasting system. The key change was introducing fixed forecasting horizons and stronger data-alignment contracts. Instead of training on inconsistent mixed-age targets, the redesigned pipeline labels each video at explicit post-publish ages, joins all modalities through a canonical video_id, and trains models against clear 7-day and 30-day targets.

The system was showcased at CUCAI through a co-authored paper and demo application.

Problem

Short-form video virality is difficult to model because view counts are noisy, heavy-tailed, and heavily dependent on timing. A raw snapshot of a video's view count is not necessarily a fair label: a video with 5,000 views after one hour is very different from a video with 5,000 views after 30 days.

The original project faced three major issues:

  1. Temporal label inconsistency — videos were being compared at different ages, producing noisy supervision.
  2. Multimodal alignment complexity — metadata, downloaded video, extracted audio, transcripts, embeddings, and fused vectors all had to refer to the same item without silent mismatches.
  3. Missing modality behavior — transcripts were not always available, and missing text needed to be represented explicitly instead of accidentally corrupting downstream models.

The redesigned project addresses these issues by making the entire pipeline video_id-native, horizon-labeled, restartable, and schema-locked.

My Role

I contributed across the data engineering, machine learning, evaluation, and demo layers of the project. My work included:

System Architecture

At a high level, the system follows this pipeline:

YouTube Shorts discovery
        ↓
Fixed-horizon metadata labels
        ↓
Video download
        ↓
Audio extraction
        ↓
Transcript collection / ASR
        ↓
Video, audio, and text embeddings
        ↓
Multimodal fusion
        ↓
Supervised model training
        ↓
Model evaluation and stacking
        ↓
Clustering and interpretation
        ↓
Web application inference

Each stage uses deterministic naming, per-stage state tracking, and canonical video_id joins so that data can be processed incrementally without losing alignment between modalities.

Data Collection Pipeline

The data collection layer constructs the canonical dataset used by all downstream modeling stages. It discovers YouTube Shorts, labels them at fixed post-publish horizons, downloads raw videos, derives audio, and collects transcript text.

Fixed-Horizon Metadata Labels

The most important redesign was replacing arbitrary view-count snapshots with fixed-horizon labels.

For each video, the pipeline records:

This turns the prediction task into a supervised regression problem with well-defined targets:

This change improved dataset quality because the model no longer had to learn from mixed-age labels where some videos were only hours old and others were weeks old.

YouTube Shorts Discovery

The collector searches YouTube Shorts using broad seed queries and short-duration filters. For each run, it builds publish windows centered around now - horizon_days with a tolerance window, deduplicates discovered video_ids, fetches video and channel metadata, and appends horizon-labeled rows to the canonical metadata CSV.

Primary artifacts:

The SQLite label store deduplicates (video_id, horizon_days) so that repeated runs do not repeatedly label the same video-horizon pair.

Canonical Identity and Delta Processing

Every downstream stage uses video_id as the canonical identity. If a metadata row is missing a video_id, the pipeline can parse it from common YouTube URL formats such as:

Each stage also computes a deterministic source_hash, normally from the normalized URL. This lets the system detect whether a video has already been processed, whether it needs to be retried, or whether it should be skipped as unchanged.

The delta-processing contract is:

This makes the pipeline restartable and avoids unnecessary recomputation.

Video Download

The video downloader consumes metadata delta rows and materializes MP4 files keyed by video_id.

Output pattern:

Data/raw/Video/raw_data/<video_id>.mp4
clipfarm/raw/video/<video_id>.mp4

The downloader includes retry logic and multiple yt-dlp fallback strategies. It also classifies terminal failures such as removed, private, unavailable, upcoming, age-restricted, and authentication-blocked videos.

Audio Extraction

Audio is extracted from downloaded MP4 files using FFmpeg and normalized into a model-friendly waveform format:

ffmpeg -y -i input.mp4 -vn -ac 1 -ar 16000 -acodec pcm_s16le output.wav

The chosen audio format is:

This standardization supports downstream ASR and Wav2Vec2 embedding generation.

Output pattern:

Data/raw/Audio/raw_data/<video_id>.wav
clipfarm/raw/audio/<video_id>.wav

Transcript Collection

The text stage attempts transcript generation using a caption-first strategy:

  1. Try YouTube subtitles when available.
  2. Fall back to ASR over the extracted WAV file.

The operational ASR backend is whisper_cpp. Each transcript is saved as a JSON file containing the final transcript, transcript source, language metadata, caption metadata, optional timestamps, and error/status information.

Output pattern:

Data/raw/Text/raw_data/<video_id>.json
clipfarm/raw/text/<video_id>.json

The system also handles empty transcripts carefully. A first empty result is treated as retryable; a second consecutive empty result is promoted to a terminal empty-transcript status. This matters because fusion later uses terminal text failures to decide whether a zero text placeholder is valid.

Embedding Pipeline

The embedding stage converts raw media artifacts into dense modality-specific vectors. Each modality is processed independently as a delta job against the current metadata horizon CSV and its own state database.

Video Embeddings

Video embeddings are generated using a VideoMAE model.

Key details:

The frame sampling logic is deterministic for a fixed decode order. If a video has fewer frames than required, the final frame is repeated to pad the sequence.

Audio Embeddings

Audio embeddings are generated using Wav2Vec2.

Key details:

Text Embeddings

Text embeddings combine metadata text and transcript text.

Key details:

This design lets the text representation capture both creator-provided context and spoken/captioned content.

Multimodal Fusion

The fusion stage converts per-modality embeddings into one model-ready fused representation per video_id. It emits sharded .npz vector files, a Parquet manifest, and a schema lock file.

Primary script:

Data/common/fuse_embeddings_delta.py

Supported fusion strategies:

Missing-Modality Policy

The project assumes that video and audio embeddings are required. Text is conditionally optional because some Shorts have no usable captions or speech.

The fusion policy is:

This avoids silently mixing rows with different feature semantics. The text_present mask is appended to fused vectors so downstream models can explicitly learn from missing-text behavior.

Fusion Strategies

Let:

The implemented strategies are:

concat:   fused = [v || a || t || m]
sum_pool: fused = [pad(v) + pad(a) + pad(t) || m]
max_pool: fused = [elementwise_max(pad(v), pad(a), pad(t)) || m]

With default 768-dimensional video, audio, and text embeddings, the fused dimensions are:

| Strategy | Dimension | |---|---:| | concat | 2305 | | sum_pool | 769 | | max_pool | 769 |

The production fusion stage intentionally avoids randomly initialized learned projections. This keeps fusion deterministic, stable across daily runs, and compatible with historical model checkpoints.

Schema Locking

Each fusion strategy writes a schema.json file containing:

If a future run produces vectors with incompatible dimensions, the stage fails instead of silently creating train/serve skew.

Supervised Training Pipeline

The training stage builds supervised virality predictors from two inputs:

  1. Horizon-labeled metadata.
  2. Fused embedding pointers and shards.

Primary script:

Super_Predict/train_suite_from_horizon.py

For each (fusion_strategy, target_horizon_days) pair, the training pipeline reconstructs vectors from fused shards, joins them with horizon-labeled metadata, creates train/validation/test splits, trains several model families, and writes snapshot artifacts to S3.

Target Definition

The supervised target is:

target_raw = horizon_view_count
target_log = log1p(target_raw)

The log transform is important because view counts are heavy-tailed. Training and primary evaluation happen in log space, while raw-scale metrics are still computed for interpretability.

Leakage Prevention

The training pipeline explicitly excludes target-leaking features such as:

A hard leakage assertion fails training if any disallowed columns survive feature selection.

This was a major reliability improvement because virality forecasting can easily become inflated by accidental access to post-outcome metadata.

Splitting Strategy

The implemented split is a deterministic random split:

A split manifest is saved with each run so the exact data partition can be audited later.

One limitation is that this is not a strict chronological holdout. The pipeline prevents direct feature leakage and duplicate-video leakage, but not all forms of temporal leakage that could occur from nearby capture times. This is documented as a future improvement.

Model Families

The project benchmarks multiple model families to compare linear, tree-based, neural, and multimodal-gated approaches.

Concat MLP

The concat_mlp model consumes:

Architecture:

This model is directly affected by fusion strategy because it consumes the fused vector.

Gated Fusion MLP

The gated_fusion_mlp model consumes separate video, audio, and text vectors instead of the already fused vector.

Architecture:

This model explicitly models modality reliability and text missingness. It is more expressive but operationally heavier than simpler baselines.

Ridge Regression

The ridge baseline uses:

This provides a strong linear baseline and helps identify whether gains come from nonlinear modeling or simply from better feature construction.

GBDT with Neural Projection

The best-performing base model in the March 6 snapshot was a hybrid projected-GBDT model.

It works in two stages:

  1. Train a neural projector that maps the high-dimensional fused vector into a lower-dimensional latent representation.
  2. Train a HistGradientBoostingRegressor on the projected latent features plus metadata.

The projector uses:

The GBDT model uses early stopping and tree-based nonlinear modeling over the reduced representation.

Model Evaluation

The project evaluates models using both log-space and raw-space metrics.

Primary log-space metrics:

Raw-scale metrics:

Slice metrics:

The text slice evaluation is important because the system intentionally supports both full multimodal examples and examples where transcript text is missing but terminally explainable.

March 6 Model Snapshot

A detailed March 6, 2026 snapshot compared 24 configurations across:

The snapshot evaluated 18,492 prediction rows across both horizons.

Main Findings

The strongest base model was GBDT with concat fusion.

Key findings:

Best base configurations:

| Horizon | Best Base Model | Strategy | MAE(log) | RMSE(log) | R²(log) | |---|---|---|---:|---:|---:| | 7 days | GBDT | concat | 1.1127 | 1.3695 | 0.4122 | | 30 days | GBDT | concat | 1.2264 | 1.6212 | 0.7173 |

The 30-day R² was higher despite a larger MAE because the 30-day target distribution had more explainable variance in log space.

Stacked Ensemble Evaluation

After benchmarking the base models, I implemented an out-of-fold stacking workflow to evaluate whether a learned ensemble could outperform the strongest single model.

The selected four base learners were:

  1. gbdt + concat
  2. concat_mlp + max_pool
  3. gated_fusion_mlp + concat
  4. ridge + sum_pool

The stacking dataset was built from aligned residual-analysis outputs. For each horizon, the pipeline pivots base predictions into a wide matrix:

X = [base_model_1_pred_log, base_model_2_pred_log, base_model_3_pred_log, base_model_4_pred_log]
y = y_true_log

The stacker is trained on base predictions, not on residuals. Residuals are used for diagnostics and evaluation.

OOF Protocol

The ensemble search used repeated K-fold out-of-fold prediction:

This reduces optimistic bias because each scored prediction comes from a model that did not train on that row.

Candidate meta-models included:

Stacking Results

The learned stacker improved RMSE(log) at both horizons.

| Horizon | Best Base RMSE(log) | Best Stack | Best Stack RMSE(log) | RMSE Gain vs Base | |---|---:|---|---:|---:| | 7 days | 1.3695 | Linear stacker | 1.3584 | +0.0111 / 0.81% | | 30 days | 1.6212 | ElasticNetCV stacker | 1.6005 | +0.0207 / 1.28% |

Portfolio-scale practical impact from the ensemble improvements:

The gains were modest relative to the best base model but meaningful because they reduced larger misses and improved calibration behavior. At 30 days, the stacker reduced high-target tail risk and improved bias from a noticeable underprediction pattern to near zero.

Clustering and Latent-Space Analysis

The project also includes an unsupervised clustering and latent-space interpretation workflow. This helps analyze whether groups of Shorts share similar multimodal style patterns.

The clustering pipeline loads fused vectors from S3, reconstructs them from manifest pointers, applies dimensionality reduction, evaluates cluster candidates, and writes cluster assignments.

Dimensionality Reduction

The pipeline standardizes fused vectors and applies PCA:

Cluster Selection

For each fusion strategy and each k in a configured range, the pipeline evaluates KMeans candidates across multiple random seeds.

The selection score combines:

Composite scoring:

0.55 * silhouette + 0.35 * stability - 0.10 * tiny_cluster_penalty

The final KMeans run uses deterministic seeding and canonical label remapping so that cluster IDs remain stable for fixed inputs.

Cluster Outputs

Main outputs:

This workflow made the embedding space more interpretable and supported qualitative analysis of what kinds of Shorts the model was grouping together.

Interpretation Pipeline

The interpretation stage converts cluster assignments into human-auditable style attributes.

For each clustered video, it joins the canonical URL and computes lightweight media features such as:

These features are intentionally lightweight and scalable. They are not meant to be a full cinematic analysis system; instead, they provide practical signals for comparing cluster profiles.

The output includes availability flags so missing media does not get confused with truly low motion or low audio energy.

Main output:

Interpretation/interpretation.csv

Web Application Demo

The final system was integrated into a web application for live inference. The app allowed users to upload an MP4 file and provide metadata, then generated 7-day and 30-day forecast outputs using the trained multimodal models.

The demo system mirrored the training pipeline as closely as possible:

  1. Extract media-derived inputs from the uploaded video.
  2. Generate or load the required modality embeddings.
  3. Apply the same fusion strategy and schema expectations used during training.
  4. Run selected model checkpoints.
  5. Return forecasted view counts for 7-day and 30-day horizons.
  6. Display uncertainty-style ranges based on robust model prediction intervals.

This application was showcased at CUCAI as part of the project presentation and co-authored paper.

Technical Highlights

End-to-End Multimodal Pipeline

The project processes raw YouTube Shorts into model-ready features across video, audio, text, and metadata. Each modality has its own extraction and embedding path, but all stages join through a shared video_id identity contract.

Fixed-Horizon Forecasting

The redesigned labeling framework introduced explicit 7-day and 30-day targets. This eliminated inconsistent mixed-age labels and made the forecasting task clearer, more defensible, and easier to evaluate.

Restartable Daily Processing

Each stage uses per-stage SQLite state and deterministic source hashing. This means processing can resume after interruption, skip unchanged items, and retry only unresolved failures.

AWS S3 Artifact Backbone

S3 stores raw media, embeddings, fused shards, manifests, model snapshots, and state databases. This enables the pipeline to scale beyond local files while keeping artifacts discoverable and reproducible.

Schema-Locked Fusion

Fusion writes schema locks so vector dimensions and mask behavior cannot silently change between runs. This reduces train/serve skew risk and catches upstream drift early.

Leakage-Resistant Training

The supervised pipeline explicitly removes target and post-outcome columns, deduplicates videos, stores split manifests, and evaluates both aggregate performance and missing-text slices.

Model Breadth

The project compares neural networks, gated multimodal architectures, linear models, projected tree models, deterministic blends, and learned stacked ensembles.

Interpretability Workflows

The clustering and interpretation stages add qualitative insight into the embedding space by combining PCA/UMAP projections, KMeans clusters, URL joins, and human-auditable motion/audio/visual-density features.

Scale

The project was designed for daily incremental data growth.

Notable scale points:

Results Summary

The project improved from an initially weak supervised-learning setup into a reproducible multimodal forecasting system with clear labels, stronger model comparisons, and a working inference demo.

Most important outcomes:

Challenges

Heavy-Tailed Targets

YouTube view counts are extremely skewed. A few very large videos can dominate raw-scale metrics. The project addresses this by optimizing and ranking primarily in log space while still reporting raw-scale metrics for interpretability.

Missing Text

Not all Shorts have usable transcripts. Instead of dropping these videos or silently filling text with zeros, the pipeline uses terminal text-state statuses and an explicit text_present mask.

Data Alignment

The system must align metadata rows, video files, audio files, transcripts, embeddings, fused vectors, labels, and model predictions. The canonical video_id contract and manifest-based pointer system were central to making this reliable.

Train/Serve Consistency

The demo application needs to reproduce training-time preprocessing. Schema locks, explicit model snapshots, fusion strategy metadata, and stored preprocessing configuration reduce the risk of train/serve skew.

Reproducibility vs. External APIs

Some parts of the pipeline depend on external systems such as YouTube search results, video availability, model hub artifacts, and cloud storage timing. The pipeline controls what it can through deterministic naming, state databases, run IDs, stored configs, and artifact snapshots.

What I Learned

This project taught me how much real-world machine learning performance depends on data contracts, labeling quality, and evaluation design. The modeling architecture mattered, but the largest improvements came from making the target definition precise, preventing leakage, making modality alignment reliable, and evaluating failure cases explicitly.

I also learned how to design ML systems that are not just notebooks, but reproducible pipelines: restartable ingestion, schema-locked feature generation, model snapshotting, slice evaluation, and deployment-aware inference workflows.

Technologies Used

Future Improvements

Potential next steps include: