Overview
Video Virality Predictor is an end-to-end multimodal machine learning system for forecasting the future performance of YouTube Shorts. The project was designed around a practical question: can early video, audio, text, and metadata signals be used to estimate how many views a short-form video will receive after fixed post-publish horizons?
The final system predicts YouTube Shorts view counts at 7-day and 30-day horizons using a reproducible pipeline that collects videos, extracts raw media artifacts, generates dense embeddings for each modality, fuses those embeddings into model-ready feature sets, trains multiple supervised regressors, evaluates model behavior across missing-modality slices, and exposes the result through a live web application for inference on uploaded MP4 files and metadata.
My work focused on redesigning the project from a weak early supervised-learning setup into a more rigorous multimodal forecasting system. The key change was introducing fixed forecasting horizons and stronger data-alignment contracts. Instead of training on inconsistent mixed-age targets, the redesigned pipeline labels each video at explicit post-publish ages, joins all modalities through a canonical video_id, and trains models against clear 7-day and 30-day targets.
The system was showcased at CUCAI through a co-authored paper and demo application.
Problem
Short-form video virality is difficult to model because view counts are noisy, heavy-tailed, and heavily dependent on timing. A raw snapshot of a video's view count is not necessarily a fair label: a video with 5,000 views after one hour is very different from a video with 5,000 views after 30 days.
The original project faced three major issues:
- Temporal label inconsistency — videos were being compared at different ages, producing noisy supervision.
- Multimodal alignment complexity — metadata, downloaded video, extracted audio, transcripts, embeddings, and fused vectors all had to refer to the same item without silent mismatches.
- Missing modality behavior — transcripts were not always available, and missing text needed to be represented explicitly instead of accidentally corrupting downstream models.
The redesigned project addresses these issues by making the entire pipeline video_id-native, horizon-labeled, restartable, and schema-locked.
My Role
I contributed across the data engineering, machine learning, evaluation, and demo layers of the project. My work included:
- Redesigning the labeling framework around fixed 7-day and 30-day supervision horizons.
- Building and standardizing a multimodal data pipeline for metadata, video, audio, text, embeddings, fusion, training, and interpretation.
- Implementing leakage-resistant feature generation and model-evaluation workflows.
- Benchmarking neural, linear, tree-based, gated-fusion, and stacked-ensemble models.
- Adding slice-based evaluation for text-present and text-missing examples.
- Implementing out-of-fold stacked ensemble analysis across four selected base models.
- Supporting a web application capable of live 7-day and 30-day forecasting from uploaded video files and metadata.
- Contributing to a co-authored CUCAI paper and conference demo.
System Architecture
At a high level, the system follows this pipeline:
YouTube Shorts discovery
↓
Fixed-horizon metadata labels
↓
Video download
↓
Audio extraction
↓
Transcript collection / ASR
↓
Video, audio, and text embeddings
↓
Multimodal fusion
↓
Supervised model training
↓
Model evaluation and stacking
↓
Clustering and interpretation
↓
Web application inference
Each stage uses deterministic naming, per-stage state tracking, and canonical video_id joins so that data can be processed incrementally without losing alignment between modalities.
Data Collection Pipeline
The data collection layer constructs the canonical dataset used by all downstream modeling stages. It discovers YouTube Shorts, labels them at fixed post-publish horizons, downloads raw videos, derives audio, and collects transcript text.
Fixed-Horizon Metadata Labels
The most important redesign was replacing arbitrary view-count snapshots with fixed-horizon labels.
For each video, the pipeline records:
horizon_days: either7or30horizon_view_count: the observed view count at the labeling timehorizon_label_type:views_at_agecaptured_at: timestamp of the metadata capture- canonical
video_id
This turns the prediction task into a supervised regression problem with well-defined targets:
- 7-day forecast: predict views at approximately 7 days after publish.
- 30-day forecast: predict views at approximately 30 days after publish.
This change improved dataset quality because the model no longer had to learn from mixed-age labels where some videos were only hours old and others were weeks old.
YouTube Shorts Discovery
The collector searches YouTube Shorts using broad seed queries and short-duration filters. For each run, it builds publish windows centered around now - horizon_days with a tolerance window, deduplicates discovered video_ids, fetches video and channel metadata, and appends horizon-labeled rows to the canonical metadata CSV.
Primary artifacts:
Data/raw/Metadata/shorts_metadata_horizon.csvData/raw/Metadata/horizon_labels.sqlite
The SQLite label store deduplicates (video_id, horizon_days) so that repeated runs do not repeatedly label the same video-horizon pair.
Canonical Identity and Delta Processing
Every downstream stage uses video_id as the canonical identity. If a metadata row is missing a video_id, the pipeline can parse it from common YouTube URL formats such as:
- raw 11-character YouTube IDs
youtube.com/watch?v=.../shorts/.../embed/.../v/...youtu.be/...
Each stage also computes a deterministic source_hash, normally from the normalized URL. This lets the system detect whether a video has already been processed, whether it needs to be retried, or whether it should be skipped as unchanged.
The delta-processing contract is:
- If an item has no prior state, process it.
- If its hash changed, process it again.
- If its previous status is non-terminal, retry it.
- If its previous status is terminal and unchanged, skip it.
This makes the pipeline restartable and avoids unnecessary recomputation.
Video Download
The video downloader consumes metadata delta rows and materializes MP4 files keyed by video_id.
Output pattern:
Data/raw/Video/raw_data/<video_id>.mp4
clipfarm/raw/video/<video_id>.mp4
The downloader includes retry logic and multiple yt-dlp fallback strategies. It also classifies terminal failures such as removed, private, unavailable, upcoming, age-restricted, and authentication-blocked videos.
Audio Extraction
Audio is extracted from downloaded MP4 files using FFmpeg and normalized into a model-friendly waveform format:
ffmpeg -y -i input.mp4 -vn -ac 1 -ar 16000 -acodec pcm_s16le output.wav
The chosen audio format is:
- WAV container
- PCM 16-bit little-endian
- Mono channel
- 16 kHz sample rate
This standardization supports downstream ASR and Wav2Vec2 embedding generation.
Output pattern:
Data/raw/Audio/raw_data/<video_id>.wav
clipfarm/raw/audio/<video_id>.wav
Transcript Collection
The text stage attempts transcript generation using a caption-first strategy:
- Try YouTube subtitles when available.
- Fall back to ASR over the extracted WAV file.
The operational ASR backend is whisper_cpp. Each transcript is saved as a JSON file containing the final transcript, transcript source, language metadata, caption metadata, optional timestamps, and error/status information.
Output pattern:
Data/raw/Text/raw_data/<video_id>.json
clipfarm/raw/text/<video_id>.json
The system also handles empty transcripts carefully. A first empty result is treated as retryable; a second consecutive empty result is promoted to a terminal empty-transcript status. This matters because fusion later uses terminal text failures to decide whether a zero text placeholder is valid.
Embedding Pipeline
The embedding stage converts raw media artifacts into dense modality-specific vectors. Each modality is processed independently as a delta job against the current metadata horizon CSV and its own state database.
Video Embeddings
Video embeddings are generated using a VideoMAE model.
Key details:
- Model class:
transformers.VideoMAEModel - Default model:
MCG-NJU/videomae-base - Input:
clipfarm/raw/video/<video_id>.mp4 - Output:
clipfarm/embeddings/video/<video_id>.npy - Frame sampling: uniformly sample a fixed number of frames, defaulting to 16
- Vector extraction: CLS token from the model output
- Normalization: L2-normalized
float32vector - Expected default dimension: 768
The frame sampling logic is deterministic for a fixed decode order. If a video has fewer frames than required, the final frame is repeated to pad the sequence.
Audio Embeddings
Audio embeddings are generated using Wav2Vec2.
Key details:
- Model class:
transformers.Wav2Vec2Model - Default model:
facebook/wav2vec2-base-960h - Input:
clipfarm/raw/audio/<video_id>.wav - Output:
clipfarm/embeddings/audio/<video_id>.npy - Sample rate: 16 kHz
- Max duration: 90 seconds by default for memory safety
- Pooling: mean-pooling over the final hidden states
- Normalization: L2-normalized
float32vector - Expected default dimension: 768
Text Embeddings
Text embeddings combine metadata text and transcript text.
Key details:
- Model class:
sentence_transformers.SentenceTransformer - Default model:
all-MiniLM-L6-v2 - Input: transcript JSON plus metadata row
- Output:
clipfarm/embeddings/text/<video_id>.npy - Metadata text fields: title, description, and query joined with
[SEP] - Transcript field: final transcript text
- Embedding construction: concatenate metadata embedding and transcript embedding
- Normalization: L2-normalized final vector
- Expected default dimension: 768, from two 384-dimensional MiniLM vectors
This design lets the text representation capture both creator-provided context and spoken/captioned content.
Multimodal Fusion
The fusion stage converts per-modality embeddings into one model-ready fused representation per video_id. It emits sharded .npz vector files, a Parquet manifest, and a schema lock file.
Primary script:
Data/common/fuse_embeddings_delta.py
Supported fusion strategies:
concatsum_poolmax_pool
Missing-Modality Policy
The project assumes that video and audio embeddings are required. Text is conditionally optional because some Shorts have no usable captions or speech.
The fusion policy is:
- Missing video embedding → skip as missing required modality.
- Missing audio embedding → skip as missing required modality.
- Text present → perform full fusion.
- Text missing with approved terminal text status → use a zero text vector and set
text_present = 0. - Text missing with non-terminal text status → defer fusion.
This avoids silently mixing rows with different feature semantics. The text_present mask is appended to fused vectors so downstream models can explicitly learn from missing-text behavior.
Fusion Strategies
Let:
v= video vectora= audio vectort= text vector or zero placeholderm= text-present mask
The implemented strategies are:
concat: fused = [v || a || t || m]
sum_pool: fused = [pad(v) + pad(a) + pad(t) || m]
max_pool: fused = [elementwise_max(pad(v), pad(a), pad(t)) || m]
With default 768-dimensional video, audio, and text embeddings, the fused dimensions are:
| Strategy | Dimension |
|---|---:|
| concat | 2305 |
| sum_pool | 769 |
| max_pool | 769 |
The production fusion stage intentionally avoids randomly initialized learned projections. This keeps fusion deterministic, stable across daily runs, and compatible with historical model checkpoints.
Schema Locking
Each fusion strategy writes a schema.json file containing:
fusion_strategyvideo_dimaudio_dimtext_dimfused_dimmask_appendedupdated_at
If a future run produces vectors with incompatible dimensions, the stage fails instead of silently creating train/serve skew.
Supervised Training Pipeline
The training stage builds supervised virality predictors from two inputs:
- Horizon-labeled metadata.
- Fused embedding pointers and shards.
Primary script:
Super_Predict/train_suite_from_horizon.py
For each (fusion_strategy, target_horizon_days) pair, the training pipeline reconstructs vectors from fused shards, joins them with horizon-labeled metadata, creates train/validation/test splits, trains several model families, and writes snapshot artifacts to S3.
Target Definition
The supervised target is:
target_raw = horizon_view_count
target_log = log1p(target_raw)
The log transform is important because view counts are heavy-tailed. Training and primary evaluation happen in log space, while raw-scale metrics are still computed for interpretability.
Leakage Prevention
The training pipeline explicitly excludes target-leaking features such as:
horizon_view_counthorizon_dayshorizon_label_type- current/post-outcome counters such as
view_count,like_count, andcomment_count - rate-derived columns such as
views_per_day age_daysvirality_score
A hard leakage assertion fails training if any disallowed columns survive feature selection.
This was a major reliability improvement because virality forecasting can easily become inflated by accidental access to post-outcome metadata.
Splitting Strategy
The implemented split is a deterministic random split:
- 70% train
- 15% validation
- 15% test
A split manifest is saved with each run so the exact data partition can be audited later.
One limitation is that this is not a strict chronological holdout. The pipeline prevents direct feature leakage and duplicate-video leakage, but not all forms of temporal leakage that could occur from nearby capture times. This is documented as a future improvement.
Model Families
The project benchmarks multiple model families to compare linear, tree-based, neural, and multimodal-gated approaches.
Concat MLP
The concat_mlp model consumes:
- strategy-specific fused vector
- standardized numeric metadata
- categorical metadata embeddings
Architecture:
- input assembly: fused vector + numeric features + categorical embeddings
- MLP hidden layers: 1024 → 512 → 256
- activation: GELU
- dropout: 0.20
- output: scalar log-view prediction
This model is directly affected by fusion strategy because it consumes the fused vector.
Gated Fusion MLP
The gated_fusion_mlp model consumes separate video, audio, and text vectors instead of the already fused vector.
Architecture:
- separate modality towers for video, audio, and text
- each tower maps to a shared 256-dimensional space
- a metadata-conditioned gate predicts softmax weights over modalities
- the weighted modality sum is passed into a prediction head with metadata and missing-text context
This model explicitly models modality reliability and text missingness. It is more expressive but operationally heavier than simpler baselines.
Ridge Regression
The ridge baseline uses:
- median-imputed and standardized numeric features
- one-hot categorical features
- fused vector dimensions
RidgeCVover a regularization grid
This provides a strong linear baseline and helps identify whether gains come from nonlinear modeling or simply from better feature construction.
GBDT with Neural Projection
The best-performing base model in the March 6 snapshot was a hybrid projected-GBDT model.
It works in two stages:
- Train a neural projector that maps the high-dimensional fused vector into a lower-dimensional latent representation.
- Train a
HistGradientBoostingRegressoron the projected latent features plus metadata.
The projector uses:
Linear(fused_dim → 256)- GELU
- dropout
Linear(256 → projector_dim)
The GBDT model uses early stopping and tree-based nonlinear modeling over the reduced representation.
Model Evaluation
The project evaluates models using both log-space and raw-space metrics.
Primary log-space metrics:
mae_logrmse_logr2_log
Raw-scale metrics:
mae_rawrmse_raw
Slice metrics:
text_present = 1text_present = 0
The text slice evaluation is important because the system intentionally supports both full multimodal examples and examples where transcript text is missing but terminally explainable.
March 6 Model Snapshot
A detailed March 6, 2026 snapshot compared 24 configurations across:
- 4 model families:
concat_mlp,gated_fusion_mlp,gbdt,ridge - 3 fusion strategies:
concat,sum_pool,max_pool - 2 horizons: 7 days and 30 days
The snapshot evaluated 18,492 prediction rows across both horizons.
Main Findings
The strongest base model was GBDT with concat fusion.
Key findings:
- GBDT won every strategy/horizon comparison in test MAE(log).
- The 7-day horizon was easier than the 30-day horizon for all model families.
- GBDT was much more robust from 7-day to 30-day forecasting than the neural and linear baselines.
- Ridge was the weakest overall baseline, especially at the 30-day horizon.
- Text-missing examples were harder for all model families, but GBDT was least sensitive to missing text.
Best base configurations:
| Horizon | Best Base Model | Strategy | MAE(log) | RMSE(log) | R²(log) | |---|---|---|---:|---:|---:| | 7 days | GBDT | concat | 1.1127 | 1.3695 | 0.4122 | | 30 days | GBDT | concat | 1.2264 | 1.6212 | 0.7173 |
The 30-day R² was higher despite a larger MAE because the 30-day target distribution had more explainable variance in log space.
Stacked Ensemble Evaluation
After benchmarking the base models, I implemented an out-of-fold stacking workflow to evaluate whether a learned ensemble could outperform the strongest single model.
The selected four base learners were:
gbdt + concatconcat_mlp + max_poolgated_fusion_mlp + concatridge + sum_pool
The stacking dataset was built from aligned residual-analysis outputs. For each horizon, the pipeline pivots base predictions into a wide matrix:
X = [base_model_1_pred_log, base_model_2_pred_log, base_model_3_pred_log, base_model_4_pred_log]
y = y_true_log
The stacker is trained on base predictions, not on residuals. Residuals are used for diagnostics and evaluation.
OOF Protocol
The ensemble search used repeated K-fold out-of-fold prediction:
- 5 folds
- 3 repeats
- fixed seed
- train on fold complements
- predict held-out fold
- average OOF predictions per row across repeats
This reduces optimistic bias because each scored prediction comes from a model that did not train on that row.
Candidate meta-models included:
- Linear Regression
- RidgeCV
- LassoCV
- ElasticNetCV
- HuberRegressor
- RandomForestRegressor
- HistGradientBoostingRegressor
Stacking Results
The learned stacker improved RMSE(log) at both horizons.
| Horizon | Best Base RMSE(log) | Best Stack | Best Stack RMSE(log) | RMSE Gain vs Base | |---|---:|---|---:|---:| | 7 days | 1.3695 | Linear stacker | 1.3584 | +0.0111 / 0.81% | | 30 days | 1.6212 | ElasticNetCV stacker | 1.6005 | +0.0207 / 1.28% |
Portfolio-scale practical impact from the ensemble improvements:
- The 7-day stack improved RMSE(log) by the raw-scale equivalent of approximately 42,780 views.
- The 30-day stack improved RMSE(log) by the raw-scale equivalent of approximately 130,973 views.
The gains were modest relative to the best base model but meaningful because they reduced larger misses and improved calibration behavior. At 30 days, the stacker reduced high-target tail risk and improved bias from a noticeable underprediction pattern to near zero.
Clustering and Latent-Space Analysis
The project also includes an unsupervised clustering and latent-space interpretation workflow. This helps analyze whether groups of Shorts share similar multimodal style patterns.
The clustering pipeline loads fused vectors from S3, reconstructs them from manifest pointers, applies dimensionality reduction, evaluates cluster candidates, and writes cluster assignments.
Dimensionality Reduction
The pipeline standardizes fused vectors and applies PCA:
- cluster-space PCA up to 50 dimensions
- visualization PCA to 2 dimensions
- optional UMAP for clustering and visualization
Cluster Selection
For each fusion strategy and each k in a configured range, the pipeline evaluates KMeans candidates across multiple random seeds.
The selection score combines:
- silhouette score
- seed stability via adjusted Rand index
- tiny-cluster penalty
Composite scoring:
0.55 * silhouette + 0.35 * stability - 0.10 * tiny_cluster_penalty
The final KMeans run uses deterministic seeding and canonical label remapping so that cluster IDs remain stable for fixed inputs.
Cluster Outputs
Main outputs:
Unsup_Cluster/cluster_results.csvUnsup_Cluster/cluster_diagnostics.json- PCA/UMAP latent-space artifacts
- visualization-ready 2D projections
This workflow made the embedding space more interpretable and supported qualitative analysis of what kinds of Shorts the model was grouping together.
Interpretation Pipeline
The interpretation stage converts cluster assignments into human-auditable style attributes.
For each clustered video, it joins the canonical URL and computes lightweight media features such as:
motion_meancut_rate_per_minaudio_rms_meanaudio_rms_stdvisual_density
These features are intentionally lightweight and scalable. They are not meant to be a full cinematic analysis system; instead, they provide practical signals for comparing cluster profiles.
The output includes availability flags so missing media does not get confused with truly low motion or low audio energy.
Main output:
Interpretation/interpretation.csv
Web Application Demo
The final system was integrated into a web application for live inference. The app allowed users to upload an MP4 file and provide metadata, then generated 7-day and 30-day forecast outputs using the trained multimodal models.
The demo system mirrored the training pipeline as closely as possible:
- Extract media-derived inputs from the uploaded video.
- Generate or load the required modality embeddings.
- Apply the same fusion strategy and schema expectations used during training.
- Run selected model checkpoints.
- Return forecasted view counts for 7-day and 30-day horizons.
- Display uncertainty-style ranges based on robust model prediction intervals.
This application was showcased at CUCAI as part of the project presentation and co-authored paper.
Technical Highlights
End-to-End Multimodal Pipeline
The project processes raw YouTube Shorts into model-ready features across video, audio, text, and metadata. Each modality has its own extraction and embedding path, but all stages join through a shared video_id identity contract.
Fixed-Horizon Forecasting
The redesigned labeling framework introduced explicit 7-day and 30-day targets. This eliminated inconsistent mixed-age labels and made the forecasting task clearer, more defensible, and easier to evaluate.
Restartable Daily Processing
Each stage uses per-stage SQLite state and deterministic source hashing. This means processing can resume after interruption, skip unchanged items, and retry only unresolved failures.
AWS S3 Artifact Backbone
S3 stores raw media, embeddings, fused shards, manifests, model snapshots, and state databases. This enables the pipeline to scale beyond local files while keeping artifacts discoverable and reproducible.
Schema-Locked Fusion
Fusion writes schema locks so vector dimensions and mask behavior cannot silently change between runs. This reduces train/serve skew risk and catches upstream drift early.
Leakage-Resistant Training
The supervised pipeline explicitly removes target and post-outcome columns, deduplicates videos, stores split manifests, and evaluates both aggregate performance and missing-text slices.
Model Breadth
The project compares neural networks, gated multimodal architectures, linear models, projected tree models, deterministic blends, and learned stacked ensembles.
Interpretability Workflows
The clustering and interpretation stages add qualitative insight into the embedding space by combining PCA/UMAP projections, KMeans clusters, URL joins, and human-auditable motion/audio/visual-density features.
Scale
The project was designed for daily incremental data growth.
Notable scale points:
- Daily ingestion target: approximately 1,000 videos per day.
- Collection scale: approximately 50,000 YouTube Shorts.
- Downstream fused dataset scale: approximately 10,000 fused multimodal records.
- Snapshot evaluation scale: 18,492 prediction rows across 7-day and 30-day horizons.
- Multimodal file scale: 150,000+ files across raw media, transcripts, embeddings, manifests, and derived artifacts.
Results Summary
The project improved from an initially weak supervised-learning setup into a reproducible multimodal forecasting system with clear labels, stronger model comparisons, and a working inference demo.
Most important outcomes:
- Introduced fixed 7-day and 30-day forecasting targets.
- Standardized metadata, video, audio, text, embedding, fusion, training, clustering, and interpretation stages.
- Built a schema-locked multimodal fusion pipeline with explicit missing-text handling.
- Benchmarked 24 model/fusion/horizon configurations in a March 6 snapshot.
- Identified GBDT with concat fusion as the strongest base model.
- Built OOF stacked ensembles that improved RMSE(log) at both horizons.
- Reduced the 7-day RMSE(log) by a raw-scale equivalent of about 42,780 views.
- Reduced the 30-day RMSE(log) by a raw-scale equivalent of about 130,973 views.
- Delivered a live web application for conference demonstration.
- Co-authored a paper presented at CUCAI.
Challenges
Heavy-Tailed Targets
YouTube view counts are extremely skewed. A few very large videos can dominate raw-scale metrics. The project addresses this by optimizing and ranking primarily in log space while still reporting raw-scale metrics for interpretability.
Missing Text
Not all Shorts have usable transcripts. Instead of dropping these videos or silently filling text with zeros, the pipeline uses terminal text-state statuses and an explicit text_present mask.
Data Alignment
The system must align metadata rows, video files, audio files, transcripts, embeddings, fused vectors, labels, and model predictions. The canonical video_id contract and manifest-based pointer system were central to making this reliable.
Train/Serve Consistency
The demo application needs to reproduce training-time preprocessing. Schema locks, explicit model snapshots, fusion strategy metadata, and stored preprocessing configuration reduce the risk of train/serve skew.
Reproducibility vs. External APIs
Some parts of the pipeline depend on external systems such as YouTube search results, video availability, model hub artifacts, and cloud storage timing. The pipeline controls what it can through deterministic naming, state databases, run IDs, stored configs, and artifact snapshots.
What I Learned
This project taught me how much real-world machine learning performance depends on data contracts, labeling quality, and evaluation design. The modeling architecture mattered, but the largest improvements came from making the target definition precise, preventing leakage, making modality alignment reliable, and evaluating failure cases explicitly.
I also learned how to design ML systems that are not just notebooks, but reproducible pipelines: restartable ingestion, schema-locked feature generation, model snapshotting, slice evaluation, and deployment-aware inference workflows.
Technologies Used
- Python
- PyTorch
- scikit-learn
- pandas / NumPy
- Transformers
- SentenceTransformers
- VideoMAE
- Wav2Vec2
- Whisper.cpp
- FFmpeg
- yt-dlp
- PyAV
- SQLite
- Parquet
- AWS S3
- GitHub Actions
- KMeans / PCA / UMAP
- HistGradientBoostingRegressor
- RidgeCV / ElasticNetCV / LassoCV
- Multimodal fusion
- Out-of-fold stacking
- Web application inference
Future Improvements
Potential next steps include:
- Replace random train/validation/test splits with strict chronological holdouts.
- Add bootstrap confidence intervals for model comparisons.
- Pin exact model artifact revisions for every embedding model.
- Add a formal model registry binding
run_id, model checkpoint, fusion strategy, preprocessing version, and horizon. - Improve raw-scale metrics with robust denominator clipping or median percentage-error variants.
- Add deeper content interpretation features such as scene changes, object density, speech rate, music/speech classification, and visual style embeddings.
- Expand the web app with explanations, confidence intervals, and cluster-based content insights.