Overview
CollectLite is a full-stack human-data collection and model-evaluation platform designed as a lower-scale reconstruction of the kind of system Cohere's Collect team appears to build publicly: internal tools and workflows for human data collection, annotation, curation, and model evaluation.
The system supports project owners who need to turn raw examples into high-quality model training or evaluation datasets. Users can create annotation projects, upload CSV or JSONL files, define task instructions, generate annotation queues, run model-in-the-loop pre-labeling, collect human feedback, measure agreement, resolve disagreements, and export clean datasets for downstream training or evaluation.
The project was intentionally scoped around Cohere's public Collect and Human Data evidence. Public job postings describe Collect engineers as working on critical internal tools with Next.js, TypeScript, React, and Python, while Human Data postings describe tools and workflows for annotation, data curation, model-in-the-loop labeling, and evaluation. CollectLite implements those ideas at a personal-project scale.
Problem
High-quality human data is one of the hardest parts of building reliable AI systems. Raw datasets often lack clear labels, inconsistent instructions create noisy annotations, and model outputs need structured human review before they can be used for training or evaluation.
The core problems CollectLite addresses are:
- Annotation workflow fragmentation — teams often manage task instructions, raw data, assignments, review, and exports across disconnected spreadsheets and scripts.
- Inconsistent label quality — labels need consensus, reviewer override, gold examples, and disagreement tracking to become trustworthy training/evaluation data.
- Model-in-the-loop complexity — model suggestions can speed up annotation, but they must be stored with provenance and compared against human decisions.
- Export reliability — downstream training and evaluation pipelines need schema-consistent, auditable JSONL/CSV outputs.
- Internal-tool usability — annotators need a fast, clear UI, while project owners need dashboards for progress, quality, and bottlenecks.
My Role
I designed and implemented CollectLite as a full-stack SWE/AI systems project. My work included:
- Researching public evidence about Cohere's Collect and Human Data teams.
- Designing a lower-scale architecture that maps to Collect-like internal tooling.
- Building a Next.js + TypeScript annotator/admin interface.
- Implementing a Python FastAPI backend with project, task, assignment, annotation, review, and export APIs.
- Designing a relational Postgres schema for annotation lifecycle state, consensus labels, model suggestions, audit logs, and dataset exports.
- Adding model-in-the-loop pre-labeling using Cohere Rerank/Embed or local fallback models.
- Implementing consensus, disagreement queues, gold-task checks, and reviewer override workflows.
- Building operational dashboards for throughput, agreement, label distribution, model-human disagreement, and export readiness.
- Containerizing the stack with Docker Compose and documenting setup, API contracts, and system design.
System Architecture
At a high level, the system follows this flow:
Dataset upload
↓
Schema validation and row normalization
↓
Task generation
↓
Model-in-the-loop suggestions
↓
Annotator work queue
↓
Human annotation capture
↓
Consensus and quality-control engine
↓
Reviewer disagreement resolution
↓
Dataset export
↓
Training / evaluation-ready JSONL or CSV
The platform separates user-facing workflows from backend data contracts. The frontend handles project management, annotation workbench, review queues, and dashboards. The backend owns lifecycle state, validation, assignment logic, model suggestion generation, consensus computation, audit logging, and export generation.
Core Product Surfaces
Project Owner Dashboard
Project owners can create annotation projects, upload raw examples, define task types, write labeling instructions, monitor task progress, inspect quality metrics, and export datasets.
Supported task types:
- Pairwise preference ranking
- Document relevance rating
- Classification
- Extraction/QA validation
- Freeform model response critique
Dataset Upload and Validation
Users upload CSV or JSONL files. The backend validates required fields, normalizes row IDs, detects duplicates, and creates immutable source examples.
Example dataset row:
{
"query": "What documents are needed for mortgage renewal?",
"candidate_a": "Income verification and property tax documents...",
"candidate_b": "A branch address and appointment confirmation...",
"metadata": {"domain": "banking", "source": "demo"}
}
Each uploaded row receives a canonical example_id, source hash, project ID, and ingestion timestamp.
Task Generation
The task generator converts source rows into annotation tasks. Each task stores:
- project ID
- example ID
- task type
- instruction version
- priority
- model suggestion status
- assignment state
- lifecycle status
Task states:
created → suggested → assigned → submitted → needs_review → resolved → exported
Model-in-the-Loop Suggestions
Before humans annotate, the system can call a model-assisted pre-labeling pipeline.
For relevance or RAG-style tasks, Cohere Rerank can score candidate documents against a query. For classification or critique tasks, an LLM can draft a suggested label and short rationale. These suggestions are shown to annotators as optional assistance, not treated as ground truth.
Stored model suggestion metadata includes:
- model provider
- model name
- prompt/template version
- raw score or label
- rationale
- confidence
- created timestamp
- latency and cost estimate
This allows the dashboard to compare model suggestions against human consensus later.
Annotator Workbench
Annotators receive tasks from an assignment queue. The UI emphasizes speed and clarity:
- task instructions pinned at the top
- item context in the center
- candidate responses/documents side by side
- model suggestion panel
- label controls
- confidence selector
- optional notes
- submit and skip actions
The system records annotation latency, selected labels, confidence, notes, model-suggestion visibility, and assignment metadata.
Consensus and Quality Control
CollectLite supports multiple quality-control mechanisms:
- Multi-annotator consensus — assign the same task to multiple annotators and compute agreement.
- Gold tasks — insert known-answer tasks to estimate annotator reliability.
- Disagreement queue — send low-agreement tasks to reviewer resolution.
- Reviewer override — allow senior reviewers to set final labels.
- Model-human disagreement — flag cases where model suggestions disagree with consensus.
Consensus outputs are stored separately from raw annotations so the platform preserves both individual labels and final resolved labels.
Review Queue
Reviewers see tasks that require adjudication. A task enters review when:
- annotators disagree beyond a threshold
- annotator confidence is low
- model and human labels strongly disagree
- the task is a gold-task failure
- a project owner manually flags it
The reviewer can inspect all raw annotations, comments, model suggestions, and source metadata before resolving the final label.
Export Pipeline
The export service generates clean JSONL or CSV files for training and evaluation.
Export records include:
- example ID
- source fields
- final label
- consensus score
- number of annotations
- reviewer override flag
- model suggestion metadata
- instruction version
- split assignment: train, validation, or evaluation
- provenance timestamps
This makes the output usable for supervised fine-tuning, model evaluation, reranker evaluation, or RAG-quality benchmarking.
Data Model
Core tables:
usersprojectsdatasetssource_examplestask_templatestasksassignmentsannotationsmodel_suggestionsconsensus_resultsreview_decisionsgold_labelsexportsaudit_events
The schema is designed around immutability where possible. Raw examples, annotation submissions, model suggestions, and audit events are append-only. Resolved consensus and export records are versioned.
API Design
Representative endpoints:
POST /api/projects
GET /api/projects
POST /api/projects/{project_id}/datasets
POST /api/projects/{project_id}/tasks/generate
POST /api/projects/{project_id}/tasks/suggest
GET /api/tasks/next
POST /api/tasks/{task_id}/annotations
GET /api/projects/{project_id}/review-queue
POST /api/reviews/{task_id}/resolve
GET /api/projects/{project_id}/metrics
POST /api/projects/{project_id}/exports
GET /api/exports/{export_id}/download
The backend exposes typed OpenAPI documentation through FastAPI, and the frontend uses generated TypeScript API types to reduce frontend/backend drift.
Observability and Reliability
The project includes lightweight production-minded instrumentation:
- structured JSON logs
- request latency middleware
- background job status tracking
- task lifecycle counters
- failed model-call retries
- export audit records
- dashboard metrics
Main operational metrics:
- tasks created
- tasks completed
- average annotation latency
- consensus agreement rate
- gold-task accuracy
- model-human disagreement rate
- review backlog
- export readiness
Technology Stack
Frontend
- Next.js
- TypeScript
- React
- Tailwind CSS
- shadcn/ui
- TanStack Query
- Recharts
Backend
- Python
- FastAPI
- Pydantic
- SQLAlchemy
- Alembic
- PostgreSQL
- Redis Queue or FastAPI background tasks
AI / Model-in-the-Loop
- Cohere Rerank for relevance-ranking tasks
- Cohere Embed for semantic similarity and clustering
- Optional local fallback with SentenceTransformers
- Optional LLM judge for draft labels and rationales
DevOps
- Docker Compose
- GitHub Actions
- pytest
- Playwright or React Testing Library
- Ruff / Black
- ESLint / Prettier
Lower-Scale Implementation Plan
Day 1: Core CRUD and upload
- Create repo and Docker Compose stack.
- Implement Postgres schema and Alembic migrations.
- Build project creation and dataset upload.
- Add row validation and source-example creation.
- Build simple admin dashboard.
Day 2: Annotation flow
- Generate tasks from uploaded rows.
- Build annotator queue and task workbench.
- Implement annotation submission.
- Add assignment and task-state transitions.
- Seed demo users and demo dataset.
Day 3: Model-in-the-loop and quality
- Add Cohere Rerank or local reranker suggestions.
- Store model suggestion metadata.
- Compute consensus from multiple annotations.
- Build disagreement review queue.
- Add gold-task reliability scoring.
Day 4: Dashboard and export
- Build metrics dashboard.
- Add JSONL/CSV export pipeline.
- Add audit logs.
- Polish README, architecture diagram, screenshots, and demo script.
Demo Scenario
The demo project uses a RAG relevance-labeling workflow.
- Upload a dataset of user queries and candidate retrieved documents.
- Generate relevance-labeling tasks.
- Run Cohere Rerank to pre-score each candidate document.
- Annotators label each candidate as irrelevant, partially relevant, or highly relevant.
- The system computes consensus and flags disagreements.
- Reviewers resolve ambiguous cases.
- Project owner exports a clean reranker-evaluation dataset.
- Dashboard shows throughput, agreement, and model-human disagreement.
Scale
CollectLite is intentionally lower-scale but production-shaped.
Target local/demo scale:
- 1-5 projects
- 100-5,000 source examples
- 500-25,000 generated tasks
- 2-10 annotator accounts
- 2-3 annotations per task
- JSONL/CSV exports up to tens of thousands of records
Results Summary
The intended result is a polished interview project that demonstrates full-stack engineering, AI systems thinking, and internal-tool product judgment.
Expected outcomes:
- End-to-end annotation workflow from dataset upload to export.
- Model-in-the-loop suggestions stored with provenance.
- Consensus and disagreement resolution workflow.
- Dashboard for data quality and project progress.
- Exportable training/evaluation dataset.
- Clear documentation and system design artifacts.
Technical Highlights
Internal Tooling UX
The platform is designed for two user types: project owners and annotators. Project owners need operational visibility; annotators need a fast and unambiguous labeling interface.
Model-in-the-Loop Annotation
Model suggestions are used to accelerate human labeling while preserving human decisions as the source of truth. The system explicitly tracks disagreement between model suggestions and human consensus.
Annotation Quality Control
Consensus, gold tasks, reviewer overrides, and annotator reliability scores turn raw labels into trusted data.
Export Contracts
Exports are schema-controlled and include provenance fields, making the generated data suitable for model evaluation or fine-tuning.
Production-Minded Design
Even at small scale, the system includes state machines, audit logs, background jobs, typed APIs, schema migrations, and observability metrics.
Challenges
Scope Management
The biggest challenge is keeping the system buildable before the interview. The MVP focuses on one strong workflow: RAG relevance labeling with optional pairwise preference ranking.
Annotation UX
Small UI choices affect label quality. Instructions, candidate layout, keyboard shortcuts, and confidence controls need to be simple and fast.
Model Suggestion Bias
Showing model suggestions can bias human annotators. CollectLite logs whether suggestions were visible and can support an ablation mode where some tasks hide suggestions.
Consensus Design
Consensus is straightforward for classification but harder for ranking and freeform critique. The MVP starts with majority vote and reviewer override.
Export Correctness
Training/evaluation exports must avoid schema drift. The system version-controls instruction templates and export schemas.
What I Learned
This project demonstrates that human-data platforms are not just labeling UIs. They require careful system design around workflow state, data quality, model assistance, provenance, auditability, and downstream training/evaluation contracts.
It also helped connect my existing AI/data engineering background to full-stack product engineering. I designed a system that combines frontend UX, backend APIs, data modeling, AI services, quality metrics, and production-style observability.
Technologies Used
- Next.js
- TypeScript
- React
- Tailwind CSS
- shadcn/ui
- TanStack Query
- Recharts
- Python
- FastAPI
- Pydantic
- SQLAlchemy
- Alembic
- PostgreSQL
- Redis Queue / background jobs
- Cohere Rerank
- Cohere Embed
- SentenceTransformers
- Docker Compose
- GitHub Actions
- pytest
- Playwright / React Testing Library
- JSONL / CSV exports
Future Improvements
- Add organization-level role-based access control.
- Add keyboard-first annotator shortcuts.
- Add active-learning prioritization using model uncertainty.
- Add pairwise Bradley-Terry aggregation for preference ranking.
- Add prompt/version registry for model suggestions.
- Add blind annotation mode to reduce model-suggestion bias.
- Add webhook export to training pipelines.
- Add OpenTelemetry tracing and Prometheus metrics.
- Add dataset diffing between export versions.
- Add multi-tenant project isolation.