AI Extraction–Validation–Publication Pipeline v1
(Entity-first • Claim-based • Provenance-driven • Human-in-the-loop)
1) Objectives
The pipeline must reliably transform unstructured inputs into:
- Canonical Entities (Company / Person / Corridor)
- Claims (field-level truth units with evidence + confidence)
- Relationships (graph edges with provenance)
- Publishable Notes (human narrative rendering + embedded JSON-LD)
- Machine Feeds/APIs (entity updates, corridor briefs, NDJSON streams)
2) Inputs (Ingestion Channels)
A. Structured
- Company submission form (recommended primary)
- Founder/Person submission form
- Corridor opportunity intake form (trade/investment/RE/legal)
B. Semi-structured
- Interview transcript (Q&A)
- Email threads
- “Pitch deck + short answers” templates
C. Unstructured
- PDFs (deck, registry docs)
- Website content snapshots
- Public registry extracts (optional)
- Partner feeds (institutions)
All inputs are stored as immutable artifacts with IDs:
document_id,transcript_id,submission_id
3) Pipeline Stages (End-to-End)
Stage 0 — Intake & Preprocessing
Goal: Normalize inputs and create a traceable job.
Actions
- Assign
job_id - Store raw artifact in object storage
- Extract text (PDF → text)
- Chunk into logical sections
- Detect language + translate (optional)
- Create
ingestion_manifest:- source type
- timestamps
- submitter identity (if any)
- consent flags (interview / PII)
Outputs
raw_textchunks[]manifest.json
Gates
- If no consent and contains PII → route to compliance review
Stage 1 — Entity Detection & Candidate Generation
Goal: Identify which entities exist and whether they already exist.
AI Tasks
- Named entity recognition (Company, Person, City, Institution, Product)
- Normalize names, domains, locations
- Propose entity candidates:
- existing match probability
- new entity proposal
Deterministic Tasks
- Domain normalization (strip tracking)
- Location normalization (country code ISO2)
- Slug suggestion
Outputs
entity_candidates[]with:entity_typecanonical_namematch_candidates[](existing entity IDs + match score)proposed_new_entity(if no match)
Gates
- If match score ≥ threshold (e.g., 0.92) → auto-link to existing entity
- Else → human chooses (or AI asks minimal disambiguation)
Stage 2 — Claim Extraction (Field-Level Truth Units)
Goal: Extract structured fields as claims with evidence.
AI Tasks
- Extract schema-aligned fields per entity type:
- Company: sector, stage, business model, products, markets, etc.
- Person: roles, affiliations, expertise
- Corridor: nodes, domains, operating model
Hard Rules
- If a field is not explicitly supported by the input evidence:
- mark as
unknownOR - create claim with low confidence + “self_reported” level
- mark as
Outputs (Claim objects)
Each claim:
claim_identity_id(or temporary entity key)field_path(e.g.,funding.total_raised)valueevidence_span(chunk_id + start/end offsets)source_ref(document_id / interview / registry)confidence(0–1 or 0–100)verification_leveldefault =self_reportedcreated_at
Gates
- Critical fields require evidence:
- legal_name, location, founded_date, funding, certifications, export markets
- If missing evidence → cannot move to “verified”
Stage 3 — Relationship Extraction (Graph Edges)
Goal: Turn implicit relationships into explicit edges.
AI Tasks
- Identify relationships:
- Company founded_by Person
- Company partners_with Institution
- Company exports_to Country
- Corridor includes Node City
- Company corridor_fit domains
Outputs
edges[]:from_entity_idto_entity_idrelation_typeevidenceconfidence
Gates
- No edge without evidence pointer
- If evidence is weak → “proposed edge” status
Stage 4 — Normalization & Taxonomy Assignment
Goal: Convert free-text into controlled vocab references.
AI Tasks
- Map sector strings →
sector taxonomy codes - Map tech mentions →
technology taxonomy codes - Map topics →
topic taxonomy codes - Map corridor domains →
corridor_domain codes
Deterministic Checks
- Only allow tags existing in the taxonomy registry
- Reject unknown codes or route to “taxonomy steward” queue
Outputs
- Normalized entity draft:
primary_sectortagRefsecondary_sectors[]technologies[]topics[]
Gates
- If AI suggests a non-existent tag → requires steward approval
Stage 5 — Consistency & Quality Validation (Automated)
Goal: Detect contradictions and enforce completeness.
Checks
- Dates consistency (founded_date not in future)
- Country/city coherence (valid ISO2)
- Funding coherence (currency format, non-negative)
- Duplicates (domain collision, name collision)
- Required fields present for publishable draft
- Risk checks:
- prohibited content
- PII leaks
- defamation risk (accusations)
Outputs
validation_report.json:- errors (blockers)
- warnings (non-blocking)
- completeness score
- confidence summary
Gates
- Errors block progression
- Warnings can pass but are logged
Stage 6 — Human Review (Editorial + Verification)
Goal: Convert “AI draft” into “publishable truth”.
Two distinct roles
- Editor Review
- readability, clarity, tone
- remove promotional language
- ensure template compliance
- Verifier Review
- confirm evidence for critical claims
- assign verification level:
- self_reported → partially_verified → verified → externally_verified
- approve or reject claims
UI Requirements
- Side-by-side view:
- extracted field → evidence highlight → approve/edit
- One-click demote “unsupported claims”
- Mark “needs more evidence” with checklist
Outputs
approved_claims[]rejected_claims[]edited_entity_draft
Gates
- Publishing requires:
- editorial_state ≥
verifiedOR at least “published (self_reported)” with explicit label - last_verified_at set
- sources listed
- editorial_state ≥
Stage 7 — Canonical Entity Build (Source of Truth)
Goal: Build the final entity record from approved claims.
Process
- Merge approved claims into entity
- Maintain:
- per-field provenance pointers
- entity-level provenance summary
- Write to:
- Postgres entity store
- Graph store (edges)
- Search index (keyword)
- Vector DB (embeddings)
Outputs
entity.json(canonical)graph_updatessearch_docembedding_artifacts
Stage 8 — Article Rendering (Narrative View)
Goal: Generate the public note from the canonical entity.
AI Tasks
- Produce:
- Executive Summary (150–250 words)
- Core Activity & Tech Structure
- Market Positioning
- Ecosystem Context
- Corridor Analysis (if applicable)
- Verification Note
Hard Constraints
- No new facts may be introduced beyond approved claims
- All quantitative statements must reference approved fields
- Style: impersonal, technical, neutral
Outputs
article.md(or structured blocks)render_blocks.json(for CMS layout)
Stage 9 — Machine Layer Generation (JSON-LD + Feeds)
Goal: Publish machine-readable signals.
Generation
- JSON-LD (schema.org Organization/Person + additionalProperty)
- OpenGraph metadata
- RSS entry (human)
- AI feed entry:
entity_updates.ndjson(diff-based)corridor_briefs.ndjson
Outputs
page.html(or CMS page)embedded_jsonldfeeds
Stage 10 — Publication & Indexing
Goal: Release to public + trigger indexing.
Actions
- Publish page (CMS or static deploy)
- Update sitemaps + lastmod
- Ping search engines (optional)
- Notify subscribers / partners (webhooks)
- Log publishing event
Outputs
- Public URL
- API cache refresh
- Webhook events
4) Status Model (State Machine)
Entity state
- draft
- in_review
- verified
- published
- archived
Claim state
- proposed
- approved
- rejected
- superseded (replaced by newer claim)
Edge state
- proposed
- approved
- rejected
5) Confidence & Verification Policy
Confidence (0–100) = AI confidence in extraction correctness.
Verification level = editorial evidence quality.
Rules:
- High confidence does NOT equal verified.
- Verified requires evidence review.
Default levels:
- Interview/submission → self_reported
- Public registry doc → verified
- Partner feed + doc → externally_verified
6) Anti-Hallucination Guards (Critical)
- No-free-text generation without entity constraints
Generation must be constrained to approved fields only. - Evidence pointer required for critical fields
No evidence → cannot be “verified”. - Diff-based updates
Only changed fields are published as updates. - Changelog required
Every published entity page includes update log.
7) Outputs (What You Sell)
This pipeline creates multiple products:
Public media
- Human-readable notes
- Sector dossiers
- Corridor briefs
AI/Institutional products
- Verified entity feed
- Corridor pipeline feed
- Graph export
- Premium API endpoints
- Webhooks for updates
8) Minimal MVP Implementation (Practical)
If you want the leanest working version:
- Entity store (Postgres)
- Claims table (per-field)
- Taxonomy registry (simple table)
- Editor/Verifier UI (approve claims)
- Article renderer (template)
- JSON-LD embed + RSS + NDJSON feed
That’s enough to be “AI-native” for real.
9) Recommended “Jobs” & Queues
intake_queueentity_resolution_queueclaim_extraction_queuetaxonomy_mapping_queuevalidation_queueeditor_review_queueverifier_queuepublish_queuemonitor_update_queue
This makes it scalable to multiple cities and corridors.


