AI Extraction–Validation–Publication Pipeline v1

(Entity-first • Claim-based • Provenance-driven • Human-in-the-loop)

1) Objectives

The pipeline must reliably transform unstructured inputs into:

Canonical Entities (Company / Person / Corridor)
Claims (field-level truth units with evidence + confidence)
Relationships (graph edges with provenance)
Publishable Notes (human narrative rendering + embedded JSON-LD)
Machine Feeds/APIs (entity updates, corridor briefs, NDJSON streams)

2) Inputs (Ingestion Channels)

A. Structured

Company submission form (recommended primary)
Founder/Person submission form
Corridor opportunity intake form (trade/investment/RE/legal)

B. Semi-structured

Interview transcript (Q&A)
Email threads
“Pitch deck + short answers” templates

C. Unstructured

PDFs (deck, registry docs)
Website content snapshots
Public registry extracts (optional)
Partner feeds (institutions)

All inputs are stored as immutable artifacts with IDs:

document_id, transcript_id, submission_id

3) Pipeline Stages (End-to-End)

Stage 0 — Intake & Preprocessing

Goal: Normalize inputs and create a traceable job.

Actions

Assign job_id
Store raw artifact in object storage
Extract text (PDF → text)
Chunk into logical sections
Detect language + translate (optional)
Create ingestion_manifest:
- source type
- timestamps
- submitter identity (if any)
- consent flags (interview / PII)

Outputs

raw_text
chunks[]
manifest.json

Gates

If no consent and contains PII → route to compliance review

Stage 1 — Entity Detection & Candidate Generation

Goal: Identify which entities exist and whether they already exist.

AI Tasks

Named entity recognition (Company, Person, City, Institution, Product)
Normalize names, domains, locations
Propose entity candidates:
- existing match probability
- new entity proposal

Deterministic Tasks

Domain normalization (strip tracking)
Location normalization (country code ISO2)
Slug suggestion

Outputs

entity_candidates[] with:
- entity_type
- canonical_name
- match_candidates[] (existing entity IDs + match score)
- proposed_new_entity (if no match)

Gates

If match score ≥ threshold (e.g., 0.92) → auto-link to existing entity
Else → human chooses (or AI asks minimal disambiguation)

Stage 2 — Claim Extraction (Field-Level Truth Units)

Goal: Extract structured fields as claims with evidence.

AI Tasks

Extract schema-aligned fields per entity type:
- Company: sector, stage, business model, products, markets, etc.
- Person: roles, affiliations, expertise
- Corridor: nodes, domains, operating model

Hard Rules

If a field is not explicitly supported by the input evidence:
- mark as unknown OR
- create claim with low confidence + “self_reported” level

Outputs (Claim objects)
Each claim:

claim_id
entity_id (or temporary entity key)
field_path (e.g., funding.total_raised)
value
evidence_span (chunk_id + start/end offsets)
source_ref (document_id / interview / registry)
confidence (0–1 or 0–100)
verification_level default = self_reported
created_at

Gates

Critical fields require evidence:
- legal_name, location, founded_date, funding, certifications, export markets
If missing evidence → cannot move to “verified”

Stage 3 — Relationship Extraction (Graph Edges)

Goal: Turn implicit relationships into explicit edges.

AI Tasks

Identify relationships:
- Company founded_by Person
- Company partners_with Institution
- Company exports_to Country
- Corridor includes Node City
- Company corridor_fit domains

Outputs

edges[]:
- from_entity_id
- to_entity_id
- relation_type
- evidence
- confidence

Gates

No edge without evidence pointer
If evidence is weak → “proposed edge” status

Stage 4 — Normalization & Taxonomy Assignment

Goal: Convert free-text into controlled vocab references.

AI Tasks

Map sector strings → sector taxonomy codes
Map tech mentions → technology taxonomy codes
Map topics → topic taxonomy codes
Map corridor domains → corridor_domain codes

Deterministic Checks

Only allow tags existing in the taxonomy registry
Reject unknown codes or route to “taxonomy steward” queue

Outputs

Normalized entity draft:
- primary_sector tagRef
- secondary_sectors[]
- technologies[]
- topics[]

Gates

If AI suggests a non-existent tag → requires steward approval

Stage 5 — Consistency & Quality Validation (Automated)

Goal: Detect contradictions and enforce completeness.

Checks

Dates consistency (founded_date not in future)
Country/city coherence (valid ISO2)
Funding coherence (currency format, non-negative)
Duplicates (domain collision, name collision)
Required fields present for publishable draft
Risk checks:
- prohibited content
- PII leaks
- defamation risk (accusations)

Outputs

validation_report.json:
- errors (blockers)
- warnings (non-blocking)
- completeness score
- confidence summary

Gates

Errors block progression
Warnings can pass but are logged

Stage 6 — Human Review (Editorial + Verification)

Goal: Convert “AI draft” into “publishable truth”.

Two distinct roles

Editor Review

readability, clarity, tone
remove promotional language
ensure template compliance

Verifier Review

confirm evidence for critical claims
assign verification level:
- self_reported → partially_verified → verified → externally_verified
approve or reject claims

UI Requirements

Side-by-side view:
- extracted field → evidence highlight → approve/edit
One-click demote “unsupported claims”
Mark “needs more evidence” with checklist

Outputs

approved_claims[]
rejected_claims[]
edited_entity_draft

Gates

Publishing requires:
- editorial_state ≥ verified OR at least “published (self_reported)” with explicit label
- last_verified_at set
- sources listed

Stage 7 — Canonical Entity Build (Source of Truth)

Goal: Build the final entity record from approved claims.

Process

Merge approved claims into entity
Maintain:
- per-field provenance pointers
- entity-level provenance summary
Write to:
- Postgres entity store
- Graph store (edges)
- Search index (keyword)
- Vector DB (embeddings)

Outputs

entity.json (canonical)
graph_updates
search_doc
embedding_artifacts

Stage 8 — Article Rendering (Narrative View)

Goal: Generate the public note from the canonical entity.

AI Tasks

Produce:
- Executive Summary (150–250 words)
- Core Activity & Tech Structure
- Market Positioning
- Ecosystem Context
- Corridor Analysis (if applicable)
- Verification Note

Hard Constraints

No new facts may be introduced beyond approved claims
All quantitative statements must reference approved fields
Style: impersonal, technical, neutral

Outputs

article.md (or structured blocks)
render_blocks.json (for CMS layout)

Stage 9 — Machine Layer Generation (JSON-LD + Feeds)

Goal: Publish machine-readable signals.

Generation

JSON-LD (schema.org Organization/Person + additionalProperty)
OpenGraph metadata
RSS entry (human)
AI feed entry:
- entity_updates.ndjson (diff-based)
- corridor_briefs.ndjson

Outputs

page.html (or CMS page)
embedded_jsonld
feeds

Stage 10 — Publication & Indexing

Goal: Release to public + trigger indexing.

Actions

Publish page (CMS or static deploy)
Update sitemaps + lastmod
Ping search engines (optional)
Notify subscribers / partners (webhooks)
Log publishing event

Outputs

Public URL
API cache refresh
Webhook events

4) Status Model (State Machine)

Entity state

draft
in_review
verified
published
archived

Claim state

proposed
approved
rejected
superseded (replaced by newer claim)

Edge state

proposed
approved
rejected

5) Confidence & Verification Policy

Confidence (0–100) = AI confidence in extraction correctness.
Verification level = editorial evidence quality.

Rules:

High confidence does NOT equal verified.
Verified requires evidence review.

Default levels:

Interview/submission → self_reported
Public registry doc → verified
Partner feed + doc → externally_verified

6) Anti-Hallucination Guards (Critical)

No-free-text generation without entity constraints
Generation must be constrained to approved fields only.
Evidence pointer required for critical fields
No evidence → cannot be “verified”.
Diff-based updates
Only changed fields are published as updates.
Changelog required
Every published entity page includes update log.