Direct extraction from the source with the original URL preserved. Internal layer — never published.
Methodology
Trustworthy by default.
Nothing hidden.
How Djeed makes data trustworthy by default — and what we won't hide from you. The same discipline applies across the dataset catalog, DjeedX workspaces, and the API.
Principle 01
Provenance on every record.
Every record in every Djeed dataset carries a full audit trail: who created it, when, and every diff since. The history endpoint is read-only and exportable.
AI-extracted records also store the upstream extraction payload — the raw response from the model, not just our parsed interpretation — so any disputed value can be re-checked against what the model actually returned.
Principle 02
Berkeley Protocol-ready outputs.
The Berkeley Protocol on Digital Open Source Investigations (2020, co-developed by the UC Berkeley Human Rights Center and the United Nations) is the international standard for collecting, preserving, verifying, and presenting open-source information in investigations of international crimes and human rights violations. It is used by the ICC, UN fact-finding missions, accountability NGOs, and tribunal researchers.
International Criminal Court
Fact-finding missions + treaty bodies
Accountability + tribunal researchers
Djeed does not apply the Berkeley Protocol — you do. What Djeed delivers is data that is ready to be used under it.
Every Silver-tier record carries the chain-of-custody attributes the Protocol expects: original source URL, capture timestamp, content hash, raw extraction payload, deduplication trail, and version history. The analyst — you — runs the verification, source-assessment, corroboration, and ethical-review steps the Protocol requires.
We power the workflow with data and tools — DjeedX, the API, the per-dataset evidence trail — that make Berkeley-Protocol-compliant investigation possible at the speed and scale the work actually demands.
Principle 03
We deliver Silver. Gold is yours to build.
The factory processes data through three layers:
Deduplicated, entity-resolved, cross-source corroborated. Every dataset on djeed.com ships at Silver, ready to use.
Verified in your context. Combine Silver with internal records, partner reports, expert judgement. Built inside DjeedX or via API in your own systems.
The handoff is intentional. Djeed earns trust by being the rigorous external party that does the layered extraction at scale. You earn judgement by combining that base with what only you know. Neither side does the other's job.
Principle 04
Bundled datasets are sourced, not opinions.
The base datasets we bundle into DjeedX (admin boundaries, conflict events, partner-org references, baseline indicators) come from the factory's Bronze→Silver pipeline, with explicit source URLs on every record. Each bundled record carries a citation edge to the originating source. Edit a bundled record in your DjeedX workspace and your edits stay in your workspace — the upstream isn't mutated. That's the path from bundled Silver to your own Gold.
Principle 05
Privacy on cross-workspace entity matches.
DjeedX can tell you that a partner organisation in your workspace also appears in N other workspaces — but never which ones. The other workspace owner has to opt in to disclose a contact. This is the only way cross-workspace insight is shared.
Principle 06
What ships with every published dataset.
Each dataset on djeed.com comes with a signed methodology page that documents exactly what is inside and how it got there. The page is readable on the public catalog and exportable alongside any record export — it is the dataset's receipt.
Coring (unit of observation)
Every dataset declares its primary unit — what one row represents. Djeed datasets are cored at one of:
- Event-centric — one row = one event (incident, decision, action), with linked actors, locations, dates, and sources.
- Claim-centric — one row = one assertion by one source about a fact in the world. Atomic, never merged across sources.
- Act-centric — one row = one action (often nested inside an event); granular for accountability work.
- Entity-centric — one row = one organisation, person, group, or asset, aggregated across all its appearances.
- Indicator-centric — one row = one measure at one place at one time (statistical surface).
The choice depends on the buyer use case. The methodology page of each dataset declares the coring up front and ties every record back to that decision.
Bronze → Silver migration
Records start at Bronze (raw extraction with the source URL preserved). They become Silver when they pass the migration gate. The pipeline runs:
- URL dedup — already-processed sources are skipped before any extraction work happens.
- Spatial enrichment — lat / lon is reverse-geocoded to authoritative country / admin1 / admin2 codes via shapefile join, plus an H3 hex cell (resolution 7) for fast spatial blocking.
- Entity resolution — fuzzy name matching collapses spelling variants of the same actor across sources.
- Cross-source corroboration — multiple claims about the same event are linked together, with confidence scored from source reliability + claim agreement.
- Promotion gate — a record is promoted to Silver when corroboration confidence ≥ 0.7 AND at least three independent sources back it. Below the gate it stays Bronze and is not published.
Deduplication scoring
At the dedup stage, every candidate pair is scored across five dimensions:
- Spatial proximity — same H3 cell or within km distance threshold.
- Temporal proximity — same day, or within N days when date precision is fuzzy.
- Semantic similarity — claim-text embeddings cosine similarity above threshold.
- Category match — primary + secondary action categories must align.
- Entity match — actor / victim names resolve to the same canonical entity.
Pairs above the auto-merge threshold (typically same-day, within 5 km, same category) merge automatically. Borderline pairs go to AI review; hard cases escalate to human review. Every merge keeps a trail back to the source records — provenance is never lost.
Final dataset output
A subscriber sees a Silver-tier dataset with:
- Typed rows following the coring (one event / claim / act / entity / indicator per row).
- Per-record provenance bundle — source URLs, capture timestamp, content hash, raw extraction payload, full edit history.
- Spatial fields — lat / lon, country, admin1, admin2, H3 cell.
- Temporal fields — event date, date precision, week / month / year buckets.
- Confidence score per record.
- Verification status field (unverified / verified / disputed).
- Linked entities, events, sources — explorable in the per-dataset graph view, mappable in the map view, sortable in the table view.
Output formats: CSV, JSON, GeoJSON, Excel. Live API + OData endpoints for Power BI and Tableau on higher-tier subscriptions.
Roadmap
Open questions we're still working on.
Per-record confidence-decay over time for AI-extracted records.
Public methodology versioning — every breaking schema change tagged with a migration note + diff endpoint.
Independent third-party methodology audits on the highest-evidence datasets.
A formal Silver-to-Gold playbook — patterns for how teams cook Djeed Silver into decision-ready Gold inside their own workflow.
Per-dataset Berkeley Protocol coverage matrix — for each Silver dataset, which Protocol attributes are guaranteed, which are best-effort, which are out of scope.
Built to be inspected. Trusted by default.
Browse a live dataset to see the methodology in action, or talk to us about a custom one.