/gaia-feed
user-facingWhat it does
/gaia-feed ingests an external document into the brain knowledge layer in a single gesture. Hand it a URL, a local file path, or pipe content via stdin, and it writes a provenance-stamped markdown file under .gaia/knowledge/ingested/ and registers an ingested entry in brain-index.yaml.
The pipeline runs five stages: classify source, fetch content, strip HTML (for URLs), write the ingested file with provenance frontmatter, and register the brain-index entry. Both writes are atomic -- a temporary file is written first and renamed into place only on success.
For URL sources, the ingested file contains the WebFetch-rendered markdown of the page -- clean text with script, style, and head content stripped out -- not the byte-raw HTML.
When to use it
- You want to pull external reference material (API docs, library guides, specifications) into your queryable knowledge layer.
- You want provenance tracking (source URL, fetch timestamp, content hash, expiry) on ingested documents.
- You want the ingested document to appear alongside project artifacts when querying the Brain via
/gaia-brain-query. - You want to update a previously ingested document to a new version -- re-run the command with the same
--slugto overwrite it cleanly.
Prerequisites
- No strict prerequisites. The ingested directory and brain index are created automatically if they do not exist. A seed
brain-index.yamlmust be present (run/gaia-brain-reindexfirst if the knowledge store has not been initialized).
How to invoke
/gaia-feed https://example.com/api-docs # ingest a URL
/gaia-feed ./specs/openapi.yaml # ingest a local file
/gaia-feed - # read from stdin (paste content)
/gaia-feed --slug my-api-docs https://example.com/docs # explicit slug
/gaia-feed --ttl 60 https://example.com/docs # 60-day time-to-live
/gaia-feed --tags api,reference ./specs/api.md # explicit tags
/gaia-feed --kind llms_txt https://example.com # force llms_txt source kind
Flags and options
| Flag | Default | Description |
|---|---|---|
--slug SLUG | Auto-inferred from URL hostname or filename | URL-safe identifier for the ingested file. Determines the filename (<slug>.md) and the brain-index entry key. If you need two different versions of the same source to coexist, give each a distinct slug. |
--tags TAG1,TAG2 | Auto-inferred from source kind and domain signals | Comma-separated list of tags stored in the provenance frontmatter. Tags help filter results when querying the brain. |
--ttl DAYS | 30 | Time-to-live in days. The ingested file's expires_at is set to fetched_at + ttl_days. After the TTL elapses without a successful refresh, the entry is marked stale. |
--kind url|file|llms_txt|stdin | Auto-detected from the source argument | Overrides the auto-detected source kind. Normally set automatically by the orchestration layer (e.g., llms_txt when the llms-full.txt probe succeeds). You rarely need to set this manually. |
Source kinds
The pipeline classifies every ingestion into one of four source kinds, which determines the fetch method and the confidence score assigned to the brain-index entry:
| Source kind | Trigger | Fetch method | Confidence |
|---|---|---|---|
url | An http:// or https:// URL | WebFetch (orchestration layer) | 0.7 |
llms_txt | A URL where the llms-full.txt probe succeeds | WebFetch for the llms-full.txt endpoint | 0.9 |
file | Path to an existing local file | Direct read | 0.8 |
stdin | - as the source argument | Read from stdin | 0.8 |
llms-full.txt probe
When the source is a URL, the pipeline first probes for a conventional llms-full.txt endpoint at the base of the URL. If the probe returns non-empty content, the pipeline ingests that content directly (with source kind llms_txt and the higher 0.9 confidence tier) instead of fetching and stripping the original page. This provides cleaner, LLM-optimized content when the site publishes it.
What it does step by step
- Classify the source Determines whether the input is a URL, a local file, or stdin. For URLs, probes for a conventional
llms-full.txtendpoint and uses it when available (cleaner, LLM-optimized content). - Fetch the content Reads the file directly, reads stdin, or (for URLs) delegates to
WebFetchin the orchestration layer. A 30-second fetch timeout and 10 MB size cap are enforced. - Strip HTML For URL sources, removes HTML tags, decodes common entities, and strips script/style/head content to produce clean markdown. File and stdin sources pass through unchanged.
- Write the ingested file Writes the content under
.gaia/knowledge/ingested/<slug>.mdwith exactly 11 provenance frontmatter fields. The write is atomic via a sibling temporary file and rename. If a file with the same slug already exists, it is replaced. - Register the brain-index entry Appends (or replaces) an
ingestedentry inbrain-index.yamlwith a populated trust block carrying the content hash, source URL, timestamps, and a confidence score tiered by source kind. The index is validated against its schema before the rename; on failure the prior index is preserved.
Same-slug overwrite behavior
Re-running /gaia-feed with the same --slug (or with a source that auto-infers the same slug) replaces the existing entry cleanly. Both the ingested file and the brain-index entry are overwritten atomically -- the old content is not duplicated or versioned.
This is the supported way to update an ingested source to a new version. The provenance frontmatter is refreshed with the new fetch timestamp, content hash, and expiry.
If you want a different version to coexist alongside the existing one (rather than replace it), use a different --slug for each version.
Provenance frontmatter
Every ingested file carries exactly 11 frontmatter fields:
| Field | Type | Description |
|---|---|---|
title | string | Document title, auto-inferred from the first heading or filename. |
slug | string | URL-safe identifier (auto-inferred or explicit via --slug). |
ingest_source_kind | enum | One of url, file, llms_txt, stdin. |
source_url | string or null | Origin path or URL; null for stdin. |
fetched_at | ISO 8601 | UTC timestamp of the fetch. |
expires_at | ISO 8601 | fetched_at + ttl_days. After this time the entry is considered stale if it has not been successfully refreshed. |
content_hash | string | sha256 of the post-strip markdown body. |
ttl_days | integer | Time-to-live in days (default 30). |
token_estimate | integer | Rough token count derived from word count. |
tags | list | Auto-inferred tags (source kind, domain signals), or explicit via --tags. |
status | enum | One of current, stale, failed. New ingestions start as current. |
Security controls
The ingestion pipeline enforces three layers of protection:
- SSRF pre-check. Before any network read, the safe-fetch guard resolves the host and rejects URLs pointing to private (RFC 1918), link-local, loopback, carrier-grade NAT (RFC 6598), or cloud-metadata addresses. Only
httpandhttpsschemes are permitted. - Size cap and fetch timeout. Fetched content is capped at 10 MB; a 30-second fetch timeout prevents resource exhaustion.
- Slug write-boundary containment. The slug is sanitised (path separators and traversal sequences are stripped) and a realpath containment check verifies the resolved write path is a child of
.gaia/knowledge/ingested/before any file is created.
Outputs
| Output | Location | Description |
|---|---|---|
| Ingested file | .gaia/knowledge/ingested/<slug>.md | The ingested document with provenance frontmatter. |
| Brain index entry | .gaia/knowledge/brain-index.yaml | An ingested entry with a trust block carrying content hash, source URL, timestamps, and confidence. |
What to run next
/gaia-brain-query-- query the brain to see the ingested document alongside project artifacts./gaia-knowledge-refresh-- re-fetch all ingested sources and update any that have changed upstream./gaia-brain-reindex-- the reindex sweep preserves ingested entries; run it any time to refresh project-artifact entries without losing ingested content.
Troubleshooting
The slug already exists
This is expected behavior. If a file with the same slug already exists, the pipeline overwrites it atomically. The brain-index entry is replaced with fresh provenance. See Same-slug overwrite behavior.
Brain-index validation failed
The pipeline validates the index before committing. On failure, the prior index is preserved. Check the error message for schema violations and ensure the index is well-formed.
URL fetch failed
URL fetching is delegated to WebFetch in the orchestration layer. Ensure the URL is reachable and returns content. Paywalled, SPA-rendered, and authenticated sources are out of scope.
How do I update an ingested source?
Re-run /gaia-feed with the same --slug. The existing entry is overwritten cleanly. See Same-slug overwrite behavior.
How do I remove an ingested source?
Use /gaia-unfeed <slug>. It deletes the ingested file and de-registers the index entry atomically.