/gaia-feed

user-facing
Category:
Sprint Management
Lifecycle phase:
4 -- Implementation
Arguments:
<url-or-path | - for stdin> [--slug SLUG] [--tags TAG1,TAG2] [--ttl DAYS] [--kind url|file|llms_txt|stdin]

What it does

/gaia-feed ingests an external document into the brain knowledge layer in a single gesture. Hand it a URL, a local file path, or pipe content via stdin, and it writes a provenance-stamped markdown file under .gaia/knowledge/ingested/ and registers an ingested entry in brain-index.yaml.

The pipeline runs five stages: classify source, fetch content, strip HTML (for URLs), write the ingested file with provenance frontmatter, and register the brain-index entry. Both writes are atomic -- a temporary file is written first and renamed into place only on success.

For URL sources, the ingested file contains the WebFetch-rendered markdown of the page -- clean text with script, style, and head content stripped out -- not the byte-raw HTML.

When to use it

  • You want to pull external reference material (API docs, library guides, specifications) into your queryable knowledge layer.
  • You want provenance tracking (source URL, fetch timestamp, content hash, expiry) on ingested documents.
  • You want the ingested document to appear alongside project artifacts when querying the Brain via /gaia-brain-query.
  • You want to update a previously ingested document to a new version -- re-run the command with the same --slug to overwrite it cleanly.

Prerequisites

  • No strict prerequisites. The ingested directory and brain index are created automatically if they do not exist. A seed brain-index.yaml must be present (run /gaia-brain-reindex first if the knowledge store has not been initialized).

How to invoke

/gaia-feed https://example.com/api-docs         # ingest a URL
/gaia-feed ./specs/openapi.yaml                  # ingest a local file
/gaia-feed -                                     # read from stdin (paste content)
/gaia-feed --slug my-api-docs https://example.com/docs   # explicit slug
/gaia-feed --ttl 60 https://example.com/docs     # 60-day time-to-live
/gaia-feed --tags api,reference ./specs/api.md   # explicit tags
/gaia-feed --kind llms_txt https://example.com   # force llms_txt source kind

Flags and options

FlagDefaultDescription
--slug SLUGAuto-inferred from URL hostname or filenameURL-safe identifier for the ingested file. Determines the filename (<slug>.md) and the brain-index entry key. If you need two different versions of the same source to coexist, give each a distinct slug.
--tags TAG1,TAG2Auto-inferred from source kind and domain signalsComma-separated list of tags stored in the provenance frontmatter. Tags help filter results when querying the brain.
--ttl DAYS30Time-to-live in days. The ingested file's expires_at is set to fetched_at + ttl_days. After the TTL elapses without a successful refresh, the entry is marked stale.
--kind url|file|llms_txt|stdinAuto-detected from the source argumentOverrides the auto-detected source kind. Normally set automatically by the orchestration layer (e.g., llms_txt when the llms-full.txt probe succeeds). You rarely need to set this manually.

Source kinds

The pipeline classifies every ingestion into one of four source kinds, which determines the fetch method and the confidence score assigned to the brain-index entry:

Source kindTriggerFetch methodConfidence
urlAn http:// or https:// URLWebFetch (orchestration layer)0.7
llms_txtA URL where the llms-full.txt probe succeedsWebFetch for the llms-full.txt endpoint0.9
filePath to an existing local fileDirect read0.8
stdin- as the source argumentRead from stdin0.8

llms-full.txt probe

When the source is a URL, the pipeline first probes for a conventional llms-full.txt endpoint at the base of the URL. If the probe returns non-empty content, the pipeline ingests that content directly (with source kind llms_txt and the higher 0.9 confidence tier) instead of fetching and stripping the original page. This provides cleaner, LLM-optimized content when the site publishes it.

What it does step by step

  1. Classify the source Determines whether the input is a URL, a local file, or stdin. For URLs, probes for a conventional llms-full.txt endpoint and uses it when available (cleaner, LLM-optimized content).
  2. Fetch the content Reads the file directly, reads stdin, or (for URLs) delegates to WebFetch in the orchestration layer. A 30-second fetch timeout and 10 MB size cap are enforced.
  3. Strip HTML For URL sources, removes HTML tags, decodes common entities, and strips script/style/head content to produce clean markdown. File and stdin sources pass through unchanged.
  4. Write the ingested file Writes the content under .gaia/knowledge/ingested/<slug>.md with exactly 11 provenance frontmatter fields. The write is atomic via a sibling temporary file and rename. If a file with the same slug already exists, it is replaced.
  5. Register the brain-index entry Appends (or replaces) an ingested entry in brain-index.yaml with a populated trust block carrying the content hash, source URL, timestamps, and a confidence score tiered by source kind. The index is validated against its schema before the rename; on failure the prior index is preserved.

Same-slug overwrite behavior

Re-running /gaia-feed with the same --slug (or with a source that auto-infers the same slug) replaces the existing entry cleanly. Both the ingested file and the brain-index entry are overwritten atomically -- the old content is not duplicated or versioned.

This is the supported way to update an ingested source to a new version. The provenance frontmatter is refreshed with the new fetch timestamp, content hash, and expiry.

If you want a different version to coexist alongside the existing one (rather than replace it), use a different --slug for each version.

Provenance frontmatter

Every ingested file carries exactly 11 frontmatter fields:

FieldTypeDescription
titlestringDocument title, auto-inferred from the first heading or filename.
slugstringURL-safe identifier (auto-inferred or explicit via --slug).
ingest_source_kindenumOne of url, file, llms_txt, stdin.
source_urlstring or nullOrigin path or URL; null for stdin.
fetched_atISO 8601UTC timestamp of the fetch.
expires_atISO 8601fetched_at + ttl_days. After this time the entry is considered stale if it has not been successfully refreshed.
content_hashstringsha256 of the post-strip markdown body.
ttl_daysintegerTime-to-live in days (default 30).
token_estimateintegerRough token count derived from word count.
tagslistAuto-inferred tags (source kind, domain signals), or explicit via --tags.
statusenumOne of current, stale, failed. New ingestions start as current.

Security controls

The ingestion pipeline enforces three layers of protection:

  • SSRF pre-check. Before any network read, the safe-fetch guard resolves the host and rejects URLs pointing to private (RFC 1918), link-local, loopback, carrier-grade NAT (RFC 6598), or cloud-metadata addresses. Only http and https schemes are permitted.
  • Size cap and fetch timeout. Fetched content is capped at 10 MB; a 30-second fetch timeout prevents resource exhaustion.
  • Slug write-boundary containment. The slug is sanitised (path separators and traversal sequences are stripped) and a realpath containment check verifies the resolved write path is a child of .gaia/knowledge/ingested/ before any file is created.

Outputs

OutputLocationDescription
Ingested file.gaia/knowledge/ingested/<slug>.mdThe ingested document with provenance frontmatter.
Brain index entry.gaia/knowledge/brain-index.yamlAn ingested entry with a trust block carrying content hash, source URL, timestamps, and confidence.

What to run next

  • /gaia-brain-query -- query the brain to see the ingested document alongside project artifacts.
  • /gaia-knowledge-refresh -- re-fetch all ingested sources and update any that have changed upstream.
  • /gaia-brain-reindex -- the reindex sweep preserves ingested entries; run it any time to refresh project-artifact entries without losing ingested content.

Troubleshooting

The slug already exists

This is expected behavior. If a file with the same slug already exists, the pipeline overwrites it atomically. The brain-index entry is replaced with fresh provenance. See Same-slug overwrite behavior.

Brain-index validation failed

The pipeline validates the index before committing. On failure, the prior index is preserved. Check the error message for schema violations and ensure the index is well-formed.

URL fetch failed

URL fetching is delegated to WebFetch in the orchestration layer. Ensure the URL is reachable and returns content. Paywalled, SPA-rendered, and authenticated sources are out of scope.

How do I update an ingested source?

Re-run /gaia-feed with the same --slug. The existing entry is overwritten cleanly. See Same-slug overwrite behavior.

How do I remove an ingested source?

Use /gaia-unfeed <slug>. It deletes the ingested file and de-registers the index entry atomically.