aphant.org

Methods

This document describes how the analyses under reddit/ were produced. It is intentionally specific so a reader can re-run any step or audit any citation.

Corpus

Chunking

Thread-aware, four kinds (spec §4.3):

Kind Primary Context included
post post top-3 scored top-level comment bodies
top_comment top-level comment parent post title + top child reply
deeper_reply non-top-level comment parent comment body
orphan_comment comment whose link_id is not in posts none

Token budget per chunk: 512 (truncated by dropping lowest-priority context first). All chunks indexed: 381,592 total (19,962 post + 179,367 top_comment + 182,184 deeper_reply + 79 orphan_comment).

Chunk IDs are deterministic SHA-1 truncated to 20 hex chars over (chunk_version, chunk_kind, primary_kind, primary_id, sorted role-tagged secondary refs, content_hash).

Embeddings & retrieval

Theme tagging

The taxonomy at pipeline/config/taxonomy.yaml defines 15 top-level themes broken into 33 sub-themes. For each sub-theme, each seed query is run through hybrid retrieval at k=200, results are unioned, and the highest-scoring (chunk, sub-theme) pair wins. No per-chunk LLM classification — retrieval ranking is the tag. The working table at reddit/data/theme_tags.parquet is the source of truth (15,368 rows); markdown is regenerated from it.

Per-theme writing

For each sub-theme, an autonomous agent received a self-contained JSON bundle: the sub-theme metadata, seed queries, year counts, and the top 30 retrieved chunks (full content + permalinks). The agent picked 5-14 representative excerpts, grouped them into 2-4 sub-patterns, wrote the markdown, and ran the linter to verify every citation. Files only land here once aphantasia-lint returns exit 0.

Citation policy

Every Reddit-sourced excerpt across reddit/ cites with:

[(YEAR, FULLNAME, chunk CHUNK_ID)](PERMALINK)

Where FULLNAME is t3_<post_id> or t1_<comment_id>. The output linter (aphantasia-lint) verifies for every citation:

  1. chunk_id exists in chunks.
  2. fullname appears in that chunk's chunk_sources.source_fullname.
  3. permalink matches the same row's permalink.
  4. year matches the year derived from the chunk source's created_utc.
  5. The quoted excerpt (preceding blockquote) is a whitespace/punctuation-normalised substring of chunks.content.

Any failure rejects the file. The full reddit/ tree passed linting before commit.

Limitations

Reproducibility

cd /data/space/aphantasia/pipeline
uv sync --extra dev
uv run aphantasia-doctor --db data/corpus.db
uv run aphantasia-lint /data/space/aphantasia/reddit/ --db data/corpus.db
uv run pytest -v -m "not slow"

Most recent doctor run before this commit: green; all 24 checks pass. Most recent lint run: clean across all generated markdown.