Methods

This document describes how the analyses under reddit/ were produced. It is intentionally specific so a reader can re-run any step or audit any citation.

Corpus

Source: r/Aphantasia subreddit dump as JSONL (r_aphantasia_posts.jsonl, r_aphantasia_comments.jsonl) downloaded by the user; not re-scraped.
Date span: 2015-08-31 → 2026-05-06.
Row counts: 23,884 posts, 392,558 comments.
Deletion handling: rows where selftext / body is [deleted], [removed], or empty are kept in the relational tables (with is_deleted / is_removed / is_empty_* flags) but excluded from semantic chunks. Counts: 2,128 deleted posts, 1,794 removed posts, 2,841 empty selftexts; 9,936 deleted comments, 2,339 removed comments.
130 comments reference posts not in the dump ("orphans"); they surface as orphan_comment chunks if they meet the 30-token floor.

Chunking

Thread-aware, four kinds (spec §4.3):

Kind	Primary	Context included
`post`	post	top-3 scored top-level comment bodies
`top_comment`	top-level comment	parent post title + top child reply
`deeper_reply`	non-top-level comment	parent comment body
`orphan_comment`	comment whose `link_id` is not in posts	none

Token budget per chunk: 512 (truncated by dropping lowest-priority context first). All chunks indexed: 381,592 total (19,962 post + 179,367 top_comment + 182,184 deeper_reply + 79 orphan_comment).

Chunk IDs are deterministic SHA-1 truncated to 20 hex chars over (chunk_version, chunk_kind, primary_kind, primary_id, sorted role-tagged secondary refs, content_hash).

Embeddings & retrieval

Embedding model: ollama:nomic-embed-text, 768-dim. Originally the substrate was specified to use intfloat/e5-small-v2 (384-dim) on local CPU; the pre-build benchmark projected ~11 hours to embed the full corpus. The substrate builder switched to a network-local Ollama endpoint at 10.10.10.44:11434 (recorded in embedding_metadata.device), which finished in minutes. The switch is consistent with the "no API keys, no inference server stood up alongside the pipeline" rule — the Ollama endpoint pre-existed on the local network. Querying re-uses the same endpoint.
Vector store: sqlite-vec virtual table with cosine similarity.
Lexical store: SQLite FTS5 with the porter unicode61 remove_diacritics 2 tokenizer.
Hybrid retrieval: reciprocal rank fusion (k=60) of BM25 + vector rankings.

Theme tagging

The taxonomy at pipeline/config/taxonomy.yaml defines 15 top-level themes broken into 33 sub-themes. For each sub-theme, each seed query is run through hybrid retrieval at k=200, results are unioned, and the highest-scoring (chunk, sub-theme) pair wins. No per-chunk LLM classification — retrieval ranking is the tag. The working table at reddit/data/theme_tags.parquet is the source of truth (15,368 rows); markdown is regenerated from it.

Per-theme writing

For each sub-theme, an autonomous agent received a self-contained JSON bundle: the sub-theme metadata, seed queries, year counts, and the top 30 retrieved chunks (full content + permalinks). The agent picked 5-14 representative excerpts, grouped them into 2-4 sub-patterns, wrote the markdown, and ran the linter to verify every citation. Files only land here once aphantasia-lint returns exit 0.

Citation policy

Every Reddit-sourced excerpt across reddit/ cites with:

[(YEAR, FULLNAME, chunk CHUNK_ID)](PERMALINK)

Where FULLNAME is t3_<post_id> or t1_<comment_id>. The output linter (aphantasia-lint) verifies for every citation:

chunk_id exists in chunks.
fullname appears in that chunk's chunk_sources.source_fullname.
permalink matches the same row's permalink.
year matches the year derived from the chunk source's created_utc.
The quoted excerpt (preceding blockquote) is a whitespace/punctuation-normalised substring of chunks.content.

Any failure rejects the file. The full reddit/ tree passed linting before commit.

Limitations

Self-selection. This is one English-language subreddit; users who joined had already discovered they have aphantasia (or are curious about it). Generalisations to the wider population should be made cautiously.
Deletion gaps. ~12,000 comments and ~4,000 posts have content removed or deleted. Their metadata is preserved but their voice is lost.
Single sub-community. Aphantasia discussions also live on Twitter/X, Discord, the Aphantasia Network forum, and elsewhere; this corpus does not capture them.
No demographics. Authors are not surveyed for age, gender, geography, or co-occurring conditions; correlations are associational at best.
Retrieval limits. Embedding-based retrieval surfaces what's semantically near a query; rare or unusually-phrased experiences may not appear in any of the seeded query buckets.
Author anonymity. Per the spec's privacy default, Reddit usernames are recorded in the relational tables for audit but never foregrounded in the analyses. Quotes are attributed by Reddit fullname (e.g. t1_xxx) and chunk_id.

Reproducibility

cd /data/space/aphantasia/pipeline
uv sync --extra dev
uv run aphantasia-doctor --db data/corpus.db
uv run aphantasia-lint /data/space/aphantasia/reddit/ --db data/corpus.db
uv run pytest -v -m "not slow"

Most recent doctor run before this commit: green; all 24 checks pass. Most recent lint run: clean across all generated markdown.