Methods
This document describes how the analyses under reddit/ were
produced. It is intentionally specific so a reader can re-run any
step or audit any citation.
Corpus
- Source: r/Aphantasia subreddit dump as JSONL (
r_aphantasia_posts.jsonl,r_aphantasia_comments.jsonl) downloaded by the user; not re-scraped. - Date span: 2015-08-31 → 2026-05-06.
- Row counts: 23,884 posts, 392,558 comments.
- Deletion handling: rows where
selftext/bodyis[deleted],[removed], or empty are kept in the relational tables (withis_deleted/is_removed/is_empty_*flags) but excluded from semantic chunks. Counts: 2,128 deleted posts, 1,794 removed posts, 2,841 empty selftexts; 9,936 deleted comments, 2,339 removed comments. - 130 comments reference posts not in the dump ("orphans"); they
surface as
orphan_commentchunks if they meet the 30-token floor.
Chunking
Thread-aware, four kinds (spec §4.3):
| Kind | Primary | Context included |
|---|---|---|
post |
post | top-3 scored top-level comment bodies |
top_comment |
top-level comment | parent post title + top child reply |
deeper_reply |
non-top-level comment | parent comment body |
orphan_comment |
comment whose link_id is not in posts |
none |
Token budget per chunk: 512 (truncated by dropping lowest-priority context first). All chunks indexed: 381,592 total (19,962 post + 179,367 top_comment + 182,184 deeper_reply + 79 orphan_comment).
Chunk IDs are deterministic SHA-1 truncated to 20 hex chars over
(chunk_version, chunk_kind, primary_kind, primary_id,
sorted role-tagged secondary refs, content_hash).
Embeddings & retrieval
- Embedding model:
ollama:nomic-embed-text, 768-dim. Originally the substrate was specified to useintfloat/e5-small-v2(384-dim) on local CPU; the pre-build benchmark projected ~11 hours to embed the full corpus. The substrate builder switched to a network-local Ollama endpoint at10.10.10.44:11434(recorded inembedding_metadata.device), which finished in minutes. The switch is consistent with the "no API keys, no inference server stood up alongside the pipeline" rule — the Ollama endpoint pre-existed on the local network. Querying re-uses the same endpoint. - Vector store:
sqlite-vecvirtual table with cosine similarity. - Lexical store: SQLite FTS5 with the
porter unicode61 remove_diacritics 2tokenizer. - Hybrid retrieval: reciprocal rank fusion (k=60) of BM25 + vector rankings.
Theme tagging
The taxonomy at pipeline/config/taxonomy.yaml defines 15
top-level themes broken into 33 sub-themes. For each sub-theme,
each seed query is run through hybrid retrieval at k=200, results
are unioned, and the highest-scoring (chunk, sub-theme) pair wins.
No per-chunk LLM classification — retrieval ranking is the
tag. The working table at reddit/data/theme_tags.parquet is the
source of truth (15,368 rows); markdown is regenerated from it.
Per-theme writing
For each sub-theme, an autonomous agent received a self-contained
JSON bundle: the sub-theme metadata, seed queries, year counts,
and the top 30 retrieved chunks (full content + permalinks). The
agent picked 5-14 representative excerpts, grouped them into 2-4
sub-patterns, wrote the markdown, and ran the linter to verify
every citation. Files only land here once aphantasia-lint returns
exit 0.
Citation policy
Every Reddit-sourced excerpt across reddit/ cites with:
[(YEAR, FULLNAME, chunk CHUNK_ID)](PERMALINK)
Where FULLNAME is t3_<post_id> or t1_<comment_id>. The
output linter (aphantasia-lint) verifies for every citation:
chunk_idexists inchunks.fullnameappears in that chunk'schunk_sources.source_fullname.permalinkmatches the same row'spermalink.yearmatches the year derived from the chunk source'screated_utc.- The quoted excerpt (preceding blockquote) is a
whitespace/punctuation-normalised substring of
chunks.content.
Any failure rejects the file. The full reddit/ tree passed
linting before commit.
Limitations
- Self-selection. This is one English-language subreddit; users who joined had already discovered they have aphantasia (or are curious about it). Generalisations to the wider population should be made cautiously.
- Deletion gaps. ~12,000 comments and ~4,000 posts have content removed or deleted. Their metadata is preserved but their voice is lost.
- Single sub-community. Aphantasia discussions also live on Twitter/X, Discord, the Aphantasia Network forum, and elsewhere; this corpus does not capture them.
- No demographics. Authors are not surveyed for age, gender, geography, or co-occurring conditions; correlations are associational at best.
- Retrieval limits. Embedding-based retrieval surfaces what's semantically near a query; rare or unusually-phrased experiences may not appear in any of the seeded query buckets.
- Author anonymity. Per the spec's privacy default, Reddit
usernames are recorded in the relational tables for audit but
never foregrounded in the analyses. Quotes are attributed by
Reddit fullname (e.g.
t1_xxx) and chunk_id.
Reproducibility
cd /data/space/aphantasia/pipeline
uv sync --extra dev
uv run aphantasia-doctor --db data/corpus.db
uv run aphantasia-lint /data/space/aphantasia/reddit/ --db data/corpus.db
uv run pytest -v -m "not slow"
Most recent doctor run before this commit: green; all 24 checks pass. Most recent lint run: clean across all generated markdown.