I analysed 20 years of my chats

The author, Vadim Drobinin, was inspired by Tim Urban's "Your Life in Weeks" to fill the empty squares of his life's timeline with meaningful data beyond mere events. Dissatisfied with traditional journaling, he embarked on a "personal CRM" project, leveraging 20 years of his digital communication history to gain objective insights into his relationships and personal patterns, all to answer the question: "Am I a bad friend?"

Data Collection & Challenges: The project began with archiving messages from a wide array of platforms, including ICQ, IRC, VK, Twitter, Facebook, Instagram, and Telegram. Parsing these diverse data sources presented significant technical hurdles, such as platform-specific encodings, differing internal message IDs across exports, and varying export structures that required extensive cleaning and normalization into a uniform format.
Noise Reduction: A substantial portion of the raw chat data was identified as conversational noise (e.g., emojis, links, short fillers like "hahaha"). The author developed a systematic approach using frequency counting of short tokens and manual review to create denylists, while simultaneously protecting meaningful short messages (e.g., "he died").
Identity Resolution & Classification: A core technical challenge involved mapping individual identities across multiple platforms, accounting for nicknames and diminutives (like "Sasha" referring to different people). Traditional methods like keyword matching and BERT models were found insufficient due to high false-positive rates for classifying messages into meaningful categories (e.g., life events, banter, emotional temperature).
LLM Application: Large Language Models were ultimately employed for both name resolution and message classification, achieving a low false-positive rate (under 1%) when processing messages in smaller chunks. The LLM's output, a structured JSON manifest, was then deterministically processed to populate the author's personal knowledge vault, ensuring traceability to original messages.
Prompt Engineering & Validation: The LLM's prompt evolved significantly through iterative refinement to prevent issues like confabulation, with specific rules added (e.g., requiring explicit first-person markers for life event classification). A "closure gate" validation script and manual sampling of outputs were crucial for maintaining data quality.
Directional Sentiment Analysis: Recognizing the limitations of standard sentiment analysis for conversational data, the author developed a directional approach using 18 tags and prefixes to capture asymmetric emotional states between speakers. This method allowed for understanding a relationship's emotional baseline and detecting significant departures from it.
Key Discoveries: The analysis yielded numerous personal insights, including how vocabulary overlap indicates relationship dynamics, how message volume and length change over time, and how question rates inversely correlate with relationship bandwidth in thinning friendships. He also observed the lifecycle of endearments in his partner chat, the significant impact of geographical moves on his social network, and how his total "conversation-days" remained constant despite a shrinking contact list. The data also revealed that his self-perception as "the supportive friend" was equally balanced by his tendency to be "the advice friend," highlighting a reflex to explain rather than listen.

Ultimately, the project became a profound exercise in self-discovery, revealing subtle shifts in his relationships and challenging his self-perceptions. While not changing his immediate communication habits, the objective data provided a comprehensive understanding of his social patterns, enriching his personal timeline with the nuanced human connections that traditional metrics often miss.

I analysed 20 years of my chats

The Lowdown