HN
Today

Good results fine tuning a local LLM like Qwen 3:0.6B to categorize questions

This post details a clever fine-tuning strategy for a tiny local LLM (Qwen 0.6B) to categorize household questions, achieving an impressive 92% accuracy. It highlights how small models, when expertly constrained and trained with opaque output IDs, can become highly effective and practical for specific RAG pre-processing tasks. The author's journey from 10% baseline accuracy to near-perfect classification resonates with engineers seeking efficient, local AI solutions.

74
Score
15
Comments
#9
Highest Rank
16h
on Front Page
First Seen
Jun 22, 1:00 AM
Last Seen
Jun 22, 4:00 PM
Rank Over Time
14159119109121213121117232930

The Lowdown

The author embarked on a personal project to build a chatbot for household questions, aiming to enhance its Retrieval Augmented Generation (RAG) capabilities through question categorization. The core idea was to use a pre-processing step to classify questions into specific metadata categories (e.g., "pool," "hvac") to narrow down the search space for vector retrieval.

  • Initial Approach & Baseline: The experiment started with Qwen 3:0.6B, a very small local LLM. A baseline test using prompting alone yielded a dismal 10% accuracy, with the model frequently overusing broad categories or inventing new ones.
  • First Fine-tuning Attempt: Leveraging Unsloth with QLora, the author fine-tuned the model on a dataset of ~850 household-related questions. This significantly improved accuracy to 79%, demonstrating the potential of fine-tuning but still showing issues with partial category names and semantic confusion between similar categories.
  • Second Fine-tuning Attempt with Opaque IDs: The most impactful improvement came from a subtle change: asking the model to output fixed, two-character opaque IDs instead of descriptive category names. This simple modification boosted accuracy to 92%, largely by eliminating semantic ambiguities and ensuring precise outputs.
  • Practical Application: The highly accurate categorization now serves as a reliable pre-processor in the author's chatbot, efficiently directing RAG queries to relevant data subsets.

This project effectively illustrates that even extremely small LLMs can be fine-tuned to perform highly accurate, niche classification tasks, especially when guided by well-structured prompts and output formats. The success hinges on the strategic design of the training data and the model's output, making these tiny models viable for local, resource-constrained applications.

The Gossip

Classical Classification Critiques

Many commenters debated the necessity of using an LLM for a simple classification task, suggesting that traditional machine learning methods like Scikit Learn with n-grams or BERT-based models with classifier heads might be more efficient, lighter, and faster to train for this specific problem. They questioned if an LLM was 'overkill' given the problem's scope, though some acknowledged the LLM approach if it met the author's goals.

Small LLM Significance

The discussion delved into the utility and performance characteristics of tiny LLMs like Qwen 0.6B. Commenters praised its speed and identified its niche, particularly when fine-tuned, and pondered its use for distillation. Questions also arose regarding the necessity of fine-tuning for such small models to be useful and their hardware requirements (CPU vs. GPU).

Advanced LLM Avenues

Several users offered more advanced LLM techniques for classification or related tasks, such as zero-shot encoders, natural language inference, various tuning methods (GRPO, GEPA, DPO), and leveraging larger LLMs for synthetic data generation. There was also a specific inquiry about using grammar constraints to prevent LLMs from 'inventing' categories, which was confirmed as a runtime-dependent feature.

RAG's Refinement Rationale

A core question from the community was how the question categorization explicitly aids in improving Retrieval Augmented Generation (RAG) results. The author's stated intention was to narrow down the search space for vector ranking, leading to more relevant and efficient information retrieval.