one of my side projects, called Doktor (think of it what you will :P) is a telegram chatbot. i ask it details about my lab tests, prescriptions and diagnostics. The source data is a folder where i have about 60-70 files through last 10 years. There is a pipeline that parses these pdf or handwritten prescriptions and push to a sql db and a vector db.

and it uses llm. when i ask something to the bot, a llm classifies my query to one of 5 values. And accordingly a switch case is there which figures out which subsequent query to fire. Today task was to just make this classifier faster.

my intent was to always run everything locally. since its health data, i dont want to host it anywhere. 

i had a 5-way classification scheme.

TREND -- "how has my HbA1c changed over time"

LATEST_VALUE -- "what is my latest cholesterol"

FETCH_DOCUMENT -- "fetch my fibroscan report"

GENERAL_QA -- "what medicines did Dr. Sharma prescribe"

UNCLEAR -- "hello", "asdf", "thanks"

despite how simple this task is, i was running llama3.1:8b. a 4.7GB model !! And each classification call takes more than second, sometimes even 5-6 seconds or so.the first call in a session -- cold start -- could take upwards of 5 to 12 seconds. the chatbot doesnt feel snappy. it was (and is) a pain to use. It kind of works, but each query takes a good 30seconds. To be fair, quite a lot of that time is also taken by the parts after the classifier -- but thats a topic for some other day.

so today, was a day to find out can i make the classifier taken less than 1 second. 

for a while, i couldnt help but chuckle inside. having worked in systems where microseconds or nanoseconds matter, building a component now that should take upto 1s is comical. proper backsliding. or, the likelier scenario, i am dumb and not knowledgable enough yet on this domain. 
i digress. 

before testing any models, i needed a way to measure classification quality. the project already had unit tests for the intent classifier, but they mocked the LLM entirely.
i created a new integration test suite with 35 test cases across all five intent categories. these tests hit the real Ollama instance with no mocks. each test sends a natural language query and asserts the expected intent label.
they cover clear-cut queries, shorter ambiguous ones, queries without explicit keywords, and edge cases for UNCLEAR. 

a key decision here -- tests define correctness, not the model's current behavior.when a test fails, it means the model got it wrong. not that the test needs updating. and surprisingly, i found my current 8b model was failing 3 of them. Edge cases -- but still failures. however with 32 of 35 being my current baseline, i got to work. 

the failures were:
"track my triglycerides" and "cholesterol history" returned LATEST_VALUE instead of TREND.
"do I have any abnormal results" returned LATEST_VALUE instead of GENERAL_QA.
these are genuinely borderline. the model's wrong, but not unreasonably so.
total runtime was about 37 seconds for 35 tests. Important to note the time -- 37 seconds.  

first candidate to go small was llama3.2:1b. a 1.3GB model, roughly a quarter the size of the 8b. and spoiler warning -- it was shit. 
8 out of 35 passed. 23 percent. it defaulted to GENERAL_QA or UNCLEAR for almost everything. the classifications looked almost random.  i was damn surprised, bceause classification appeared to be to be pretty straight forward. 

before giving up on the 1b model, i tried a different approach. Ollama supports custom Modelfiles where you can bake a system prompt into the model config.
the idea was if the model cant figure out the categories from descriptions alone, maybe showing it 200 labeled examples would help. 40 per category, intentionally different from the test cases. set temperature 0 for deterministic output.

result with 1b plus Modelfile was 10 out of 35. 29 percent. i was hoping this is the silver bullet, turned out to be a lead one :D 
the model classified everything as TREND. including "hello", "ok", and "fetch my prescription." the 200 examples overwhelmed the 1b model's limited reasoning capacity. it latched onto TREND because it appeared first and just repeated it for every single input.

next up was llama3.2:3b. 2.0GB, exactly half the size of the 8b.
32 out of 35 passed. 91 percent. identical accuracy to the 8b model. same failure patterns on ambiguous queries. noticeably faster response times.

i also tested the 3b model with the 200-example Modelfile.
31 out of 35 passed. 89 percent.
slightly worse than bare. the few-shot examples actually confused the 3b model on some GENERAL_QA queries.
the bare model with just category descriptions performed better. So the modelfile actually made it worse !

i also tried qwen3.5:0.8b. a tiny 0.8B parameter model from Alibaba before even trying 1b. i dont know what i was doping wrong, but its just thinking and thinking for eternity on everything. 30-60seconds for each test case. i gave up after running 4-5 tests out of 35 (which still took minutes). no matter the accuracy, this speed is not accetable.

the final scorecard was clear.
8b was 91% at 4.7GB. took 37s. 
3b was 91% at 2.0GB. took <12s
everything else was either useless or actively hurt by few-shot examples.


Modelfiles didnt help at all.this was the biggest letdown. the promise of few-shot prompting is teaching by example.in practice, the 1b got worse with examples and fixated on one category, and the 3b got slightly worse too.
more examples equaled worse results on small models.


the win was modest. 4.7GB down to 2.0GB with the same accuracy. i did meet my 1s target, but it still needs a 2g+ model. i hoped for a 90-95% reduction in latency alongside 100-200mb models. a fools' paradise. 
the floor for this task appears to be somewhere between 1B and 3B parameters for off-the-shelf basic prompting.

fine-tuning with LoRA is probably the real answer.if i want a sub-1B model that nails this, id need to actually train it on labeled examples.thats a project for another day.

or probably , dont use llm at all. its clear to be that llm are memory machines and not a reasoning machine. they are stupid memory savants. all the intelligence it purports to present is basically feigning memory as intelligence. it was so obhious even from such a small example. 
llms are not the one-ring-to-rule-them-all. 

all said, i still dont know though what ring will rule my classification problem. need to read up a lot more. it will be so sad if ultimately i just end up regex-ing it. ( allthough, that is the likely outcome given the simplicity of the propmpts and the narrow scope.)


Date: 2026-04-11