Rendered at 17:33:39 GMT+0000 (Coordinated Universal Time) with Netlify.
wiradikusuma 2 days ago [-]
Quick question, for average joe do we still need to "train" LLM or we can just use off the shelf model and use it ("inference"?) for normal use cases like business process augmentation (e.g. helping read paper receipts, or generate cat videos)?
magicalhippo 2 days ago [-]
Modern smaller LLMs like Qwen3.6 27B is quite good at visual tasks like describing images. I wouldn't trust it on receipts unless you're fine with a bit less than 100% accuracy, say 90-ish%. For descriptions of images and such I've found they do quite well indeed. A key change was the introduction of more or even dynamic visual tokens, that really helped the model "see" more details.
Generating cat videos is the domain of diffusion models. If you have at least a 16GB GPU and a fair bit of patience you can get quite good results, check out ComfyUI reddit for example.
magicalhippo 2 days ago [-]
Just as example, here's what Qwen3.6 27B Q5_K_XL can do given this[1] image. I didn't do any prompt engineering here just a dead simple prompt: "Transcribe the following receipt. Put line items in a separate section, each line item separated by a double newline". Temperature set to 0.5.
Here's the output:
Publix.
Bradenton Commons Shopping Center
4651 Cortez Rd. W.
Bradenton, FL 34210
Store Manager: Joe Galati
941-792-7195
N/O LF WHEAT BREAD 3.99 F
PBX THCK L/S BACON 7.82 F
PUBLIX BROWN GRAVY 0.83 F
TOP SIRLOIN STEAK 11.74 F
You Saved 3.92
VITA PRTY SNK WINE 6.99 F
You Saved 3.00
ORGANIC CARROTS 1.69 F
BRC FLRT EAT SMART 3.34 F
1 @ 3 FOR 10.00
You Saved 0.15
GINGER ROOT 0.65 F
0.13 lb @ 4.99/ lb
POTATOES RUSSET 0.84 F
0.65 lb @ 1.29/ lb
POTATOES SWEET 0.49 F
0.49 lb @ 0.99/ lb
DELECT BSQUE CK/TN 10.99 T
FS OUTSTRETCH UNSC 15.99 T
Order Total 65.36
Sales Tax 1.89
Grand Total 67.25
Credit Payment 67.25
Change 0.00
Savings Summary
Special Price Savings 7.07
************************************************************
* Your Savings at Publix *
* 7.07 *
************************************************************
Receipt ID: 5957 6249 2191 1277 712
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PRESTO!
Trace #: 766630
Reference #: 0098440513
Acct #: XXXXXXXXXXXX2034
Purchase VISA
What is the difference between this and using normal OCR and then running that output through a LLM? It seems such a bazooka way to kill a fly to me using a modelime Qwen.
Doxin 20 hours ago [-]
For this example, perhaps not much, but a similar query also works. given this image[0] I queried qwen3.5-35b-a3b "where is waldo", and after thinking some got this back:
---
Waldo is located in the *center* of the image.
Here is how to spot him:
1. Look towards the middle of the beach scene.
2. Find the *red and white striped lifeguard stand/booth* on the right side of the center area.
3. Just to the left of that booth, there is a man standing up wearing his signature *red and white striped shirt*, blue jeans, and red hat with a pom-pom. He also has a camera around his neck.
---
Which is not generally something OCR can do. If you instead ask "how many horses are in this image?" you get this response:
---
Based on a careful look at the image, there are *3* equine animals (which appear to be donkeys or ponies used for beach rides).
1. One is near the top left, being ridden by someone in yellow shorts.
2. Right next to it is another one, being ridden by someone in blue and yellow.
3. On the far right edge of the image (near the water), there is a third grey animal being ridden by someone wearing a hat.
--
Now is this all anything you can't do with more boring machine learning? sure, but there's something incredibly convenient about how generic LLMs are. You don't need to train anything, just point the LLM at an image and ask.
For most tasks I agree. However once you've done your OCR you already have lost a lot of positional and context information, so for some tasks it might not be good enough.
If you have scanned PDFs that follow a template, like an invoice from a repeat supplier, then yeah OCR is definitely the way to go.
minimaxir 2 days ago [-]
You can use modern off-the-shelf models for those types of tasks, however a smaller-but-bespoke model will usually be more cost-efficient if used at scale.
najarvg 2 days ago [-]
And smaller bespoke models running locally are better for regulated workflows (healthcare, banking etc) as well
jiehong 2 days ago [-]
I think nowadays a lot of models are trained more at doing this than at knowing things, while being smaller. So I’d say yes!
At least that’s my impression.
electroglyph 2 days ago [-]
nice writeup! looking forward to doing some more training as soon as i get some more data sorted. it'll be a custom arch, but i'll probably shoehorn it into unsloth for a speed boost.
danielhanchen 2 days ago [-]
Thank you!
stared 2 days ago [-]
While I do admire Unsloth (especially their https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF binarizations), the linked blog post looks like written by AI from notes (unless a human author acquired this taste from interactions with chatbots).
danielhanchen 2 days ago [-]
Oh thanks :) We're also going to add MTP support soon for Qwen3.6!
95% of it is fully human done - the maths, algos, code snippets, screenshots & benchmarks are done / conducted by us and NVIDIA :)
We did use AI to fix spelling errors + made some nice plots using Chat (ours would look horrible lol)
Update - Just got rid of the spiced up intro
stared 2 days ago [-]
Thanks!
To be clear, I use AI for editing all the time.
Actually, diagrams are nice.
Just some pieces like that look like copy-paste (I mean, empty lines before, code get no special typography, etc):
If we write the boundary information for a packed batch as:
B = { lengths, cu_seqlens, max_seqlen, mask structure }
then every transformer layer in that forward pass consumes the same B.
If the model has L layers, rebuilding or re-synchronizing on B once per layer is not new work. It is the same information being reconstructed again and again.
In other words, the useful work is:
build B once, use it L times.
The wasteful version is:
build B + build B + ⋯ + build B (L times)
giancarlostoro 2 days ago [-]
> Actually, diagrams are nice.
I especially use AI to generate code for things like Mermaid[0]. It's just easier to describe the flow I want to outline than to remember all the nuances of Mermaid or similar code -> graph / diagram tooling. The output still looks nice too.
What’s with the all the hate for AI assisted writing on HackerNews? It’s a tool and people use tools all the time. It saves TIME and helps in improving coherence of one’s articles.
embedding-shape 2 days ago [-]
> What’s with the all the hate for AI assisted writing on HackerNews?
I don't think it's specifically for "AI assisted writing", any lazy writing gets hate on HN, the bar for quality just sits higher for better or worse.
> It saves TIME and helps in improving coherence of one’s articles.
I agree that it saves time for the author, but for the reader it has the opposite effect, and if you're unable to write coherent articles without the use of LLMs, maybe solve that first instead of patching over the problem.
saberience 2 days ago [-]
Because AI writing is lazy and moreover, I don’t want to know the AIs opinion on something, I can get that myself, if I want to read someone’s article I want to hear that persons words and that persons opinions.
If someone has no opinions or unique insight then why would I listen to them or read their content.
Again, if I want the AIs view on something I can open up Claude and ask them myself, why bother reading generated articles that took 10 seconds for someone else to prompt?
danielhanchen 2 days ago [-]
Update - Just got rid of the spiced up intro
stingraycharles 2 days ago [-]
It destroys the previously implicit contract that the writer actually spent a decent amount of thought and time into the writing, and that the ideas expressed are theirs and original.
I don’t mind good usage of LLM assisted writing, but if the author can’t even be bothered identifying the most obvious AI tells, I use it as a proxy that the author probably but very little effort into the article.
It’s also often a horribly verbose style, where the same ideas could be presented with 20% of the prose.
It’s also ruining the entire experience on web communities (although here on HN the moderation team seems to get a hold of keeping them at bay at this point, much appreciated).
All in all, it’s objectively a net negative for the readers, and serves only the author.
I prefer original, less coherent articles that are genuine and where I know the ideas expressed are really the author’s and not the LLM’s inference.
Last but not least, I don’t think the grandparent you’re replying to was particularly hateful in the grand scheme of things.
vardalab 2 days ago [-]
Why would you prefer less coherent article? If article has a utility, I will read it, no matter what the source is.
toraway 1 days ago [-]
The problem with AI written articles is still feeling uncertain whether there's actually any utility after reading 2000 words as you realize that it's been 90% filler so far but think maybe it will lead somewhere soon? But it doesn't and you wasted ten minutes reading glorified blog spam that was micro targeted at whatever niche you were researching.
After a while you pick up on the warning signs and just bail early without any guilt about false positives. It's really the only sustainable strategy in a world where it takes 5 seconds to absorb 5 minutes of your attention span.
stingraycharles 2 days ago [-]
For the same reason, people prefer authenticity over mass-produced, generic stuff.
But authentic writing takes a lot of effort, and nobody wants to do that anymore in 2026, so the status quo is more mass-produced, generic content, which is frustrating and (to me) a regression.
SwellJoe 2 days ago [-]
LLM prose is unfocused and extremely verbose. It wastes the readers time and is insulting. If you don't care enough about something to write about it, I certainly can't be expected to care enough to read it.
LLMs don't want anything. Thus, they have no taste. It's not merely a style question, it wastes readers time trying to find the point the author was trying to make; a fruitless search, as the LLM wasn't trying to make a point, it was completing one probable sentence after another.
wat10000 2 days ago [-]
When used well, it's not noticeable and nobody complains. The problem is only when it's used badly.
Generating cat videos is the domain of diffusion models. If you have at least a 16GB GPU and a fair bit of patience you can get quite good results, check out ComfyUI reddit for example.
Here's the output:
[1]: https://i.pinimg.com/originals/41/08/dc/4108dcf51f15af464bb6...---
Waldo is located in the *center* of the image.
Here is how to spot him:
1. Look towards the middle of the beach scene.
2. Find the *red and white striped lifeguard stand/booth* on the right side of the center area.
3. Just to the left of that booth, there is a man standing up wearing his signature *red and white striped shirt*, blue jeans, and red hat with a pom-pom. He also has a camera around his neck.
---
Which is not generally something OCR can do. If you instead ask "how many horses are in this image?" you get this response:
---
Based on a careful look at the image, there are *3* equine animals (which appear to be donkeys or ponies used for beach rides).
1. One is near the top left, being ridden by someone in yellow shorts.
2. Right next to it is another one, being ridden by someone in blue and yellow.
3. On the far right edge of the image (near the water), there is a third grey animal being ridden by someone wearing a hat.
--
Now is this all anything you can't do with more boring machine learning? sure, but there's something incredibly convenient about how generic LLMs are. You don't need to train anything, just point the LLM at an image and ask.
[0] https://i.pinimg.com/originals/18/64/44/1864444c819a7adae742...
If you have scanned PDFs that follow a template, like an invoice from a repeat supplier, then yeah OCR is definitely the way to go.
At least that’s my impression.
95% of it is fully human done - the maths, algos, code snippets, screenshots & benchmarks are done / conducted by us and NVIDIA :)
We did use AI to fix spelling errors + made some nice plots using Chat (ours would look horrible lol)
Update - Just got rid of the spiced up intro
To be clear, I use AI for editing all the time. Actually, diagrams are nice.
Just some pieces like that look like copy-paste (I mean, empty lines before, code get no special typography, etc):
I especially use AI to generate code for things like Mermaid[0]. It's just easier to describe the flow I want to outline than to remember all the nuances of Mermaid or similar code -> graph / diagram tooling. The output still looks nice too.
[0]: https://mermaid.js.org/
I don't think it's specifically for "AI assisted writing", any lazy writing gets hate on HN, the bar for quality just sits higher for better or worse.
> It saves TIME and helps in improving coherence of one’s articles.
I agree that it saves time for the author, but for the reader it has the opposite effect, and if you're unable to write coherent articles without the use of LLMs, maybe solve that first instead of patching over the problem.
If someone has no opinions or unique insight then why would I listen to them or read their content.
Again, if I want the AIs view on something I can open up Claude and ask them myself, why bother reading generated articles that took 10 seconds for someone else to prompt?
I don’t mind good usage of LLM assisted writing, but if the author can’t even be bothered identifying the most obvious AI tells, I use it as a proxy that the author probably but very little effort into the article.
It’s also often a horribly verbose style, where the same ideas could be presented with 20% of the prose.
It’s also ruining the entire experience on web communities (although here on HN the moderation team seems to get a hold of keeping them at bay at this point, much appreciated).
All in all, it’s objectively a net negative for the readers, and serves only the author.
I prefer original, less coherent articles that are genuine and where I know the ideas expressed are really the author’s and not the LLM’s inference.
Last but not least, I don’t think the grandparent you’re replying to was particularly hateful in the grand scheme of things.
After a while you pick up on the warning signs and just bail early without any guilt about false positives. It's really the only sustainable strategy in a world where it takes 5 seconds to absorb 5 minutes of your attention span.
But authentic writing takes a lot of effort, and nobody wants to do that anymore in 2026, so the status quo is more mass-produced, generic content, which is frustrating and (to me) a regression.
LLMs don't want anything. Thus, they have no taste. It's not merely a style question, it wastes readers time trying to find the point the author was trying to make; a fruitless search, as the LLM wasn't trying to make a point, it was completing one probable sentence after another.