

I just checked how much my 4x32gb costs. Guys, I’m focking rich


I just checked how much my 4x32gb costs. Guys, I’m focking rich
I once witnessed funnies thread in my life, so I made it into meme. Feels like it fits here.

is there a general term for the setting that offloads the model into RAM? I’d love to be able to load larger models.
Ollama does that by default, but prioritizes gpu above regular ram and cpu. In fact, it’s other feature that often doesn’t work, cause they can’t fix the damn bug that we reported a year ago - mmap. That feature allows you to load and use model directly from disk (alto, incredibly slow, but allows to run something like deepseek that weight ~700gb with at least 1-3 token\s).
num_gpu allows you to specify how much to load into GPU vram, the rest will be swapped to regular RAM.
You’d need ollama (local) and custom models from huggingface.
Half of the charm in using ollama - ability to install models in one command, instead of searching for correct file format and settings on huggingface.
for example:
Isn’t that one also pretty censored? Really uncensored one usually either builded from scratch (behemoth or midnight-miqu as example) or named accordingly: mixtral-uncensored or llama3-ablitered.
Two years ago, when I found out that you need damn subscription, to watch YOUR stuff with transcoding on your device in local network, from your local server - I complained on reddit and a lot of people was disagree with me for harsh position.
They_got_what they_focking_deserve.png