De Lancre

De Lancre@lemmy.world · 19 hours ago

Two years ago, when I found out that you need damn subscription, to watch YOUR stuff with transcoding on your device in local network, from your local server - I complained on reddit and a lot of people was disagree with me for harsh position.

They_got_what they_focking_deserve.png

De Lancre@lemmy.world · 5 days ago

I just checked how much my 4x32gb costs. Guys, I’m focking rich

De Lancre@lemmy.world · 6 days ago

I once witnessed funnies thread in my life, so I made it into meme. Feels like it fits here.

De Lancre@lemmy.world · 7 days ago

is there a general term for the setting that offloads the model into RAM? I’d love to be able to load larger models.

Ollama does that by default, but prioritizes gpu above regular ram and cpu. In fact, it’s other feature that often doesn’t work, cause they can’t fix the damn bug that we reported a year ago - mmap. That feature allows you to load and use model directly from disk (alto, incredibly slow, but allows to run something like deepseek that weight ~700gb with at least 1-3 token\s).

num_gpu allows you to specify how much to load into GPU vram, the rest will be swapped to regular RAM.

De Lancre@lemmy.world · 7 days ago

You’d need ollama (local) and custom models from huggingface.

Half of the charm in using ollama - ability to install models in one command, instead of searching for correct file format and settings on huggingface.

De Lancre@lemmy.world · 7 days ago

for example:

Isn’t that one also pretty censored? Really uncensored one usually either builded from scratch (behemoth or midnight-miqu as example) or named accordingly: mixtral-uncensored or llama3-ablitered.

De Lancre@lemmy.world · 9 days ago

Hate me