@waifu There's text-generation-webui which is pretty good. I use it as a backend for Sillytavern which has a much nicer interface. There's a bunch of uncensored models on huggingface that'll let you do whatever. They're great. One of the reasons I started running them locally. Each has like it's own flavour, I guess.

IDK what card you have, but if you have ~6GB of VRAM, you might be able to run a quantized 7B parameter model pretty comfortably. 7B is the ground floor in terms of model size, but it's more than sufficient to get you going. Silicone Maid is a pretty good lewd 7B in my tests. You might find some success with it.

TheBloke also makes quants for every model under the sun, which are like compressed versions of a model. Quants are much easier to load and move around than the raw models and they usually don't get borked in the process.

**waifu** @waifu@shitposter.club · Jan 06, 2024, 11:35

**waifu** @waifu@shitposter.club · Jan 06, 2024, 11:35

Jan 06, 2024, 11:35

waifu @waifu@shitposter.club

@VD15 sadly i have a gt 1030 with 2G so i can barely run stuff, but i shall give this a try thank you

**matrix07012** @matrix@gameliberty.club · 2024-01-06T11:53:40Z

matrix07012 @matrix@gameliberty.club

@waifu @VD15 LLammaCPP can split the model between your GPU and CPU so you can maybe run 1-2 layers on your GPU and rest on your CPU. With a quantized (https://huggingface.co/TheBloke/Silicon-Maid-7B-GGUF) small model it shouldn't be super slow.

Jan 06, 2024, 11:53 · · Web · · ·

Jan 06, 2024, 12:16

Jan 06, 2024, 12:16

Jan 06, 2024, 12:16

:VD15_0::VD15_1::VD15_2::VD15_3::VD15_4::VD15_5::VD15_6::VD15_7: @VD15@pl.valkyrie.world

@matrix @waifu Pure CPU inference speed with that model isn't bad on my 16 thread ryzen, actually. That's like a comfortable reading speed for me.

533828b71b1b8a54606b4e0313d39d6788902797da6f016610f826fabdfca5f1.png

@mint@ryona.agency · Jan 06, 2024, 12:24

@mint@ryona.agency · Jan 06, 2024, 12:24

Jan 06, 2024, 12:24

@mint@ryona.agency

@VD15 @matrix @waifu I think it needs I/O performance between the CPU and RAM more than it needs actual clock cycles. I've palyed around with LLM a while ago, used Mythomax 13B on koboldcpp with about half of the layers offloaded to GPU and it was pretty slow on my shitty AMD FX system with DDR3-1600.

Jan 06, 2024, 12:59

Jan 06, 2024, 12:59

Jan 06, 2024, 12:59

:VD15_0::VD15_1::VD15_2::VD15_3::VD15_4::VD15_5::VD15_6::VD15_7: @VD15@pl.valkyrie.world

@mint @matrix @waifu yeah, I underestimated IO throughput when I put together my rig. I have everything on 4x to 16x risers and the performance tanks whenever it goes over multiple cards. Still infers faster than I can read, so I don't really care, but still

**matrix07012** @matrix@gameliberty.club · Jan 06, 2024, 14:33

**matrix07012** @matrix@gameliberty.club · Jan 06, 2024, 14:33

Jan 06, 2024, 14:33

matrix07012 @matrix@gameliberty.club

@mint @VD15 @waifu Logically IO is a massive bottleneck since it's a bunch of small calculations over a large dataset

Trending now

Resources

Developers

What is Mastodon?

gameliberty.club

More…