@waifu tbh you can run stable diffusion on a mid-range card pretty easy
@waifu There's text-generation-webui which is pretty good. I use it as a backend for Sillytavern which has a much nicer interface. There's a bunch of uncensored models on huggingface that'll let you do whatever. They're great. One of the reasons I started running them locally. Each has like it's own flavour, I guess.
IDK what card you have, but if you have ~6GB of VRAM, you might be able to run a quantized 7B parameter model pretty comfortably. 7B is the ground floor in terms of model size, but it's more than sufficient to get you going. Silicone Maid is a pretty good lewd 7B in my tests. You might find some success with it.
TheBloke also makes quants for every model under the sun, which are like compressed versions of a model. Quants are much easier to load and move around than the raw models and they usually don't get borked in the process.
@waifu @VD15 LLammaCPP can split the model between your GPU and CPU so you can maybe run 1-2 layers on your GPU and rest on your CPU. With a quantized (https://huggingface.co/TheBloke/Silicon-Maid-7B-GGUF) small model it shouldn't be super slow.
@matrix @waifu Pure CPU inference speed with that model isn't bad on my 16 thread ryzen, actually. That's like a comfortable reading speed for me.