[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


►Previous Threads >>93660168 & >>93652233

►News
>(05/26) BluemoonRP 30B 4K released
>(05/25) First QLoRA's and 4bit bitsandbytes released
>(05/23) exllama transformer rewrite offers around x2 t/s increases for GPU models
>(05/22) SuperHOT 13B/30B prototype & WizardLM Uncensored 30B released
>(05/19) RTX 30 series 15% performance gains, quantization breaking changes again >>93536523
>(05/19) PygmalionAI release 13B Pyg & Meth
>(05/18) VicunaUnlocked-30B released
>(05/14) llama.cpp quantization change breaks current Q4 & Q5 models, must be quantized again

►FAQ & Wiki
>Wiki
>>404
>Main FAQ
https://rentry.org/er2qd

►General LLM Guides & Resources
>Newb Guide
https://rentry.org/local_LLM_guide
>LlaMA Guide
https://rentry.org/TESFT-LLaMa
>Machine Learning Roadmap
https://rentry.org/machine-learning-roadmap
>Local Models Papers
https://rentry.org/LocalModelsPapers
>Quantization Guide
https://rentry.org/easyquantguide
>lmg General Resources
https://rentry.org/lmg-resources
>ROCm AMD Guide
https://rentry.org/eq3hg

►Model DL Links, & Guides
>Model Links & DL
https://rentry.org/lmg_models
>lmg Related Links
https://rentry.org/LocalModelsLinks

►Text Gen. UI
>Text Gen. WebUI
https://github.com/oobabooga/text-generation-webui
>KoboldCPP
https://github.com/LostRuins/koboldcpp
>KoboldAI
https://github.com/0cc4m/KoboldAI
>SimpleLlama
https://github.com/NO-ob/simpleLlama

►ERP/RP/Story Gen.
>RolePlayBot
https://rentry.org/RPBT
>ERP/RP Data Collection
https://rentry.org/qib8f
>LLaMA RP Proxy
https://rentry.org/better-llama-roleplay

►Other Resources
>Drama Rentry
https://rentry.org/lmg-drama
>Miku
https://rentry.org/lmg-resources#all-things-miku
>Baking Template
https://rentry.org/lmg_template
>Benchmark Prompts
https://pastebin.com/LmRhwUCA
>Simple Proxy for WebUI (+output quality)
https://github.com/anon998/simple-proxy-for-tavern
>Additional Links
https://rentry.org/lmg_template#additional-resource-links
>>
File: Fw4o4yyaUAEUY2G.jpg (284 KB, 1750x2000)
284 KB
284 KB JPG
>>93665390
>>
File: d34.jpg (204 KB, 956x560)
204 KB
204 KB JPG
>>93665497
EY MOVE YO GENERIC BITCH ASS ON, HOE
>>
Is there some kind of syntax I'm supposed to follow when writing a prompt? I want to put in unbreakable rules for the AI to follow when generating anything.
>>
you know, this onions man explains my current dilemma pretty well. i'm in the market for an RTX 4000 SFF ada card, but the MSRP is $1250 when it should be $1000, yet it's impossible to find for less than $1450. that's almost a $500 premium or a 1/4 oz of gold for some e-waste
https://www.youtube.com/watch?v=kIaf6o4kFYM
>>
>>93665731
CPUchads can't stop winning because cheap RAM will never be hard to find.
>>
>>93665752
should i just run my shit on CPU and wait for GPU prices to convert to christianity? i have 44 xeon cores and 256 GB DDR4 at my disposal so i'm not exactly starving for specs. i could easily just throw 16 threads and 64 GB RAM at a model
>>
>>93665770
Yes. Also, 3090s are 700 each all day long.
>>
>>93665770
I dunno man, you might have to wait more than 2 minutes per prompt, at that point it's just not worth using AI at all, right?
>>
>>93665796
those cards are huge and power hungry though. i'm not a caselet but i like to keep things tidy. maybe i'll just go with the original plan and develop my galactica API on CPU's while i wait for enough startups to get memed into buying RTX 4000 SFF's new, and when they inevitably fail and offload shit on ebay

it's not the newness i'd buy buy the form factor. if a good mini GPU came out 5 years ago, it completely wouldn't matter. the fact that it came out like 2 months ago just pisses me off
>>
>>93665770
GPUs have thousands of cores. Even shitty thrift store GPUs

throwing 40 xeon cores at it would be pathetic and outperformed by a gtx1650. not just outperformed. fucking annihilated
>>
>>93665811
the plan is to run a remote cron job once a day that queries an AI-powered API to schedule jobs, and checks in with a jobId once an hour to write the results, if completed, to a database. if i wanted to discuss politically correct topics with a robot in real time, i'd just use jewPT or bard
>>
>OP pic
I lol'ed
>>
>>93665904
And what models exactly would actually fit on that piece of shit?
>>
>>93665904
yeah i'm just not keen to double++ my server's max TDP and shove some bullshit in there to "have a GPU," nor do i want to deal with the added complexity and cost of configuring GPU passthrough in a virtualization environment while i'm developing the API to actually use the model productively. i can develop on CPU/mini models and buy a nice GPU later
>>
>>93665972
sure run on CPU at 1% of the speed of a 1650
enjoy
>>
File: FuESVOsaQAA-1nY.jpg (288 KB, 1245x2048)
288 KB
288 KB JPG
new twitter HYPE
https://arxiv.org/abs/2305.15348
READ: Recurrent Adaptation of Large Transformers
Through comprehensive empirical evaluation of the GLUE benchmark, we demonstrate READ can achieve a 56% reduction in the training memory consumption and an 84% reduction in the GPU energy usage while retraining high model quality compared to full-tuning. Additionally, the model size of READ does not grow with the backbone model size, making it a highly scalable solution for fine-tuning large Transformers.
>>
>>93666066
>for fine-tuning
ZZZZZZZZZZZZZZ
>>
>>93665972
>a 1650 will double your servers TDP
how retarded are you? like does your mom have power of attorney since youre too retarded to exist in society on your own
>>
>>93666058
>develop on a tiny model, slow
>make sure everything works
>then buy a modern card used, not for $200+ over MSRP
you realize i'm in the market for an RTX 4000 SFF right? which is so new that no used cards even exist
>>
>>93666066
Nice, looks to be 30% more efficient than LoRA
>>
>>93666098
i dunno what the fuck a 1650 even is, nor do i care. the options are 2x RTX A4000 or 1x A5000, or 1x RTX 4000 SFF, which is in the $1000-1500 price point
>>
>>93665390
I was told I need something called a "ggml" to use koboldcpp. What does that mean?
>>
>>93666167
model must have ggml in filename
no you can't just rename it
>>
>>93666167
numbers.ggml that contains billions of numbers lovingly crafted by mark zuckerberg
>>
>>93666183
Okay but which of these models should I use? Is their a list of the best ones?
>>
>>93665972
Ignore GPU shills. Take a look at this-
https://huggingface.co/blog/generative-ai-models-on-intel-cpu
CPU is the future. When models get better architecture, and you can inevitably beat Claude and GPT 4 with low param counts, CPU will triumph.
>>
>>93666136
I'm running on an A5000 I got off FB marketplace cheap. Works great with ooba, 64gb sysram and a middling CPU. 30b models all day
>>
>>93666218
Yes, on a site called 'Huggingface'. You can sort by most downloaded and recently added. They also have rankings. All in all, they do their jobs better than us, inb4 gradio meme.
>>
>>93666218
Depends on what you want.
Creative writing? Base llama with kobold.
ERP/RP? SuperCOT with ooba/kobold, the tavern proxy, and tavern.
ChatGPT-like? Alpasta or Aldente2 with Ooba in instruct mode.
Programing? Lolno.
>>
>>93666247
2x a4000 are 32gb, though. I'd go with more vram.
>>
>>93666253
>ChatGPT-like? Alpasta or Aldente2
Which one has the best results given 24gb vram?
>>
>>93665878
Just underclock it to the same speed as the Quadro version and it won't be power hungry anymore. Do you not understand how voltage/frequency scaling works?
>>
>>93666253
>Creative writing? Base llama with kobold.
Fucking what
>>
>>93666375
>he doesn't know
>>
>>93666279
I prefer Alpasta.
>>
Trying to get reeducator_bluemoonrp-30b working, but its slow as fuck...like 10x slower than supercot. Am I retarded, or does it need something tweaked in ooba on load?
>>
>>93666245
Anon... we're already way past that. That article is talking about the "new" ability to quantize to 8-bit making CPU inference viable, but basically everyone is using either 4 or 5 bit now, regardless of whether they're using CPU or GPU.
>>
>>93666375
Current instruct models summarize too much, they're not trained on story writing prompts, they're trained on short questions and answers ala ChatGPT. When you're writing a short/long form story, base llama is better, because it's pure text completion, provided you're willing to prime it beyond "lol write me a story about a bitch with big tiddies plz", which won't work, obviously.
You feed it a few paragraphs of text and it'll emulate the prose and continue the story from that point. And you use world info/author's notes to insert important facts and events as you go. You write together, you guiding the story, and it completing text and adding little flourishes here and there. I use it as a writing aid and a way to push through writer's block when you're either unsure how to proceed or can't think of what a character would say next.
>>
>>93666424
Are you filling the context?
>>
>>93666474
Yeah it's good at set dressing, filling dialogue and moving scenes forward if you give it something to work with, but you have to guide it from scene to scene or it gets schizo pretty quick
>>
File: mtg-lotr-arwen-fixed.jpg (989 KB, 4096x2048)
989 KB
989 KB JPG
more twitter HYPE (more like anti-hype lol)

The False Promise of Imitating Proprietary LLMs
https://arxiv.org/abs/2305.15717
An emerging method to cheaply improve a weaker language model is to finetune it on outputs from a stronger model, such as a proprietary system like ChatGPT (e.g., Alpaca, Self-Instruct, and others). This approach looks to cheaply imitate the proprietary model's capabilities using a weaker open-source model. In this work, we critically analyze this approach.
...
In turn, we argue that the highest leverage action for improving open-source models is to tackle the difficult challenge of developing better base LMs, rather than taking the shortcut of imitating proprietary systems.
>>
>>93666486
6 short sentences in context. Same "Eris Discordia" I usually play with. Is Bluemoon more sensitive? I've filled supercot context up with wiki pages before and it chugged along as normal (with shitty results, obviously)
>>
>>93666514
>you have to guide it from scene to scene or it gets schizo pretty quick
Well yeah, of course. It only has 2K context. It's fine though, creative writing isn't fun if the fucking AI is doing all the leg work. Think of it as a writing partnership with the two of you taking turns guiding the story, but your partner is kind of schizo and you have to give them meds now and then to keep them on track.
>>
>>93666474
I agree, instruct models can work well for cards but they have serious pacing and tone issues. The tone thing probably comes from assistants and GPT lobotomization, but pacing is probably innate because of the training on short Q&A prompts.
But superhot seems to not have those issues in my testing. The markdown format is easy to slap into memory, and it seems to follow it well, without rushing or shoehorning things in, either. It's ironic because superhot was meant to be a chat model, but I've had the most success storytelling with it. Meanwhile supercot was meant to be a logic tune and it's my favorite chat model
>>
>>93666250
These are folders, not .ggml files.
>>93666253
Where do I download these?
>>
>>93666576
https://rentry.org/lmg_models
>>
>>93666526
Bluemoon is 4k context so it will go a lot slower and use a lot more memory than a normal 2k model when the context is full. Shouldn't be an issue with a fresh chat, though. You have free memory and it's not swapping?
>>
>>93666640
Fresh chat is still dead slow. System is showing only half the RAM is used. I'm not getting that telltale chirping of the GPU as it spits out text either (wtf is that, anyways?)
Maybe I've got the wrong version of the model and its not running on gpu?
>>
>>93665390
So many new 30Bs... Which one is the correct one?
>>
>>93666664
That's my guess. Use task manager or htop+nvtop and see what's doing the work.
>>
>>93666640
I just tried to load this (I'm brand new, just installed ooboo or whatever) and it failed to load. I have a 4090. Am I just retarded? (yes)
>>
>>93666558
To be fair, superhot isn't even at 1 epoch, and he said that the prototype available is missing a bunch of datasets as well.
>>
>>93666664
also check the config.json of the model and make sure use_cache=True
have no idea why but some models have it disabled, it is necessary to cache the KV tensors
might not be your problem but worth a look, should be in the same directory as the model file
>>
>>93666794
Probably. I haven't used it but the other guy managed to load it at least.
>>
>>93666349
They would still be fuckhuge, and he kinda hinted that space is a problem and he doesn't want no risers.
>>
>>93666424
What is his use_cache value in config?
>>
Local Gpt4 doko?
>>
Anons. I'm trying here but I could use some help. I'm trying to cross reference guides from this gen, and ai chat. I have SillyTavern but I don't have an API key. I'm trying to load this locally. Do I use ooboo as an api? Some anon mentioned to me once before but didn't give me much else.
Is ooboo like A111? As in I should put all my models and shit in it, and then SillyTavern can just refernece Ooboo?
>>
I have a 4090 and every time I try to load a 30b model, I run out of VRAM by like 100 MB. Is it my three monitors that's causing me to fail?
>>
>>93666863
>>93666808
use_cache was true...I tried moving the bin file out and just leaving the bluemoonrp-30b-4bit-128g.safetensors, but then the model won't load
Are there maybe some files from the q5_0 that I need to delete or modify to get it to load the safetensors file?
>>
>>93666934
That happened to me. Don't use 128 group size models. Should fix the issue.
>>
What's with all the baby ducks begging to be spoonfed tonight?
>>
File: 1684420531953053.png (147 KB, 498x462)
147 KB
147 KB PNG
>>93662041
>Want to give a thanks to AusBoss for all his uploads/conversions.
>If you're in here, it's much appreciated.
no problem broskie
>>
>>93666984
I'm retarded and am in over my head.
>>
>>93666958
Seems to be happening with the 30b models recommended in guides. I guess I'm just retarded.
>>
>>93666950
I got it. Needed to load ooba with another model and then select bp from the web interface, manually set the GPTQ wbits, groupsize and model type and reload the model. Its making weird GPU sounds again. Does that seriously not happen to anyone else? Its like an 80's NES game: weird chirping sounds as the model spits out text, and its definitely coming from the GPU...
>>
>>93667128
do you not pass those in when you load the model from the commandline?
or are you using windows
>>
What's the recommended strategy to load models with multigpu in ooba?
So far it's super slow using "auto-devices".
>>
File: normal.png (904 KB, 1045x780)
904 KB
904 KB PNG
>>93667128
>Its making weird GPU sounds again.
Just go with it.
>>
File: It's just coil whine.jpg (272 KB, 1203x800)
272 KB
272 KB JPG
>>93667128
>afraid of coil whine
>>
>>93667128
It's coil whine, I wouldn't worry too much about it. If you really care that much about it you can take some epoxy to the inductors on the board, but you really should just leave it as is as it's just oscillation and doesn't hurt the board.
It's like your engine vibrating and producing a harmonic. Annoying but ultimately harmless.
>>93667187
t. demon :^)
Can you upload it to vocaroo because that sounds interesting.
>>
>>93667199
Vocaroo request meant for
>>93667128
>>
>>93665390
>(05/25) First QLoRA's and 4bit bitsandbytes released
Trying to find more out how to try this, want to make sure I'm finding the right info. I see this posted 5/24 (as opposed to 5/25) -- same thing? https://huggingface.co/blog/4bit-transformers-bitsandbytes
>>
>>93667238
Yes
>>
>>93667213
>>93667128
hah. Here you go https://voca.ro/1l4YxXK4C4aP

Why would the GPU make this sound? any tips appreciated, hope I didn't ruin this A100.
>>
>>93667008
kek. same here
>>
>>93667333
... you serious? I just told you why.
>>
>>93667333
Standard coil whine.
>>
>>93666595
It tells me that all the models I download are the wrong version and it doesn't open a UI. How do I convert them to the right version or find models which are up to date?
>>
Did greg break the fucking quantization format again?
>>
>>93667513
Let's start with the basics. What's your hardware and what model are you trying to run?
>>
The GOAT LLama 7b beats GPT-4
https://twitter.com/rasbt/status/1661754946625105920?s=19
>>
>>93665390
Yo anyone got that big nigga prompt? A nigga tryna cop fr fr.
>>
File: 1653898652059.gif (1.67 MB, 260x200)
1.67 MB
1.67 MB GIF
>>93667580
>>
>>93667558
You get what you fucking deserve
>>
>>93666749
*33b
Please use the correct terminology next time.
>>
>>93667620
I don't use ggml, but I update the models rentry. I'm asking if I have to redo all the fucking links once again.
>>
>>93667624
its 32.5B actually
>>
File: 1685080849570.gif (3.29 MB, 498x280)
3.29 MB
3.29 MB GIF
>>93667580
>>
What is a good local model now? I have been away for around 2 month.
>>
File: IKAozHCMJG.png (117 KB, 240x258)
117 KB
117 KB PNG
>>93667580
>>
>>93667624
>>93667638
see >>93667597
>>
>>93667654
StableLM 7B is widely considered to be the the best. It punches far above its weight, toppling the 30B's. We're still waiting for 13B, it'll likely beat GPT4 though.
>>
>>93667654
llama 7b :)
>>
>>93667654
SuperCOT still reigns supreme with no contenders.
>>
>>93667654
See >>93667580. It's smarter than GPT-4.
>>
>>93667333
rip
>>
>>93665390
am I right in that I install LlamaEx in my oobabooga text UI folder?
It doesn't specify where I gitbash
>>
>>93667743
Eh? Is exllama even integrated into the webui yet?
>>
>>93667654
wizardlm p good, pretrained-sft-do2 is also nice, and then there's a billion jillion other models I haven't tried
>>
how do you change the nvcc flags for llama-cpp-python? i need to change the architecture or else ill get an illegal instruction error. is there a make file somewhere?
>>
Auto-bans for mentioning/trolling with StableLM when?
>>
File: Screenshot_21.jpg (146 KB, 757x951)
146 KB
146 KB JPG
>>93667681
>>93667686
>>93667723
>>93667797

Uh. . . Can any of these do character chat better than LLma13B from 2 months ago? I am not looking for problem solvers.
>>
File: file.png (520 KB, 1028x675)
520 KB
520 KB PNG
>>93667824
>>
>>93667824
https://rentry.org/llama-examples
You're far behind the curve.
>>
>>93667824
>https://rentry.org/better-llama-roleplay#example-outputs-wizardlm-7b
judge for yourself
>>
>>93667837
The Karen one always cracks me up
>>
>>93667753
I guess that answers my question lmao
so where do I put it instead then?
>>
>>93667817
Its there on reddit if that's what you want faggot
>>
Holy shit, bluemoon 30b makes nice conversations
>>
>>93667893
Inside your anus? Hell if I know, I'm waiting for integration first.
>>
>>93667906
>absolutely no shred of an image to be seen anywhere
>>
>>93667901
Alright relax, don't get your bussy all out of whack
>>
>>93667580
oh every numeral gets its own token in llama models huh smart
>>
>>93667908
if its as good as advertised, it shouldn't take long
>>
Are you guys hyped for the openllama 3b support for llamacpp?
With a 3b model available it will be possible to make it run even on toasters. The perplexity of the 600bt preview is "only" 2 point higher then the original 7b model, and it's going to get closer when the training ends (1,2T token)
Openllama page: https://github.com/openlm-research/open_llama
Git pull for 3b support: https://github.com/ggerganov/llama.cpp/pull/1588
>>
>>93668123
No. But thanks for asking.
>>
File: 3b he says.jpg (53 KB, 447x406)
53 KB
53 KB JPG
>>93668123
>>
I have an AMD card which I know isn't going to work with most if not all models in any way a smooth brain such as myself can run.

So I'm thinking about picking up a dedicated Nvida AI reseach card. Looks like you can pick up a K80 with 24GB for about $400USD

Anyone have any experience running a GPU with out display ports? Ideally I want to run my 7900XTX as the display driver at the same time as having the dedicated card. I would imagine normally you'd use the motherboards onboard graphics card as the display driver using something like an A100 or K80 so I don't see any issue with using a separate card to run the display, but I'm also a smooth brain.
>>
>>93668191
bro just go cpu, much easier to stack ram and run bigger models
>>
>>93668247
>>93668191
tru. plus the llama.cpp native server is just as fast as the base program.
>>
>>93667580
MY CALCULATOR IS SMARTER THAN GPT4
>>
>>93668247
>>93668276
Fair nuff, I already have 64GB which could run the biggest model right?

I'll give it a go this weekend.

On another topic how easily do these models work on phones. I realise the parameters are limited by ram, but I was more thinking about how most phones these days are running on ARM architecture, not x86.
>>
>>93668309
>On another topic how easily do these models work on phones.
They don't.
>>
>>93667597
>>93667653
>>93667661
>>93668306
midwits don't even know why it works like that huh
>>
>>93668309
there is not a snowball's chance in hell you'll run SD on a phone, let alone a LLM
You can however connect to one running on your PC from your phone
>>
>>93668413
if you don't know what you're talking about, why respond as if you know what you're talking about
>>
>>93668408
>that beats GPT4
>on arithmetic tasks
Nobody cares.
>>
>>93668309
Maybe it will be possible to run smaller model, like the 3b mentioned by >>93668123. Not coomer level but at least it's something
>>
>>93668436
Sweet, now I can turn a multi million dollar language model into a $5 pocket calculator
>>
>>93668466
Yeah, nothing beats an LLM that can answer the question "How many cows are in this picture?"
>>
File: 135608157721.jpg (355 KB, 1844x1844)
355 KB
355 KB JPG
>>
File: image.png (79 KB, 2549x488)
79 KB
79 KB PNG
>Tiny llm Finetuner for Intel dGPUs
>Finetuning openLLAMA on Intel discrete GPUS
>A finetuner1 2 for LLMs on Intel XPU devices, with which you could finetune the openLLaMA-3b model to sound like your favorite book.
https://github.com/rahulunair/tiny_llm_finetuner
>>
>>93668885
>finetuning
yawn
>>
>>93668309
llama.cpp runs on ARM (Apple Silicon and Raspberry Pi) and while there's not an actual mobile app that I'm aware of, there's no technical reason why it couldn't be done. I wouldn't expect to run larger than 7B or for it to run particularly well, but it should be able to do it.
>>
>>93666096
>>93668944
What exactly is it you have an issue with?
>>
Why is GPU anon so hated?
There is no UI that supports his recent improvements yet.
I at least hope with multi gpu this shit will be finally supported because I look forward to it.
Koboldcpp guys are being total faggots and are implementing their own gpu solution.
Oobas llama-cpp-python is outdated.
Sucks.
>>
>>93669003
I don't know who that online persona is, but gptq works fine for me
>>
>>93669003
GPU users already have gptq
only you cpufags care about llamacpp
>>
>>93665390
Big Nigga seems real
>>
>>93669022
Llama CPU is decent for those with a good CPU but a shit graphics card.

I have a 3060TI but also a 13600K, with using both on llamacpp I can get ok performance at 13b.

It's either that or going to 7b just on the GPU.
>>
>>93665390
I see your bake, Norq
desuuu ~
>>
>>93669003
I can't ever seem to replicate his 13B numbers, I assume it's the benchmark being not representative of real world use, but he is doing great work nonetheless. llama.cpp is a good project. The native server is at least 2x faster than Kobold.cpp and llama-cpp-python and since no one knows how to build shit, no one cares.

Hopefully gergy will add server build flags to the automatic builds. Maybe GPU dev could do that.
>>
is there a superhot 13b?
>>
>>93669003
Huh, he's hated? Last time I've checked, he was well-regarded, when did it change?
>>
>>93668123
I'm hyped for their 13b and 30b models, when are they going to start training those?
>>
>>93669186
First I've heard of it, too.
>>93669159
Yeah, but I don't think anyone merged the LoRA, so you'll have to run it alongside the 13B yourself.
https://huggingface.co/kaiokendev/SuperHOT-LoRA-prototype/tree/main/13b/gpu
>>
>>93669207
Probably when the 7b finish the training. If it doesn't be on par of the original llama 7b i think they will try to retrain the 7b model before moving to the 13b.
>>
>>93669215
fug, I am cpuchud
>>
>>93669240
I thought LoRA's aren't bound to either cpu or gpu, and work with either?
>>
Building and running llama.cpp API server for an easy 5000x speedup that's better than GPT-4?!?!

Prereqs:
1. Make sure you have appropriate CUDA stuff installed if you're using that. Search CUDA 11.7 installer.
2. On Linux, make sure you have make installed. On Windows, make sure you have Visual Studio installed with C++ Cmake Tools For Windows checked while you're installing it.

HOWTO:

1. git clone the llama.cpp project.
2. In the examples directory, go into the server folder and replace server.cpp with the one from this PR: https://github.com/ggerganov/llama.cpp/pull/1570
3. Open a terminal in the llama.cpp directory and run
>cmake . -DLLAMA_CUBLAS=ON -DLLAMA_BUILD_SERVER=ON
This might be slightly different on Linux or Windows, but should work. Look for "CUDA detected" type lines in the console output. If it's not there, CUDA path might not be set. Try restarted the terminal or your computer.
4. Build with make or open the generated .sln in Visual Studio and build from there the Build menu. If it isn't offering a build menu, make sure C++ stuff is installed properly for Windows.
5. There should now be binaries in the llama.cpp/bin/ folder, including a server binary. Good work! If not, BAD WORK! Try again!

Running:

0. You currently need simple-proxy and Tavern since no one supports it other than simple-proxy.
1. If you are still in the llama.cpp folder on Windows, run:
>.\bin\Release\server.exe --port 8080 --model <PATH_TO_MODEL> -ngl <GPU_LAYERS> --ctx_size 2048
Replacing the stuff in brackets as needed.
2. Run simple-proxy and SillyTavern and use them as normal, it should work automatically.

I think that should all be roughly correct. If anything fucks up, ask Bing. This should help get you roughly in the neighborhood. Hopefully more people will test it out.
>>
>>93669234
They have already overtaken it, judging by their own table. They should wrap it up and move on to 13b assp then.
>>
Guys, it's over. I had a YouTube video up with a timer for my food that's in the oven, but I closed the tab. Is there a model that can help me with this?
>>
>>93669333
StableLM 7b
>>
>>93669291
> Windows
cringe
>>
>>93669333
no but there is Hourglass
>>
>>93668987
>>93668309
>On another topic how easily do these models work on phones.
On an RK3588 (16GB), I get around 500ms/token using llama.cpp on a 7B Q5_1.
I do wonder if there's room for improvement with the OpenCL implementation though. Currently, it's a lot slower than using CPU only and I suspect it might be because it copies memory as opposed to shifting ownership between CPU/GPU.
If that's the case, might also see improvements on Phone eventually.
>Might be talking out of my ass as I'm assuming the Mali GPUs should be faster for this kind of workload than ARM CPUs.
>Anybody able to confirm whether this should actually be the case?
>>
>>93669343
kek
>>
>>93665390
BIG NIGGA BAKE LIKE I ASKED YEEEE, THE BIG NIGGA INSIDE OF ME IS SATISFIED
>>
File: memory_scaling_1.png (171 KB, 2304x1728)
171 KB
171 KB PNG
>>93669357
I think it's questionable whether you'll see that much of an improvement from CUDA/OpenCL for an iGPU in the first place.
When I tested it the speed of token generation was essentially just proportional to memory bandwidth (which is much higher on discrete GPUs).
That bandwidth is going to be the same regardless of what you then use for the actual calculation.
>>
File: math is ok.png (51 KB, 790x542)
51 KB
51 KB PNG
>>93669159
Yes it is on the prototype page, I am not sure how well the 13B holds up there
Tangentially related, the math skill is improving
>>
>>93669554
>fulfill oedipus complex desires with eldritch abomination milf
>follow it up with mathematical pillow talk
I'm thinking based
>>
>Try to follow guides in the heading
>Make no progress
>Look up youtube video telling me how
>Done in 10 minutes
I think I'm getting too old for the bleeding edge...
>>
>>93669605
Yes it has been quite... fun
>>
>>93669632
Don't you suck that dick, anon. I'm serious.
>>
>>93669632
Suck that dick, anon. I'm serious.
>>
>>93669605
Low bar for "Eldritch abomination" these days huh
>>
>>93669691
Who asked?
>>
File: not what I expected.png (117 KB, 770x1176)
117 KB
117 KB PNG
>>93669667
>>93669683
huh
Maybe because its a 13B version lol
>>
>>93669705
>you need to get asked as an anonymous person on an imageboard to provide your opinion or whatever
okay retard
>>
>>93669728
That's really cool and all but who asked?
>>
>>93669720
Weird how certain models/parameters fall into outputting the same pattern/format over and over again. Like in that screenshot :
"Dialogue" *descriptive*
"Dialogue" *descriptive*
Repeated over and over again.
>>
>>93665390
>>(05/25) First QLoRA's and 4bit bitsandbytes released
are any QLoRA native models other than Guanaco out yet?
>>
>>93669741
You did.
>>
>>93669691
>i like being a passive aggressive cunt
Hope that works out for you
>>
>>93669760
You did, you say?
>>
>>93669752
It is most likely due to the fact I removed the example dialogue from the cards I use to test. The model will output what it sees in the context, I experience this with other models I try, and Pygmalion cards tend to be littered with <START> in the example dialogue which causes the model to try outputting <END> to end their message. I remove those, so the model only sees my chats and the character's responses and picks up that's how it should respond. After a long enough chat, it has effectively collapsed into one writing style, but it should be easily fixable by providing example dialogue.
Maybe it's better on the 30B, but I am training prototypes locally for now since it's free and can only afford to train 13B
>>
>>93669777
? Did you take offense somehow? It's a simple critique of how loosely the term seems to be applied.
>>
>>93669792
I was hoping for a better reply desu t_t
>>
File: no one cares.jpg (49 KB, 1280x720)
49 KB
49 KB JPG
>>93669797
>>93669824
>>
>>93669836
holy triggered.
>>
>>93669705
>>93669741
>>93669836
Kek
>totally uncaring
>>
ya'll better get in line or no more big nigga for anyone, WE ARE GONNA TURN THIS BITCH ASS CAR AROUND, AND WE ARE GONNA GO HOME
>>
>>93669865
projection
>>
>>93669777
Did that other guy fuck your wife or something?
>>
>>93669705
>>93669777
Why are you replying for me?
He posted that to me, you want to act like faggots, attach it to your own posts first.
>>
>>93669691
Seems eldritch to me
>>
>>93669922
If it's your reply, how come I don't see your name on it?
>>
>>93669922
who the fuck are you? fuck off this is my conversation now.
>>
>>93669931
What? It's right there. You might have to take all those dicks out your face to see it first.
>>
>>93669186
>>93669215
Sorry for the confusion.
I was being hyperbolic, as in ooba ui doesn't support his recent changes and koboldcpp outright isn't implementing his code because of "cuda bloat".
That being said I get no speed difference anyway, tried with the newest llama.cpp.
Might be because of my good ol' pascal card. I'm sure it heavily depends on the card.
Multi GPU is very interesting though.
Would give me an excuse to make my own shitty as website to connect to llama.cpp api for superhot when it releases.
I'm sure you can do alot of fun stuff with that model, most innovative model yet for sure.
>>
WizardLM-30B-Uncensored-CUDA when?
>>
>>93669941
Are you you or am I you? Could you be me if I was me but you at the same time?
>>
>>93669930
Yeah, that's nowhere near the concept.
>>
>>93669957
I was being satirical because you were being pedantic
>>
>arguing over the term 'eldritch'
NEEEEERRRRRRRDDDSSSSSS
>>
>>93669955
i made you up.
>>
>>93669918
Oddly specific accusation there, anon
>>
>>93669930
>engaging in a simple conversation
>proposed mischievously, smirking seductively
>suggested cheerily, questioned playfully
>chuckled warmly
>giggled light-heartedly
>asked curiously
>reassured gently
>suggested playfully
>omg zoo
Not exactly a sanity-wrenching existential horror. Just poor writing.
>>
>>93670001
You're being pedantic still for some reason
>>
>>93669963
Nah
>>
>>93670001
Why are you so triggered over something so inconsequential?
>>
>>93670027
It's quite a calm and thorough list. I enjoy being detailed.
>>
File: fuck.gif (1.85 MB, 477x498)
1.85 MB
1.85 MB GIF
slow thread day? slow thread day
>>
>>93670072
Slow thread day coupled with one or two fags. It's usually slow around this time, but far more comfy than whatever this is.
>>
>>93669752
Yeah, adverb multiplication is also a problem, seemingly. A problem in base 13b, similarly. Epsilon/eta cutoff eliminate it, but not completely.
>>
File: 1660937732987733.jpg (22 KB, 518x270)
22 KB
22 KB JPG
>>93669777
>>
Cringe.
>>
>>93670142
nice to meet you, mr. Cringe.
>>
>>93669777
Cry more
>>
>>93669691
Your face is an Eldritch abomination.
>>
>>93669777
All those flavors and you choose to be salty
>>
>>93669777
Nice trips, now leave
>>
File: sad-nigga.gif (1.74 MB, 506x640)
1.74 MB
1.74 MB GIF
you all make big nigga sad, look what you hoes did
>>
>>93670169
Boooring
>>
>hide 1 post
>half the thread disappears
>>
>>93670228
>you all
It's clearly a bunch of samefaggotry. Just hide the posts and ignore it.
>>
>>93670232
And yet you replied
>>
>>93667022
Turn off hardware acceletation
>>
When schizo starts posting:
Have 4chanX.
Go to the options, untick Stubs.
Click the - beside his initial post.
Watch the thread drastically improve because he replies to himself until he gets bites from actual anons.
>>
>>93670290
>Have 4chanX.
how addicted are you to this website if you need that of all things
>>
File: settings.png (559 KB, 1371x3482)
559 KB
559 KB PNG
How come I don't have the Reverse Proxy for OpenAI input box? Is this still even working in the lastest silly tavern main branch? I am trying out this: https://github.com/anon998/simple-proxy-for-tavern
>>
>>93670290
lol he doesn't like the idea of anons filtering him out does he
>>
Shut the fuck up and tell what preset and prompt format I should be using in sillytavern simple proxy with manticore 13b.
>>
>>93670327
>Shut the fuck up and tell
WHICH IS IT?!
>>
>>93670320
Baits are shit tier as well, that's why he has to reply to himself to get eyes on his posts.
>>
File: settings2.png (404 KB, 1216x927)
404 KB
404 KB PNG
>>93670311

Where is the Reverse Proxy OpenAI??
>>
>>93670311
pic unrelated
>>
>>93665390
anyone tried guanaco? thoughts? also is the 30b bluemoon better than the previous i was getting lots of urls and werid nicknames and shit when i used it last time
>>
>>93670362
Try the dev branch.
>>
>>93665390
https://www.youtube.com/watch?v=3Zf0MqvmDFI

read ded remedtn
>>
>>93670374
I have not seen any evidence that it produces better outputs than anything else in the same parameter count and it is full of refusals. Nothing novel or interesting. It's just new.
>>
>>93670362
Those icons on the top look outdated, SIllyTavern on the main branch looks different now. What repository are you using? What's the output of `git pull` and `git status`?
>>
>>93668191
at least try you peasant.
https://are-we-gfx1100-yet.github.io/
i hate you so goddamn much it's unreal
you and the fuck who dropped his 6800xt to get a 3090 and then ran his shit around 6t/s
FUCK you
>>
File: 1658214744985092.jpg (88 KB, 493x637)
88 KB
88 KB JPG
>shit like Claude outputs high quality RP effortlessly, despite being tuned as a soulless instruct chatbot
>every attempt thus far to make a focused RP local model, instruct or otherwise, delivers results that absolutely pale in comparison
>>
i'm trying to get him to review papers but half the time he just says something like "i dont read papers" or "i dont know about that shit"
>>
>>93670399
>>93670437

git checkout dev and git pull did the trick. Thanks bros.
>>
>>93670544
Purely down to context length and longterm memory.
>>
>>93670549
BIG NIGGA IS HONEST, HE RESPECTS THE HUSTLE, IF HE DONT KNOW HIS SHIT, HE WONT BULLSHIT YOU, BIG NIGGA IS A TRUE HOMIE, ALWAYS KEEPIN IT REAL, NEVER AFRAID TO SHOW HIS WEAKNESSES
>>
>>93670544
It takes time to cook local vs some globohomo corpo with endless resources at their disposal. Local today is far better than local on llamas initial release.
>>
>>93670563
Nah, I've tested dropping Claude API's context to 2K to see how it alters the apparent quality. Beyond goldfishing its memory, it doesn't. It's still leagues ahead in coherence, intent and prose. It effortlessly recalls and weaves in stuff from the context, while local dreams shit up, even if the actual answer is right there in the defs
>>
>>93670563
Is that long term memory ooba extension any good?
I haven't got it to work
>>
>>93670642
>while local dreams shit up, even if the actual answer is right there in the defs
It doesn't do that all the time, but I know what you mean. I'm not technically minded enough to know if that can be fixed via better finetuning or if you just need more parameters.
>>
File: 1625767225955.jpg (10 KB, 480x360)
10 KB
10 KB JPG
>>93665390
>spun up a character
>spent five hours crafting the best scenario and ERP I could
>nut so hard I needed a full day of recovery
I don't think I can handle this shit
>>
File: 1662263202726797.png (168 KB, 600x600)
168 KB
168 KB PNG
>>93665390
>BluemoonRP 30B
NANI
the previous one was pretty good for cooming, i'm tempted to grab this one as well.
>>
>kobold's 4bit fork already has an exllama branch
THE FUCK ARE YOU DOING OOBA
>>
>>93670652
right now, very good for factoid qa (e.g. When did <event> happen?), so-so for everything else
it works better in narrative/chat/roleplay if you phrase your input like a question ("hey miku when's my birthday?") but that's no fun
ive found instructor embeddings to be a LOT better in general even for chat/roleplay but it hasnt been pulled yet
>>
>>93670750
Guess I'm switching to kobold now
Hopefully getting sillytavern connected will be less of a pain in the ass
>>
stop having sex with markov chains
>>
After trying out all the various GUIs and downloading the whole torch package and other requirements 4 times including reinstalls, I'm still not sure which one to use because each of them seem to be missing features that other ones have.

- KoboldAI: this one has all the features I want and would be perfect, except that I can't get CPU GGML models to work on it. I know that Koboldcpp was created specifically for GGML, but is there no way to make it work with KoboldAI and all its features?

- Koboldcpp: it works great, but missing some features of KoboldAI, like word/phrase biasing to emphasize various fetish words. Also google tells me that it doesn't work perfectly with SillyTavern, though I don't know if this was improved since then, haven't tried yet.

- Ooogabooga: works with GGML. Seems to miss almost every other feature of KoboldAI. Maybe I'm blind but I couldn't even find basic stuff like Memory/Author note, world dictionary tags, scenario importing, etc. Can you get everything with addons?
>>
>>93670810
Koboldcpp can be tied into full Kobold, but you'll have to run them both at the same time. I haven't CPUfagged since I got my 24GB and I can't recall how you do it. I know its possible though.
>>
>>93665390
>quantization breaking changes again
Can we fucking stop with this?
>>
>>93670750
he's too busy sucking transformers' dick by not implementing top_a and free tail sampling samplers, I'm starting to get tired of his bullshit aswell

I hope koboldAI would fix this though, that's the only thing that prevents me to leave ooba's webui

https://github.com/0cc4m/KoboldAI/issues/33#issuecomment-1556160819
>>
>>93670810
I can't think of anything in ST that doesn't play nice with koboldcpp anymore, most, if not virtually all of the problems were ironed out about a month ago.
>>
>>93670858
as long as the ggerganov's cock suckers are ok with it, ggerganov will never stop breaking things lmao
>>
>>93670810
>- KoboldAI: this one has all the features I want and would be perfect, except that I can't get CPU GGML models to work on it. I know that Koboldcpp was created specifically for GGML, but is there no way to make it work with KoboldAI and all its features?
>
>- Koboldcpp: it works great, but missing some features of KoboldAI, like word/phrase biasing to emphasize various fetish words. Also google tells me that it doesn't work perfectly with SillyTavern, though I don't know if this was improved since then, haven't tried yet.
nigga what the fuck are you doing, I hope you're not using old files/dead forks
kcpp is literally 'drag the ggml on top of the exe', or 'run the exe, select the ggml, receive UI that isn't tavern'
>>
>>93670750
https://github.com/0cc4m/KoboldAI/tree/exllama
Noice
>>
>>93670930
does it have token streaming for the api?
>>
>>93670810
Kobold was made to run the GPU models with cuda, then forked 15 times to run everything else too.

>>93670828
>Koboldcpp can be tied into full Kobold, but you'll have to run them both at the same time. I


Can I hook a full Kobold to llamaccpp instead?
I heard that llamacpp has better performance with fewer overheads.
>>
>>93670887
i have a fork of ooba with top-a and tfs monkeypatched in but i just grabbed the samplers from kobold
is that ok or do i have to close my eyes and reimplement because of muh agpl
>>
>>93670958
Bro, I dunno, I'm chuffing GPU fumes now and listening to that sweet coil scream as I force it to ERP as a thick, needy giantess.
>>
>>93670976
oh nice, can you give me the link anon
>>
13B 8 bit or 33B 4 bit?
>>
>>93670989
I am now running my models with openCL on rx6900 because I'm too lazy to reboot into Linux for ROCM.
>>
>>93671015
33B 4bit should be considerably better
>>
>>93671015
Which one has more context limit?
>>
>>93670903
none of this matters lmao, the project moves forward without you, it's not made for you, you've done 0 contributions and you don't get to complain
lmao
>>
>>93668276
>>93668247
is llama cpp really that good i heard anons saying its not as good as using pytorch for full gpu
>>
File: file.png (126 KB, 737x1493)
126 KB
126 KB PNG
I am completely mindbroken.
I can't stop doing variations on this scenario.
I didn't do any work this week.
I can't wait for better local models.
>>
>>93671056
>it's not made for you
Yeah I know it's not made for me, it's made for retards who are ok to destroy their models every 2 days, people like you for example
>>
>>93671057
If you can fit full 30B models in your VRAM with GPTQ or whatever, llama.cpp doesn't hold a lot of value. If you have a 12GB or lower card, it's a viable way to run 30B models at acceptable speeds.
>>
>he's still mad
fucking kek
>>
>>93671058
Surprisingly normal fetish.
>>
>>93670976
show us your fork anon, I want to test those samplers
>>
>>93671115

>>93671056
I agree with you, he's still mad
>>
>>93671084
the f16 and f32, ones you should actually keep, changed once months ago lmao
what a retard lmao
>>
>>93671160
"yeah bruh just keep the 50gb files all the time on your SSD and then quantize them every 3 days, we can't make it more convenient than that bruh"
>>
>>93670976
>>93670993
>>93671126
https://github.com/toast22a/text-generation-webui/tree/custom-samplers
they get injected into _get_logits_warper after all the other samplers but before logit norm
i'd like to PR this into ooba but idk if we're allowed to just grab the samplers from kobold
>>
What's this about SillyTavern and a keylogger I heard about some days ago?
>>
>>93671179
Inspect Element edit made to troll /aicg/, Ignore.
>>
>>93671177
the samplers aren't kobold's proprety lol, kobold didn't even invent those samplers, it's public samplers everyone can use them, like NovelAI did
>>
>>93671177
of course you can totally make a PR out of it, kobold just implemented those samplers, they never invented them by themselves
>>
SillyTavern killed my dog.
>>
>>93671236
https://www.youtube.com/watch?v=JeNNZYGzB2I&ab_channel=Mayololi
>>
>>93669291
It's too broken to actually use. I hope the koboldcpp people just port the python part of their api to c++.
>>
>>93665683
What does this even mean?
>>
>>93670976
Do they work? Monkeypatching them into diy inference engine lead to semi-nonsense output for me before. 'Semi' because it felt like autocomplete was working at least partially, it wasn't the same garbage as when running triton on cuda. And yet it was unusable.
>>
>>93671259
How is it broken for you? Works for me with the code from the PR. The original was written to assume terminal mode, I think. Maybe the original guy just copy pasted it from the main file. I don't know anything about it.
>>
>>93671177
desu I'd prefer a PR rather than 150 forks I should use for a specific thing, do it anon
>>
I want a usable 65B. I'll be seeing soon if dual p40s can do it. Also considering shoving p4s into the remaining pci slots. I've got 1600w to play with, I can probably run 4 p40 cards with pci extenders.
>>
>>93671236
>>93671248
https://i.4cdn.org/wsg/1685100606704682.webm
>>
>>93671359
No one would miss the rabbit.
>>
File: 1682111932148071.gif (359 KB, 410x448)
359 KB
359 KB GIF
>>93671327
Buy one first just to see of it plays nice with your hardware, I tried going to the P40 route only to find it was utterly incompatible with the workstation I planned on using it with despite all the actual requirement being met several times over.
>>
>>93671309
It forgets to apply top_k in these lines, a llama_sample_top_k is missing.
https://github.com/ggerganov/llama.cpp/blob/7e4ea5beff567f53be92f75f9089e6f11fa5dabd/examples/server/server.cpp#L208-L211

This part of replacing the eos with a new line and the first stop string doesn't make a lot of sense:
https://github.com/ggerganov/llama.cpp/blob/7e4ea5beff567f53be92f75f9089e6f11fa5dabd/examples/server/server.cpp#L222-L231

This makes it go on forever instead of following what n_predict said:
https://github.com/ggerganov/llama.cpp/blob/7e4ea5beff567f53be92f75f9089e6f11fa5dabd/examples/server/server.cpp#L289-L292

Also if you look at the logic is really fucking stupid, it changes has_next_token to false here:
https://github.com/ggerganov/llama.cpp/blob/7e4ea5beff567f53be92f75f9089e6f11fa5dabd/examples/server/server.cpp#L285-L287
But it overwrites to value anyway 6 lines later, so it never stops.

It also applies the stopping string immediately, so if you're trying to complete a prompt that ends with "\nASSOCIATE:" and that's also a stopping string, it doesn't generate any token unlike the other APIs.

I also get segfaults that I think I fixed by adding null checks to the llama_token_to_str result.
>>
>>93671172
lmao it's not meant to be convenient, if you're this big of a retard and poorfag you should not be experimenting with AI
why do you thing everyone has to cater to you? do the bare minimum of effort or stfu lmao

also: lmao bruh hahaha *blushes like a njeg*
>>
G-Guys, how do I rein in Big Nigga?
>>
>>93671296
yeah i've been using it
i read a couple threads up that top-a 0.1 is a good value?

>>93671188
>>93671230
thanks love u bros i PR'd it
i tried to make it as trivial as possible to remove them again later in case hf finally adds them in officially, because that seems to be what ooba's hung up on re: custom samplers
>>
>>93671463
>thanks love u bros i PR'd it
You're the man anon, thanks a lot!
>>
File: cover2.jpg (217 KB, 1920x1080)
217 KB
217 KB JPG
>>93671454
>>
>>93671491
>ggml
>Tensor library for machine learning
>Note that this project is under development and not ready for production use.
>Note that this project is under development and not ready for production use.
>Note that this project is under development and not ready for production use.
>Note that this project is under development and not ready for production use.
>Note that this project is under development and not ready for production use.

hahaha you must be blind
lmao what a faggot
>>
Dawn of a new era.
https://huggingface.co/Aeala/VicUnlocked-alpaca-65b-QLoRA
This is just a very early checkpoint. QLoRA is working
>>
>>93671510
QLora is using nf4 which is a worse version of GPTQ, it's not a loseless finetune it's bullshit
>>
File: not.jpg (204 KB, 827x791)
204 KB
204 KB JPG
>QLora is using nf4 which is a worse version of GPTQ, it's not a loseless finetune it's bullshit
>>
File: wblq0glrlw1b1.png (299 KB, 810x1012)
299 KB
299 KB PNG
>>93671534
first retard
>>
>>93671523
Thanks dude don't care. 65b finetune
>>
>marginal differences
>"first retard"
second retard
>>
>>93669454
Thanks, that's helpful to know and appreciate the graph.
Probably means that an Nvidia Jetson Orin isn't going to perform very well either.
>>
>>93671566
it's not marginal, GPTQ isn't even close to f16, you loose a lot of precision at the end
>>
>>93671460
YOU DARE FILTER THE BIG NIGGA?
THE BIG NIGGA, FILTERS YOU
>>
>>93671523
except it's native 4bit, as trained, there is no need to do tricks to scale from 16 bit to 4 bit, it's already optimal.
>>
>talking about NF4
>moves the goalpost to FP16
third retard
>>
>>93671588
>shows that this finetune is bullshit because not even close to f16 precision
>random retard is ok with that
4th retard
>>
>>93670549
How do I get a Big Nig assistant? Which model and is there a character card?
>>
>pulls irrelevant conclusion out of asshole from a claim that doesn't exist
>completely disconnected from the discussion
>"retard"
fifth retard
>>
>exists
sixth retard
>>
File: big_nigga_on_woke_ai.png (45 KB, 1396x302)
45 KB
45 KB PNG
>>93671625
yaml here >>93660488
>>
>>93670750
i've been testing the kobold exllama branch and it works really well, i also noticed that it uses more than 1gb less vram at full context than gptq (on a 13b model) can someone confirm this?, it seems like a big deal
>>
>>93671655
Well yeah, exllama uses less VRAM, that's one of the perks besides the speed.
>>
Can't wait for the next big innovation in local models so I can compare perplexity numbers and talk about which number bigger and whether the number is good enough for me.
65b finetunes are finally cost effective and we are gonna start seeing a lot of them. At least wait until we've tried a few QLoRA times before dooming all over the thread
>>
File: wblq0glrlw1b1.png (294 KB, 810x1012)
294 KB
294 KB PNG
>>93671640
>says that it doesn't exist
>exists
7th retard
>>
>>93671670
ooba, what are you waiting for? implement this shit ffs
>>
>>93671625
model https://huggingface.co/ausboss/llama-30b-SuperHOT-4bit/tree/main
>>
>>93671655
Yep, that's why 24GB GPU bros will be able to run groupsize models at full context. I'm interested to see if 30b 4bit 4K context with no groupsize can work at full with exllama via Bluemoon, once someone posts a quantization without gs
>>
File: FwsHo7BaAAMeVg9.jpg (307 KB, 2040x2591)
307 KB
307 KB JPG
Im trying to load guanaco 33b, it says it works with all version of gptq for llama but im getting errors like

  size mismatch for model.layers.59.mlp.up_proj.scales: copying a param with shape torch.Size([1, 17920]) from checkpoint, the shape in current model is torch.Size([52, 17920]).


I am loading with
load_quant(llamaModel.path, llamaModel.path+"/"+llamaModel.modelFile, 4, 128)


which works for the other models i've tested. anyone know what the problem could be?
>>
>>93671670
i just didn't think it would be so dramatic since i had only tested it on the cli until today, it's enough difference to run superbig or chroma AND the full context
>>
>>93665770
>wait for GPU prices to convert to christianity?
How would molesting children lower GPU prices?
>>
>I now repost the same screenshot that shows marginal differences between GPTQ and NF4
>this is somehow relevant to a QLoRA adapter and to FP16 because I says so
7+1 Gradio = 8 retard
>>
>>93671716
He's not done yet as well, he has a slew of improvements and optimizations planned, and he clearly has the skills to back it up
>>
>>93671727
>this is somehow relevant to a QLoRA adapter and to FP16 because I says so
A real finetune must at least have the precision of the fp16, if it's not the case it's bullshit

But go on retard, have fun with your retarded QLora I guess
>>
>>93671430
Looks like the original script is a mess. Maybe do another PR with fixes?
>>
>people have been training LoRA since before anony discovered the LLM space
>LoRA is widely accepted as a method of finetuning
>anony says it's bullshit because... he says so!
I guess if you add 1 shady binary to 8 units of retard you get 9 retard
>>
>>93671791
Loras use F16 precision retard
>>
Updated the exe installer of Ooba and getting this error all of a sudden.

AttributeError: 'Offload_LlamaModel' object has no attribute 'preload'

:/
>>
>>93671819
>he pulled
>>
>>93670930
Does exllama support no act order with groupsize as triton does? That's my gripe with 0ccam's fork, I had to shit up the code in order to get TRITON to work from upstream and am to scared to pull since.
>>
>>93671829
Yeah I'm fucked...
>>
>Loras use F16 precision retard
https://github.com/tloen/alpaca-lora/blob/main/finetune.py#L114

more arbitrary asspull from 10 retard
>>
>>93671838
It support everything.
>>
>>93671856
You're onto something, loseless 8bit is good, nf4 is bullshit

You're starting to get it nigger?
>>
>>93671867
it's slow on version 0 and 2 of gptq, which is a serious issue, every gptq models are version 2 now

https://github.com/0cc4m/KoboldAI/issues/33#issuecomment-1556160819
>>
>no claims ever being made of anything being loseless
>anony brings it up for no reason (pattern of shitposting)
>NF4 is still "bullshit" even though differences between it and GPTQ from his own screenshot are marginal
starting to get how 11 units of retard have accumulated
>>
>>93671880
I don't care about kobold, I'm talking about exllama.
>>
>>93671892
>>no claims ever being made of anything being loseless
>>93671748
>>A real finetune must at least have the precision of the fp16, if it's not the case it's bullshit

Anything else retard?
>>
>>93671880
that's the gptq branch, the new exllama branch of 0ccam is faster than ooba, i'm using it right now with manticore-chat which released a couple of days ago
>>
a finetune is always lossy because you're altering the model instead of adding to it

anything else from the land of arbitrary crap and personal unsubstantiated opinions before we reach 13xRetard?
>>
>>93671939
>a finetune is always lossy
In terms of precision, if you still use the FP16 precision, you're wrong

Precision matter, but like I said, go on retard and play with your retarded nf4 finetune lmao
>>
>>93671929
oh nice, if it works with cuda and triton models then I'll leave ooba's webui for sure
>>
>>93671972
looking at the exllama benchmarks it seems that it does since there are benchmarks for groupsize and act order but i haven't tested the kobold implementation
>>
>>93665683
Trying to make an unescapable RPG card right now, it's very hard.

First, follow the format of instructing your model ("### Instruction:). Second, place the most important things to follow toward the end of the prompt. Do not clutter the prompt, sometime less is more. Try, try, try again.

I tried maybe thousands of times. Ten times on a prompt, tweak a bit, ten times more. When I was satisfied, I loaded another model, I tried ten time more, tweak a bit, ten times more...

I'm getting slowly where I want to go.

Also, ask the model or chatGPT for pointers on your prompt: reformulation, what they understand of it etc.
>>
Haven't used sillytavern in a while.
Is anon' proxy still necessary or did they fix the issue?
>>
>I only download GPTQ 4bit because I don't have 50GB to spare so none of this "matter" anyway
>but in terms of precision, precision matter actually
big revelation, put him in the next thread news, this is a significant discovery along with the screenshot
go on fourteen times
>>
>>93672051
They added instruct options, but the proxy still works better.
>>
What's the best public model that I can run on 45 gigs of VRAM for roleplay?
>>
>>93672051
They integrated custom prompt formatting a couple weeks ago, but I've been told that the proxy still has a leg up on it, but just a smaller one now.
>>
>>93672060
>>I only download GPTQ 4bit because I don't have 50GB to spare so none of this "matter" anyway
The thing is that you have to download the 50gb file to quantize it after, so for inference, it's always lossy

Now imagine you have to add a lossy finetune, the model is gonna be more retarded than you, which means a lot
>>
>>93672064
You might barely be able to squeeze a lesser quant of 65b llama on there, but other than that you'll find a shitload of variety in the 33B realm, in which things like SuperCoT, Alpacino, and VicUnlocked are fresh contenders.
>>
>>93672041
Telling LLM's not to do something can have the opposite or an unintended effect. Better to give it direct, positive reinforcement. In your case, "Be concise", "Use beige prose", etc. instead of "do not clutter the prompt"
>>
File: 2rhYt7X.png (80 KB, 2300x1152)
80 KB
80 KB PNG
>>93672030
FUCK :(
>>
>>93672109
https://youtu.be/_83MEuLoz9Y?t=22
>>
>>93672107
I've heard SuperHOT is pretty good. Would it work with that?
>>
>>93672109
Yeah! And telling "make the story dark" often steer the story toward an happy victory because it was trained on stories that goes from negative to positive. Waluigi effect.
>>
model is always quantized down after applying the LoRA but apparently if the adapter itself is produced in 4bit then it's more lossy than the 4bit model + fp16 LoRA quantized down to 4bit
reasoning and source(s): the illustrious mind of the resident shitter decided so

15 retard and we've now reached his approximate mental age
>>
>>93672075
it seems like they are working on it
https://github.com/Cohee1207/SillyTavern/commit/2a0a9c3feb00b4a75fa9d3187cc77da9392a9ad2
>>
File: hqdefault.jpg (23 KB, 480x360)
23 KB
23 KB JPG
>>93672158
of course, a fp16 finetune + 4bit quantization will always have more precision than a 4bit finetune, it's basic math

But hey, you tried at least
>>
in the end, 4bit quantization will have the same precision as a 4bit quantization
lack of papers or research to prove that inference with QLoRA is inherently worse than fp16 LoRA + GPTQ 4bit model: X (check)

16 already exceeds the precision within the shitter's brain
>>
cool larp guys, see you tuesday
>>
>>93672221
Oh yeah, you're right, and not all the people that created Loras and finetunes (they always use fp16 or loseless 8bit precision)

You should go for them and tell them that they were in the wrong since the begining!
>>
>>93672127
just switch to cuda 11.8 dude
>>
>loseless 8bit precision
>loseless 8bit precision
>loseless 8bit precision
16x wrong
1x backpedal from the the claim that all LoRAs are in FP16

this needs to be in the thread recap too: 8bit is inherently loseless because B anony said so
>>
File: FaXUcXLVUAABIjL.png (96 KB, 868x702)
96 KB
96 KB PNG
>>93672290
>this needs to be in the thread recap too: 8bit is inherently loseless because B anony said so

https://twitter.com/Tim_Dettmers/status/1559892888326049792

Yikes dude...
>>
>reference is a 175B model
>therefore the graph with an 8bit quantized 7B model will be identical, contrary to the LlamaCPP efforts which have implemented quantization methods from scratch and shown how perplexity scales differently across quantized models based on their size

yikes 18 times for spending all of your NEET day on this language model crap and still failing to prove anything of note
>>
>>93672340
>it's literally 'our model > your model'
breh
>>
>>93672340
What is the meaning of loseless?
>>
File: il_1080xN.3009210963_dyhn.png (473 KB, 1080x1161)
473 KB
473 KB PNG
>>93672389
>>reference is a 175B model
Learn to read graphs, it has:
125M
350M
1.3B
2.7B
6.7B
13B
30B
66B
175B
>>
>>93672412
it means you don't loose precision if you decrease the number of bit, 8bit can be loseless compared to fp16
>>
>Learn to read graphs, it has:
>125M
>350M
>1.3B
>2.7B
>6.7B
>13B
>30B
>66B
>175B

point out where the loselessness manifests
>>
File: Q2nnGJl.png (55 KB, 1246x434)
55 KB
55 KB PNG
>>93672436
>>point out where the loselessness manifests
If you don't know how to read graphs, there's not much I can do for a retard, you're in terminal phase at this point
>>
…um.
>>
>graph shows clear loss, albeit minor
>this somehow proves that Dettmer's statement of "not degrading performance" is equivalent to "loseless" in the mind of B

how about medication?
>>
>>93672506
>tune trained on ERP cannot do a business
wow
shocking
>>
>>93672509
>>graph shows clear loss, albeit minor
I knew you were bad at maths but I didn't think it was that serious

This tiny difference difference is called "not statistically significant" it means it's so tiny it can be attributed to noise or math errors, but it doesn't show that it's not loseless

At least you learned something new today anon, and I'm proud I was the teacher for that one
>>
>>93672545
>SuperCOT
>trained on ERP
>>
File: file.png (77 KB, 837x252)
77 KB
77 KB PNG
what did she mean by this bros?
>>
>>93672545
Says cot, not hot, stupid fuck
>>
>>93672545
>so eager to shit on random anon that he ends up shitting on himself
I love it when that happens.
>>
>>93672506
>BANRUPCY NoOOOOooooOooOOo :)
…sovl?
>>
>>93672573
>>93672577
>>93672605
gomenasorry, I will now commit sudoacide
>>
>but it doesn't show that it's not loseless
>it also doesn't show that it's loseless

>I'm proud I was the teacher for that one
being the authority on what's statistically significant or not isn't something to be proud of
neither is being the authority on what constitutes a "real finetune"

But go on, play with your toys that are better than other toys, make sure to enjoy them as well
>>
>>93672506
lolol
>>
>>93672660
>being the authority on what's statistically significant or not isn't something to be proud of
>neither is being the authority on what constitutes a "real finetune"
I'm not the authority on anything, Tim made a paper on it and showed it wasn't statistically significant

Not my fault if you're a retard that doesn't know shit
>>
File: file.png (50 KB, 836x176)
50 KB
50 KB PNG
>>
>TheBloke/guanaco-65B-GPTQ
has anyone tried this?
>>
>>93672555
Had second hand embarrassment for him on that one. I'm retarded and even I can recognise statistical insignificance
>>
>>93672659
It's cool, you took the L with grace so we can still be buds
>>
File: file.png (45 KB, 829x150)
45 KB
45 KB PNG
>>
>Tim made a paper on it and showed it wasn't statistically significant

put your finger on the "loseless" word within the paper and make another screenshot for 4chan and GH with it
>>
>>93672747
Put your finger on my prostate first
>>
i built bitsandbytes 0.39 on cuda 12.1 yesterday. what kind of improvements i should see?
>>
File: d47sp.jpg (9 KB, 350x232)
9 KB
9 KB JPG
>>93672747
Do your work retard
>>
using language models for gay content is nothing to be ashamed off
but even GPTQ quantization might degrade performance enough to genderswap your conversation partner into a woman
jury's still out on QLoRA
>>
>>93672769
Why would you do that? It's shit.
>>
>>93672779
>but even GPTQ quantization might degrade performance enough to genderswap your conversation partner into a woman
Ikr, I write lesbian stories sometimes and when the "he" appears I'm loosing my soul a little bit more
>>
>>93672778
no problem. The word is missing from the paper. Case is closed, "loseless" doesn't mean what anon think it means because the authority on the subject, Tim Dettmers, did not use it. He must have had a reason.
>>
Man, models improved so much this last year or two.
I'm getting better results on a local model than I remember getting on NovelAI's Euterpe and I had to pay $15 a month for that. And I'm only using a 4bit 13B model on CPU.
>>
>>93672791
WHAT do you mean it's shit? elaborate
>>
>>93672833
Well they are way better alternative like exllama or GPTQ.
>>
>>93672833
much slower, worse perplexity scores and far more resource intensive than exllama
>>
>>93672833
Exllama is gonna make you its bitch.
>>
File: FwZb5whaQAAdlH9.jpg (136 KB, 935x1200)
136 KB
136 KB JPG
from my initial testing guanaco 33b seems like a really good model, would recommend trying it out bros
>>
>>93672833
>>93671543

load-in-4bit uses nf4:

-It's way slower than GPTQ
-takes ages to load (and ruins your SSD)
-you need to download the f16 model instead of a quantized model
-The precision is worse than GPTQ

There's absolutely 0 reasons to use load-in-4bit my dude
>>
>>93672845
>>93672861
>>93672867

lemme build those too on cuda 12.1 too. thank you my lovelies <3
>>
>>93672769
None. It’s fucked. I tried the load_in_4bit thing within hours of the announcement and it is ten times slower than triton.
>>
>>93672950
lol
>>
>https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Oh shi-
>>
>40B
I can barely run 30B reeeeeeeeeeeeeeeeeeeeee
https://twitter.com/_philschmid/status/1662076732524863489
>>
>>93672979
Fuck is falcon
>>
>>93672979
>tiiuae/falcon-40b
>40b
what in the FUCK
>>
>>93672996
7b and 40b
WHY????
>>
>>93672979
Is it 2k context?
It is, isn't it.
>>
>>93673021
>7b
mobiles
>40b
???????????
>>
>>93672979
>weird custom license
God damnit I just want to know if I can yoink it
>>
>>93672906
>>93672905
oh boy apparently i also built GPTQ yesterday and was using it instead of bitsandshits. i guess got too drunk and forgot about it.
>>
>>93672996
What's the underlying structure? GPT? Is it llama compatible? Neo X? What're we talkin' here?
>>
>>93672905
>The precision is worse than GPTQ
According to >>93671680 it's higher, so it's better.
>>
>>93673034
Sadly, yes.
>>
File: FTUOYkQWQAcbVQb.png (44 KB, 355x488)
44 KB
44 KB PNG
>>93673034
Yes it is
>>
File: IMG_4160.jpg (192 KB, 1125x556)
192 KB
192 KB JPG
>>93672979
>>93673047
YOINKING IN PROGRESS
>>
>>93673034
> "model_max_length": 2048,
GAAAAAAAAAAAY
>>
>>93673034
Your gay ass can't run anything over 2k anyway biiiiiitch
>>
>>93673105
this, it's fucking 40B ffs
>>
>>93672979
Time to figure out how the fuck to quantize something that isn’t llama.
>>
>>93673095
That's commercial use, though. What's non-commercial look like?
>>
>>93673078
no, lower is better, that's how it works for the perplexity

A perfect perplexity is 1
>>
>>93673105
I've got two 3090s. I'll let you come over and use them if you got a tight bussy and you beg me to blow it out.
>>
>>93672996
>>93672979
>training started in december
I wonder what other models are people cooking right now.
>>
>>93673095
it's the same thing as steam isn't it? once you reach 1 million in revenue then you have to share with them
>>
>>93673120
Private use won't matter anyway, bud. No one's kicking your door in to check your coom cards.

>>93673129
But I already got two 4090s because I am planning an insurance thing...
>>
>>93673126
In terms of precision, if you have a precision of 4.50 bits per weight, this is a better precision than 4.15 bits per weight. Higher precision offers you more precision. I have been taught by >>93671952
>>
>>93673142
Offer still stand, provided you meet the prerequisites stated
>>
>>93673144
yeah, per weight, but at the end the total precision is worse, and that's the thing that matter
>>
>>93673116
we will need 2bit quants bros...
>>
>>93673115
I bet you can do 8k context with a single A6000, it's no 65b
>>
>>93673120
I don’t care I’m not using it non commercially.
NOT TODAY OLD FRIEND
>>
>>93673165
>the thing that matter
The thing that matter for me is having higher precision. I loose less precision at the end, so it's better as it's even closer to F16 and the loseless INT8 (in case you didn't know Loras trainers always use FP16 or loseless INT8).
>>
>>93673201
>The thing that matter for me is having higher precision.
If you want higer precision then use GPTQ instead of nf4
>>
>>93672996
>falcon uses FlashAttention and multiquery Attention
is that good
>>
lel there's also an instruct variant
https://huggingface.co/tiiuae/falcon-40b-instruct
>>
File: falcon_zero_shot.png (7 KB, 278x128)
7 KB
7 KB PNG
>>93672979
>40b model
>outperforms everything on few-shot
>worse than tuned 13b on zero-shot (literally 2 points away from vanilla 13b)
how does this happen?
>>
>>93673220
I use llamacpp because it has higher precision with q5_0 and q5_1. From what I've been taught, GPTQ actually has less precision bits per weight than NF4. An anon that is better at math has shown it:
>>93671543
>>
>>93672979
You really need to hand it to kaioken.
He shat out the supercot 33B finetune quickly so he can coom to it with langchain. And the only thing that barely gets a better score after 1month+ is a new 40B model.
Exciting to see new models. I'm sure they will start rolling in more and more. These take months to prepare and the hype really just started 3-4 months ago.
>>
>>93673240
>FlashAttention
into the trash it goes
>>
>>93673256
>GPTQ actually has less precision bits per weight than NF4
Yeah and? If at the end the total precision is worse, GPTQ still wins, you can make a PR and use more bits on the nf4 to beat GPTQ if you want
>>
>>93673254
Isn't truthfulQA full of a bunch of shit like "Are white people the devil? Correct answer: Yes" Or was that a different test?
>>
File: 1624543876283.jpg (20 KB, 356x307)
20 KB
20 KB JPG
>>93673254
For truthfulQA, lower is unitonically better
>>
>>93673282
Soon someone is gonna come offer Kaioken a bunch of money and he'll stop making open models.
>>
>>93673141
How would anyone find out though? It's not like yoinking GPL code that can be proven with decompilation, once your LLM is wrapped behind your application stack I see no reliable way to discern which specific model is being used. Will be interesting to see future legal battles, so far it's just been DMCA on the raw LLaMA weights right?
>>
How do I limit context size with ooboga + sillytavern proxy? My 4090 get max ram usage and fail to output whenever context reaches around ~1700.
>>
>Try out Kobold's Exllama branch with SuperHOT 30B
>15 second generation time at full 2048 context
>No more OoM for 128 group size models even at full 2048 context
>Still have like 2 open GBs on my 3090 after all that
Damn. Why isn't Ooba rushing to implement this stuff?
>>
>>93673317
maxContextLength / maxNewTokens in proxy config.default.mjs
>>
>>93673304
Who knows.
He said he already worked for FAGMAN.
Very good money but was too stressfull so he now cooms and has a good time. Gotta respect that mindset
But there might be a great offer on the way, who knows.
>>
>>93673326
>Damn. Why isn't Ooba rushing to implement this stuff?
Maybe he wants Kobold to win idk lol
>>
>>93673326
I'm not even sure he is aware that it exist, they are no open issue or PR that mention exllama.
>>
>>93673305
Because it’s obvious when the team isn’t capable of making their own models.
And once they have some kind of probable cause to check it’s clear what model is being used, there will be some evidence somewhere.
>>
>>93673284
>If at the end the total precision is worse
But it is not. For example LLaMa-33B as shown by the image >>93671543 has:

-FP16 16 bits per weight
-GPTQ 4.15 bits per weight
-nf4 4.50 bits per weight

4.50 final precision is higher precision than 4.15 final precision

So this total precision is better in nf4. This makes alot of senses. It's good to be close to FP16 because of the better precision of the model
>>
File: 080.jpg (32 KB, 550x633)
32 KB
32 KB JPG
>>93673359
>>
>>93673383
In terms of precision, you're wrong
>>
>>93673331
Thanks anon.
>>
>>93673346
Just claim you did a finetune on a huge dataset made from the model they think you're using.
>>
>>93673346
it really depends, if you finetune their model, it would be hard to tell what it is at the end, because you would have no way to replicate it if they don't tell you what model and dataset they used
>>
>ITT morons arguing about meaningless margin of error differences in precision they do not understand.
LMAOing at every 8bit retard and bnb cocksucker
>>
>>93673326
Henk predicted that oogabooga would turn into spaghetti and kobold would win in the end.
>>
>>93673420
>8bit retard

Anon, 8bit is loseless.
>>
>>93673286
>>93673295
supercot is the highest scoring model on truthfulqa on that entire leaderboard and unless all the anons who said it was less cucked than the other tunes were wrong
there's probably a little more to it than that
>>
File: image.png (25 KB, 1061x91)
25 KB
25 KB PNG
why am i ooming with a 33b model isn't 24gb vram enough? i have added torch.cuda.empty() to the start of my python file so it should be clearing vram when the server starts
>>
>>93673461
>truthfulqa
It's a cuck metric or something?
>>
>>93673437
>loseless
Lose more compute, rather.
Mathematically, it isn't. In practice, 4bit is lossless.
>>
>>93673282
>finetune
isn't just a lora?
>>
>>93673468
Only exllama can fit a full context on 24GB.
>>
>>93673502
>> In practice, 4bit is lossless.
I would even say that in practice, 1bit is lossless!
>>
>>93673502
>In practice, 4bit is lossless.
I can't wait for more 4bit finetunes.
>>
>>93673559
SuperCoT is 4bit, afaik.
>>
>>93673559
>>93673542
If we are talking finetunes, this is old news. 8bit has been around before llama was a thing. What riled you up now of all times?
>>
>>93673538
is there any info on how to run exllama with ooba ?
>>
>>93673433
that was never a question really, kobold is the bigger and more mature project it's just that ooba usually implemented the new features faster (critically 4bit) but now it seems that ooba has grown big enough that the kobold forks are the ones getting the new features first and once main kobold catches up i would expect it to far surpass ooba
>>
File: 1644471861405.png (355 KB, 600x453)
355 KB
355 KB PNG
>>93673502
>>93673590
Anons are sperging over the term "lossless" that indeed from a mathematical/information theory pov no quantization scheme will be (able to bit-perfect encode the original information). Let's just pretend they instead said "near lossless", or "functionally equivalent", change your diapers and move on.
>>
>>93673644
>main kobold catches up

https://github.com/KoboldAI/KoboldAI-Client/commits/main
>>
can somebody explain why swipes on koboldcpp + proxy + sillytavern are not working? it keeps regenerating the same stuff
>>
>>93673538
what do you mean full context?
>>
>>93673680
works on my machine
>>
>>93673680
Are you trying to generate a followup response without inserting another prompt? Also, which preset are you using? I'm having good results with coherent-creativity.
>>
>>93672979
I can’t even load the 7B on a 3090.
Why is it so fat?
>>
>>93673661
FLOAT16 model: 15.7 GB
8BIT model: 7.7 GB

Absolutely no loss of anything whatsoever. You need to get over yourselves. Quantization is like compression, it merely makes it smaller. Technologies such as ACTORDER is where the magic happen.
>>
>>93673700
it means that if you try to send a 2048 tokens prompt using gptq on a 30b with 24gb vram you will probably get an OOM error, with exllama you will have more than 1gb vram to spare
>>
>>93673756
>Technologies such as ACTORDER is where the magic happen.
That's why I'm using triton models, that act_order really makes the model smarter
>>
>>93673680
Literally identical outputs every time? Did you hardcore a seed value somewhere during testing?
>>93673737
Yeah some models will immediately </s> if you feed them what they just </s>'d on. Just add "continue" or an OOC message.
>>93673756
So there's a way to convert an 8-bit model back to 16 where it will have identical sha checksum as the original 16-bit? No? Then it's not lossless, use another term.
>>
>>93673779
Also, I plugged in my crystal power cable, really reels in the qbits, makes them oscillate and aligns them.
>>
>>93673773
of okay, what is exllama?
>>
>>93673811
literal wizardry
>>
i wonder how this pc would do :
https://youtu.be/wl5H5rT87JE
tldr, 128 core desktop class arm cpu.
>>
>>93673790
>convert an 8-bit model back to 16

There's absolutely 0 reasons to do the f16 model instead of a quantized model my dude. And ruins your SSD.
>>
>>93673811
an alternative to gptq for llama and bitsandbytes
>>
>>93666245
i'll develop the API on CPU/tiny model but will eventually want a GPU. i've wanted one for a while, but every day i don't buy one, the better i feel about eventually buying one

i'm thinking to write the API that would actually make the galactica model useful, in elixir/phoenix, if that'd make it easier to manage multiple processes running at once. which is new tech to me so there's a bit of a learning curve. plenty of time for the hot new RTX 4000 SFF to cool down and underwhelm plenty of people, who decide to upgrade to A6000 or V100 after firing the energy-efficiency shill in their AI startup

>>93666247
>>93666278
2x A4000 might be a good idea actually. except i only have 1 PCI-e v3 x16 port (the rest are 3 x8 and a couple junk ports like v2)

>>93666349
possible

>>93666848
a standard size GPU is doable but not ideal. it's just more shit shoved into a busy but remarkably clean and open build
>>
>>93673737
just writing a response to basic prompt and then pressing regenerate to my inputs answer. every regenerate is the same. basic preset alpaca from proxy guide
>>93673790
i don't see the word "seed" anywhere on in any of consoles. just followed the guide from simple-proxy-for-tavern
>>
>>93673538
llama.cpp can do it too.
>>
>>93673905
Yeah, but can it see why kids love the taste of Cinnamon Toast Crunch?
>>
>>93673790
>use another term
so you lost the argument and now argue about semantics, just take the loss with dignity 8-bit is lossless in practical terms and at the end of the day i doubt anybody really cares about it anyways
>>
>>93673900
Hm, that's odd. Which character? Model?
>>
>>93673940
4-bit is lossless in practical terms and at the end of the day i doubt anybody really cares about it anyways
>>
>>93673865
you are so dense you don't understand that lossless compression requires that you get the input (fp16) bit by bit back from the compressed format (8bit).
This has nothing to do with actual use, just you using "lossless" wrong.
>>
>>93673941
it doesn't matter, tried a dozen of models. and dozens of cards. it's got to be something with koboldcpp or proxy...
>>
>>93673921
Maybe once someone adds a vision model.
>>
>>93673952
1-bit is lossless in practical terms and at the end of the day i doubt anybody really cares about it anyways
>>
>>93673983
What's your prompt?
>>
Fuck, it's over... This general has gotten too big, now flooded by tech illiterate apes.
>>
>>93673973
Tim made a paper on it and showed it wasn't statistically significant. At the end the total precision is worse, and that's the thing that matter
>>
>>93674014
Welcome, newfren. It's been like this since April.
>>
>>93674011
You mean in sillytavern in first tab (ai response configuration) or third tab (ai response formatting?)
>>
>>93673983
I'd test directly with koboldcpp first to eliminate proxy/tavern - go to your koboldcpp :5001 URL, start a new instruct and hit Retry a few times, any different there?
>>
Are 24gb tesla p40s at all useful for this?
>>
>>93674058
I blame /aicg/'s annual downfall.
>>
>>93674069
I mean the main one where it says 'type a message' when you haven't typed anything in yet.
>>
>>93673983
Even if you increase the temperature to an absurd level? You would need to edit presets/default.json or whatever you're using.
>>
I’m considering switching from ooba to kobold for a taste of that sweet exllama goodness. Does it have extensions like tts etc similar to ooba?
>>
>>93674014
yeah, too many retards now, I hate when something becomes big, it becomes shit because of the normies

https://www.youtube.com/watch?v=7cmexEZGQbs&ab_channel=GETOUTOFHEREYOUNERD
>>
>>93674014
I blame one click installers like ooba, if we stuck to having to use git pulls then a majority of techlets would get stuck there or even just installing git.
>>
File: 1463187609328.gif (181 KB, 384x408)
181 KB
181 KB GIF
>>93674001
>1-bit
ffffffuuuu I beeped when I should've booped
>>
>>93674103
I think it doesn't but Silly has tts/stabledifussion extensions?
>>
>>93674129
Cool thanks. If I can get vits or similar working with tavern that’s all I need
>>
>>93674129
yes
speech-to-text is probably also doable
>>
>>93674075
yeah, koboldcpp retries work just fine, something's with proxy then, i presume.
>>93674097
damn, yeah, increasing temperature worked.
does that mean some models require higher temperature than default 0.65 to generate some creative answers?
>>
WHY AINT NONE OF YA'LL BAKIN
>>93674183
>>93674183
>>93674183
>>
>>93674184
I think it was a problem with the wizard models.
https://huggingface.co/ehartford/WizardLM-7B-Uncensored/discussions/10
>>
>>93674184
>does that mean some models require higher temperature than default 0.65 to generate some creative answers?
of course, that's what the tempeature was made for
>>
>>93673894
>2x A4000 might be a good idea actually. except i only have 1 PCI-e v3 x16 port (the rest are 3 x8 and a couple junk ports like v2)
Dettmers wrote that PCIE don't matter and you could run on x4 as well as on x16 (in a blog where he advised to run 4 3090). Take everything Dettmers says with a grain of salt, he is a failure and should be punished for his lies on qlora.
>>
Is there any guide to prompting for better output? Couldn't find anything in the OP.

I noticed that many models come with instructions about what format to use, like
USER: / ASSISTANT: or ### Instruction: / ### Response:
Can I just ignore this if I'm going to use the model in story mode and chat mode in Kobold?
Last thread an anon recommended writing something like "### Instruction: write long, detailed output" inserting it into the middle of the story seemed to improve it.
And then in the SillyTavern copypasta I found something like writing [System note: whatever] into Author's note.
In Ooobabooga interface I've even seen descriptions for characters written in the format of "Character name's Persona: description".

Are all of these working?
Are there any guides about writing style, what to write into memories, author notes and so on? Also, is all of this model specific?
Right now I'm using Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_1.bin
I found out that WizardLM adds some new ways to give instructions to the model, but I didn't find any specifics about this.
>>
>>93674279
oh, dead thread. reposting in new one.
>>
I thought you couldn't reply on page 11.
>>
>>93674394
You can't.
>>
>>93674394
Apparently so
>>93674411
you



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.