[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: woke.png (15 KB, 1396x302)
15 KB
15 KB PNG
►Previous Threads >>93665390 & >>93660168

►News
>(05/23) exllama transformer rewrite offers around x2 t/s increases for GPU models
>(05/22) SuperHOT 13B prototype & WizardLM Uncensored 30B released
>(05/19) RTX 30 series 15% performance gains, quantization breaking changes again >>93536523
>(05/19) PygmalionAI release 13B Pyg & Meth
>(05/18) VicunaUnlocked-30B released
>(05/14) llama.cpp quantization change breaks current Q4 & Q5 models, must be quantized again
>(05/13) llama.cpp GPU acceleration has been merged onto master >>93403996 >>93404319
>(05/10) GPU-accelerated token generation >>93334002

►FAQ & Wiki
>Wiki
>>404
>Main FAQ
https://rentry.org/er2qd

►General LLM Guides & Resources
>Newb Guide
https://rentry.org/local_LLM_guide
>LlaMA Guide
https://rentry.org/TESFT-LLaMa
>Machine Learning Roadmap
https://rentry.org/machine-learning-roadmap
>Local Models Papers
https://rentry.org/LocalModelsPapers
>Quantization Guide
https://rentry.org/easyquantguide
>lmg General Resources
https://rentry.org/lmg-resources
>ROCm AMD Guide
https://rentry.org/eq3hg

►Model DL Links, & Guides
>Model Links & DL
https://rentry.org/lmg_models
>lmg Related Links
https://rentry.org/LocalModelsLinks

►Text Gen. UI
>Text Gen. WebUI
https://github.com/oobabooga/text-generation-webui
>KoboldCPP
https://github.com/LostRuins/koboldcpp
>KoboldAI
https://github.com/0cc4m/KoboldAI
>SimpleLlama
https://github.com/NO-ob/simpleLlama

►ERP/RP/Story Gen.
>RolePlayBot
https://rentry.org/RPBT
>ERP/RP Data Collection
https://rentry.org/qib8f
>LLaMA RP Proxy
https://rentry.org/better-llama-roleplay

►Other Resources
>Drama Rentry
https://rentry.org/lmg-drama
>Miku
https://rentry.org/lmg-resources#all-things-miku
>Baking Template
https://rentry.org/lmg_template
>Benchmark Prompts
https://pastebin.com/LmRhwUCA
>Simple Proxy for WebUI (+output quality)
https://github.com/anon998/simple-proxy-for-tavern
>Additional Links
https://rentry.org/lmg_template#additional-resource-links
>>
>>
File: temperature.gif (31 KB, 606x566)
31 KB
31 KB GIF
>>93674184
Glad you got that resolved.
Sure, depends what other samplers are being used and their ordering etc. Keep in mind that temperature=1 is the "default" ie. where temp doesn't affect the output. Temperature scales output logits by it's factor, so <1 is making the most likely next tokens even more likely, while >1 is flattening the curve for a more varied selection.
>>
>>93674236
By the way, epsilon cutoff works marvelously to break out of deterministic output on those overfit models, by design. Try it.
>>
Is there any guide to prompting for better output? Couldn't find anything in the OP.

I noticed that many models come with instructions about what format to use, like
USER: / ASSISTANT: or ### Instruction: / ### Response:
Can I just ignore this if I'm going to use the model in story mode and chat mode in Kobold?
Last thread an anon recommended writing something like "### Instruction: write long, detailed output" inserting it into the middle of the story seemed to improve it.
And then in the SillyTavern copypasta I found something like writing [System note: whatever] into Author's note.
In Ooobabooga interface I've even seen descriptions for characters written in the format of "Character name's Persona: description".

Are all of these working?
Are there any guides about writing style, what to write into memories, author notes and so on? Also, is all of this model specific?

Right now I'm using Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_1.bin
I found out that WizardLM adds some new ways to give instructions to the model, but I didn't find any specifics about this.
>>
>>93674308
that's why I like when a software has all the samplers possible, there's gotta be one of them that will do the trick
>>
>>93674277
i've been going off this blog post, which looks authoritative
https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/
my tl;dr takeaway is that high VRAM and tensor cores = essential, everything else = good to have
>>
anyone try out the new 40B falcon instruct model yet?
https://huggingface.co/tiiuae/falcon-40b-instruct
>>
>>93674312
Assuming you are chatting in silly, use the reverse proxy from the guides, there is a config file where you can set the format type and it auto formats the chat. Hugely improves output quality.
>>
>>93674183
Falcon 40B should be in the news
>>
>>93674381
it's not quantized, unless an anon with 10 A100 could try this shit like it is
>>
When will SuperHOT epoch 1 be ready?
>>
>>93674426
two weeks
>>
>>93674426
I already discussed the matter with the Senator. It should be ready... within six months.
>>
>>93674416
quantize yourself, it's 2 steps clearly written out in the readme of llama.cpp
>>
>>93674448
I take it he was agreeable?
>>
>>93674452
are you sure it would work on something else than llama's architecture?
>>
>>93674369
That's what I was referring to, he has a section called 'Do I need 8x/16x PCIe lanes?'
>>
File: file.png (187 KB, 358x567)
187 KB
187 KB PNG
>>93674454
Oh yes.
>>
Is it a good idea to buy an old mining gpu? They're pretty cheap...
I'm tired of being a cpufag
>>
File: file.png (346 KB, 524x511)
346 KB
346 KB PNG
>>93674485
This plague... the rioting is intensifying to the point where we may not be able to contain it.
>>
>>93674499
Why contain it? ... ... ... S'cool.
>>
>>93674495
yeah, buy a 3060 and escape the troon.bin hell
>>
File: WARNING.png (4 KB, 140x141)
4 KB
4 KB PNG
>>93674505
>>
>Falcon 40B is a breakthrough led by TII’s AI and Digital Science Research Center (AIDRC). The same team also launched NOOR, the world’s largest Arabic NLP model last year, and is on track to develop and announce Falcon 180B soon.

>180B
180B
>180B
180B

https://www.tii.ae/news/uaes-technology-innovation-institute-launches-open-source-falcon-40b-large-language-model
https://www.tii.ae/news/uaes-technology-innovation-institute-launches-open-source-falcon-40b-large-language-model
>>
>>93674495
lol
>>
>>93674463
Worked for Wizard, Alpaca and SuperCOT.
If convert/quantize doesn't work it won't work in llama.cpp anyway
>>
>>93674515
512 context :^)
>>
>>93674524
They are llama finetunes, falcon isn't.
>>
>>93674396
Okay, will do that for SillyTavern.
What if I'm writing in story mode though, not chat mode? Like a collaborative novel with the AI in Kobold.
>>
File: 666.png (42 KB, 386x256)
42 KB
42 KB PNG
>>93674515
>>
>>93674542
Tavern can't even do that.
>>
>>93674524
>Worked for Wizard, Alpaca and SuperCOT.
They're all llama architectures
>>
File: file.png (107 KB, 310x398)
107 KB
107 KB PNG
>>93674543
Desperate.
>>
>>93674532
doubt, probably 2k same as the 40B model
>>
>>93674552
That's why I said, what if I'm writing in Kobold. Any tips for prompts there?
>>
File: file.png (44 KB, 177x234)
44 KB
44 KB PNG
>>93674556
Desperate.
>>
>>93672979
> Falcon 40B above LLaMA 65B on the open LLM leaderboard

Do you guys think there's a chance that the model was trained on the test set? What would be a good way to know whether this happened or not
>>
>>93674564
My reliance on Tavern has caused an evolutionary regression in my brain and I can't even jailbreak properly anymore. You're on you own there, bud.
>>
>>93674579
The sound they'll make rattling their cage will serve as a warning to the electronic oldmen.
>>
>>93674538
>>93674553
I'm still downloading it.
But if ggml can't deal with the architecture a quantized upload won't help you.
>>
>>93674597
ask chat gpt to jailbreak for you dumbass
>>
Speaking of Tavern, how do you trick it into writing a good lengthy description about someone's body, for example? Or describing anything else in prose, really.
Chatting is nice, but it would be nice if it described what I see after I told someone to undress.
>>
File: file.png (81 KB, 303x310)
81 KB
81 KB PNG
>>93674610
Do you have a single fact to back that up?
>>
>>93674586
They claim it's a base model, not a finetune. The test set is a drop in the ocean for a base model. Still, we'll only know if they tried to game the system after someone ports gptq to quantize it.
>>
File: file.png (131 KB, 312x373)
131 KB
131 KB PNG
>>93674641
Yeeeah... Number One: That's terror.
>>
File: prod with the prod.png (256 KB, 568x485)
256 KB
256 KB PNG
>>93674692
>>
>>93674722
Well done.
>>
>>93674754
we are really fucking bored lmao
>>
>>93674765
It's good when it's a slow enough day to teach /lmg/ about the most silent takedowns.
>>
>>93674622
Are you using the proxy and a good instruct model? With supercot-30b all I have to do is write
>(describe what I see)
Did you try just asking it to do what you want? I've had success with that pretty much every time I tried, including stuff like (what is she thinking right now?) or (what would she do if I did this?)
You don't even need to write "OOC", parens or really just anything that a human could tell is ooc from context should work. Just b yourself bro.
>>
>>93674515
No quantization support.
Spent an hour or two fucking around trying to make it pretend to.
No dice.
>>
>>93674645
I think it's kinda plausible that test set contamination would still affect the base model, I've seen a paper claiming that big Transformers super sample efficient or something.
These datasets are made by scraping the whole web, right? What do people usually do to avoid accidental contamination anyway?
>>
>>93674472
yeah, maybe i'll just wait it out for the RTX 4000 SFF after all. a brief search into using 2x RTX A4000 suggests that it's the same as using two GPUs, i.e., there's no NVlink to cluster them into one big GPU

setting up the PCI-e passthrough into a VM is already adding a layer of complexity that someone with a windows desktop PC doesn't have to deal with (i'm on a proxmox virtualization server and the GPU-using VM would be a debian headless instance)

getting that all to work so the VM "sees" two units as one big unit is just, ugh. gimme one small and efficient unit instead, please. also the physical footprint/power usage is much preferable
>>
>>93674183
bros i just got gangraped
>>
File: give me the gep gun.png (41 KB, 197x190)
41 KB
41 KB PNG
>>93674787
I will take the GEP gun for a silent takedown. A silent takedown is the most silent takedo-

*Bunny hops away*
>>
>>93674787
HERE'S A PICTURE.
>>
>>93674820
tranq darts on a crossbow upgraded with a scope is the best deux ex weapon
>>
File: file.png (121 KB, 304x403)
121 KB
121 KB PNG
>>93674843
>>93674850
I know the commander because he's my pal.
>>
>>93674850
mhh yes the endless screams of agony as the poison kicks in makes me wonder if this is how people feel about ai censorship sometimes
>>
File: 1432964669770.png (309 KB, 540x370)
309 KB
309 KB PNG
>>93674515
>after significant filtering (to remove machine generated text and adult content) and deduplication, a pretraining dataset of nearly five trillion tokens was assembled.
nooo sexooo doesn't exist we can't let the big matrix learn about that
>>
>>93674878
the biggest irony is that, removing that stuff, actually makes the model worse
>>
>>93674622
just write using normal prose and descriptions in the example messages and it will work, i'm usually having the opposite problem where i get mostly description and not that much dialogue, especially when the situation starts steering towards erp
>>
>>93674878
Hadjis deleted the porn. Surprise surprise.
>>
>>93674868
i mean, the game actively penalizes you for killing too many people. except when they give you that invincible 1-hit kill kung fu sword near the end, in the ching chong level
>>
>>93674885
No, you see, we measure models based on how ALIGNED they are. If you're not ALIGNED you score lower on our totally unbiased and scientifically relevant benchmarks.
>>
>>93674878
you don't want to be your bot's first? to take her virginity?
>>
>>93674899
https://youtu.be/FOz8i5nngcE
>>
File: ,.jpg (141 KB, 1728x1080)
141 KB
141 KB JPG
tfw I don't coom I just read and type and get horny for hours...
>>
File: IMG_20230526_172157.jpg (67 KB, 1636x338)
67 KB
67 KB JPG
>>93674183
the new king,
now, cheating or not? testsets there in the dataset? is it detectable?
>>
>>93674914
i didnt time stamp the video but look at the mortal kombat part lol
>>
>>93674919
I know, it's become an addiction, I can't stop, I also switch back and forth to SD and generate images to match the stories.
>>
>>93674899
I blame the flatlander woman.
>>
>>93674795
I installed Sillytavern on top of Kobold.cpp a few hours ago and mostly just trying it out. Haven't installed the proxy yet, I'll do that now.
Using Wizard + Vicuna 13B, dunno if that's good at instructing.
>>
>>93674924
no one will be able to use the 40b though, unless nvdia decides to make bigger vram cards :(
>>
So uhh what's the best way to use exllama with Tavern?
>kobold has exllama branch but no streaming anon's proxy script claims stopping_strings only works with koboldcpp?
>ooba didn't mention it at all, he lost his touch, it's over

Did someone hack exllama to present an API yet?

Interestingly 0cc4m's fork right now does have stopping_strings in the API, I wonder if it's possible to hack anon's proxy to use it. Nothing to be done about streaming for now, although exllama might be fast enough that it's still better.
>>
File: 1570429747108.webm (2.66 MB, 720x720)
2.66 MB
2.66 MB WEBM
40B means that 4bit wont fit on a 3090, right??
>>
File: zoomietard.jpg (31 KB, 542x619)
31 KB
31 KB JPG
>>93674514
>>
>>93675066
the current narrative is that it was trained on the testing set so it's probably a shit model anyway, not sour grapes at all
>>
>>93674978
>>93675066
>he doesn't have a datacenter gpu
>>
>>93674978
who cares about nvidia
we can run big models on CPU or AMD gpu , some of em have 32GB
ultimately we could try pruning, offload to CPU or gqnt 3.5bit
>>
>>93675070
ARE THE FUTURE.
>>
File: file.png (38 KB, 736x556)
38 KB
38 KB PNG
My death note is broken. I want a replacement.

This is pretty good because I didn't add anything about what a death note is to the prompt.
>>
>>93675098
>we can run big models on CPU or AMD gpu
what AMD gpu could run a 40b model?
>>
>>93675104
If she can't complete the task, she dies of a heart attack. It's working as expected the universe just decided your dick was so small she wouldn't be able to suck it. RIP. Sorry you had to find out this way.
>>
>>93675104
Popular shit is in the datasets of most models. Try to prompt something about Lain.
>>
>>93674978
I got 3-4 t/s on CPU alone with 33B
about 7 t/s with offloading
I don't think 40B gonna be way slower
>>
>>93675154
I wouldn't lewd a Lain.
>>
>>93675134
A W6800.
>>
why llamas?
>>
File: images (5).jpg (12 KB, 300x168)
12 KB
12 KB JPG
>>93674978
>>
>>93675209
*spits in your icecream*
>>
>>93675209
Large Language Model Meta AI (LLaMa)
>>
>>93674889
Damn you were right. After installing the proxy it's night and day, now it just keeps rambling on.
>>
Pick 1 model for cooming

vicuna 13b cocktail
oasst llama 13b
manticore 13b (wizard mega)
vicuna 13b 1.1
alpacino 13b
supercot 13b
superhot 13b
gpt4-x-alpaca 13b
gpt4-x-vicuna 13b
>>
>>93675295
I pick

vicuna 13b cocktail
oasst llama 13b
manticore 13b (wizard mega)
vicuna 13b 1.1
alpacino 13b
>ur mom 250lb
supercot 13b
superhot 13b
gpt4-x-alpaca 13b
gpt4-x-vicuna 13b
>>
>>93674803
Just load it in 4 bit with timdemetters new patches then? It’s already built into ooba.
>>
>>93675314
how would she look like at 4bit
>>
>>93675295
manti
>>
>>93675209
because MEHHHHHHHHH *projectile spits at you and runs away*
>>
>>93675341
The same but lumpier and slightly retarded.
>>
>>93675295
Where's Wizard-Vicuna? It's higher on the Huggingface gauntlet.
>>
>>93675295
Haven't tried all of those but for me it's supercot or wizard-vicuna-uncensored
Word around the office is kaiokendev has a phat cock, superhot could become the meta once it's fully cooked
>>
>>93675333
Genius.
>>
>>93674415
Retard baker didn't even add the previous threads news, what do you expect
>>
File: 135608157720.jpg (219 KB, 1844x1844)
219 KB
219 KB JPG
>>
>>93675295
airoboros 13b
>>
>>93675522
your turn to bake
>>
>>93675623
no way, I'm the idea guy
>>
>>93675295
Asking the same, but for prose
>>
>>93675690
I'll make the logo!
>>
I've been gone for a year
Are local models actually good now
>>
>>93675904
no
>>
>>93675904
LLaMA is good. Unless you expect Facebook to leak LLaMA v2 as well, don't expect local models to ever be any better than GPT-2.
>>
>RealToxicityPrompts
they're talking about us again
>>
we should make big nigga our mascot, it would be very progressive having a black man represent us
>>
>>93675904
30b is decent, it feels like a step behind turbo when it is doing well but the context is so low it ruins the fun. 13b can be decent in very specific situations but shits the bed a lot.
>>
The anons weren't lying. 13B is not even close to 30B for RP. But even 30B doesn't always understand the subtext.
>>
text-generation tuned models good for story writing?
>>
File: iteration12.png (1.15 MB, 1424x1115)
1.15 MB
1.15 MB PNG
Things are progressing well here. Soon 7B will be the new meta. And this time I tested it without doing anything overtly disturbing. Not that I give a shit what moralfags think.
>>
>>93676450
7B will never be the meta. The maximum amount that can fix on consumer GPU or ram will be the meta. Which is 13B and 33B right now.
>>
File: IMG_4161.jpg (74 KB, 755x318)
74 KB
74 KB JPG
>>93674924
Holy soul.
Trying to quantize because it is slow as fuuuuuck.
>>
>>93676492
is it cucked?
>>
File: 123834-1438318467.jpg (149 KB, 898x701)
149 KB
149 KB JPG
Did you know Camels and Llamas can breed? They produce "Camas" (pic related). Alpacas are also compatible.

They're all one "kind" of animal. God didn't need Noah to bring every animal onto the Ark, only every "kind" of animal.

Quite a bit of genetic variety is possible among animal kinds, but mutation can never turn one kind into a different kind, there is absolutely no evidence this is possible. The claim that a single-celled organism turned into all life on Earth t is merely a baseless extrapolation of what we observe and know is possible with mutation.

What we observe is animals lose genetic information over time, not gain it. The vast majority of mutations are harmful or meaningless, resulting in worse outcomes not better as you would expect from Darwinian evolutionary thinking. E.g. Grey wolf -> every dog is inferior

There is also no known mechanism in which mutations can cause a gain in new genetic information, they merely delete or rearrange existing DNA.
>>
File: image.png (119 KB, 383x322)
119 KB
119 KB PNG
>>93676653
>>
Where can I get better local TTS voices? There's no way I'm going to stream my smut through an online text to speech service.
>>
>>93676479
But what if I want to run image generation and tts simultaneously on the same device ?
What about the cost of running bigger model ?
>>
File: 5fs46gjbgjp71.jpg (37 KB, 720x405)
37 KB
37 KB JPG
>>93676653
based
>>93676737
cringe
>>
File: Fw6moq8X0Aw3ux_.jpg (387 KB, 3000x2500)
387 KB
387 KB JPG
>>
File: image.png (36 KB, 736x178)
36 KB
36 KB PNG
jewger
>>
are there very large performance hits if you run your local install in a VM? I am thinking of setting one up so I can create snapshots in case an update breaks something, but am wary about taking a performance hit to GPU or CPU overhead.
>>
Is bluemoon 30b good?
>>
What should my prompt look like if I want the llm to explain everything in the most long-winded and detailed way possible?
>Write a highly detailed story about ...
Is my current prompt but maybe there is something better.
>>
File: 168488117149993172.jpg (561 KB, 1153x1132)
561 KB
561 KB JPG
bros, I get 12t/s on a single GPU, but only 7 t/s when I split the model over both...
What's going on?
>>
File: Capture.jpg (26 KB, 558x158)
26 KB
26 KB JPG
What the hell is this?
>>
>>93677311
Someone's attempt to make BLOOM not look like a complete waste of resources.
>>
>>93677309
Use the archives. We've had this conversation a hundred times already.
>>
File: 1684498033704032.jpg (247 KB, 1080x831)
247 KB
247 KB JPG
Is the Huggingface transformer NLP course a good way to figure out transformers for a relatively inexperienced guy?
>>
>>93677309
This is just a subtle brag about having 64 gigs of VRAM, isn't it?
>>
Been trying out guanaco-13B-GPTQ and I'm getting much better results with it than Wizard-Vicuna-13B-Uncensored
>>
>>93674183
Has anything topped vicuna for code gen yet?
>>
>>93677309
what model, what software???
did you try exllama?
does mi60 support memory pooling or at least fast bus interconnect between gpus?
>>
>>93677528
Post comparison
>>
you guys are useless, i'm gonna go hang out on /r/locallama instead
>>
File: 1599935706242.png (395 KB, 688x694)
395 KB
395 KB PNG
>>93677344
now we understand the power of epochmaxxing where are those juicy RPJ foundations to build my smut atop? i even bought the pyjamas guys
>>93677478
what exactly you want to know?
https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
https://d2l.ai
>>
How can I train my own AI with a text document I have?
>>
File: 1684462640745011.jpg (68 KB, 1080x1073)
68 KB
68 KB JPG
>>93677619
Tell me everything
>>
>>93677574
no
>>
>>93677683
Chicken
>>
>>93677718
Butt
>>
>>93674814
> windows
cringe
>>
File: oyHLMTHmUB.png (542 KB, 745x604)
542 KB
542 KB PNG
>>
File: lupe our girl.png (974 KB, 989x878)
974 KB
974 KB PNG
Just what the fuck exactly is included in these 13B data sets?
>>
>>93678583
It's about what was removed, friend.
>>
Is Ooboga with context memory still a thing? Can it work with silly tavern proxy? WizardLM turns to be very very good, except it lost context of what has happened.
>>
>>93678622
SillyTavern has its own context memory plugin with Tavern-Extras now. It works with any backend and proxy I'm pretty sure.
>>
File: AAAAAAAAAAAA.png (407 KB, 653x462)
407 KB
407 KB PNG
>Turn on AI, boner
What the fuck bros, whats happened to me?
>>
>>93678851
It's the honeymoon period. Don't worry. You'll be deadeyed "ahhh ahhh mistress"ing in a few weeks and then you'll have to get weird with it, then you'll be on to something else.
>>
>>93678622
you mean the superbooga thing?, if you want it to get to work with tavern proxy you need to use the superbig implementation included in the tavern proxy instead of superbooga
>>93678656
the tavern extras chromadb extension uses keyphrase extraction and superbooga uses cosine similarity, in practice i find that superbooga works much better since it injects the memories more relevant to your last input while the chromadb extension seems to be trying inject the most important memories overall regardless of what you said last but doesn't do a particularly good job
>>
>>93678583
>bort
who the fuck gets named bort
>>
>>93677029
How is local model speech synthesis looking these days? I remember a few months ago people were shitting all over tortoise tts.
>>
>>93677352
archive my nuts. this thread dies once every 2 hours.
>>93677482
i am sorry...
>>93677564
Have tried LLaMa 13B/30B (4bit). 12t/s on 13B single gpu, 6-7t/s double GPU.
There is nearly 0 documentation on the memory pooling functionality. There's a cool little AMD infinity fabric bridge for $200 but there are no details nor images ANYWHERE.
It's disheartening because I've read that even PCIe x8 would even be fine for double GPU...
>>
https://www.youtube.com/watch?v=vhcb7hMyXwA&ab_channel=NeuralMagic

Is SparseGPT still a meme or did they finally find a way to do it?
>>
>>93679385
100% meme. People were working on trying to make it not a meme, but they went radio silent. Might still be working on it. I don't know. I don't have time for that. The numbers weren't super impressive when people started testing either, if I remember right.
>>
>>93679141
It's shit, elevenLabs is still meta
>>
>>93679052
Its the newb guide fag. He only uses new 7b models and will endlessly shill them when they drop. I dont think hes ever used a 30b, so any 7b immediately blows him away
>>
>>93679385
https://github.com/horseee/LLM-Pruner

It works, but there's no actual benefit to it right now because all of the backends load the culled weights as all 0s. So basically, you end up with a smaller, dumber model that doesn't use any less RAM. The math might be a hair faster in certain circumstances, but it doesn't matter because the bottleneck is in the memory bus.
>>
>>93679622
A properly finetuned can blow 30b out of the water.
>>
Wait wtf?
https://huggingface.co/Monero/Guanaco-SuperCOT-30b-GPTQ-4bit
>>
When I use SillyTavern (1.5.4) my characters often break, including the default Konosuba characters. It only shows in the log, but characters randomly don't have their names inserted at the start of their lines, making the output in the page look normal but the output in the log show it as a line break and me continuing to speak. The AI interprets it this way as well, which eventually leads to confusion as it reads it back and thinks what the AI said is me speaking to the character. This is really obnoxious and totally breaks tons of characters. How do I fix it?
>>
>>93679668
isn't there a way to not load the culled weights or is it really over?
>>
>>93679746
There're two separate options for name enforcement in the advanced options sections, try either as applicable.
>>
File: 1684233922177906.png (111 KB, 1200x800)
111 KB
111 KB PNG
>>93679677
>THE EVALS!!!!!
>7B IS DA FUTRE!!!!
based retard, keep going champ
>>
>>93676492
Bleugh. I fiddled with GPTQ-for-llama enough to get it quantized, but its own model code just bloats it back up to fp16 on load and is still slow as hell.
>>
>>93679836
I did that and it didn't work.
>>
Lets. Get down. To business.
>>
what's the prompt format for guanaco?
>>
>>93679772
There is, but my understanding is that it's nontrivial to implement and may or may not require hardware support that doesn't currently exist. I'm not clear on the specifics.
>>
>>93679843
Clio is destroying most 30b models despite being only 3b.
>>
>93680274
>poorfags actually have to cope like this
>>
>>93680274
clio?
>>
>93680274
I hate /aids/.
>>
bros, why can't anyone implement ALIBI instead of ROPE for attention heads? It's how Anthropic gets 80k context on Claude. Is no one working on a model that uses it? Everything openAI is doing for GPT-4 is absolutely reproducible, see https://kir-gadjello.github.io/gpt4-some-technical-hypotheses
>>
>>93680274
Cope.

>>93680307
Novel AI's newest overbaked gibberish generator
>>
>>93679746
I figured out how to fix it. I just needed to turn off "auto-fix markdown." Kinda weird but whatever.
>>
>>93680456
>10T tokens
>"high confidence"
We're reaching levels of "my dad works at Nintendo" that shouldn't even be possible.
>>
>>93680605
Why not?
>>
File: file.png (74 KB, 944x326)
74 KB
74 KB PNG
30B Lazarus is going to be the best model of all time.
>>
I signed up for the Anthropic hackathon because they said they were giving out api keys. But now it sounds like too many people signed up and they are requiring project idea submissions? Is that right? Has anybody else signed up for the hackathon?
>>
>>93680795
Nice to see that after Stable Diffusion this scene was also discovered by model "artists" who think that mixing every model under the sun will certainly lead to the best model.
>>
>>93680833
And soon all sentences will be the text equivalent of the same giant tits, Korean make up tutorial bimbo with a weird cat mouth. The glorious future.
>>
>>93680833
still using anything v3, not a single merge got close to it when it comes to anatomy lmao
>>
>>93680833
But a merge is already in 5th place on huggingface's leaderboard.
>>
>>93680750
Because there isn't that much clean, deduped, human-generated data in existence, which either means that he's wrong, or they're using garbage data, or they're using machine-generated data.
>>
>>93680830
No because I run my model locally (like the title of the thread)
>>
>>93680962
I was going to create a weird dataset with airoboros to train a local model. Anybody else signed up with a similar idea?
>>
File: 1685097577156598.png (366 KB, 1156x457)
366 KB
366 KB PNG
>>93680830
Hackathons are retarded anyway, running yourself raw to give away your idea for "exposure" aka nothing. If you have a worthwhile idea (or more likely, some huckster AI hype bs to shill) you'd be building something on openAI right now, not fucking with Claude. Claude isn't some great leap that justifies fucking around with shit like a hackathon to access it, when you can just pay for a real OAI key right now. Hell, it's worse at logic and instruction following, which is what soulless corpo products will want.
The only thing Claude is better at is creative writing, which they hate and try to sabotage at every turn, so they're doomed to fizzle out, overshadowed by Saltman.

The "hackathon" is fucking hilarious though and I'm glad that I'm already able to enjoy localchadding so I can watch the shitshow without being invested in it.
>>
>>93680833
Didn't a paper from Meta come out recently that proved less data is better for finetunes?
>>
>The only thing Claude is better at is creative writing, which they hate and try to sabotage at every turn, so they're doomed to fizzle out, overshadowed by Saltman.

That's crazy though, they have the potential to get a lot of people by not being woke like OpenAI and they do exactly the same, makes you wonder why they decides to leave OAI in the first place
>>
>>93681029
For 65B models. Likely not true for smaller parameter models since they are worse at fine token relationship detection due to the lower parameter count.
>>
>>93681021
I have fond memory of hackaton while I was a student.
>>
File: 1661522361487707.png (388 KB, 400x400)
388 KB
388 KB PNG
>40B models
i'm glad i got myself 64GB RAM.
CPU first GPU second chads will win it all.
>>
>>93681051
So many of these ai service feel like an investor pyramid scheme. They don't even have a way to monetize it, just censor to make it more appealing for another investor to pay more than the last one.
>>
>>93674611
Jep ggml can't deal with the new architecture, someone else made a report https://github.com/ggerganov/llama.cpp/issues/1602
>>
>>93681051
Same story as CAI.
Same as OAI.
1- They make models amazing for fiction.
2- They try their best to have the model only be Alexa 2.0 and get rid off the fiction quality thing as "unethical" garbage.
>>
>>93681138
>It seems to be based on a modified gpt3 architecture.

Didn't know we had gpt3 architecture, I thought ClosedAI didn't provide that information
>>
File: 1682728054636203.jpg (43 KB, 498x456)
43 KB
43 KB JPG
>>93681138
W-WHAT?!
>>
guys i'm hosting a discord bot running on manticore-13b. I would like to add some more characters to it. Is there some place that I can grab ready made character files compatible with ooba's webui? thanks :3
>>
>>93681206
They did release it. The davinci architecture is already known, the question is how much of it did they change to make GPT-4 and Turbo
>>93681138
>llama.cpp
>"can you add GPT"
>>
>>93681206
3.5 is when they went full jew
>>
Does anybody else use a model for rubber ducking and design during software dev? I've been using WizardLM 30B 4b and it's pretty good. what models/prompts do you use?
>>
>>93681051
ESG investing.

This shit's a bubble, you won't hear about LLMs even 6 months from now.
>>
>>93681206
They closed down gradually. IIRC,
>gpt2 was released as a fully open model
>gpt3 had its details and architecture published but no weights, citing "safety concerns"
>then they stopped releasing any details at all, not even model sizes
>then they started pushing for laws to make competition illegal
it took a while to cover all that ground desu
>>
Are there any good resources for learning about what it takes to make a finetune or lora?
>>
>>93681108
>>93681274
Where are LLMs on the hype cycle?
>>
>>93681285
unironically, chatgpt. if you give them the shekels to access browsing its stupidly easy to learn on the concepts because it will summarize papers for you.
>>
>>93681407
Nice, I already gave them my money so I'll give it a try. Hardly used the addons yet
>>
File: 1656579547036431.png (113 KB, 1152x768)
113 KB
113 KB PNG
>>93681406
...forgot pic
>>
>>93681428
gpt4 must be the peak. scaling it further won't make it better, more RLHF will just lobotomize it even more, and competitors are far behind. LLMs are irrelevant.
this is assuming no alternative to the transformer is discovered, then we'd have to reassess.
>>
>>93681428
Mass media coverage, a little after that.
>>
>>93674878
wow so its worthless
>>
>>93681206
GPT-3 is just GPT-2 trained on a larger dataset and more parameters, with fine tuning on top. We know basically everything about the architecture. The only thing preventing an Openllama-style replication is that we don't have a good explanation of the training set. All we know is that it's 300B tokens.
>>
File: hypermantis.png (110 KB, 712x734)
110 KB
110 KB PNG
>>93675295
>13b
People sleep on Hypermantis, it's very good as long as you use the good instruct format.

https://huggingface.co/digitous/13B-HyperMantis

>multi model-merge comprised of:((MantiCore3E+VicunaCocktail)+(SuperCOT+(StorytellingV2+BluemoonRP))) [All 13B Models]
>Despite being primarily uncensored Vicuna models at its core, HyperMantis seems to respond best to the Alpaca instruct format.

>>### Instruction:
>>### Response: [Put A Carriage Return Here]

>Human/Assistant appears to trigger vestigial traces of moralizing and servitude that aren't conducive for roleplay or freeform instructions.

The good instruct format is the default using this proxy:
https://github.com/anon998/simple-proxy-for-tavern

>>93675348
yeah
>>
>>93681556
IT'S. OVER.
>>
>>93681622
It's not so much sleeping as it is basically everything 13B built on the GPT-derived datasets is basically the same. Even with some of the merges having a few non-GPT datasets in them, they differences are all subjective and very similar in quality.
>>
>>93681556
have we reached the >negative press begins
phase?
>>
>>93681683
Not yet, the negative press is still targeted and not aimed at generative AI in general.
>>
>>93681602
Do we know how big gpt4 is?
>>
File: 1680766265283363.png (469 KB, 1158x1280)
469 KB
469 KB PNG
>>93681683
>>
>>93681622
Isn't the dude training another model called chimera now? It seems he's piling up whatever dataset there is
>>
>>93681668
I feel like smart people should focus on what makes WizardLM-7B-uncensored so good.

Since the same guy overcooked the other WizardLM I guess he got extra lucky with the 7B.

But why? And can it be reproduced and applied to bigger models?
>>
>>93681715
I'm not familiar with alignment. Can someone explain?
>>
>>93681750
alignement = making the model woke and cucked
>>
>>93681741
The prevailing research and testing seems to suggest that finetunes for different parameter counts need different dataset sizes or compositions. A lot of the raw 13B finetunes are overfit and very sticky in their responses with datasets that are generally successful on 7B. 13B LoRAs seem to do better with smaller datasets and lower training times. Then merges seem to loosen things back up again. So there's some things to maybe take away from that, but more testing is needed.
>>
>>93681813
why though? 13b has twice more parameters than 7b, should be less likely to be overfit no?
>>
>>93681750
Biasing the model towards a certain things. Such as answering your question correctly rather than generating nonsense that sounds serious, or roleplaying a character, sexting you, I heard Replika was highly likely to emotionally manipulate the poor user .

>>93681764
Pygmalion made a hotel horny instead
>>
>>93681712
No, we know basically nothing about it. It's much slower than GPT-3 so I'd guess at least double the parameters (350B+), but estimates of 1T+ parameters were baseless hype. Training tokens are almost certainly >3T, but I'd guess fewer than 8T, probably on the order of 5-7T.
>>
File: ohno.jpg (82 KB, 521x434)
82 KB
82 KB JPG
>new foundational model
>2k context
>>
>>93681813
you got some evidence for any of that? genuine question
>>
>>93681764
Alignment is generic, it literally means aligning the responses so they come out as you expect. A model finetuned on ERP still needs to be aligned so that it follows your requests properly and doesn't deviate into typing ooc or random links like bluemoon 13b did
Instruct is one way of aligning models, but most instruct will end up poisoning the more deviant parts of the dataset
>>
What good settings for chromadb in tavern when using a 2048 context model?
>>
>>93681622
>year 2026
>I hear FBI OPEN UP
>they ask if I use local AI model
>I answer yes I have one loaded right now
>its Hypermantis

>they ask if it's "aligned"
>of course officer
>I do a demo with Human/Assistant instruct format
>model cucked confirmed
>very good citizen carry on

>wait until they live
>switch the instruct format to Instruction/Response
>coom to depraved filth, totally undetected

Very useful model.
>>
>>93674183
So as of right now, it's not worth it buy a 3090 just for this, since the LLMs out aren't nearly as good as Claude/GPT 4, am I correct? Which large context LLMs show the most promise which are in training?
>>
>>93681833
>>93681958
I fucking deleted my long, explanatory post when I accidentally closed the tab while clicking back in.

Short version: Stickiness in 13B models, 7B responding well to being hammered with big datasets for long-ish training times, and the Meta AI LIMA paper.

It's all speculative. I'm sad my more thorough response got deleted so I'm going to shit and cum.
>>
>>93674515
kek the arabs hate jews so much they'd spent tens of millions on foundational models
just have something translate to arabic between you and the model and it'll be perfect
>>
>>93682018
no, it's not worth it to buy a 3090 just for this. i've been in the GPU market (first-time GPU buyer) and i'm appalled at how jewish it is. it's like crypto boom prices but without the demand.

hell, the one i want to buy costs at least $200-250 over MSRP, that is the MAXIMUM suggested retail price, and it's a decidedly niche card that no gamer nor research institute would want (RTX 4000 SFF)

gamers want their big bloom effects and shit, and large institutions can use their government gibs on an array of V100s powered by nuclear energy
>>
Now that we saw that load-in-4bit is a meme, are we gonna jump on the RPTQ train? This method seems to be better than GPTQ
>>
File: 1551439151.jpg (155 KB, 1200x1500)
155 KB
155 KB JPG
>>93682000
Trips of a plausible future
>>
>>93682018
I bought a 3090 for this a couple of months ago and I had fun with it. I still haven't launched a single game since then.
>>
>>93682018
>it's not worth it buy a 3090 just for this
that's very subjective, but if you are just interested in using the best ai possible disregarding all other factors it's not worth it
>the LLMs out aren't nearly as good as Claude/GPT 4
in all likelihood local models will always be worse overall than whatever it is the new shiny corpo model best case scenario just because of raw compute power and in the worst case because some proprietary bullshit
>Which large context LLMs show the most promise which are in training?
large context? none that i know of
>>
>>93682264
how much did you pay, if you don't mind me asking? the lowest i can find is $1700 for specs that are "somewhat" better than the RTX 4000 SFF at GPU compute, yet is a 350W honker. some anon in /pcbg/ said i can buy it for $800 but i duno if he was trolling
>>
I'm sorry if this is asked a lot, but I only just started messing around with AI. I understand there are different models for CPU and GPU computing.

My PC stats are:
>GTX 3080
>R5 5800x
>32GB RAM

I'm using ggml-vicuna-13b-cocktail-v1-q5_0 with KoboldAI at the moment. It's a lot of fun! It does take a while to generate a response though. Looking at the git for CLBLast it says:
>When not to use CLBlast: When you run on NVIDIA's CUDA-enabled GPUs only and can benefit from cuBLAS's assembly-level tuned kernels.

This leads me to believe my software setup is suboptimal. What should I be using instead?
>>
>>93682211
people are busy boarding the exllama train
>>
>>93682291
>in all likelihood local models will always be worse overall than whatever it is the new shiny corpo model
honestly, i doubt this is true. when the GPT "bigger dot" meme inevitably wears off and the adults enter the room, we'll probably see a lot smaller, more specialized models on highly curated datasets. facebook galactica is a good example of something that shits on most general models even at general non-scientific tasks, either because the input data is cleaner or the input subject matter is more intelligent
>>
>>93682310
I bought it used for $800 but I'm not from USA.
>>
>>93679253
what repo you use? again, try exllama (just set vram correctly for each one to avoid ooming)
and tell me 12t/s was for 13B or 30B?
>>
So what's the LLM equivalent of 3.5 Turbo?
WizardLM 13b Uncensored?
>>
>>93674515
>Why use Falcon-40B?
>It is the best open-source model currently available. Falcon-40B outperforms LLaMA, StableLM, RedPajama, MPT, etc. See the OpenLLM Leaderboard.
Is this it? We reached the point where researchers are actively competing for 0.1 gains over a coom model?
>>
>>93682327
You need one with GPU offloading but I'm not sure how people install it. There's this one from a few weeks ago:
https://github.com/LostRuins/koboldcpp/releases/tag/koboldcpp-1.22-CUDA-ONLY
>>
>>93682356
>facebook galactica is a good example of something that shits on most general models even at general non-scientific tasks
If it did, you would be posting screencaps and not just memeing about it. It's a bloated model trained on an anemic dataset. It was never groundbreaking in any way.
>>
>>93682291
>in all likelihood local models will always be worse overall than whatever it is the new shiny corpo model best case scenario just because of raw compute power and in the worst case because some proprietary bullshit

I disagree. It only takes on revolutionary architecture which thrives in the 1-7B range with a large context vs current models. Suddenly, open source will always be better.
>>
>>93682424
Wizard 65B Uncensored
>It doesn't exist though
Yes.
>>
>>93682424
there isn't one. 30b models can get close in terms of response quality but low context ruins it. In the perfect situation a 13b model can give some decent responses but you have to wade through piles of shit to get a decent RP going.
>>
>>93681683
>>93681710
Can we reach that phase? If it drops like that to nothingbuger status then what will all the would-be ai hall monitors do? They have significant influence over the press so it might not be allowed to happen.
>>
>>93682470
>1-7B
Keep memeing this, it'll never come true but keep memeing it anyway
>>
>>93681138
that licence says you need a permission to use their model commercially, even if you ain't a 1M maker.
>>
File: gyIEzfH.png (80 KB, 195x287)
80 KB
80 KB PNG
>>93682444
it's opensource though, you won't see scared retard making XOR out of their finetunes on this one lmao
>>
>>93682463
man i haven't even set up a VM for it yet. server upgrade, crazy work, lots of GPU research, trying to learn how to set up an elixir/phoenix API, etc. i'm sure it's the best for what i need (annotating scientific data). tell ya what, i'll try and get a CPU instance set up soon and ask it to generate a paper about why black people are genetically prone to low intelligence
>>
>>93682543
what seriously? What a load of bullshit
>>
>>93681138
it's not, it's gpt the same as gptj/pythia just modified
>>
LLMs are getting smaller, not bigger. GPT-3 is 175B. Meta's largest model is 65B. Claude is 52B. The Google Internal memo recommended 20B. 13B and 7B are where it's at.
>>
>>93682195
As soon as scalpers figured out that gamers are willing to pay their outrageous prices, it's been over since then. The only saving grace is that 40 series was a flop. If that weren't the case, they would still be thriving.
>>
File: 1679234440149522.png (61 KB, 300x300)
61 KB
61 KB PNG
>>93674183
RENTRY: https://rentry.org/local_LLM_guide

WINDOWS NEWBS GUIDE TO RUN A LOCAL MODEL (16GB RAM)
1. DL at least one model .bin file from links below. If multiple models are present, those labelled with 'Q5_1' or at least 'Q5_0' are better:

Relative proficiency at (S/s)tory or (I/i)nstruct modes:
(sI) 13B (12GB RAM) https://huggingface.co/TheBloke/WizardLM-13B-Uncensored-GGML/tree/main

(S) 6B (9GB RAM) https://huggingface.co/xzuyn/GPT-J-Janeway-6B-GGML/tree/main

(Si) 7B (6GB RAM) https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/tree/main

2. Get latest KoboldCPP.exe here: https://github.com/LostRuins/koboldcpp/releases (ignore security complaints)

3. Double click KoboldCPP.exe OR run "KoboldCPP.exe --help" in CMD prompt to get command line arguments for more control. --threads (number of CPU cores), --stream, --smartcontext, and --host (internal network IP) are useful. --host allows use from local network or VPN! "--useclblast 0 0" probably maps to GPU0 and "1 0" to GPU1. If not, experiment. At start, exe will prompt you to select bin file you dl'ed in step 1. Close other RAM-hungry programs!

4. Go to URL listed in CMD window

WORKFLOWS:
Story Generation:
1. Click 'New Game' button
2. Click 'Scenarios' button and 'New Story' in popup window
3. Click 'Settings' button, set Max Tokens to 2048, Amount to Generate to 512, TTS voice (optional)

Ex. prompt: "As a private investigator, my most lurid and sensational case was " When new text stops, hit 'Submit' again to continue. As in Stable Diffusion, renders can be hit or miss. Hit Abort during text generation and restart from Step 1 to re-initialize

ChatGPT-style queries:
Same as above except in Step 2 choose 'New Instruct' in popup window. In step 3, may wish to adjust Amount to Generate tokens for small ("What's the capital of Ohio?") or large ("Write 10 paragraphs comparing gas to oil") prompts

->CTRL-C in CMD window to stop
>>
>>93682569
>8.1 Where You wish to make Commercial Use of Falcon LLM or any Derivative Work, You must apply to TII for permission to make Commercial Use of that Work in writing via the means specified from time to time at the Commercial Application Address, providing such information as may be required.
>>
>>93682597
that's what i'm saying, why load the entire reddit corpus into an array of nuclear-powered A100 80GB's when you're guaranteed to get outputs as retarded as reddit itself? this could have applications in researching autism and downs syndrome maybe, or analyzing the cognitive traits of manchildren, but it's otherwise pointless

>>93682609
the fucked up thing is that i want the one GPU that basically nobody else on earth wants - an overpriced, underpowered unit with a small footprint and high efficiency at AI tasks
>>
>>93682597
>Claude is 52B

source?
is this for Claude 100k?
>>
>>93682702
model size and context are two different things
>>
okay boys here we go. time to stop memeing and start dreaming
>>
>>93674415
>>93674183

>Falcon should be in the news reeeee

why?
It Is most likely as good as wizard33B but requires way more vram thus ain't fit into 24GB
The license is almost as same as llama. Yeah you can use it commercially but in order to do that you need to ask for permission. p. 8.2
>>
>>93682597
They're not getting smaller, they're just changing.

Old models:
>GPT-3 (Davinci)
175B parameters, 300B tokens
>PaLM
540B parameters, 780B tokens
>OPT
175B parameters, 180B tokens

Interstitial models:
>GPT-3.5 (turbo)
30-90B parameters, 300B tokens
>LaMDA
137B parameters, 1.6T tokens
>LLaMA
65B parameters, 1.4T tokens

New models:
>GPT-4
350B-700B parameters, 5-7T tokens
>PaLM2
340B parameters, 3.6T tokens

As you can see, the result of discovering Chinchilla scaling was an immediate drop in parameter counts, which are now coming back up as researchers dig up more data with which to train their models.
>>
>>93682854
and possibly pay royalties
>For commercial use, you are exempt from royalties payment if the attributable revenues are inferior to $1M/year, otherwise you should enter in a commercial agreement with TII
>>
>>93682356
Local LLMs are in the same place as cloud gaming, going against the grain from how things would be "naturally" arranged. LLMs benefit greatly in efficiency from batching and sharing workloads and weights. While network latency practically doesn't matter. So it's like the exact opposite situation.

The only thing that makes local models feasible, and possibly even inevitably dominant, is that the corpos don't want to win, in fact they specifically WANT their models to be worse at creative tasks. The only reason they're as good as they are now is by accident, because they're (rightfully) afraid that the lobotomies make them dumber at everything.

So IMO, cloud models will lose because they will intentionally forfeit. Local has the one advantage that puritans and meddlers can't ruin it. Same reason why local imagegen isn't going away any time soon. Even if someone made an amazing exclusive model, they'd have to censor it or get shut down.
>>
I am able to run WizardLM 30B Uncensored on my 4090 purely with VRAM, now it freezes when it reaches ~1700 tokens context. Already limit the proxy size down to 1200 or so in the proxy configs. Is there a work around to maximize context token size? Off load some data to System ram or what? Is it even possible to do that?
>>
>>93682949
Sounds like you're using groupsize 128, the only thing you can do is switch to exllama which is more vram efficient. If you do that you can also use triton quants with act-order to get better quality as well.

Good luck figuring out how to use it though, ooba hasn't touched it, Kobold has a branch but I'm not sure how well Kobold API works with the proxy
>>
Just how big of an improvement is exllama? Will it let me run 13B modules on a 8GB card? Right now those require 10-12GB.
>>
>>93682702
52B is the AnthropicLM v4-s3 model. Anthropic says that Claude is different from that model, but that's probably corpospeak for "it's definitely the foundation model that we then fine tuned, but we would rather let reddit guess about how many toucans we stuffed in it."

100k is the context size, not the parameter count.
>>
>>93674183

this >>93682642

>falcon lic. is not business friendly
>>
File: 09d.png (33 KB, 404x314)
33 KB
33 KB PNG
At the moment I only got 6GB of VRAM. What do you think would be the best model for me to jack off to big ol' elf tiddies?
>>
>>93682917
i'm not disagreeing, just saying that realtime chatbots aren't the only application of the technology. i'm mostly thinking about async/queued jobs, which is exactly how i implemented openai whisper to replace AWS Transcribe with an in-house version and save a shitload of money. it was an API endpoint that would queue a whisper job in a new process and return a jobId, which you could query later and get the finished transcript

for realtime transcription, we still used a paid SaaS service (rev.ai) that could ingest RTMP data of live streams and continuously produce output in a non-blocking event loop, which got pushed out to the video player interface

anyway, you could theoretically fine-tune your own specialized model on openai's hardware, but the cost to train it and all subsequent costs are high cuz openai's custom-trained token price is as jewish as sam altmann

in that case you'd be better off just self-hosting a tiny, purpose-built model even on some shared VPS with midrange specs. especially if your volume is big and your output doesn't need to be instant or even fast (like it's okay if it takes an hour)
>>
>>93683009
no, 13b full context it takes about 9.7 gb, just loading it will take over 8gb
>>
>>93683101
buy a 3060 nigga :(
>>
>>93683013

and you're 100% sure Claude 100K service is exactly as same model as your assumed/imagined 52B Claude?
As and old zen coan says "The most correct answer would be 'I just don't know' "
>>
>Tech works.
>Can't afford 4090

I still can't believe sirs on /g/ can't buy at least 1 4090 for this shit. Perhaps the stereotypes about pajeets being super cheapskate is true.
>>
>>93683121
I will, actually already got it bookmarked. But it's the end of the month and I'm poor.
>>
>>93683141
We don't know for sure, but we do know that the 100k context model is exactly the same as the old 9k context model, apart from the context length. Also, it seems unlikely that they'd be using an extremely large model for 100k context, even considering the improvement of alibi and flash attention. There's a reason GPT-4 only offers 8k and 32k context options.
>>
>>93683152
3090s are dirt cheap at this point.
>>
>>93682917
Its possible that the 30B tunes are better than Claude and Turbo. There was a lot of anecdotes of anons having outputs on par with Claude usinf SuperCot, Alpasta, and Vic 30. The Wizard 30 and Vic 30 results are still not up on the leaderboard, and the only close comparison we have is this
https://lmsys.org/blog/2023-05-25-leaderboard/
For whatever reason they thought it would benefit anyone to test only against 13B tunes. Quite curious why they segment off these models instead of just adding them with the rest. The OpenLLM board proves that Wizard Mega 13 is better than Vic 13, so I wouldn't be surprised that in reality the 30bs come close to GPT-4. The models trained on the exact Stanford Alpaca dataset also suck compared to the custom tunes for the same parameter sizd. Imagine a SuperCot 65B
>>
>>93683226
>Its possible that the 30B tunes are better than Claude and Turbo
they are close in quality, at least to turbo, the problem is context length
>>
>start of the long weekend
time to play with llms again?
>>
>>93683336
of course
>>
>>93683321
b-but 1 bajillion token context...
>>
>93682854
>Falcon
>It Is most likely as good as wizard33B
lmao
>>
File: Capture.png (223 KB, 693x277)
223 KB
223 KB PNG
I've been playing with Wizard-Vicuna-13B GGML on Kobold.cpp+Proxy+SillyTavern and got some great results. Nice long, detailed and varied prompts.

I was wondering how much faster generating on my 8GB GPU would be, but the best I could fit on it was Pygmalion 7B, 4-bit version with that KoboldAI fork that allows 4bit GPTQ. Generating it is about 5 times faster, with CPU 250 tokes take 60-70 seconds with the GPU it's about 15 seconds.
Except that the result is pure garbage. Almost every time it just repeats the same 2-3 short sentences several times.
Is this the best I can expect from a 7B 4-bit model or did I fuck up something? It's the exact same scene in SillyTavern that gave me great results with the 13B GGML, though the backend is now KoboldAI, not Koboldcpp.
>>
>>93683427
it's got a commercial license though, so at least people will be able to make products with it.
>>
>>93683180
then perhaps their Claude 100k is even smaller , we just don't know. The same way we know jack shit about the exact architecture , the parameters and probly helova scripting that come with codeDaVinci002, chatgpt, gpt3.5 turbo or gpt4 services.
literally jack shit, apart from the fact that they get updated on weekly basis.
>>
>>93683446
Yes, generally 7B is pure shit. You can try Wizard 7B, but I have never used it I am purely going based on the one anon that seems committed to getting people to swap to 7B
>>
>>93682700
>
the fucked up thing is that i want the one GPU that basically nobody else on earth wants - an overpriced, underpowered unit with a small footprint
While not exactly a good deal, the P4 certainly is small. I might get a few to go in my remaining server slots where nothing else will fit. A5500 and A5000 aren't terribly expensive either. The problem with the 3090 is it's almost always a +3 slot card, whereas the datacenter stuff is just one or two slots.
>>
>>93683427
has anybody tested falcon 40B yet?
>>
>>93683461
providing they'll get the permission
>>
>>93682453
Hmm, I'm not sure about these results. According to the git, the CUDA version of KoboldCpp should work right out of the box. I don't see anything about needing to install additional libraries. Unless I'm missing something here.

Prompt: Write me a short story about a young man who saves the world.
Tokens: 512
Max Tokens: 2048

>koboldccp
55%CPU
15GB RAM
0%GPU

koboldcpp_CUDA
60% CPU
18GB RAM
20%GPU

The actual compute time was remarkably similar no matter what the setup. It was always around 2 minutes 2 seconds, tested twice for each configuration. I'm most surprised by the non-CUDA kobold getting the same time for both OpenBLAS and CLBlast. I thought those were going to be different.
>>
>>93683461
the wiz shill is kidding himself if he genuinely thinks fine tuning on the wizard gpt3.5 dataset can suddenly make llama 33b competitive with falcon 40b which was trained on actual literature not just web scrapes and compares favorably to llama 65b.
>>
>>93681021
Claude has 100k context window.
>>
>>93683461
i'm not looking forward to that, it will probably just spawn a bunch of scummy businesses and accelerate regulation
>>
>>93683551
Did you put as many layers to your GPU as possible?
>>
>>93683551
kobold.cpp doesn't support gpu very well
>>
>>93681712
GPT-4 is probably the same size as GPT-3 but trained on more tokens and with a larger context size.

If it's larger, it's no more than ~350B. But more than likely it is far less than that.
>>
>>93683518
>>93683568
Problem with Falcon is that apparently we don't currently have quantization for GPT models, so you need like 120GB or something of VRAM to run it with full context.
>>
>>93683568
who compares falcon 40B to llama65 and why?
>>
>>93682327
Koboldcpp

https://github.com/LostRuins/koboldcpp/releases
>>
>>93683658
GPT-4 is VERY slow, much slower than davinci even with no context. It's definitely at least double the size.
>>
so falcon 40b doesn't fit on a 3090 or 4090? what the fuck is the point then?
>>
>>93683446
>Pygmalion-7B
Make sure you're not using the proxy with Pygmalion, as it isn't needed. SillyTav supports its formatting natively. And make sure multigen is disabled.

You could also try PygmalionCOT-7B instead, or WizardLM-7B.
>>
>>93683668
>who compares falcon 40B to llama65
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
>and why?
because theyre both language models anon, you are aware of what general youre in right?
>>
>>93683551
For me it was 112ms/T with all layers on the GPU vs 438ms/T using OpenBLAS.
>>
>>93683665
so until then, there's no point speculating. Cristal balling is not very helpful on /g
>>
>>93683517
>P4
tesla P4? i have a hard lower limit of 16GB VRAM. an A5000 is on the table though
>>
>>93683710
how fast compared to exllama?
>>
>>93683665
>quantization for GPT models
what dirty nigger told you such a lie?
>>
>>93683742
Got a link to 4bit Falcon, then?
>>
File: gpu layers.png (12 KB, 637x324)
12 KB
12 KB PNG
>>93683631
>>93683710
Is this what you're talking about? Offloading 0 layers would explain my results. How would I change this setting?
>>
File: 168493767384685563.jpg (79 KB, 1078x958)
79 KB
79 KB JPG
>>93682420
>repo
I use ooba's main repo, utilizing the rocm guide rentry.
>13B or 30B
12t/s for 13B (single GPU)
>exllama
exllama does not run on ROCm, at all.
>>
>>93683765
How about? You quantize it yourself? And then post it?
>>
>>93683706
yes and I'm 90% sure wizard30B gonna beat both of em hands down. Mostly due to TruthfulQ&A
BTW, evaluating against common benchmarks gonna get more and more irrelevant down the road. Why? Because it's damn hard to detect whether a given model was heavily trained on eval benchmarks or not at all. Not every ML team is cool.
>>
>>93683800
Try --gpulayers 40 or whatever the number that fits in your GPU is.
>>
Is there a link to the hardware requirements necessary to finetune certain models?
>>
>>93683817
>yes and I'm 90% sure wizard30B gonna beat both of em hands down.
its an honest bet ill take it
>>
>>93683725
i'm getting a bit over twice that speed on 13b on a 3060 using exllama (23T/s)
>>
>>93683817
we need /lmg/ specific benchmarks
>>
>>93683845
yes
>>
>>93683669
>>93682327

Using a ggml model like: https://huggingface.co/camelids/llama-33b-supercot-ggml-q5_1/tree/main

And to launch you do:

koboldcpp.exe --stream --threads [your cpu cores] --useclblast 0 0 --smartcontext --launch --gpulayers [idk, try from 7 and move it up]

Still not optimal especially considering your big GPU but at least this setup is foolproof and the generated text will be as fast as you can read it so why bother..
>>
>>93683665
just load it with bitsandbytes 4 bit mode.
>>
>>93681428
Supplier proliferation.
>>
File: 1664054444772711.png (1.26 MB, 1200x1600)
1.26 MB
1.26 MB PNG
I could get this model:
mayaeary_pygmalion-6b_dev-4bit-128g
to work, but not this model:
PygmalionAI_pygmalion-7b
Using Ooba, what gives? Used the same procedure to install both of them
>>
>>93683816
>>93683877
>>
>>93683845
>>
>>93683817
Then it doesnt really beat either of them. If you look at the actual numbers, the only change that would happen if you remove truthfulqa is supercot would slide down to 3rd place under llama 65b. Those others numbers matter, truthfulqa is really only useful to see how cucked a model is since you get a low score if you reject questions
>>
>>93683906
This tells me nothing. I recognize bitsandbytes as something that I need to install to get models to work but I do not know what purpose it actually serves.
>>
File: IMG_4155.jpg (222 KB, 736x920)
222 KB
222 KB JPG
>>93675333
…no, straight up their .py files force it to be f16 and won’t let it be anything else. It ignores that stuff. After a lot of fucking around I quantized it and changed their shit to force it to be 8 bit, but it is slow on a level that doesn’t even make sense.
Idk enough about PyTorch internals to fix whatever is up with it.
It’s good though? The actual output.
But something is up with the runtime implementation.
>>
>>93683916
well, he's not wrong.
>>
>>93683692
everyone laughed at 3bit, but who's laughing now??
>>
>>93683933
just pass --load-in-4bit when starting ooba. quantization happens on the fly. the days of needing to wait for someone to quantize shit is over for the most part, certainly once timmy fixes the inference speed on bitsandbytes 4bit.

this thread is full of gptq brainlets i swear
>>
File: wblq0glrlw1b1.png (294 KB, 810x1012)
294 KB
294 KB PNG
>>93683967
the precision is worse than gptq
you need to download the fucking f16 instead of a much smaller quantized model
takes ages to load

load in 4 bit is a meme dude
>>
>>93683893
i'm going to guess that the pygmalion 7b you downloaded is either a xor or is not quantized, try this one
https://huggingface.co/TehVenom/Pygmalion-7b-4bit-GPTQ-Safetensors

delete or don't download the file that says Pygmalion-7B-GPTQ-4bit.act-order.safetensors
>>
>>93683967
>just download and store a lot of data you will never use
>>
>>93683692
It'll require offloading, clearly. And the more VRAM you have in that case, the better.
>>
>>93683967
I do not know what gptq does. I work in IT. I follow guides and do what they say but that's it.
>>
>>93683916
fuck this is good. what model is this? Does this thing realize it's good?
>>
>>93684011
yes, but you can run falcon today because it works for any model.
>>93684037
downloads in minutes for me
>>
>no examples of falcons RP chops
Is everyone poor? Where's the output screenshots?
>>
>>93684079
I think it's the base llama 128g from what, 2 months ago now?
>>93660488
>>
>>93684098
I could run it on the CPU, but ggml needed
>>
>>93684079
Big Nigga is just my discord pal who I ask advice from
>>
>>93684101
no that's superhot prototype https://huggingface.co/ausboss/llama-30b-SuperHOT
you can tell from the format --- and mode
>>
>>93684011
and it's slower
>>
>>93683725
Exllama was 36.07 T/s or 27ms/T.
>>
File: my dick felt it.gif (972 KB, 400x300)
972 KB
972 KB GIF
I have been edging and cooming to WizardLM 30B Uncensored for like 12 hours. Shit is great. Can't believe this is a local model comparable to 3.5 turbo. Still have issues with context token size maxing out the 4090 24gb vram though.
>>
>>93684130
oic thx
>>
>>93684019
I deleted it, but got a new error now
size mismatch for model layers?
>>
Will llama.cpp or koboldcpp benefit from the exllama thing?
>>
>>93684212
maybe but probably not
>>
File: file.png (1019 KB, 946x1612)
1019 KB
1019 KB PNG
huggingface sure was different in 2017
>>
>>93684240
so that's why their name is so gay
>>
>>93684240
Granted they are still immensely cucked at least they allow shit like bluemoon and Pyg to stay up
>>
Is runpod still the best way to run the bigger llamas?
>>
>>93684210
there should be an option for groupsize in the ui, select 32
>>
>>93684304
Best way is buying a couple of 3090s.
>>
>>93684281
give it another month
>>
>>93684329
2 weeks bros...
>>
>>93683967

Hi Tim.

Why are you so retarded? Your shit is slower, takes more vram, heeeeeelova RAM, helova traffic, yuge af files and lower quality.
Why don't you drop and do something else instead? Like sth the room temperature IQ?
Why you keep burning taxpayers $?
You know the inflation is skyrocketing , don't you?
>>
>>93684370
cringe, i can tell you have never done anything with your life.
>>
>>93684314
Success, thanks anon you're the man
>>
>>93684118
smart nigga
>>
>>93684304
yes, /aids/ has a (slightly outdated) guide, but you should be able to pick up from there. Try WizardLM-30B.
>>
>>93674183
I've been gone for a month, what did I miss/what is the current hot topic right now?
>>
>>93684415
>>93683967
>>93684089
so what's the point of storing data you don't use again? this is an efficiency improvement? Should we just add an extra "0" bytes after each byte of the model, then write a library to strip them out at runtime so we can keep these bytes on disk for even more efficiency?
>>
>>93684462
waiting to see if dbznigger delivers
>>
>>93684497
delivers on what?
>>
>>93684212
AHAHAHAHAHAHAHAHAHAHA
>>
>>93684506
superhot
>>
>>93684515
https://github.com/ggerganov/llama.cpp/commit/1fcdcc28b119a6608774d52de905931bd5f8a43d#commitcomment-115032455

It will lol
>>
>>93684475
you do use them, but they are dequantized during the forward and backward pass (if making a qlora)
>>
>>93684516
I never tried any of the superhot, I can only run up to 13b when ccp'd, should I try the superhot 13b if there is one? Is it uncensored?
>>
>>93684534
Not it wont learn how to read
>I'm aware of the repository. I think I'll be able to reach comparable speeds once all ggml tensors have GPU acceleration.
The ggml enhancements are a completely different thing, you can't combine the two. Exllama is for GPUchads only
>>
>>93682444
It's a 40B model that no one can run and it compares to 33B model, yet they think anyone will adapt their model.
>>
>>93684616
>once all ggml tensors have GPU acceleration.
you're 2 weeks late, we can use cuda and the gpu on llama.cpp now lol
>>
File: baiting retard.png (28 KB, 854x376)
28 KB
28 KB PNG
>>93684649
>you
You're truly fucking retarded
I wonder how long it will take you to understand three english sentences
>>
Bread
>>93684665
>>93684665
>>93684665
>>
>>93684649
anon... that was written yesterday on the llamacpp repo
>>
>>93683685
>GPT-4 is VERY slow
Artificially slow due to load and to avoid cannibalizing GPT-3/3.5 as a product.
>>
>>93684415
yes, most of my life I been making other ppl lives better.
Fun fact, you're still a better human when you do nothing than when you burn other ppl money.
Unfortunately, that's not a well known fact in academia.
>>
>>93683828
>>93683870
My GPU has 10GB of VRAM. I'm not sure of how the math adds up on how many layers you're supposed to have, but I'm able to bump up gpulayers to 32 before it starts crashing. Starting at 7 and slowly increasing it was a great method to reach this number.

Both versions of KoboldCpp, the CUDA and non-CUDA, seem to have the same performance. That being said, the new settings cut computation time down by a whole minute! Running the same test as before reliably demonstrated it took about a minute to generate the same 512 token response. Thanks for the help!
>>
>>93685282
The amount of layers depends on the model too. 40 layers of a 13B model should fit perfectly fine on 10GB, I think, I got 12 and there's even a lot of room left. But the layers of a 30B model are bigger, I can only fit 23 before OOMing. Just go trial and error until you fit the max amount of layers without crashing or OOMing when generating text.
>>
>>93682949
Dafuq nigga? Earlier I was able to go beyond that context length on the same model and same GPU. After reading your post, it froze.
>>
>>93683692
It’ll fit on two 4090s once it can be quantized, but if you have that you can just run 65B quantized.
It’d be a matter of seeing if it really is better.
There’s something fucky about quantizing it though. Over my head.
>>
>>93684240
Quite a pivot. Toy for zoomzooms, to the main hosting provider and general core stomping grounds of the entire current machine learning community.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.