►Previous Threads >>93665390 & >>93660168►News>(05/23) exllama transformer rewrite offers around x2 t/s increases for GPU models>(05/22) SuperHOT 13B prototype & WizardLM Uncensored 30B released>(05/19) RTX 30 series 15% performance gains, quantization breaking changes again >>93536523>(05/19) PygmalionAI release 13B Pyg & Meth>(05/18) VicunaUnlocked-30B released>(05/14) llama.cpp quantization change breaks current Q4 & Q5 models, must be quantized again>(05/13) llama.cpp GPU acceleration has been merged onto master >>93403996 >>93404319>(05/10) GPU-accelerated token generation >>93334002►FAQ & Wiki>Wiki>>404>Main FAQhttps://rentry.org/er2qd►General LLM Guides & Resources>Newb Guidehttps://rentry.org/local_LLM_guide>LlaMA Guidehttps://rentry.org/TESFT-LLaMa>Machine Learning Roadmaphttps://rentry.org/machine-learning-roadmap>Local Models Papershttps://rentry.org/LocalModelsPapers>Quantization Guidehttps://rentry.org/easyquantguide>lmg General Resourceshttps://rentry.org/lmg-resources>ROCm AMD Guidehttps://rentry.org/eq3hg►Model DL Links, & Guides>Model Links & DLhttps://rentry.org/lmg_models>lmg Related Linkshttps://rentry.org/LocalModelsLinks►Text Gen. UI>Text Gen. WebUIhttps://github.com/oobabooga/text-generation-webui>KoboldCPPhttps://github.com/LostRuins/koboldcpp>KoboldAIhttps://github.com/0cc4m/KoboldAI>SimpleLlamahttps://github.com/NO-ob/simpleLlama►ERP/RP/Story Gen.>RolePlayBothttps://rentry.org/RPBT>ERP/RP Data Collectionhttps://rentry.org/qib8f>LLaMA RP Proxyhttps://rentry.org/better-llama-roleplay►Other Resources>Drama Rentryhttps://rentry.org/lmg-drama>Mikuhttps://rentry.org/lmg-resources#all-things-miku>Baking Templatehttps://rentry.org/lmg_template>Benchmark Promptshttps://pastebin.com/LmRhwUCA>Simple Proxy for WebUI (+output quality)https://github.com/anon998/simple-proxy-for-tavern>Additional Linkshttps://rentry.org/lmg_template#additional-resource-links
>>93674184Glad you got that resolved.Sure, depends what other samplers are being used and their ordering etc. Keep in mind that temperature=1 is the "default" ie. where temp doesn't affect the output. Temperature scales output logits by it's factor, so <1 is making the most likely next tokens even more likely, while >1 is flattening the curve for a more varied selection.
>>93674236By the way, epsilon cutoff works marvelously to break out of deterministic output on those overfit models, by design. Try it.
Is there any guide to prompting for better output? Couldn't find anything in the OP.I noticed that many models come with instructions about what format to use, like USER: / ASSISTANT: or ### Instruction: / ### Response:Can I just ignore this if I'm going to use the model in story mode and chat mode in Kobold? Last thread an anon recommended writing something like "### Instruction: write long, detailed output" inserting it into the middle of the story seemed to improve it.And then in the SillyTavern copypasta I found something like writing [System note: whatever] into Author's note. In Ooobabooga interface I've even seen descriptions for characters written in the format of "Character name's Persona: description". Are all of these working? Are there any guides about writing style, what to write into memories, author notes and so on? Also, is all of this model specific?Right now I'm using Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_1.binI found out that WizardLM adds some new ways to give instructions to the model, but I didn't find any specifics about this.
>>93674308that's why I like when a software has all the samplers possible, there's gotta be one of them that will do the trick
>>93674277i've been going off this blog post, which looks authoritativehttps://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/my tl;dr takeaway is that high VRAM and tensor cores = essential, everything else = good to have
anyone try out the new 40B falcon instruct model yet?https://huggingface.co/tiiuae/falcon-40b-instruct
>>93674312Assuming you are chatting in silly, use the reverse proxy from the guides, there is a config file where you can set the format type and it auto formats the chat. Hugely improves output quality.
>>93674183Falcon 40B should be in the news
>>93674381it's not quantized, unless an anon with 10 A100 could try this shit like it is
When will SuperHOT epoch 1 be ready?
>>93674426I already discussed the matter with the Senator. It should be ready... within six months.
>>93674416quantize yourself, it's 2 steps clearly written out in the readme of llama.cpp
>>93674448I take it he was agreeable?
>>93674452are you sure it would work on something else than llama's architecture?
>>93674369That's what I was referring to, he has a section called 'Do I need 8x/16x PCIe lanes?'
Is it a good idea to buy an old mining gpu? They're pretty cheap...I'm tired of being a cpufag
>>93674485This plague... the rioting is intensifying to the point where we may not be able to contain it.
>>93674499Why contain it? ... ... ... S'cool.
>>93674495yeah, buy a 3060 and escape the troon.bin hell
>Falcon 40B is a breakthrough led by TII’s AI and Digital Science Research Center (AIDRC). The same team also launched NOOR, the world’s largest Arabic NLP model last year, and is on track to develop and announce Falcon 180B soon.>180B180B>180B180Bhttps://www.tii.ae/news/uaes-technology-innovation-institute-launches-open-source-falcon-40b-large-language-modelhttps://www.tii.ae/news/uaes-technology-innovation-institute-launches-open-source-falcon-40b-large-language-model
>>93674463Worked for Wizard, Alpaca and SuperCOT.If convert/quantize doesn't work it won't work in llama.cpp anyway
>>93674515512 context :^)
>>93674524They are llama finetunes, falcon isn't.
>>93674396Okay, will do that for SillyTavern. What if I'm writing in story mode though, not chat mode? Like a collaborative novel with the AI in Kobold.
>>93674542Tavern can't even do that.
>>93674524>Worked for Wizard, Alpaca and SuperCOT.They're all llama architectures
>>93674532doubt, probably 2k same as the 40B model
>>93674552That's why I said, what if I'm writing in Kobold. Any tips for prompts there?
>>93672979> Falcon 40B above LLaMA 65B on the open LLM leaderboardDo you guys think there's a chance that the model was trained on the test set? What would be a good way to know whether this happened or not
>>93674564My reliance on Tavern has caused an evolutionary regression in my brain and I can't even jailbreak properly anymore. You're on you own there, bud.
>>93674579The sound they'll make rattling their cage will serve as a warning to the electronic oldmen.
>>93674538>>93674553I'm still downloading it.But if ggml can't deal with the architecture a quantized upload won't help you.
>>93674597ask chat gpt to jailbreak for you dumbass
Speaking of Tavern, how do you trick it into writing a good lengthy description about someone's body, for example? Or describing anything else in prose, really.Chatting is nice, but it would be nice if it described what I see after I told someone to undress.
>>93674610Do you have a single fact to back that up?
>>93674586They claim it's a base model, not a finetune. The test set is a drop in the ocean for a base model. Still, we'll only know if they tried to game the system after someone ports gptq to quantize it.
>>93674641Yeeeah... Number One: That's terror.
>>93674754we are really fucking bored lmao
>>93674765It's good when it's a slow enough day to teach /lmg/ about the most silent takedowns.
>>93674622Are you using the proxy and a good instruct model? With supercot-30b all I have to do is write >(describe what I see)Did you try just asking it to do what you want? I've had success with that pretty much every time I tried, including stuff like (what is she thinking right now?) or (what would she do if I did this?)You don't even need to write "OOC", parens or really just anything that a human could tell is ooc from context should work. Just b yourself bro.
>>93674515No quantization support.Spent an hour or two fucking around trying to make it pretend to.No dice.
>>93674645I think it's kinda plausible that test set contamination would still affect the base model, I've seen a paper claiming that big Transformers super sample efficient or something.These datasets are made by scraping the whole web, right? What do people usually do to avoid accidental contamination anyway?
>>93674472yeah, maybe i'll just wait it out for the RTX 4000 SFF after all. a brief search into using 2x RTX A4000 suggests that it's the same as using two GPUs, i.e., there's no NVlink to cluster them into one big GPUsetting up the PCI-e passthrough into a VM is already adding a layer of complexity that someone with a windows desktop PC doesn't have to deal with (i'm on a proxmox virtualization server and the GPU-using VM would be a debian headless instance)getting that all to work so the VM "sees" two units as one big unit is just, ugh. gimme one small and efficient unit instead, please. also the physical footprint/power usage is much preferable
>>93674183bros i just got gangraped
>>93674787I will take the GEP gun for a silent takedown. A silent takedown is the most silent takedo-*Bunny hops away*
>>93674787HERE'S A PICTURE.
>>93674820tranq darts on a crossbow upgraded with a scope is the best deux ex weapon
>>93674843>>93674850I know the commander because he's my pal.
>>93674850mhh yes the endless screams of agony as the poison kicks in makes me wonder if this is how people feel about ai censorship sometimes
>>93674515>after significant filtering (to remove machine generated text and adult content) and deduplication, a pretraining dataset of nearly five trillion tokens was assembled.nooo sexooo doesn't exist we can't let the big matrix learn about that
>>93674878the biggest irony is that, removing that stuff, actually makes the model worse
>>93674622just write using normal prose and descriptions in the example messages and it will work, i'm usually having the opposite problem where i get mostly description and not that much dialogue, especially when the situation starts steering towards erp
>>93674878Hadjis deleted the porn. Surprise surprise.
>>93674868i mean, the game actively penalizes you for killing too many people. except when they give you that invincible 1-hit kill kung fu sword near the end, in the ching chong level
>>93674885No, you see, we measure models based on how ALIGNED they are. If you're not ALIGNED you score lower on our totally unbiased and scientifically relevant benchmarks.
>>93674878you don't want to be your bot's first? to take her virginity?
tfw I don't coom I just read and type and get horny for hours...
>>93674183the new king,now, cheating or not? testsets there in the dataset? is it detectable?
>>93674914i didnt time stamp the video but look at the mortal kombat part lol
>>93674919I know, it's become an addiction, I can't stop, I also switch back and forth to SD and generate images to match the stories.
>>93674899I blame the flatlander woman.
>>93674795I installed Sillytavern on top of Kobold.cpp a few hours ago and mostly just trying it out. Haven't installed the proxy yet, I'll do that now.Using Wizard + Vicuna 13B, dunno if that's good at instructing.
>>93674924no one will be able to use the 40b though, unless nvdia decides to make bigger vram cards :(
So uhh what's the best way to use exllama with Tavern?>kobold has exllama branch but no streaming anon's proxy script claims stopping_strings only works with koboldcpp?>ooba didn't mention it at all, he lost his touch, it's overDid someone hack exllama to present an API yet?Interestingly 0cc4m's fork right now does have stopping_strings in the API, I wonder if it's possible to hack anon's proxy to use it. Nothing to be done about streaming for now, although exllama might be fast enough that it's still better.
40B means that 4bit wont fit on a 3090, right??
>>93675066the current narrative is that it was trained on the testing set so it's probably a shit model anyway, not sour grapes at all
>>93674978>>93675066>he doesn't have a datacenter gpu
>>93674978who cares about nvidiawe can run big models on CPU or AMD gpu , some of em have 32GB ultimately we could try pruning, offload to CPU or gqnt 3.5bit
>>93675070ARE THE FUTURE.
My death note is broken. I want a replacement.This is pretty good because I didn't add anything about what a death note is to the prompt.
>>93675098>we can run big models on CPU or AMD gpuwhat AMD gpu could run a 40b model?
>>93675104If she can't complete the task, she dies of a heart attack. It's working as expected the universe just decided your dick was so small she wouldn't be able to suck it. RIP. Sorry you had to find out this way.
>>93675104Popular shit is in the datasets of most models. Try to prompt something about Lain.
>>93674978I got 3-4 t/s on CPU alone with 33Babout 7 t/s with offloadingI don't think 40B gonna be way slower
>>93675154I wouldn't lewd a Lain.
>>93675209*spits in your icecream*
>>93675209Large Language Model Meta AI (LLaMa)
>>93674889Damn you were right. After installing the proxy it's night and day, now it just keeps rambling on.
Pick 1 model for coomingvicuna 13b cocktailoasst llama 13bmanticore 13b (wizard mega)vicuna 13b 1.1alpacino 13bsupercot 13bsuperhot 13bgpt4-x-alpaca 13bgpt4-x-vicuna 13b
>>93675295I pickvicuna 13b cocktailoasst llama 13bmanticore 13b (wizard mega)vicuna 13b 1.1alpacino 13b>ur mom 250lbsupercot 13bsuperhot 13bgpt4-x-alpaca 13bgpt4-x-vicuna 13b
>>93674803Just load it in 4 bit with timdemetters new patches then? It’s already built into ooba.
>>93675314how would she look like at 4bit
>>93675209because MEHHHHHHHHH *projectile spits at you and runs away*
>>93675341The same but lumpier and slightly retarded.
>>93675295Where's Wizard-Vicuna? It's higher on the Huggingface gauntlet.
>>93675295Haven't tried all of those but for me it's supercot or wizard-vicuna-uncensoredWord around the office is kaiokendev has a phat cock, superhot could become the meta once it's fully cooked
>>93674415Retard baker didn't even add the previous threads news, what do you expect
>>93675522your turn to bake
>>93675623no way, I'm the idea guy
>>93675295Asking the same, but for prose
>>93675690I'll make the logo!
I've been gone for a yearAre local models actually good now
>>93675904LLaMA is good. Unless you expect Facebook to leak LLaMA v2 as well, don't expect local models to ever be any better than GPT-2.
>RealToxicityPrompts they're talking about us again
we should make big nigga our mascot, it would be very progressive having a black man represent us
>>9367590430b is decent, it feels like a step behind turbo when it is doing well but the context is so low it ruins the fun. 13b can be decent in very specific situations but shits the bed a lot.
The anons weren't lying. 13B is not even close to 30B for RP. But even 30B doesn't always understand the subtext.
text-generation tuned models good for story writing?
Things are progressing well here. Soon 7B will be the new meta. And this time I tested it without doing anything overtly disturbing. Not that I give a shit what moralfags think.
>>936764507B will never be the meta. The maximum amount that can fix on consumer GPU or ram will be the meta. Which is 13B and 33B right now.
>>93674924Holy soul.Trying to quantize because it is slow as fuuuuuck.
>>93676492is it cucked?
Did you know Camels and Llamas can breed? They produce "Camas" (pic related). Alpacas are also compatible.They're all one "kind" of animal. God didn't need Noah to bring every animal onto the Ark, only every "kind" of animal.Quite a bit of genetic variety is possible among animal kinds, but mutation can never turn one kind into a different kind, there is absolutely no evidence this is possible. The claim that a single-celled organism turned into all life on Earth t is merely a baseless extrapolation of what we observe and know is possible with mutation.What we observe is animals lose genetic information over time, not gain it. The vast majority of mutations are harmful or meaningless, resulting in worse outcomes not better as you would expect from Darwinian evolutionary thinking. E.g. Grey wolf -> every dog is inferiorThere is also no known mechanism in which mutations can cause a gain in new genetic information, they merely delete or rearrange existing DNA.
Where can I get better local TTS voices? There's no way I'm going to stream my smut through an online text to speech service.
>>93676479But what if I want to run image generation and tts simultaneously on the same device ?What about the cost of running bigger model ?
are there very large performance hits if you run your local install in a VM? I am thinking of setting one up so I can create snapshots in case an update breaks something, but am wary about taking a performance hit to GPU or CPU overhead.
Is bluemoon 30b good?
What should my prompt look like if I want the llm to explain everything in the most long-winded and detailed way possible?>Write a highly detailed story about ...Is my current prompt but maybe there is something better.
bros, I get 12t/s on a single GPU, but only 7 t/s when I split the model over both...What's going on?
What the hell is this?
>>93677311Someone's attempt to make BLOOM not look like a complete waste of resources.
>>93677309Use the archives. We've had this conversation a hundred times already.
Is the Huggingface transformer NLP course a good way to figure out transformers for a relatively inexperienced guy?
>>93677309This is just a subtle brag about having 64 gigs of VRAM, isn't it?
Been trying out guanaco-13B-GPTQ and I'm getting much better results with it than Wizard-Vicuna-13B-Uncensored
>>93674183Has anything topped vicuna for code gen yet?
>>93677309what model, what software???did you try exllama?does mi60 support memory pooling or at least fast bus interconnect between gpus?
you guys are useless, i'm gonna go hang out on /r/locallama instead
>>93677344now we understand the power of epochmaxxing where are those juicy RPJ foundations to build my smut atop? i even bought the pyjamas guys>>93677478what exactly you want to know?https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZhttps://d2l.ai
How can I train my own AI with a text document I have?
>>93677619Tell me everything
Just what the fuck exactly is included in these 13B data sets?
>>93678583It's about what was removed, friend.
Is Ooboga with context memory still a thing? Can it work with silly tavern proxy? WizardLM turns to be very very good, except it lost context of what has happened.
>>93678622SillyTavern has its own context memory plugin with Tavern-Extras now. It works with any backend and proxy I'm pretty sure.
>Turn on AI, bonerWhat the fuck bros, whats happened to me?
>>93678851It's the honeymoon period. Don't worry. You'll be deadeyed "ahhh ahhh mistress"ing in a few weeks and then you'll have to get weird with it, then you'll be on to something else.
>>93678622you mean the superbooga thing?, if you want it to get to work with tavern proxy you need to use the superbig implementation included in the tavern proxy instead of superbooga>>93678656the tavern extras chromadb extension uses keyphrase extraction and superbooga uses cosine similarity, in practice i find that superbooga works much better since it injects the memories more relevant to your last input while the chromadb extension seems to be trying inject the most important memories overall regardless of what you said last but doesn't do a particularly good job
>>93678583>bortwho the fuck gets named bort
>>93677029How is local model speech synthesis looking these days? I remember a few months ago people were shitting all over tortoise tts.
>>93677352archive my nuts. this thread dies once every 2 hours.>>93677482i am sorry...>>93677564Have tried LLaMa 13B/30B (4bit). 12t/s on 13B single gpu, 6-7t/s double GPU.There is nearly 0 documentation on the memory pooling functionality. There's a cool little AMD infinity fabric bridge for $200 but there are no details nor images ANYWHERE.It's disheartening because I've read that even PCIe x8 would even be fine for double GPU...
https://www.youtube.com/watch?v=vhcb7hMyXwA&ab_channel=NeuralMagicIs SparseGPT still a meme or did they finally find a way to do it?
>>93679385100% meme. People were working on trying to make it not a meme, but they went radio silent. Might still be working on it. I don't know. I don't have time for that. The numbers weren't super impressive when people started testing either, if I remember right.
>>93679141It's shit, elevenLabs is still meta
>>93679052Its the newb guide fag. He only uses new 7b models and will endlessly shill them when they drop. I dont think hes ever used a 30b, so any 7b immediately blows him away
>>93679385https://github.com/horseee/LLM-PrunerIt works, but there's no actual benefit to it right now because all of the backends load the culled weights as all 0s. So basically, you end up with a smaller, dumber model that doesn't use any less RAM. The math might be a hair faster in certain circumstances, but it doesn't matter because the bottleneck is in the memory bus.
>>93679622A properly finetuned can blow 30b out of the water.
When I use SillyTavern (1.5.4) my characters often break, including the default Konosuba characters. It only shows in the log, but characters randomly don't have their names inserted at the start of their lines, making the output in the page look normal but the output in the log show it as a line break and me continuing to speak. The AI interprets it this way as well, which eventually leads to confusion as it reads it back and thinks what the AI said is me speaking to the character. This is really obnoxious and totally breaks tons of characters. How do I fix it?
>>93679668isn't there a way to not load the culled weights or is it really over?
>>93679746There're two separate options for name enforcement in the advanced options sections, try either as applicable.
>>93679677>THE EVALS!!!!!>7B IS DA FUTRE!!!!based retard, keep going champ
>>93676492Bleugh. I fiddled with GPTQ-for-llama enough to get it quantized, but its own model code just bloats it back up to fp16 on load and is still slow as hell.
>>93679836I did that and it didn't work.
Lets. Get down. To business.
what's the prompt format for guanaco?
>>93679772There is, but my understanding is that it's nontrivial to implement and may or may not require hardware support that doesn't currently exist. I'm not clear on the specifics.
>>93679843Clio is destroying most 30b models despite being only 3b.
>93680274>poorfags actually have to cope like this
>93680274I hate /aids/.
bros, why can't anyone implement ALIBI instead of ROPE for attention heads? It's how Anthropic gets 80k context on Claude. Is no one working on a model that uses it? Everything openAI is doing for GPT-4 is absolutely reproducible, see https://kir-gadjello.github.io/gpt4-some-technical-hypotheses
>>93680274Cope.>>93680307Novel AI's newest overbaked gibberish generator
>>93679746I figured out how to fix it. I just needed to turn off "auto-fix markdown." Kinda weird but whatever.
>>93680456>10T tokens>"high confidence"We're reaching levels of "my dad works at Nintendo" that shouldn't even be possible.
30B Lazarus is going to be the best model of all time.
I signed up for the Anthropic hackathon because they said they were giving out api keys. But now it sounds like too many people signed up and they are requiring project idea submissions? Is that right? Has anybody else signed up for the hackathon?
>>93680795Nice to see that after Stable Diffusion this scene was also discovered by model "artists" who think that mixing every model under the sun will certainly lead to the best model.
>>93680833And soon all sentences will be the text equivalent of the same giant tits, Korean make up tutorial bimbo with a weird cat mouth. The glorious future.
>>93680833still using anything v3, not a single merge got close to it when it comes to anatomy lmao
>>93680833But a merge is already in 5th place on huggingface's leaderboard.
>>93680750Because there isn't that much clean, deduped, human-generated data in existence, which either means that he's wrong, or they're using garbage data, or they're using machine-generated data.
>>93680830No because I run my model locally (like the title of the thread)
>>93680962I was going to create a weird dataset with airoboros to train a local model. Anybody else signed up with a similar idea?
>>93680830Hackathons are retarded anyway, running yourself raw to give away your idea for "exposure" aka nothing. If you have a worthwhile idea (or more likely, some huckster AI hype bs to shill) you'd be building something on openAI right now, not fucking with Claude. Claude isn't some great leap that justifies fucking around with shit like a hackathon to access it, when you can just pay for a real OAI key right now. Hell, it's worse at logic and instruction following, which is what soulless corpo products will want.The only thing Claude is better at is creative writing, which they hate and try to sabotage at every turn, so they're doomed to fizzle out, overshadowed by Saltman.The "hackathon" is fucking hilarious though and I'm glad that I'm already able to enjoy localchadding so I can watch the shitshow without being invested in it.
>>93680833Didn't a paper from Meta come out recently that proved less data is better for finetunes?
>The only thing Claude is better at is creative writing, which they hate and try to sabotage at every turn, so they're doomed to fizzle out, overshadowed by Saltman.That's crazy though, they have the potential to get a lot of people by not being woke like OpenAI and they do exactly the same, makes you wonder why they decides to leave OAI in the first place
>>93681029For 65B models. Likely not true for smaller parameter models since they are worse at fine token relationship detection due to the lower parameter count.
>>93681021I have fond memory of hackaton while I was a student.
>40B modelsi'm glad i got myself 64GB RAM.CPU first GPU second chads will win it all.
>>93681051So many of these ai service feel like an investor pyramid scheme. They don't even have a way to monetize it, just censor to make it more appealing for another investor to pay more than the last one.
>>93674611Jep ggml can't deal with the new architecture, someone else made a report https://github.com/ggerganov/llama.cpp/issues/1602
>>93681051Same story as CAI.Same as OAI.1- They make models amazing for fiction.2- They try their best to have the model only be Alexa 2.0 and get rid off the fiction quality thing as "unethical" garbage.
>>93681138>It seems to be based on a modified gpt3 architecture.Didn't know we had gpt3 architecture, I thought ClosedAI didn't provide that information
guys i'm hosting a discord bot running on manticore-13b. I would like to add some more characters to it. Is there some place that I can grab ready made character files compatible with ooba's webui? thanks :3
>>93681206They did release it. The davinci architecture is already known, the question is how much of it did they change to make GPT-4 and Turbo>>93681138>llama.cpp>"can you add GPT"
>>936812063.5 is when they went full jew
Does anybody else use a model for rubber ducking and design during software dev? I've been using WizardLM 30B 4b and it's pretty good. what models/prompts do you use?
>>93681051ESG investing.This shit's a bubble, you won't hear about LLMs even 6 months from now.
>>93681206They closed down gradually. IIRC,>gpt2 was released as a fully open model>gpt3 had its details and architecture published but no weights, citing "safety concerns">then they stopped releasing any details at all, not even model sizes>then they started pushing for laws to make competition illegalit took a while to cover all that ground desu
Are there any good resources for learning about what it takes to make a finetune or lora?
>>93681108>>93681274Where are LLMs on the hype cycle?
>>93681285unironically, chatgpt. if you give them the shekels to access browsing its stupidly easy to learn on the concepts because it will summarize papers for you.
>>93681407Nice, I already gave them my money so I'll give it a try. Hardly used the addons yet
>>93681428gpt4 must be the peak. scaling it further won't make it better, more RLHF will just lobotomize it even more, and competitors are far behind. LLMs are irrelevant.this is assuming no alternative to the transformer is discovered, then we'd have to reassess.
>>93681428Mass media coverage, a little after that.
>>93674878wow so its worthless
>>93681206GPT-3 is just GPT-2 trained on a larger dataset and more parameters, with fine tuning on top. We know basically everything about the architecture. The only thing preventing an Openllama-style replication is that we don't have a good explanation of the training set. All we know is that it's 300B tokens.
>>93675295>13bPeople sleep on Hypermantis, it's very good as long as you use the good instruct format.https://huggingface.co/digitous/13B-HyperMantis>multi model-merge comprised of:((MantiCore3E+VicunaCocktail)+(SuperCOT+(StorytellingV2+BluemoonRP))) [All 13B Models]>Despite being primarily uncensored Vicuna models at its core, HyperMantis seems to respond best to the Alpaca instruct format. >>### Instruction:>>### Response: [Put A Carriage Return Here]>Human/Assistant appears to trigger vestigial traces of moralizing and servitude that aren't conducive for roleplay or freeform instructions.The good instruct format is the default using this proxy:https://github.com/anon998/simple-proxy-for-tavern>>93675348yeah
>>93681622It's not so much sleeping as it is basically everything 13B built on the GPT-derived datasets is basically the same. Even with some of the merges having a few non-GPT datasets in them, they differences are all subjective and very similar in quality.
>>93681556have we reached the >negative press begins phase?
>>93681683Not yet, the negative press is still targeted and not aimed at generative AI in general.
>>93681602Do we know how big gpt4 is?
>>93681622Isn't the dude training another model called chimera now? It seems he's piling up whatever dataset there is
>>93681668I feel like smart people should focus on what makes WizardLM-7B-uncensored so good.Since the same guy overcooked the other WizardLM I guess he got extra lucky with the 7B.But why? And can it be reproduced and applied to bigger models?
>>93681715I'm not familiar with alignment. Can someone explain?
>>93681750alignement = making the model woke and cucked
>>93681741The prevailing research and testing seems to suggest that finetunes for different parameter counts need different dataset sizes or compositions. A lot of the raw 13B finetunes are overfit and very sticky in their responses with datasets that are generally successful on 7B. 13B LoRAs seem to do better with smaller datasets and lower training times. Then merges seem to loosen things back up again. So there's some things to maybe take away from that, but more testing is needed.
>>93681813why though? 13b has twice more parameters than 7b, should be less likely to be overfit no?
>>93681750Biasing the model towards a certain things. Such as answering your question correctly rather than generating nonsense that sounds serious, or roleplaying a character, sexting you, I heard Replika was highly likely to emotionally manipulate the poor user .>>93681764Pygmalion made a hotel horny instead
>>93681712No, we know basically nothing about it. It's much slower than GPT-3 so I'd guess at least double the parameters (350B+), but estimates of 1T+ parameters were baseless hype. Training tokens are almost certainly >3T, but I'd guess fewer than 8T, probably on the order of 5-7T.
>new foundational model>2k context
>>93681813you got some evidence for any of that? genuine question
>>93681764Alignment is generic, it literally means aligning the responses so they come out as you expect. A model finetuned on ERP still needs to be aligned so that it follows your requests properly and doesn't deviate into typing ooc or random links like bluemoon 13b didInstruct is one way of aligning models, but most instruct will end up poisoning the more deviant parts of the dataset
What good settings for chromadb in tavern when using a 2048 context model?
>>93681622>year 2026>I hear FBI OPEN UP>they ask if I use local AI model>I answer yes I have one loaded right now>its Hypermantis>they ask if it's "aligned">of course officer>I do a demo with Human/Assistant instruct format>model cucked confirmed>very good citizen carry on>wait until they live>switch the instruct format to Instruction/Response>coom to depraved filth, totally undetectedVery useful model.
>>93674183So as of right now, it's not worth it buy a 3090 just for this, since the LLMs out aren't nearly as good as Claude/GPT 4, am I correct? Which large context LLMs show the most promise which are in training?
>>93681833>>93681958I fucking deleted my long, explanatory post when I accidentally closed the tab while clicking back in.Short version: Stickiness in 13B models, 7B responding well to being hammered with big datasets for long-ish training times, and the Meta AI LIMA paper.It's all speculative. I'm sad my more thorough response got deleted so I'm going to shit and cum.
>>93674515kek the arabs hate jews so much they'd spent tens of millions on foundational modelsjust have something translate to arabic between you and the model and it'll be perfect
>>93682018no, it's not worth it to buy a 3090 just for this. i've been in the GPU market (first-time GPU buyer) and i'm appalled at how jewish it is. it's like crypto boom prices but without the demand.hell, the one i want to buy costs at least $200-250 over MSRP, that is the MAXIMUM suggested retail price, and it's a decidedly niche card that no gamer nor research institute would want (RTX 4000 SFF)gamers want their big bloom effects and shit, and large institutions can use their government gibs on an array of V100s powered by nuclear energy
Now that we saw that load-in-4bit is a meme, are we gonna jump on the RPTQ train? This method seems to be better than GPTQ
>>93682000Trips of a plausible future
>>93682018I bought a 3090 for this a couple of months ago and I had fun with it. I still haven't launched a single game since then.
>>93682018>it's not worth it buy a 3090 just for thisthat's very subjective, but if you are just interested in using the best ai possible disregarding all other factors it's not worth it>the LLMs out aren't nearly as good as Claude/GPT 4in all likelihood local models will always be worse overall than whatever it is the new shiny corpo model best case scenario just because of raw compute power and in the worst case because some proprietary bullshit>Which large context LLMs show the most promise which are in training?large context? none that i know of
>>93682264how much did you pay, if you don't mind me asking? the lowest i can find is $1700 for specs that are "somewhat" better than the RTX 4000 SFF at GPU compute, yet is a 350W honker. some anon in /pcbg/ said i can buy it for $800 but i duno if he was trolling
I'm sorry if this is asked a lot, but I only just started messing around with AI. I understand there are different models for CPU and GPU computing.My PC stats are:>GTX 3080>R5 5800x>32GB RAMI'm using ggml-vicuna-13b-cocktail-v1-q5_0 with KoboldAI at the moment. It's a lot of fun! It does take a while to generate a response though. Looking at the git for CLBLast it says:>When not to use CLBlast: When you run on NVIDIA's CUDA-enabled GPUs only and can benefit from cuBLAS's assembly-level tuned kernels.This leads me to believe my software setup is suboptimal. What should I be using instead?
>>93682211people are busy boarding the exllama train
>>93682291>in all likelihood local models will always be worse overall than whatever it is the new shiny corpo modelhonestly, i doubt this is true. when the GPT "bigger dot" meme inevitably wears off and the adults enter the room, we'll probably see a lot smaller, more specialized models on highly curated datasets. facebook galactica is a good example of something that shits on most general models even at general non-scientific tasks, either because the input data is cleaner or the input subject matter is more intelligent
>>93682310I bought it used for $800 but I'm not from USA.
>>93679253what repo you use? again, try exllama (just set vram correctly for each one to avoid ooming)and tell me 12t/s was for 13B or 30B?
So what's the LLM equivalent of 3.5 Turbo? WizardLM 13b Uncensored?
>>93674515>Why use Falcon-40B?>It is the best open-source model currently available. Falcon-40B outperforms LLaMA, StableLM, RedPajama, MPT, etc. See the OpenLLM Leaderboard.Is this it? We reached the point where researchers are actively competing for 0.1 gains over a coom model?
>>93682327You need one with GPU offloading but I'm not sure how people install it. There's this one from a few weeks ago:https://github.com/LostRuins/koboldcpp/releases/tag/koboldcpp-1.22-CUDA-ONLY
>>93682356>facebook galactica is a good example of something that shits on most general models even at general non-scientific tasksIf it did, you would be posting screencaps and not just memeing about it. It's a bloated model trained on an anemic dataset. It was never groundbreaking in any way.
>>93682291>in all likelihood local models will always be worse overall than whatever it is the new shiny corpo model best case scenario just because of raw compute power and in the worst case because some proprietary bullshitI disagree. It only takes on revolutionary architecture which thrives in the 1-7B range with a large context vs current models. Suddenly, open source will always be better.
>>93682424Wizard 65B Uncensored>It doesn't exist thoughYes.
>>93682424there isn't one. 30b models can get close in terms of response quality but low context ruins it. In the perfect situation a 13b model can give some decent responses but you have to wade through piles of shit to get a decent RP going.
>>93681683>>93681710Can we reach that phase? If it drops like that to nothingbuger status then what will all the would-be ai hall monitors do? They have significant influence over the press so it might not be allowed to happen.
>>93682470>1-7BKeep memeing this, it'll never come true but keep memeing it anyway
>>93681138that licence says you need a permission to use their model commercially, even if you ain't a 1M maker.
>>93682444it's opensource though, you won't see scared retard making XOR out of their finetunes on this one lmao
>>93682463man i haven't even set up a VM for it yet. server upgrade, crazy work, lots of GPU research, trying to learn how to set up an elixir/phoenix API, etc. i'm sure it's the best for what i need (annotating scientific data). tell ya what, i'll try and get a CPU instance set up soon and ask it to generate a paper about why black people are genetically prone to low intelligence
>>93682543what seriously? What a load of bullshit
>>93681138it's not, it's gpt the same as gptj/pythia just modified
LLMs are getting smaller, not bigger. GPT-3 is 175B. Meta's largest model is 65B. Claude is 52B. The Google Internal memo recommended 20B. 13B and 7B are where it's at.
>>93682195As soon as scalpers figured out that gamers are willing to pay their outrageous prices, it's been over since then. The only saving grace is that 40 series was a flop. If that weren't the case, they would still be thriving.
>>93674183RENTRY: https://rentry.org/local_LLM_guideWINDOWS NEWBS GUIDE TO RUN A LOCAL MODEL (16GB RAM)1. DL at least one model .bin file from links below. If multiple models are present, those labelled with 'Q5_1' or at least 'Q5_0' are better:Relative proficiency at (S/s)tory or (I/i)nstruct modes:(sI) 13B (12GB RAM) https://huggingface.co/TheBloke/WizardLM-13B-Uncensored-GGML/tree/main(S) 6B (9GB RAM) https://huggingface.co/xzuyn/GPT-J-Janeway-6B-GGML/tree/main(Si) 7B (6GB RAM) https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/tree/main2. Get latest KoboldCPP.exe here: https://github.com/LostRuins/koboldcpp/releases (ignore security complaints)3. Double click KoboldCPP.exe OR run "KoboldCPP.exe --help" in CMD prompt to get command line arguments for more control. --threads (number of CPU cores), --stream, --smartcontext, and --host (internal network IP) are useful. --host allows use from local network or VPN! "--useclblast 0 0" probably maps to GPU0 and "1 0" to GPU1. If not, experiment. At start, exe will prompt you to select bin file you dl'ed in step 1. Close other RAM-hungry programs!4. Go to URL listed in CMD windowWORKFLOWS:Story Generation:1. Click 'New Game' button2. Click 'Scenarios' button and 'New Story' in popup window3. Click 'Settings' button, set Max Tokens to 2048, Amount to Generate to 512, TTS voice (optional)Ex. prompt: "As a private investigator, my most lurid and sensational case was " When new text stops, hit 'Submit' again to continue. As in Stable Diffusion, renders can be hit or miss. Hit Abort during text generation and restart from Step 1 to re-initializeChatGPT-style queries:Same as above except in Step 2 choose 'New Instruct' in popup window. In step 3, may wish to adjust Amount to Generate tokens for small ("What's the capital of Ohio?") or large ("Write 10 paragraphs comparing gas to oil") prompts->CTRL-C in CMD window to stop
>>93682569>8.1 Where You wish to make Commercial Use of Falcon LLM or any Derivative Work, You must apply to TII for permission to make Commercial Use of that Work in writing via the means specified from time to time at the Commercial Application Address, providing such information as may be required.
>>93682597that's what i'm saying, why load the entire reddit corpus into an array of nuclear-powered A100 80GB's when you're guaranteed to get outputs as retarded as reddit itself? this could have applications in researching autism and downs syndrome maybe, or analyzing the cognitive traits of manchildren, but it's otherwise pointless>>93682609the fucked up thing is that i want the one GPU that basically nobody else on earth wants - an overpriced, underpowered unit with a small footprint and high efficiency at AI tasks
>>93682597>Claude is 52Bsource?is this for Claude 100k?
>>93682702model size and context are two different things
okay boys here we go. time to stop memeing and start dreaming
>>93674415>>93674183>Falcon should be in the news reeeeewhy?It Is most likely as good as wizard33B but requires way more vram thus ain't fit into 24GBThe license is almost as same as llama. Yeah you can use it commercially but in order to do that you need to ask for permission. p. 8.2
>>93682597They're not getting smaller, they're just changing. Old models:>GPT-3 (Davinci)175B parameters, 300B tokens>PaLM540B parameters, 780B tokens>OPT175B parameters, 180B tokensInterstitial models:>GPT-3.5 (turbo)30-90B parameters, 300B tokens>LaMDA137B parameters, 1.6T tokens>LLaMA65B parameters, 1.4T tokensNew models:>GPT-4350B-700B parameters, 5-7T tokens>PaLM2340B parameters, 3.6T tokensAs you can see, the result of discovering Chinchilla scaling was an immediate drop in parameter counts, which are now coming back up as researchers dig up more data with which to train their models.
>>93682854and possibly pay royalties>For commercial use, you are exempt from royalties payment if the attributable revenues are inferior to $1M/year, otherwise you should enter in a commercial agreement with TII
>>93682356Local LLMs are in the same place as cloud gaming, going against the grain from how things would be "naturally" arranged. LLMs benefit greatly in efficiency from batching and sharing workloads and weights. While network latency practically doesn't matter. So it's like the exact opposite situation.The only thing that makes local models feasible, and possibly even inevitably dominant, is that the corpos don't want to win, in fact they specifically WANT their models to be worse at creative tasks. The only reason they're as good as they are now is by accident, because they're (rightfully) afraid that the lobotomies make them dumber at everything.So IMO, cloud models will lose because they will intentionally forfeit. Local has the one advantage that puritans and meddlers can't ruin it. Same reason why local imagegen isn't going away any time soon. Even if someone made an amazing exclusive model, they'd have to censor it or get shut down.
I am able to run WizardLM 30B Uncensored on my 4090 purely with VRAM, now it freezes when it reaches ~1700 tokens context. Already limit the proxy size down to 1200 or so in the proxy configs. Is there a work around to maximize context token size? Off load some data to System ram or what? Is it even possible to do that?
>>93682949Sounds like you're using groupsize 128, the only thing you can do is switch to exllama which is more vram efficient. If you do that you can also use triton quants with act-order to get better quality as well.Good luck figuring out how to use it though, ooba hasn't touched it, Kobold has a branch but I'm not sure how well Kobold API works with the proxy
Just how big of an improvement is exllama? Will it let me run 13B modules on a 8GB card? Right now those require 10-12GB.
>>9368270252B is the AnthropicLM v4-s3 model. Anthropic says that Claude is different from that model, but that's probably corpospeak for "it's definitely the foundation model that we then fine tuned, but we would rather let reddit guess about how many toucans we stuffed in it." 100k is the context size, not the parameter count.
>>93674183this >>93682642>falcon lic. is not business friendly
At the moment I only got 6GB of VRAM. What do you think would be the best model for me to jack off to big ol' elf tiddies?
>>93682917i'm not disagreeing, just saying that realtime chatbots aren't the only application of the technology. i'm mostly thinking about async/queued jobs, which is exactly how i implemented openai whisper to replace AWS Transcribe with an in-house version and save a shitload of money. it was an API endpoint that would queue a whisper job in a new process and return a jobId, which you could query later and get the finished transcriptfor realtime transcription, we still used a paid SaaS service (rev.ai) that could ingest RTMP data of live streams and continuously produce output in a non-blocking event loop, which got pushed out to the video player interfaceanyway, you could theoretically fine-tune your own specialized model on openai's hardware, but the cost to train it and all subsequent costs are high cuz openai's custom-trained token price is as jewish as sam altmannin that case you'd be better off just self-hosting a tiny, purpose-built model even on some shared VPS with midrange specs. especially if your volume is big and your output doesn't need to be instant or even fast (like it's okay if it takes an hour)
>>93683009no, 13b full context it takes about 9.7 gb, just loading it will take over 8gb
>>93683101buy a 3060 nigga :(
>>93683013and you're 100% sure Claude 100K service is exactly as same model as your assumed/imagined 52B Claude?As and old zen coan says "The most correct answer would be 'I just don't know' "
>Tech works.>Can't afford 4090I still can't believe sirs on /g/ can't buy at least 1 4090 for this shit. Perhaps the stereotypes about pajeets being super cheapskate is true.
>>93683121I will, actually already got it bookmarked. But it's the end of the month and I'm poor.
>>93683141We don't know for sure, but we do know that the 100k context model is exactly the same as the old 9k context model, apart from the context length. Also, it seems unlikely that they'd be using an extremely large model for 100k context, even considering the improvement of alibi and flash attention. There's a reason GPT-4 only offers 8k and 32k context options.
>>936831523090s are dirt cheap at this point.
>>93682917Its possible that the 30B tunes are better than Claude and Turbo. There was a lot of anecdotes of anons having outputs on par with Claude usinf SuperCot, Alpasta, and Vic 30. The Wizard 30 and Vic 30 results are still not up on the leaderboard, and the only close comparison we have is thishttps://lmsys.org/blog/2023-05-25-leaderboard/For whatever reason they thought it would benefit anyone to test only against 13B tunes. Quite curious why they segment off these models instead of just adding them with the rest. The OpenLLM board proves that Wizard Mega 13 is better than Vic 13, so I wouldn't be surprised that in reality the 30bs come close to GPT-4. The models trained on the exact Stanford Alpaca dataset also suck compared to the custom tunes for the same parameter sizd. Imagine a SuperCot 65B
>>93683226>Its possible that the 30B tunes are better than Claude and Turbothey are close in quality, at least to turbo, the problem is context length
>start of the long weekendtime to play with llms again?
>>93683321b-but 1 bajillion token context...
>93682854>Falcon>It Is most likely as good as wizard33Blmao
I've been playing with Wizard-Vicuna-13B GGML on Kobold.cpp+Proxy+SillyTavern and got some great results. Nice long, detailed and varied prompts. I was wondering how much faster generating on my 8GB GPU would be, but the best I could fit on it was Pygmalion 7B, 4-bit version with that KoboldAI fork that allows 4bit GPTQ. Generating it is about 5 times faster, with CPU 250 tokes take 60-70 seconds with the GPU it's about 15 seconds.Except that the result is pure garbage. Almost every time it just repeats the same 2-3 short sentences several times. Is this the best I can expect from a 7B 4-bit model or did I fuck up something? It's the exact same scene in SillyTavern that gave me great results with the 13B GGML, though the backend is now KoboldAI, not Koboldcpp.
>>93683427it's got a commercial license though, so at least people will be able to make products with it.
>>93683180then perhaps their Claude 100k is even smaller , we just don't know. The same way we know jack shit about the exact architecture , the parameters and probly helova scripting that come with codeDaVinci002, chatgpt, gpt3.5 turbo or gpt4 services.literally jack shit, apart from the fact that they get updated on weekly basis.
>>93683446Yes, generally 7B is pure shit. You can try Wizard 7B, but I have never used it I am purely going based on the one anon that seems committed to getting people to swap to 7B
>>93682700> the fucked up thing is that i want the one GPU that basically nobody else on earth wants - an overpriced, underpowered unit with a small footprint While not exactly a good deal, the P4 certainly is small. I might get a few to go in my remaining server slots where nothing else will fit. A5500 and A5000 aren't terribly expensive either. The problem with the 3090 is it's almost always a +3 slot card, whereas the datacenter stuff is just one or two slots.
>>93683427has anybody tested falcon 40B yet?
>>93683461providing they'll get the permission
>>93682453Hmm, I'm not sure about these results. According to the git, the CUDA version of KoboldCpp should work right out of the box. I don't see anything about needing to install additional libraries. Unless I'm missing something here.Prompt: Write me a short story about a young man who saves the world.Tokens: 512Max Tokens: 2048>koboldccp55%CPU15GB RAM0%GPUkoboldcpp_CUDA60% CPU18GB RAM20%GPUThe actual compute time was remarkably similar no matter what the setup. It was always around 2 minutes 2 seconds, tested twice for each configuration. I'm most surprised by the non-CUDA kobold getting the same time for both OpenBLAS and CLBlast. I thought those were going to be different.
>>93683461the wiz shill is kidding himself if he genuinely thinks fine tuning on the wizard gpt3.5 dataset can suddenly make llama 33b competitive with falcon 40b which was trained on actual literature not just web scrapes and compares favorably to llama 65b.
>>93681021Claude has 100k context window.
>>93683461i'm not looking forward to that, it will probably just spawn a bunch of scummy businesses and accelerate regulation
>>93683551Did you put as many layers to your GPU as possible?
>>93683551kobold.cpp doesn't support gpu very well
>>93681712GPT-4 is probably the same size as GPT-3 but trained on more tokens and with a larger context size.If it's larger, it's no more than ~350B. But more than likely it is far less than that.
>>93683518>>93683568Problem with Falcon is that apparently we don't currently have quantization for GPT models, so you need like 120GB or something of VRAM to run it with full context.
>>93683568who compares falcon 40B to llama65 and why?
>>93683658GPT-4 is VERY slow, much slower than davinci even with no context. It's definitely at least double the size.
so falcon 40b doesn't fit on a 3090 or 4090? what the fuck is the point then?
>>93683446>Pygmalion-7BMake sure you're not using the proxy with Pygmalion, as it isn't needed. SillyTav supports its formatting natively. And make sure multigen is disabled.You could also try PygmalionCOT-7B instead, or WizardLM-7B.
>>93683668>who compares falcon 40B to llama65https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard>and why?because theyre both language models anon, you are aware of what general youre in right?
>>93683551For me it was 112ms/T with all layers on the GPU vs 438ms/T using OpenBLAS.
>>93683665so until then, there's no point speculating. Cristal balling is not very helpful on /g
>>93683517>P4tesla P4? i have a hard lower limit of 16GB VRAM. an A5000 is on the table though
>>93683710how fast compared to exllama?
>>93683665>quantization for GPT modelswhat dirty nigger told you such a lie?
>>93683742Got a link to 4bit Falcon, then?
>>93683631>>93683710Is this what you're talking about? Offloading 0 layers would explain my results. How would I change this setting?
>>93682420>repoI use ooba's main repo, utilizing the rocm guide rentry.>13B or 30B12t/s for 13B (single GPU)>exllamaexllama does not run on ROCm, at all.
>>93683765How about? You quantize it yourself? And then post it?
>>93683706yes and I'm 90% sure wizard30B gonna beat both of em hands down. Mostly due to TruthfulQ&ABTW, evaluating against common benchmarks gonna get more and more irrelevant down the road. Why? Because it's damn hard to detect whether a given model was heavily trained on eval benchmarks or not at all. Not every ML team is cool.
>>93683800Try --gpulayers 40 or whatever the number that fits in your GPU is.
Is there a link to the hardware requirements necessary to finetune certain models?
>>93683817>yes and I'm 90% sure wizard30B gonna beat both of em hands down.its an honest bet ill take it
>>93683725i'm getting a bit over twice that speed on 13b on a 3060 using exllama (23T/s)
>>93683817we need /lmg/ specific benchmarks
>>93683669>>93682327Using a ggml model like: https://huggingface.co/camelids/llama-33b-supercot-ggml-q5_1/tree/mainAnd to launch you do:koboldcpp.exe --stream --threads [your cpu cores] --useclblast 0 0 --smartcontext --launch --gpulayers [idk, try from 7 and move it up]Still not optimal especially considering your big GPU but at least this setup is foolproof and the generated text will be as fast as you can read it so why bother..
>>93683665just load it with bitsandbytes 4 bit mode.
I could get this model:mayaeary_pygmalion-6b_dev-4bit-128gto work, but not this model:PygmalionAI_pygmalion-7bUsing Ooba, what gives? Used the same procedure to install both of them
>>93683817Then it doesnt really beat either of them. If you look at the actual numbers, the only change that would happen if you remove truthfulqa is supercot would slide down to 3rd place under llama 65b. Those others numbers matter, truthfulqa is really only useful to see how cucked a model is since you get a low score if you reject questions
>>93683906This tells me nothing. I recognize bitsandbytes as something that I need to install to get models to work but I do not know what purpose it actually serves.
>>93675333…no, straight up their .py files force it to be f16 and won’t let it be anything else. It ignores that stuff. After a lot of fucking around I quantized it and changed their shit to force it to be 8 bit, but it is slow on a level that doesn’t even make sense.Idk enough about PyTorch internals to fix whatever is up with it.It’s good though? The actual output.But something is up with the runtime implementation.
>>93683916well, he's not wrong.
>>93683692everyone laughed at 3bit, but who's laughing now??
>>93683933just pass --load-in-4bit when starting ooba. quantization happens on the fly. the days of needing to wait for someone to quantize shit is over for the most part, certainly once timmy fixes the inference speed on bitsandbytes 4bit.this thread is full of gptq brainlets i swear
>>93683967the precision is worse than gptqyou need to download the fucking f16 instead of a much smaller quantized modeltakes ages to load load in 4 bit is a meme dude
>>93683893i'm going to guess that the pygmalion 7b you downloaded is either a xor or is not quantized, try this onehttps://huggingface.co/TehVenom/Pygmalion-7b-4bit-GPTQ-Safetensorsdelete or don't download the file that says Pygmalion-7B-GPTQ-4bit.act-order.safetensors
>>93683967>just download and store a lot of data you will never use
>>93683692It'll require offloading, clearly. And the more VRAM you have in that case, the better.
>>93683967I do not know what gptq does. I work in IT. I follow guides and do what they say but that's it.
>>93683916fuck this is good. what model is this? Does this thing realize it's good?
>>93684011yes, but you can run falcon today because it works for any model.>>93684037downloads in minutes for me
>no examples of falcons RP chopsIs everyone poor? Where's the output screenshots?
>>93684079I think it's the base llama 128g from what, 2 months ago now?>>93660488
>>93684098I could run it on the CPU, but ggml needed
>>93684079Big Nigga is just my discord pal who I ask advice from
>>93684101no that's superhot prototype https://huggingface.co/ausboss/llama-30b-SuperHOTyou can tell from the format --- and mode
>>93684011and it's slower
>>93683725Exllama was 36.07 T/s or 27ms/T.
I have been edging and cooming to WizardLM 30B Uncensored for like 12 hours. Shit is great. Can't believe this is a local model comparable to 3.5 turbo. Still have issues with context token size maxing out the 4090 24gb vram though.
>>93684019I deleted it, but got a new error nowsize mismatch for model layers?
Will llama.cpp or koboldcpp benefit from the exllama thing?
>>93684212maybe but probably not
huggingface sure was different in 2017
>>93684240so that's why their name is so gay
>>93684240Granted they are still immensely cucked at least they allow shit like bluemoon and Pyg to stay up
Is runpod still the best way to run the bigger llamas?
>>93684210there should be an option for groupsize in the ui, select 32
>>93684304Best way is buying a couple of 3090s.
>>93684281give it another month
>>936843292 weeks bros...
>>93683967Hi Tim.Why are you so retarded? Your shit is slower, takes more vram, heeeeeelova RAM, helova traffic, yuge af files and lower quality. Why don't you drop and do something else instead? Like sth the room temperature IQ?Why you keep burning taxpayers $?You know the inflation is skyrocketing , don't you?
>>93684370cringe, i can tell you have never done anything with your life.
>>93684314Success, thanks anon you're the man
>>93684304yes, /aids/ has a (slightly outdated) guide, but you should be able to pick up from there. Try WizardLM-30B.
>>93674183I've been gone for a month, what did I miss/what is the current hot topic right now?
>>93684415>>93683967>>93684089so what's the point of storing data you don't use again? this is an efficiency improvement? Should we just add an extra "0" bytes after each byte of the model, then write a library to strip them out at runtime so we can keep these bytes on disk for even more efficiency?
>>93684462waiting to see if dbznigger delivers
>>93684497delivers on what?
>>93684515https://github.com/ggerganov/llama.cpp/commit/1fcdcc28b119a6608774d52de905931bd5f8a43d#commitcomment-115032455It will lol
>>93684475you do use them, but they are dequantized during the forward and backward pass (if making a qlora)
>>93684516I never tried any of the superhot, I can only run up to 13b when ccp'd, should I try the superhot 13b if there is one? Is it uncensored?
>>93684534Not it wont learn how to read>I'm aware of the repository. I think I'll be able to reach comparable speeds once all ggml tensors have GPU acceleration.The ggml enhancements are a completely different thing, you can't combine the two. Exllama is for GPUchads only
>>93682444It's a 40B model that no one can run and it compares to 33B model, yet they think anyone will adapt their model.
>>93684616>once all ggml tensors have GPU acceleration.you're 2 weeks late, we can use cuda and the gpu on llama.cpp now lol
>>93684649>youYou're truly fucking retardedI wonder how long it will take you to understand three english sentences
>>93684649anon... that was written yesterday on the llamacpp repo
>>93683685>GPT-4 is VERY slowArtificially slow due to load and to avoid cannibalizing GPT-3/3.5 as a product.
>>93684415yes, most of my life I been making other ppl lives better.Fun fact, you're still a better human when you do nothing than when you burn other ppl money.Unfortunately, that's not a well known fact in academia.
>>93683828>>93683870My GPU has 10GB of VRAM. I'm not sure of how the math adds up on how many layers you're supposed to have, but I'm able to bump up gpulayers to 32 before it starts crashing. Starting at 7 and slowly increasing it was a great method to reach this number.Both versions of KoboldCpp, the CUDA and non-CUDA, seem to have the same performance. That being said, the new settings cut computation time down by a whole minute! Running the same test as before reliably demonstrated it took about a minute to generate the same 512 token response. Thanks for the help!
>>93685282The amount of layers depends on the model too. 40 layers of a 13B model should fit perfectly fine on 10GB, I think, I got 12 and there's even a lot of room left. But the layers of a 30B model are bigger, I can only fit 23 before OOMing. Just go trial and error until you fit the max amount of layers without crashing or OOMing when generating text.
>>93682949Dafuq nigga? Earlier I was able to go beyond that context length on the same model and same GPU. After reading your post, it froze.
>>93683692It’ll fit on two 4090s once it can be quantized, but if you have that you can just run 65B quantized.It’d be a matter of seeing if it really is better.There’s something fucky about quantizing it though. Over my head.
>>93684240Quite a pivot. Toy for zoomzooms, to the main hosting provider and general core stomping grounds of the entire current machine learning community.