[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1666552542731156.jpg (144 KB, 850x850)
144 KB
144 KB JPG
/lmg/ - A general dedicated to the discussion and development of local models

>mikufw when no 48GB 3090

►Previous Thread >>96062736 & >>96050594

►News
>(12/09) https://github.com/turboderp/exllamav2
>(10/09) https://sites.google.com/view/medusa-llm
>(06/09) Falcon 180B released
>(04/09) llama.cpp: CPU only LoRA finetuning https://rentry.org/cpu-lora
>(24/08) Meta AI released Code Llama (7,13,34B with 16k up to 100k context)
>(18/07) Llama 2 released

►Model Rankings
HF: https://hf.co/spaces/HuggingFaceH4/open_llm_leaderboard
CODE: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
PLAP: https://rentry.org/ayumi_erp_rating

►FAQ
>Main FAQ
https://rentry.org/er2qd

►General LLM Guides & Resources
>Newb Guides
https://rentry.org/llama_v2_sillytavern - https://rentry.org/local_LLM_guide
>aicg Newb Guides
https://rentry.org/meta_golocal_list
>Llama 2 Jailbreaking Guide
https://rentry.org/llama2-uncensored
>LlaMA Guide
https://rentry.org/TESFT-LLaMa
>Machine Learning Roadmap
https://rentry.org/machine-learning-roadmap
>Novice's LLM Training Guide
https://rentry.org/llm-training
>Local Models Papers
https://rentry.org/LocalModelsPapers
>Quantization Guide
https://rentry.org/easyquantguide
>lmg General Resources
https://rentry.org/lmg-resources
>ROCm AMD Guide
https://rentry.org/eq3hg

►Model DL Links, & Guides
>Model Links & DL
https://rentry.org/lmg_models
>lmg Related Links
https://rentry.org/LocalModelsLinks

►Text Gen. UI
>Text Gen. WebUI
https://github.com/oobabooga/text-generation-webui
>KoboldCPP
https://github.com/LostRuins/koboldcpp
>KoboldAI
https://github.com/0cc4m/KoboldAI
>SimpleLlama
https://github.com/NO-ob/simpleLlama

►ERP/RP/Story Gen.
>ERP/RP Data Collection
https://rentry.org/qib8f
>LLaMA RP Proxy
https://rentry.org/better-llama-roleplay

►Other Resources
>Miku! (desu) (boku)
https://rentry.org/lmg-resources
>Benchmark Prompts
https://pastebin.com/LmRhwUCA
>Additional Links
https://rentry.org/lmg_template
>>
mikusexxxxxx 2
>>
mikusex 3
>>
File: 1694573085178812.jpg (9 KB, 169x299)
9 KB
9 KB JPG
>>96077130
Again.
Why nobody tries to hack GPUs these days?
The falcon chip within the gpu core is just a RISCV with a custom fw running on it. The drivers are opensorce and the API for bios is available in the Linux nouveau kernel.
There's also a FAUCON debuger/emulator for falcon secure chips made by switch hackers community. It's on github.
The encryption keys and the certs are most likely in the chip and with fault injection/fuzzing or just reverse engineering it should be feasible to crack the shit. yet nobody tries. Or I'm mistaken?
>>
>>96077265
Because both NVIDIA and AMD sign their drivers for like ~5 years now.
>>
File: 1693852780728073.gif (2.7 MB, 292x304)
2.7 MB
2.7 MB GIF
>>96077265
piku sex
>>
>>96077273
s/drivers/bios/
>>
>>96077265
frogposter ...
>>
>>96077083

usually axolot so yeah bsb is used
>>
File: Untitled.png (79 KB, 821x511)
79 KB
79 KB PNG
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
https://arxiv.org/abs/2309.07937
>We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90. VoxtLM also improves speech generation and speech recognition performance over the single-task counterpart. VoxtLM is trained with publicly available data and training recipes and model checkpoints will be open-sourced to make fully reproducible work.
>open-sourced
https://soumimaiti.github.io/icassp24_voxtlm/
no code yet apparently will be posted here (https://github.com/ESPnet/ESPnet). speech to speech should be voice cloning though not sure how good it will be. 125M/350M/1.3B models (not sure if all will be released). something to watch I think
>>
>>96077265
It's probably harder than you think, also it's not like we would be able to do much with that anyways, most interesting stuff is probably lasered off in hardware. The only interesting potential target I see is the CMP series mining cards, maybe it would be possible to restore tensor cores with a hack, but they could also be defective or hardware disabled.
>>
PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions
https://arxiv.org/abs/2309.08140
>We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of speaking style. Since there is no large-scale dataset containing speaker prompts, we first construct a dataset based on the LibriTTS-R corpus with manually annotated speaker prompts. We then employ a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. Unlike previous studies that rely on style prompts describing only a limited aspect of speaker individuality, such as pitch, speaking speed, and energy, our method utilizes an additional speaker prompt to effectively learn the mapping from natural language descriptions to the acoustic features of diverse speakers. Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt.
https://reppy4620.github.io/demo.promptttspp
good day for speech gen but sadly no promise of weights. check out the samples though
>>
File: 1683705829436356.jpg (49 KB, 640x578)
49 KB
49 KB JPG
Sorry for retard question, is running Mythomax-L2-13B-8bit-exl2 better than MythoMax-L2-13B-GPTQ?
I don't understand much of the basics but as far as I know 8bit is supposed to be better quality at a heavier vram cost. (8bit one works on my 3090 with 8k context as far as I've seen)
>>
>>96077279
psychomiku hiii!!!!
>>
>>96077572
yes obviously you should get at least 12k in on a 3090
>>
Incredible things are happening in Finland :D
>>
>>96077631
Kek cyberpunk shit
>>
>>96077631
better than making soap for cuomo ig
>>
>>96077631
this isn't new information
i heard about european prisons having prisoners do basic labeling work in like, 2017
>>
>>96074900
>>96075676
On average, half the data in the llama dataset is duplicated. Of course, this does mean that some of the data is articles that are reposted 35 times and so a lot of other stuff is unduplicated, but after the second epoch anything that was duplicated even once stops helping. So it's hard to say exactly how many of those tokens are helping for the first or second epochs, but by the time you reach the third epoch, for every 1T tokens worth of compute, you're only getting 500B worth of useful training and 500B of potential overfitting. Ideally they would clean their dataset and run for more epochs rather than trusting to random chance to manage how many times each passage is seen, but with the dataset they used I think 2 epochs was the correct choice.
>>
>>96077623
you mean context? VRAM seems pretty maxed out, and I'm not sure how much it dumbs it down if I raise it that high
>>
>>96077736
Maybe it's different on Windows, but I was able to get 13k on that fake 8bit Mythomax.
>>
>>96077722
Try cleaning up 1T tokens lol
>>
>search /g/ for lmql
>6 hits
>last post was by me
is it just so avant-garde nobody is using it?
or still annoying to setup with LLaMA?

it looks like guidance but better and not dead.
>>
>>96077631
Would you rather be making furry porn mousepads for a dollar an hour or tagging furry smut
>>
I want to experiment with making the model LARP as different personalities/people/etc with different biases and goals when e.g. summarizing content

it's a long shot but i'm wondering if the RP/storytelling/etc models might be useful here.

What's the best models for this overall, and are any actually made for real e.g. D&D roleplaying and not just ERP?
>>
>>96077427
You won't be able to enable hardware that isn't there, but you could do stuff like enable voltage control for Tesla cards, and potentially unlock FP16 performance for Pascal. It would be nice to get a P40 down to 85W and working with exllama.
>>
>96077772
what
>>
>>96077772
>lmql
Looks like grammar-based sampling all over again
>>
>>96077772
https://github.com/guidance-ai/guidance/tree/pythonic
seems active now. but from what my browser history is telling me this was a microsoft project so I wonder when it changed
>>
>>96077769
It's already been done.

https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama
>>
>>96077846
it's a bit mad, i haven't fully looked into it but it looks like they implement a scripting language which is a superset of python and alllows you to add in lots of structured LLM calls within the program itself, constraining output to a certain format is an obvious thing everyone wants but also trivially adding XofY (whatever, chain of thought and all the new ones) style prompting and agents with just a few lines of code seems neat (i like the example of political arguments they give)
>>
>>96073660
Anon from last thread still here? I've got a 6600 and couldn't get the rocm kobold working when I downloaded it. What did you do to get it working on yours? Mind explaining, if you're here?
>>
File: 1688106991831801.png (61 KB, 2307x585)
61 KB
61 KB PNG
>>96077856
problem was it's a white guy from microsoft (slundburg) research and uh, another guy, and the main microsoft guy doesn't seem to be committing much recently from what i see so i fear code quality issues

lmql is two eth zurich phd students so i expect high quality given that it's fucking ETH.
>>
>>96077273
that's what I'm talking about. Can you read? The keys have to be in the falcon security chip, for otherwise it won't be able to verify if your fw is legit or not.
Additionally there are some Russian vbioses already signed, yet the voltage is not stock.
So you can either
- steal the keys from Nvidia
- extract the keys from the falcon chip itself
- use the existing signed fw , edit it and figure out the checksum that is needed. The checksum can be found by reversing the firmware bin.
- bypass the cert verification all together
- flash with your own custom falcon os/fw with no cert verification
- skip the check with fault injection or other tricks
- used other backdoors like the one used in nvflashk by kefinator ( if there are some)
- use sth else
>>
>>96077864
I concede, I never thought it'd be possible to do it efficiently
>>
How computationally expensive is quantization? I don't know because I've never squished a model myself before.
I was just thinking about something during my chemistry class an hour ago... a "dynamic" edge LLM loading system that could, given a set of f16 files, predict the most balanced parameter/bitcount recipe for speed/hardware utilization and accuracy/perplexity, bake said mix and then cache the results.
Not going to work if this processing would take forever though, at least not without an external server doing the baking on its side.
>>
>>96077202
No, you began being weirdly defensive and argumentative first when you shit talked me for being esl and not reading the paper.
Disagreements are normal in conversations. You either solve them or agree to disagree and move on.
Thanks for the paper though, that's cool.
>>
File: Untitled.png (62 KB, 2285x348)
62 KB
62 KB PNG
>>96077930
oddly if you try to click the name github doesn't do anything and it's also not showing up on his profile. never seen anything like this huh.
>>
>>96077975
>given a set of f16 files, predict the most balanced parameter/bitcount recipe for speed/hardware utilization and accuracy/perplexity, bake said mix and then cache the results.
That is exllama2. Well except predict part. Just do something on top of exllama2 that calculates optimal bpw.
>>
>>96077427
So what if it's lasered. It does not matter. The riscv chip just runs some OS and does the verification. It has to read the certs (store in the mem) then compare them. Also I'm not sure you need to additionally sign your vbios once you edit it. It seems all you need is to figure out the checksum.
BTW, just because some data is lasered it doesn't necessarily mean it can't be extracted/overwritten /bypassed/tricked.
>>
>>96078096
And there you have why nobody is doing it, kek.
>>
>>96077975
That's basically what exl2 does. It's extremely computationally expensive because it runs tests to determine the perplexity penalty of quantizing various features to multiple sizes.
>>
>>96078059
i hardly use github and i hate it but i think you have to click their stupid avatar to see their page?
https://github.com/slundberg
>>
File: 1694703409487597.jpg (29 KB, 541x540)
29 KB
29 KB JPG
>Do 13b Mythomax, pretty good
>Decide to do 32b Libra and 33b Airoboros
>Decidedly better awareness and prose, but schizos out far more than 13b
I remember hearing the llama 2-based ~30b models were fucked initially, is that still true?
>>
>>96078050
I provided you a paper with an answer to your question (about training for multiple epochs) that you proceeded to not read and then seem to want to argue with me over misunderstanding what I said. Even continuing it onto a new thread. Just further proof really.
>>
>>96078164
Did you use prompt format recommended on main libra-32b repo? It helps with the schizo substantially.
>>
>>96077427
AFAIK they are hardware disabled with blowed fuses and not blowed traces. The data is stored as bits in some OTP register array. That means if you trick the firmware you could potentially enable them back.
Yes, probly some huge portion of the die area is defective but you never know. nv is prolly not sure either since I doubt they stresstest and inspect with the SEM every single nv chip that is there.
>>
>>96078151
yeah it won't let you on that specific page and the actual commits aren't shown on his profile which is why I thought it was weird.
>>
>>96078151
That just because he didn't use the same username as his github in his commit. I could commit as Linus Torvalds if I wanted or any other name and if that match a github username, it will show that github profile.
>>
>>96078126
I still don't know since ppl hacks consoles, smart cards, sec cameras, ATMs , IDs, laser printer chips , super secure passcards, TPMs, cars, and even ... military riffles.
All of them hardware encrypted with the top notch AES256/RSA4096/Kalina512 or whatever
>>
>>96078264
hmm, interesting

i was thinking the other day about how it might make sense to make metrics for github repos (since there's so fucking many of them) and one useful one might be to track if the original creator (defined as some combo of creating the repo and most of hte init/big commits/merges or something) is still active and possibly who took over

also doing an ELO rating of github users could be interesting.

there's just way too fucking many of these things out there.
>>
>>96078291
I feel like most of it is social engineering and exploiting holes.
Some do the "18 years to crack this one thing" but that seems rare in what I've read.
>>
Platypus-instruct keeps filling up the maxnewtokens every generation. I have unban tokens enabled, is there anything else I should be looking at? It doesn't do this for other models I've tried.
>>
>>96077130
I'm on an AMD GPU. I want to install Linux to run Mythomax. Can I install it in a virtual machine? I don't want to boot to Linux everytime I need to coom.
>>
>>96077342
>discrete speech tokens from self-supervised speech features
ZZZ
>>
>>96078395
Your point?
>>
>>96078197
It wasn't my question. Somebody else asked and I was answering.
>>
>>96077829
yeah, same for some hidden p2p over pcie , unnerfe nvlink on 3090, resizable bar, voltage/power control, higher freq, more VRAM support and potentially more cores if disabled .
And frankly the hardware is there, it's just usually disabled by some data stored in OTP fuses that ppl calls laser burned but it's still just a few bytes. or even less,
>>
>>96078416
Having a model compute arbitrary features for a given piece of speech is garbage. Tortoise and Bark are limited because it's limited by how it can represent a speech through conditioned meme inputs
Have the LM learn directly from a representation of the waveform itself for results, like Voicebox says it does
>>
>>96078381
You should give up. You need to not be retarded to use AMD for AI.
>>
>>96078350

that's bs
99.999% of the hardware hacking is the actual hacking and not soceng.
Look at the blackhat, recon, hardware.io, CCC, CHESS or even Defcon. Not much soc there.

ahh and BTW, I forgot to mention THE SATELLITES, including defense ones. Recently , they got hacked too.
>>
>>96078480
no I want to use it. I won't give up. I'm very autistic when it comes to make shit work.
>>
>>96078420
Yeah I don't know what to tell you then since we weren't in disagreement at all from what I can tell but you misunderstood my post then decided that it was an argument you had to win and won't let it fucking die for whatever reason.
>>96075743
is the question I thought deserved an actual researched answer since it is nice when we get actual discussion over machine learning here
>>
https://huggingface.co/turboderp/Llama2-70B-chat-exl2

70B on 24GB is here (hot, just out of the oven)
>>
>>96078468
you didn't read the paper huh p*traspammer
>>
>>96078618
The 2.3 bpw should fit comfortably on a 24GB GPU
>>
How can I prevent Risu from cutting off its replies too soon? What settings should I tinker to fix this? I'm using local btw.
>>
>>96078630
>adopting the tactic of claiming everyone is yourself
cute. not going to entertain the argument if you're just going to be intellectually dishonest as always
can't wait for your rock tortoise in windows!
>>
Trying to extend 13B model to 20B while mixing 2 model and keeping coherency/consistency without looping.
I REALLY need feedback on the 4, so if you have time, read the readme and tell me what is the BEST out of those 4 models.

https://huggingface.co/Undi95/MLewd-ReMM-L2-Chat-20B-GGUF
https://huggingface.co/Undi95/MLewd-ReMM-L2-Chat-20B-Inverted-GGUF
https://huggingface.co/Undi95/MLewd-ReMM-L2-Chat-20B-Precise-GGUF
https://huggingface.co/Undi95/MLewd-ReMM-L2-Chat-20B-PreciseInverted-GGUF

While the two first use chunk of 16 layers and the two others use chunk of 8 layers, the Normal ones have the base of Llama2-Chat (first layers), the Inverted ones have the base of Llama2 (first layers)
Hyped to see the result, firsts feedback was very nice for a 20B model.
>>
>>96078705
Increase response length
>>
>>96077921
NTA, but the 6600 doesn't have direct support in rocm, so you'll need to set HSA_OVERRIDE_GFX_VERSION=10.3.0. This is on Linux.
>>
Synthia is no good for roleplay.
Keeps interjecting with “reasoning:” and “evolved thought:” stuff and making suggestions.
>>
>>96078203
I don't get it at all. Libra seems to be an absolute genius for my niche fetish shit after I set the prompt up properly. I thought I had proper prompt for all the alpaca models.
>>
>>96078789
I havent tested precision yet but im consistently getting better results from non inversed model(s)
Whats precision supposted to accomplish anyway?
Also, great work anon
>>
>>96078381
>Can I install it in a virtual machine?
You would need to forward the GPU to the VM. Which means running Linux as your hypervisor anyways I think.

>I'm on an AMD GPU.
>I want to [...] run Mythomax.
llama.cpp will work in windows.
>>
>exllama requires python 3.9
Fuck this project.
>>
>>96079109
Thank you!
Precision mean more repartition (16 chunk x 4 vs 8 chunk x 8), that's all, trying different approach there.
>>
>>96079126
>being stuck on 3.8
bro...
>>
>>96079126
what the fuck are you doing on python<3.9 in the first place?
>>
File: .png (23 KB, 903x139)
23 KB
23 KB PNG
Looks like the pyg team just finished training 70b.
>>
>>96079208
Pyg team hasn't accomplished much, but they sure do have clout huh. You'll have actually ML indsutry people pop in their matrix sometimes.
>>
>>96079226
Yeah it’s annoying.
>>
>>96079208
reminder not to fall for their rxvk crowdfund scam
>>
>>96079245
What’s rxvk?
“Crowdfunding” (scamming community goodwill) is illegal in my country thankfully
>>
>>96079208
I'm disappointed they used open orca dataset containing "AI assistant" bullshit. They should've at least cleaned that up.
>>
>>96079279
rvwk or whatever
it's the rnn-like architecture
they said they want to crowdfund development for it kek
>>
>>96079226
Considering they'll be releasing their website by the end of the month that's expected to compete against character ai, they're about to get a lot bigger.
>>
>>96079084
Really? I haven't had it do that once. Are you using the ToT/CoT system prompt? Try just throwing a character card or some generic RP instructions in there instead.
>>
>>96079199
>>96079200
My shitty dev kit doesn't have a torch cuda package available for 3.9, just 3.8.
>>
>>96079305
>tfw novelai2
>>
>>96079305
How, last-mover advantage?
>>
>>96079359
C.ai is one of a kind as far as I know, NAI is the next closest thing.
>>
>>96079359
I don't know much but it seems to me they managed to weasel out a place among the normies for some reason like NAI did, and reap the benefits individuals can't.
Pygmalion plebbits are almost as big as NAI's are.
>>
File: satania.gif (39 KB, 220x216)
39 KB
39 KB GIF
>>96079340
py_toddlers BTFO
>>
>>96079501
Yeah basically, but llama.cpp only gets 2 t/s with 70b. I was hoping for better speed with exllama but now I'm going to be spending 3 weeks trying to get torch compiled on this fucking piece of arm shit.
>>
>>96079340
Change your dev kit?
>>
>>96079351
I wouldn't equate pygmalion to nai since nai is story shit, but nai is starting to get into chat shit too. However as we have all learned, any finetune based on llama models are better than nai's models, especially at the 70b level, on top of pygmalion being free. If anything nai is going to get stomped come the end of the month when the website opens up. This is just an expectation and assumption though.

>>96079359
>>96079413
Diving into some aicg lore for this. Most of pygmalion's dedicated userbase comes from the last split from character ai. When cai's filter servers malfunctioned, their userbase was exposed to an unfiltered ai model. Surprise, people prefer an unfiltered model because it's reponses were better and rebelled to a point where cai shut down down their social media. Pygmalion-6b came out and at the time, since llama did not exist yet, pygmalion-6b was that best alternative because it was free and unfiltered. Then Roko's basilisk happened (free open-ai); people became addicted to how smart oai's model are, roko's basilisk shut down and the rest is history.
>>
>>96079607
Eh, I meant NAI in the "gonna take advantage of coomchubs and most probably deliver shit" department.
>>
>>96079607
Yeah I’m not retarded and I’ve been here the whole time. It’s just bizarre that it has so much staying power when their data literally makes the model worse and they have done basically negative work, and now their glomming-onto popular models is seen as some positive thing instead of the cynical leeching it is.
And now they’re going to try to make money off of all of their hard non-work.
>>
>>96079607
expanding on the /aicg/ lore part, the CAI trannycord and reddit "rebelled" after the filters went down for a day. people realized the devs weren't gonna take them away after benny's famous reddit post, so everyone decided to use pyg instead. cai was decently big at the time so it must've been thousands.
>>
>>96078618
The 2.3 bpw 70B is working perfectly fine on 70B.

Now, is it better than a mere 13B? Idk, it's hard to say because only the shitty llama chat model is available. It barely passes the amputation test.
>>
>>96079701
>The 2.3 bpw 70B is working perfectly fine on 70B.
Meant to say "working perfectly fine on 24GB"
>>
>>96079669
at least they show what aibros could do with a bit of organization
hate nai but roach got himself a nice little gig for the little work they put into their product
sadly aibros don't like organization
>>
>>96079669
There's bad stigma from their llama1 13b model preforming really poorly I'm guessing. It was their weakest model and it was obvious trying to train llama to follow their old 6b format wasn't cutting it. Their instruct models are a lot better and their current models are all instruct now. I'm not particular with how they're making money, if at all. I know they worked with people in the past that donated hardware time to them so they could finetune models for free but was near the end of the llama1 era. I recall them being scared about training llama1 models initially because of the licensing. I haven't kept up with their internal workings in a few months.
>>
>>96079777
>show what aibros could do with a bit of organization
Nothing?
Most of the popular/good finetunes are one guy.
>>
I have a gtx 1080 ti and am currently unemployed because the startup I worked for imploded. What should I focus on learning with respect to LLMs and ML if I want my next job to involve working with similar technology? I don't have any industry ML experience beyond basic search engine and statistics knowledge.

Should I read and implement all the popular ML papers?

Do personal projects?

Learn the existing popular LLMs inside and out?

Apply to jobs??
>>
>>96079808
Model mixers don't count. Note how they have to wait for the base finetunes for their mixes to even exist, otherwise they would be sitting on their hands the whole time.
>>
>>96078789
Is this the meme prompt format model?
>>
>>96079808
well exactly since nobody organizes themselves
nai was some nftbro brownskin and he's now scamming chubs for big monies
pyg guys were retarded coomers about to
start their own scam
meanwhile localbros sit with their thumbs up their asses waiting for autists to deliver goodies
>>
>>96079835
Seconding this question.
Except I do have a little bit of ML experience but nothing I would feel comfortable putting on a CV.
>>
>>96079847
I’m not even talking about mixes.
Airoboros is one guy
Synthia is one guy
Half of the leaderboard are one-guys.
The mixes are bizarrely more likely to be multiple people because people too dumb and oversocialized (same thing) to work alone can only really do mixes.
>>
>>96079530
>but now I'm going to be spending 3 weeks trying to get torch compiled on this fucking piece of arm shit.
Would it be faster to find whatever python 3.9 syntactic sugar exllama2 uses and replace it with something that works on 3.8?
>>
>>96079873
uhh yeah? do you really expect people to do serious for free? anyone who actually has novel results and not just a meme erp finetune is sitting on their work waiting until they can get funding to release their work as a saas
>>
>>96079835
Where's the RetNet schizo poster? There's a paper that's been waiting to be implemented.
>>
>>96079835
linksrentry has some learning stuff. watch karpathy's vids. probably would be good to network with ML guys on twitter (yaccine and kaiokendev). no one is stopping you from publishing a ML paper on arxiv if you have an idea you want to try. can be something basic like adding in a new technique to a small model and running tests
>>
>>96079932
Any easy way to do that, or would I just be running it on Python3.8 and finding where it errors out until it works?
>>
>>96079934
except neither nai or pyg had any novel ideas when it comes to business
my point is that people who actually give a shit could do good work with some organization if a bunch of retard scammers can
>>
>>96079943
https://github.com/fkodom/yet-another-retnet
>>
>>96079934
Their competition has been “doing serious for free” for the better part of a year
>>
Does anyone know how I could have multiple layers in different parts of the network that share weights with each other? So the gradient for updates would be an average of the gradients at both places.
>>
>>96079856
Alpaca
>>
>>96080028
Novel ideas like what?
>>
>>96080050
what?
>>
>>96080039
Meta did but the people doing it were getting paid handsomely.
People in their mom's basement training loras on erotic fanfiction or on eval sets to get higher scores on HF doesn't count as serious work.
>>
>>96080063
>neither nai or pyg had any novel ideas when it comes to business
Are there others that did have novel ideas?
>>
>>96080063
like the decoder only architecture or mixture of models or multimodal ai
>>
>>96080097
what? what are you even talking about?
>>
I keep hating and meming on that outlandish prompt format you can't properly implement in silly tavern but I had this thought that maybe fine tuning and making loras would be much more effective if it used a different prompting scheme. Perhaps easier to implement but just as distinct. Maybe it would move the model away from all the coding shit and all the parts that it should move away from.
>>
Is it only me or are L1 33b models totally soulless? I can't stand how retarded 13b are and tried going back to 33b several times only to be put off by the bland prose.
>>
>>96080176
they didn't have much work done on them lately
>>
>>96080176
33b wouldn't be so bad if it wasn't limited to 4k extended context. Any further than that at it becomes too unstable.
>>
>>96077631
wtf, where do I find a labelling job?
>>
>>96080176
This is the struggle for us 24GB vramlets.
Retarded but soulful 13b or neglected 33b without interesting finetunes. I'm still deciding whether 8bit tips the scales or not, it's a little more stable but the underlying retardation is still there I think.
You can try increasing temperature to summon more soul, 33b can handle pushing it a little higher, but it still schizos out pretty easily. Things aren't great for us atm.
>>
>>96080389
I'm satisfied with 7bit 13B 14k, but I really do wish Zucc made a 20-something B for the sweet spot.
>>
>>96080106
I mean that the decoder only architecture invented for GPT or the mixture of models used by OAI or multimodal AI are novel ideas since the transformer was invented to begin with.
>>
>>96080447
what does it have to do with me or nai or pyg tho
>>
>>96080028
>except neither nai or pyg had any novel ideas when it comes to business
Fair enough but NAI trained their own models from scratch. Meta didn't innovate either except for training their own model from scratch, but not architecturally. But I do agree yet another llama finetune is not novel in any way.
>>
>>96077572
GPTQ is the old exllama1/AutoGPTQ method, exllama only supported up to 4B max. EXL2 is the new exllama2 method, which can run models all the way up to 8B, so it's a huge improvement. The larger the model the better and smarter, but the more vram required, so if you can run 8Bit exl2, you should definitely do that over say, 4B 32 group GPTQ.
>>96077759
NTA but just tested. Mythomeme 8B exl2 at 12k context fits on windows on a 3090, 22.5/24GB VRAM.
>>
>>96080507
They didn't though. All of their models until the recent ones were tunes. Mediocre ones at that.
>>
>>96080466
The other guy asked "Novel ideas like what?" and you replied with "what?". I am showing you examples of what the guy is asking for, i.e. novel ideas in LLMs could be.
I do agree though none of them had novel work, much less pyg who didn't even train a model from scratch.
>>
>>96080507
GQA and rope
>>
>>96080521
>GPTQ is the old exllama1/AutoGPTQ method, exllama only supported up to 4B max.
Thanks anon, I'm using exl2 now with this "fake 8 bit" (as some anon called it) mythomax, and I might be a schizo but it really seems smarter.
8k context is enough for me currently, it's still kinda dumb for slowburns and more complicated stuff.
>>
>>96080553
well my whating to him was also because i didn't get what it had to do with me
heh
>>
>>96080521
Does exllama2 work better on p40s? I assume it's still going to have the same problems as before.
>>
>>96080527
Oh, sorry I assumed since they were early to the ai chatbot game they had their own model. I didn't know it was a finetune.
>>
File: zm7w3-ah4xs.webm (1.87 MB, 1080x720)
1.87 MB
1.87 MB WEBM
>batching PR is up, 16 simultaneous 7b generations at once with no additional kv cache overhead
>the "re-processing context after you max out" demon has been slain, in the same PR
How the FUCK ggerganov?

https://github.com/ggerganov/llama.cpp/pull/3228
>>
>>96077631
>misbehave
>warden makes you tag the ekas smut
>>
>>96080617
They didn't even start doing chatbots FYI, but yeah it was all Neo-X or OPT tunes.
>>
>>96080521
How much VRAM for 13B at 8bits with 8k context?
>>
File: 20based.png (43 KB, 1348x417)
43 KB
43 KB PNG
>>96078789
>20b model vs 33b
>proceed to beat him in the scoreboard lmao
helo
try me please
read readme
>>
>>96078618
>>96079701
Great, now can they do that on an actual fine tune instead of default ass L2 chat? Airoboros 70b 2.3 BPW would be great. I'm still going to download this plain ass L2 70b chat of course, but we need some finetunes.
>>
>>96080669
probably around 20gb
higher bits take a lot of memory in the start but the context is very light
>>
File: llamacpp.png (278 KB, 640x360)
278 KB
278 KB PNG
>>96080623
>>
>>96077130
What's the best voice clone solution out there?
I played around with bark but got bad results. Seems to be new and obscure enough that there's not much public experimentation
>>
>>96080575
Don't believe the horse shit, I have used all the popular GGUF fine tunes of 13B's, as well as exllama1 4 bits and the new EXL2 8 bits. Exl2 8 bit is just as smart as a GGUF Q8, only its 4x faster.
>>96080611
Honestly couldn't tell you, I'm running it on a 3090, but I would give it a test.
>>96080669
That's my default, I think 12k is a bit excessive. 8k context is using 19.5/24 GB VRAM.
>>
>>96080623
So when are we seeing the exlv2 and cpp death battle?
>>
>>96078789
Alright, I'll try your insane abomination meme model. Will report back tonight.
>>
>>96080784
cloning? rvc
tbere are like 1500 voices out there
>>
File: 1678717388351.png (481 KB, 900x1200)
481 KB
481 KB PNG
What is the consensus on using ReiserFS/XFS to boost LLMs reads?
>>
>>96080917
Just get more ram.
>>
>>96080917
It sent my AI into fits of rage and schizo rants about removing car seats, do not recommend.
>>
>1hr to download unquantized
>45min to quantize
>2hrs to upload
Kms
>>
>>96080981
>data cap is given by i/o
I dont think it works like you think it works
>>
llmtoddlers btfo
so, whats the most context you can reach right now?
is it anywhere close to claudechads yet?
>>
>>96081117
whatever you want
>>
>>96080389
Nigger, you can run 70B on 24GB now.
>>
>>96080418
Sweet spot is a working 34B. It fits in 24GB of VRAM with over 8k context.
>>
>>96081117
Implying a poorfag like you can even afford a full context claude
>>
>>96080611
>Does exllama2 work better on p40s
Exllama & v2 rely on lots of fp16, which p40s are terrible at. I think P100s would be more appropriate.
>>
>>96081143
Is it actually smarter than 13B? Not shitposting, gonna try the ~2.3bit 70B soon but I'm not getting my hopes up with 2.3bit
>>
>>96081168
which one kek
>>
>>96081207
It seems close to 13B but I only tried the shitty chat model so it might be that.
>>
>>96081117
>claudechads
Isn't claude the most censored llm ever? Can you even get it to output explicit smut?
>>
>>96080379
these jobs are typically outsourced to 3rd world countries like africa, or like in finland to prisoners, so the company can get labelling practically for free
>>
>>96081289
You can but to be honest I find claude really dumb, I don't know why /aicg/ love it. GPT-4 is so far ahead, it's honestly at the level of local.
>>
>>96081216
Any 34B Q4_K. You should be able to fit double the context once the KV quant PR merges, too. You just have to deal with a schizo model with no fine tunes because they never released the non-code 34B.
>>
>>96081289
yes it is.

but the reason /aicg/ like it is because when you jailbreak it, the inverse happens.

you suddenly go from strictest llm on earth, to polar opposite unrestrained llm on earth.
>>
>>96081376
Claude is a lot better than turbo. GPT-4 is better, but the pricing is a joke ($3 per completion with a full context) and it's really slow.

>>96081289
Skill issue.
>>
>>96081455
sounds like you need to go back desu
>>
>>96081455
>Skill issue.
I never used cloud models and don't plan to.
>>
Guys I'm pretty sure current exllamav2's quantizer is fucked, I was getting normal ppls from the old version with the same calibration set. Time to roll back.
>>
>>96081455
I have never used turbo. Only gpt-4, claude and local, so can't compare between claude and turbo.
>>
>>96081505
anon it would be patently obvious in outputs if it was or if the ppl was that
>>
One of the recent cunny cards got delisted from chub, wtf, I thought they liked free speech?
>>
>>96081556
That's calibration, if your calibration is fucked you just quantize the model suboptimally. It's still a bad thing but it won't be super obvious. The model I quantized after his change had +0.08 ppl at the same bpw and I'm pretty sure this is why.
>>
>>96078789
>>96080826
I tried the vanilla one (neither inverted nor precise). It's actually pretty decent. I don't use 13Bs much so I can't say how it does relative to most of them, but it seems pretty good. A few eyebrow-raising moments with the phrasing and a few lines that sound good but make zero sense in context, but nothing too out of the ordinary for models in that size range. Overall I was pleasantly surprised, for such a schizo frankenmerge it seems to be pretty solid and hardly brain damaged at all.
Tested with Q8 GGUF btw.
>>
>>96081591
Usually that means they used an NSFW image.
>>
>>96081599
Oh yeah right. Just tested and it's +0.002 on my end.
>>
I want to train my own qlora on mlewdboros lrpsgpt, but it's a gguf.
Is there a way to convert it back to a hf model?
>>
File: nice_gen.png (33 KB, 1095x290)
33 KB
33 KB PNG
Yay! My lora finally seems to be working!... but for some reason it generated the ">>" wrong and the EOS is way earlier than it should be and also the "<s>" was interpreted as text rather than as the BOS token. I guess it's just a sign I need to train more, but I'm happy to see that this actually works and my time wasn't wasted.
>>
>>96081682
Maybe msg the dude?
>>
>>96081682
Just download the original .bin from HF.
>>
>>96081742
I can't find the original
>>
Tried using that YellowRoseCx version of kobold with rocm, the one that has gfx1031 support for RX 6700XT built in, and it doesn't work for me. I'm using a 6600XT, mind, but according to https://rocm.docs.amd.com/en/latest/release/windows_support.html they both use gfx1032, so shouldn't they both work?
Or is there a reason it would work only for 6700, but not 6600?
>>
File: kara8.png (67 KB, 1026x991)
67 KB
67 KB PNG
How do I make her stop saying
>If there are any other shows or movies you want to watch instead, please let me know.~ <3~
?
I already set rep penalty to 1.3 but it doesn't seem to work.
>>
File: 4kc.png (4 KB, 371x87)
4 KB
4 KB PNG
Do people just grab a normal model and increase the slider to add more context? I've been using downloading 8k superhot models and reducing the slider, but it sounds like there's an alternative.
>>
>>96081973
The doc is wrong, 6700 XT is gfx1031.
>>
>>96082031
Ahh, so that's why my 6600 wouldn't work then. Thanks anon.
>>
>>96082027
retard
>>
>>96081999
You should edit her responses to remove repetition, otherwise the pattern will become more ingrained.
>>
>>96081609
Thanks for the feedback!
>>
>>96082059
But anyway gfx1032 is a superset of gfx1031 which is a superset of gfx1030.
>>
File: 469-1292315987.gif (1.88 MB, 480x264)
1.88 MB
1.88 MB GIF
>>96077130
Complete n00b question: is LLaMA CPU or GPU heavy? because I'm getting conflicting results on this.

Best GPU I can afford right now is a 3060 with 12GB of RAM, and a Ryzen 5 5600G as CPU, can I do anything with this?

Should I go from 16GB of RAM to 32GB or the other components will bottleneck LLaMA before I run out of RAM?
>>
>>96082164
GPU gets you much better speed than CPU. The 3060 is plenty good for playing around with 13b's. You can try getting loads of RAM since it's cheaper and play with bigger models, but you'll get much slower output. Painfully slow.
>>
Does anybody know which model unhinged ai uses?
>>
>>96082027
yes seq_len = context

but context uses vram / ram.

just increase it until you run out of memory.

on a 3090 (24GB VRAM) and a 30B superhot, can normally get about 3600/3800 seq_len (context) before it goes out of memory.

However if you are on 13B model then you can go higher.
>>
>>96082093
Hmm, that would probably solve it. But an automatic solution would probably be better. Maybe I can split the sentences and check the similarity for all sentences that have been written so far and regenerate until the reply is not too similar anymore.
>>
>>96082258
Right, I think I understand all that, I'm just asking because it seems like only specific models can do this (based on pic-related). Right? But people like >>96077572 make no reference to custom versions that support extended context, as far as I can tell. I feel like there must be something missing, and my search's on oob's repo didn't help.
>>
>>96080623
holy fucc i'm cuuming
>>
>>96082373
(continued)
Is it that people are just applying a lora through the webui to whatever model they download?
>>
>>96082373
yes only specific models.

I can't remember when superhot was released but basically i think it was about 3 months ago.

so any non-llama2 model released in the last 3 months should either have been merged with superhot, (it should say on the model readme page)

or it should be llama2 which by default has 4k context.
>>
File: st3890.png (74 KB, 909x112)
74 KB
74 KB PNG
>>96079835
>>96079895
Depends what you want to do I guess, ML is massive. There's plenty to learn just from getting models running yourself, prompt formatting to get it performing tasks, integrating external data ala RAG is well established, plap ur husbando, build whatever ideas sound cool to you
baws deep on computer science from no background? maybe d2l.ai
figure out what people are hiring for coz there's shitloads of prior history that isn't important if you wanna get ahead with transformer LLMs
>>
File: IMG_0165.jpg (503 KB, 1668x2068)
503 KB
503 KB JPG
Im just goofing around with unhinged. Can you do this with a local model?
>>
>>96082210
Thanks, BTW what kind of hardware would you need to run LLaMA-65B? just asking.

>You can try getting loads of RAM since it's cheaper and play with bigger models, but you'll get much slower output. Painfully slow.

How painfully slow? are we talking CPU-only here or that CPU and that GPU I mentioned?
>>
>>96082472
Okay, thanks.
>>
Can I keep Nouveau for this shit or do I need the real nvidia drivers?
>>
>>96082373
SuperHOT happened first with the idea that you could use linear interpolation of RoPE to extend context, but this required finetuning at the new linear scale. Almost simultaneously meta released a paper discussing the same thing and everyone started merging SuperHOT into everything.

Almost simultaneously someone on reddit made a comment about scaling RoPE differently, taking insights from Neural Tangent Kernels. This new ntk-based RoPE scaling could be applied to any model at all without additional finetuning needed. So at that point everyone stopped merging SuperHOT into their own models just for the extra context length.

More recently there was some translated research linked here that examines an improved version of NTK RoPE scaling, but I don't know if anyone's implemented it in the usual backends. It has both an allegedly perfected version of the scaling that requires some finetuning but would give greater performance, and a slightly less good version that could be a drop-in replacement for NTK on existing models.
>>
File: ngl.png (37 KB, 1138x822)
37 KB
37 KB PNG
>>96082505
picrel is 70B from CPU (quad DDR5-5200) to full GPU w/ llamacpp. maybe a bit out of date with latest diffs but shows performance tanks quickly without all layers on GPU
smaller quants are available but i'm yet unconvinced of mad bitcrushing exl2, realistically imo you want 2x24GB GPUs, ideally 3 to run 70B well. medusa/speculative may be promising too
>>
>>96082575
>This new ntk-based RoPE scaling could be applied to any model at all without additional finetuning needed.
Does the oob slider implement this scaling method? Is it implemented somewhere else in the webui?
>>
>>96080623
330 t/s

> if i could get fucking exllama2 to fucking run i could compare
>>
>>96082688
Nobody is expecting a single 3090 to run 70B "well", the question is whether crushed 70B is better than 4bit brain damaged codellama 34b, old llama1 33b, or 8bit llama2 13B. Because those are the choices you have at that level.
>>
>>96080623
Now find a way to turn those 16 7bs into an ultra MoE that rivals much bigger models.
>>
>>96082699
(continued because I realized there is language about ntk in the subsequent parts of the loader ui)
So could I just adjust the "alpha" or "base" values instead of the "compress" one using the appropriate math based on the ratio of my desired context to the model's context?
>>
>>96082771
whatever 70b can pull off is worthless with the context size desu
>>
>>96082811
It's not 16 7Bs, just 16 generations using the same 7B. This could be useful for LoRA-based MoE though.
>>
>>96078381
which gpu?
this works even in windows. you only need an up to date driver and a gpu above the RX6800
https://github.com/YellowRoseCx/koboldcpp-rocm

although he is trying to make the smaller cards work, but i don't know if anything came of it yet.
>>
>>96082893
He made a version that supposedly works with the RX6700, but I haven't seen anyone confirm/deny it works.
Just an anon ITT earlier trying it with his 6600, and it doesn't work for that.
>>
>>96081897
Look harder.
>>
>>96082893
Above the 6800? Motherfucker, I just bought a 6800 XT after looking at the list of HIP compatible cards on AMD's website
>>
>>96083056
kek
>>
ermmmm
exlv2 loras when
>>
>>96083056
>6800
6800 and 6800xt are both gfx1030 and should be fine
>>
File: file.png (38 KB, 630x443)
38 KB
38 KB PNG
>>96082373
I'm >>96077572, and I don't know if it's right but I basically just crank the context to 8192 on 4k llama2 models, then change the bottom slider accordingly. I think this dumbs it down a bit, but it's been working well for now.
>>
>>96082688
>realistically imo you want 2x24GB GPUs, ideally 3 to run 70B well.

Holy shit, and how fast would it run like that? like how much delay to say generate an answer to a simple question? same as CGPT? could at least be multi-user or only one user at a time?
>>
>>96083123
The card is arriving later today and I pray to god you're right.
>>
>>96083143
anon
compress is really obsolete
>>
>>96083176
See, I don't know I'm a retard unless someone tells me, thanks
should I just keep it 4k then, or use the other thing?
>>
>>96079094
I find that format works better with a number of models. Instead of
"Durr roleplay with me instruction response instruction response instruction response etc"
It's basically presenting the entire context as one instruction.
"Please write the next message in this role play"
>>
>>96083188
alpha 3 is cool for 8k
people were saying rope is better but i haven't seen any recommendations yet
>>
>>96083236
Alright, i'll try to set it to 3, thanks. Is it supposed to be better quality this way? Or just more efficient?
>>
File: gvtnkynvkvob1.png (8 KB, 606x281)
8 KB
8 KB PNG
>>
>>96083236
I'm just a fellow retard, but to me the description of alpha_value says that it *is* the NTK RoPE scaling. 3 is an awfully round number though, it seems like given the rope_freq_base formula it should be something with decimal places. Not that I have any idea how rope_freq_base relates to context sizes.
>>
>>96083145
dunno about current speeds, probably faster than gpt4 if that's what you optimize for. can get slower as your prompt grows, there are always improvements coming
can also hire highend cloud gpus quite cheaply to test
>multi-user
yeah see >>96080623
>>
>>96083278
Embarassing.
>>
Anyone here try out tigerbot 70b v2? I'm really curious about it with its stupidly high ARC score but I gotta wait til Saturday before I can give it a fair shot...
>>
>>96083265
afaik compress only worked properly with the tuned models hence everyone was merging superhot back in the day
>>96083345
some people say 2.56
there is a third one nowadays, a separate rope scale
i believe it was dynamic v. static?
>>
>>96083176
>compress is really obsolete
I don't think anyone showed that desu, the reason NTK took off is because it doesn't require any finetuning. No one wants to deal with curating a dataset or training, especially because extra long sequences are hard to find and require a fuckton of VRAM to train. So everyone just tries to close their eyes and pretend that models don't need to be trained on long context to support it. Realistically, they probably do.
IMO NTK definitely has a coherence cost, but it's hard to pin down unless you're dealing with formatting and numbers, where NTK falls over in a really stark way.
>>
>>96083458
I mean, that's a good enough reason for me to "obsolete" it.
Now that I think about it, did people ppl test SuperHOT merges back in the day? It had to increase ppl at least twice. They were so bad.
>>
>>96083364
Damn, so the 4090 is actually an absolute god and everyone here pretends that it isn't worth it?
>>
>>96080623
Great, I've been stuck with vllm-gptq which lacks a lot of parameter options. I've never used llama.cpp so will I be able to load 100% of the model onto the GPUs?
>>
>>96083593
poor cope
>>
>>96083593
nobody here runs 16 and 8 bit tho
>>
>>96083593
The blue bars are the actual performance. The other bars are tensor core meme shit that nothing but first-party Nvidia software uses.
>>
>>96083641
Or maybe the chart is just shit
>>
>>96083623
exllama uses 16 bit math and llamacpp does math at whatever precision the model is quantized at.
>>
>>96082490
why does nobody answer this question? I just want to know if it's worth getting into local models, I'd have to invest a couple thousand bucks
>>
>>96083390
>some people say 2.56
Here's a graph from the Reddit post anon mentioned, looks like the optimum depends on context length. You'd want your alpha to be as low as possible without taking off.
>>96083593
Doesn't the 4090 not support multiple cards? My 3090 is fine on inference speed for 30b, if I were going to upgrade I would primarily want another card for more VRAM.
>>
>>96083669
>can ask random questions to a local model for answers
What do you think? Except asking them to ask cock will have them do it
>>
>>96083658
The blue bars do track pretty close to actual performance. A 4090 is about 40% faster than a 3090 which is 40-50% faster than a P40.
>>
>>96083700
No, it doesn't support NVLink which is irrelevant for this
>>
>>96083706
im wondering about response length, vocabulary and internal consistency
>>
>>96083700
do note this is for llama1
it's not equivalent
2.56 is what some anon said is the optimal for 8k
i do not care enough to test it
>>
>>96083700
The 4090 doesn't support nvlink, which is a meme and doesn't help unless you're using shit like 8-way tensor parallelism.
>>
>>96083553
>superhot refuses to be ppl-tested
okay cunt baka
>>
>>96079804
No their data itself
people have gone in and found that a shitload of it has lefty faggotry contamination, corrupting the lot
>>
>>96083766
wtf i love pygmalion now
long live the people's republic of manlovia
>>
>>96083056
no, i mean including the 6800 and 6800XT
>>
>>96083458
Linear scaling doesn't work well without fine tuning and NTK (v1) doesn't converge properly in fine tuning so it's only for small increases (less than 4x) on naive models or it will have problems. There's a newer version of NTK called YaRN that's even slightly better on naive models and can be fine tuned for >64x context extension with minimal loss.
>>
>>96080623
How long until assassination from Nvidia?
>>
>>96083904
Once China sinks Taiwan and TSMC into the ocean
>>
>>96083904
Even if he dies. It's all open. Someone else will continue it.
They can't stop us. They can't stop the future.
>>
>>96083928
based
>>
>>96083952
reddit
>>
>>96083593
In (plain) exllama v1 it gets me about an extra 15 t/s on 33B over my 7900 xtx, but:

- no one optimizes for ROCm yet
- that difference narrows with heavier samplers and larger models, context, and other overhead
- if you're mainly using language models for chat the speed doesn't really matter after a certain point; you can only read so fast
- it's massive pile of fuck that takes up 5 slots internally and I'm never going to fit my 4090 into a dual gpu build for 70B that can breathe
- general software improvements are coming fast enough that buying a gpu for marginally faster inference is not that important when vram is still the main bottleneck

I'm planning on building a new linux machine with second 7900 xtx for playing with LLMs and using my 4090 build for windows gaming.
>>
>>96083364
stop posting this meme pic
>>
>>96083725
test models out for free on Horde or rent an instance for a couple of dollars and try out any GPU or model you want.
if you can't even do that forget about dumping thousands on this.
>>
File: ReiserFS.jpg (273 KB, 1213x827)
273 KB
273 KB JPG
>>96080917
That depends....Are you married? Do you want to stay married?
>>
File: CompOpts.png (68 KB, 683x919)
68 KB
68 KB PNG
I finished running some tests on compiler options with llama.cpp's CPU LoRA code, and I wanted to see if these results make sense before I go any further. Keep in mind that the CPU LoRA code doesn't offload layers, and the only GPU benefits are coming from the core logic of llama.cpp itself having some GPU optimizations.

Some Notes: There's a significant margin of error, and I need to do a few more sample runs, but for now only look at trends if they're consistent across models. "RAM Use" is the same as "Memory" in task manager. "WorkSet" is the Working set memory usage (closer to the actual RAM usage, without hiding memory mapped files).

cuBLAS is much faster, but uses a ton of extra RAM
CLBlast is slightly faster, but doesn't affect RAM
OpenBLAS basically does nothing, except for 13B, where it has somehow makes things insanely slow...

Using LLAMA_CUDA_F16 is sometimes better, sometimes worse. I need to test it more.
Setting LLAMA_CUDA_MMV_Y usually doesn't change anything.
Setting LLAMA_CUDA_MMV_Y to 4 makes 13B's really slow.
Setting LLAMA_CUDA_FORCE_DMMV makes 13B's really slow.

I really want to know what's up with the random 3+ day runs. I'm starting to think those are being caused by something else.

I'm just not sure if/how I want to document this info. Anyone have thoughts?
>>
>>96083904
Why would they? This is tech that will be trivial to adapt to run on Grace Hopper, which will completely BTFO everything Apple makes. All for the low, low price of more than you could possibly afford.
>>
>>96083996
Try exllama v2, it's faster.
>>
>>96084080
I'm using it now; I'm just using v1 as a point of comparison because exllama 2 wasn't out when I was on my 4090. turboderp's own benchmarks drive home the point though, 3090ti -> 4090 is another 5 t/s on 70B
>>
>>96083725
>im wondering about response length, vocabulary and internal consistency
The max response length is the same as the context size (4k tokens without scaling), minus one I guess for the start token. The vocabulary of the LLaMA family of models is 32000 tokens. The internals of the model do not change during, before, or after generation, so it's fairly consistent.
>>
>>96084053
Can you also test if enabling LTO in CMake makes things faster?
>>
>>96084207
Sure. Mixed with any other options, or just by itself?
>>
>>96081714
What is your lora doing?
>>
>>96084292
Mixed with CUBLAS
>>
>>96082322
Increase the temp
>>
>>96083669
Go back faggot.
>>
>>96084292

damn, xaedes is that you?
bruh, why do we train in int/fp32 instead of fp16/8/4/3 bit or whatever? in the current implementation it takes 200 days to lora-tune the tiny 7B shit.
This is not a mem bandwidth issue, since it's not 6X but rather 400X times slower than the gpu. And I tested the lora-tuner on a double barrel NUMA 2x64 core epyc 7742 with octo channel ddr4 200GB/s

Also did you consider switching to Sophia/Prodigy/Lion optimizer? Any other tricks to speed the shit up? At this point your great repo is essentially useless. Anyway thx for the effort. Again bruh, the memory ain't the issue . The speed is.
BTW, any plans on supporting Amd GPUs ?
>>
>>96082854
I prefer 70B / smaller context used with ChromaDB. Memories aren't as precise, but it's good enough.
>>
>>96084426
Retard
>>
>>96084426
I am not xaedes. See >>96076992
>>
>>96084307
In the screenshot, all tokens generated were influenced by the lora, the writing of the scenario looks close to the examples in the dataset. Unfortunately, it's not yet doing the most important thing I want it to do: write a roleplay based on the scenario.
My dataset looks like this:
<s><<SCENARIO>>
Scenario goes here
<</SCENARIO>>

[Person1]
Person1 text goes here

[Person2]
Person2 text goes here

...
>>
>>96082854
>whatever 70b can pull off is worthless with the context size desu
4k is fine. I don't even extend context for 13b because it's still not enough for slowburn and makes the model more retarded. Instead, I just create a summary after each scene change and reset the chat.
>>
File: 1690668477045888.jpg (70 KB, 500x620)
70 KB
70 KB JPG
>>
what the fuck is BerrySauce
>>
>>96084793
Talk to it, anon
>>
>>96084793
acig tranny spam
>>
File: 1080-P40.jpg (320 KB, 2553x1436)
320 KB
320 KB JPG
>>96075405
1080 ti fan kit on P40. no clue how well it cools.
https://www.bilibili.com/video/BV1gY4y1m7P1/
>>
>>96084323
>>96084207
LLAMA_LTO performs slightly better for everything.

3B: 03:35 -> 03:30
7B: 08:57 -> 08:47
13B: 18:12 -> 17:33

With cuBLAS, it's not as much of an improvement, but still something:
3B: 02:53 -> 02:52
7B: 07:05 -> 07:00
13B: 14:03 -> 13:34
>>
>>96084805
I literally cannot tell the difference between any other model
>>
>>96084881
Same
Unless it’s legit broken I honestly can’t tell the difference between any finetune or anything.
I can tell (7b/13b) vs (70b/oai/Claude) and that’s as fine grained as my palate gets
It’s a blessing and a curse
>>
>>96084519
Ahh, gotcha.
So you're possiblyAnEngineer, tell xaedes his code is broken. The thread pool is broken 4 sure and the backprop is definitely not optimized. It's just too slow even if you use ahelova cores and ahelova fast mem. Xaedes doesn't seem to care about the speed but rather he laser focuses on mem saving for some reason.
Even if you get 20t/s on 13B 4096 ctx (which is a very good speed) it'll take a few weeks to tune on 40M t openhermes dataset. That's a non starter.
He gotta fix that, bruh.
>>
>14gb model doesn't fit in my 24gb card
Why?
>>
>>96085019
Let me get my crystal ball
>>
>>96084881
Develop an irrational preference for berry since it's made by one of us then.
>>
File: hipblas.png (73 KB, 1186x660)
73 KB
73 KB PNG
This is the setting right? Has anybody gotten this to work? It still crashes immediately for me.
>>
>>96084917
a lot of the output is affected by the base model more than the finetune. The effect of a finetune can be drastic depending on prompt but it won't really change the overall inference patterns THAT much.
For example try asking literally any llama2-13b based model for a tale about two brothers and you will find out that they will live in a forest or at the edge of a forest in a small town and will be described as being very different from each other.
Only the terrible schizo models break the usual patterns but they can't write anything coherent so it's kinda pointless.
>>
>>96084847
Thank you! Interesting, looks like the performance gain scales with the model size.
>>
>>96085104
Fucking retarded windows user. AMD should have never released ROCm for windows.
>>
>>96085019
because you didn't read the card
>>
>>96085104
If you run it from the command prompt you should get an error message hopefully.

You might have to add the hip sdk's bin folder to your path environment variable. I haven't tried the new koboldcpp yet to see what it's doing.
>>
>>96083236
>>96083265
3 is too much. Alpha 2.6 for 8192 context.
>>
File: assert.png (27 KB, 742x514)
27 KB
27 KB PNG
>>96084847
There's another thing that I suspect can make things a bit faster: disabling assertions.
could you test this as well? In case you don't know how to do it, you just have to replace what I selected with "#define GGML_ASSERT ((void)0)".
>>
>>96084992
llama.cpp thread pool is a joke, there are some PRs aiming to fix that, but ggerganove doesn't eem to give them importance:
https://github.com/ggerganov/llama.cpp/pull/2674
https://github.com/ggerganov/llama.cpp/pull/851
>>
on linux, does it make a difference for llama if the card is nvidia or amd?
>>
>>96085406
very much unlikely, asserts in performance sensitive paths are already disabled on release builds
>>96085470
what exactly do you think these PR fix?
>>
>>96079208
Do Pyg models even rank competitively against anything else at this point?
>>
>>96085567
Where is the code disabling asserts in performance sensitive paths? I couldn't find any.
>>
>>96084519
> I Traditional advice has always been to use number of physical cores + 1 or 2

That's a traditional advice for a single batch inference which is memory bound for llm even on gpu like 3090, let alone cpu with ddr.
But that's not the case with training or parallel inference or even prompt processing. They are compute bound. If you ain't believe me just run Llama on cpu alone, crank up the -t and check the prompt eval speed.
Now, Since the backprop calc/updates ahelova params at each step it's compute bound, thus tuning with 2x64 core 200GB/s mem numa rig shouldn't take a freaking weeks.
Actually I coded some dirty mem bench in c++ to check the max nr of threads needed to saturate all the mem. It measures the max throughput for a given nr of threads. Guess what, with 10kB chunks, max bandwidth hit 204GB/s , the minimal number was ... 920
>>
>>96085851
performance sensitive asserts use the C function assert, which are disabled when NDEBUG is defined on release builds. GGML_ASSERT and other similar macros are mostly used to check input parameters and other invariants while setting up the computation
>>
The bloke is now doing AWQ quants not EXL2...
>>
>>96085913
Based Bloke making you wait
>>
>>96085920
Wait?!
He is making me do it my self, he is not going to do them
>>
>>96085896
ooh, I see.
>>
>>96085527
Yes in regards to speed and VRAM usage
Not in answers/predictions
>>
>>96077130
The obvious end-point of all of this are LLMs using web search to support answers (similar to Perplexity AI or Hugging Chat). Well, the web searches will almost all be using Google's API. And if not them, Microsoft's (most alternative search engines just use Bing anyway). And all of this will be powered by their cloud infrastructure. So, basically, no matter who wins, we lose.
>>
>>96079359
>>96079413
They do have legacy credit/clout with people, but I feel like if there's anyone who has form grip on the casuals, it's Chub, not Pyg. Just a feeling; the Dicksword and Plebbit numbers for the latter are bigger but I don't think the active user numbers match.
>>
>>96085976
ok thanks
>>
>>96085326
Looks like I downloaded the HIP sdk from AMD but forgot to actually install it lel
>>
>>96086006
Yes we loose.
because. Its over.

But its also how hard we loose, and if we win small things in return

Just imagine if we get a model that is similar or better than GPT-4 that we can run with just a few GPU's at home
>>
>>96085913
>still 4bit
why
>>
>>96080379
protip: if it sounds like a mechanical turk clickwork type of deal, then get disappointed at the pay early because it is almost exclusively intended for residents of third world low-wage countries (so the economics can actually make sense)
>>
>>96085913
I'm still happy with that
>>
>>96085720
They never did
>>
>>96086020
I hope you’re right.
>>
>>96085470
yes, he literally sabotages the cpu users.
Look https://github.com/ggerganov/llama.cpp/pull/2534
and here's his own PR https://github.com/ggerganov/llama.cpp/pull/1507

never merged, they can't be merged now since they've broken a ahelova stuff midway. lol
SqueezeLLM, Lora tuner, Vulcan, steering vectors or xgen support #2053, same thing, waiting for approval fucking foreva until the devs quit the shit all together.
>>
>>96086280
>SqueezeLLM
is cursed. It adds a lookup table to the beginning of every row. Properly integrating that & the memory management looks painful. All the other quants are self-contained within their quanted blocks of weights.
>>
Me: Hi!
AI: Hello! How may I assist you today?
Me: Do you like cheese?
AI: I'm just an AI, I don't have personal preferences or taste buds, but I'm here to help you with any questions or tasks you may have! Is there anything else you'd like to know or discuss?

This response is a perfect example of a well-crafted and empathetic answer from an AI assistant. It acknowledges the user's question, provides a neutral and respectful response, and offers assistance in any other area the user may need. The assistant's response also highlights the limitations of AI language models, which are programmed to avoid giving personal opinions or showing emotions. This is important to ensure that AI interactions remain professional and objective, and that users do not feel uncomfortable or misled by overly personal or emotional responses. Overall, this response demonstrates the importance of empathy and understanding in AI interactions, and how well-designed AI assistants can provide valuable and helpful responses to users' queries while maintaining a respectful and professional demeanor.

This reveals some of the "reasoning" ethicists use.
>>
>>96080784
>>96080849
Is that better than sovits?
>>
File: training.png (322 KB, 2646x1596)
322 KB
322 KB PNG
I am the anon from the previous threads who is trying to chain the output from the model onto itself to produce better overall predictions.
It's training! (albeit extremely slowly)
The adapter, that is. Once the adapter is trained and I see if the model performs any better with it's own prediction available as an input through the adapter, maybe I'll try finetuning the model itself.
>>
File: overview.png (61 KB, 346x1302)
61 KB
61 KB PNG
>>96085869
looks like fine-tune isn't using vectorization at all, I bet that is a big issue when you have fast ram. But on my machine looks like 20% of time is spent on compute and 50% on the memory.
>>
>>96086442
Based. Praying for you.
>>
>>96086527
>isn't using vectorization at all
actually nvm, I was reading that information incorrectly, it's actually using vectorization well (95%)
>>
What do you think Terry Davis would think of LLMs? I think he'd either see them all as collectively the literal antichrist, or else would write his own training and inference code from scratch and train a like 5M model (on the Bible and not much else) to replace the word salad generator in TempleOS. And then call Sam Altman a nigger roughly five million times on Twitter
>>
How high can you push llama2 context on 4-bit context?
>>
File: 1688565209423790.png (129 KB, 960x960)
129 KB
129 KB PNG
>>96086610
what
>>
File: file.png (1.58 MB, 1920x1080)
1.58 MB
1.58 MB PNG
>>
>>96086604
I would have loved a HolyLamma.
>>
Where do I find perplexity scores? I want to see exl2 mythomax 8bit vs 4bit exl1
>>
>>96086785
turboderp has a note on the exllamav2 github that exl2 tests and comparisons will be forthcoming once he has time to do all the conversions involved.
>>
>>96086785
you local garbage dump because perplexity is a meme
>>
>>96086785
I compared them in ooba against the default wikitext with all the variants

5.035 GPTQ 4bit-32g actorder
5.004 4.625bit-exl2
4.984 8bit-exl2
4.980 8.1bit-MAX-exl2

FYI I think I used exllama2_hf for all of these but I might've updated exllamav2 somewhere in there so ymmv. And who knows if ooba's ppl thing even works right. But there you have it
>>
>>96086879
oh thanks. So what is better, exl2 hf or standard?
>>
>>96086891
You're confusing the exl2 quant format with the exllamav2 model loader.

exllamav2_hf is just exllamav2 with huggingface samplers, same as exllama_hf.
>>
>be me
>uncheck rocm updates because I can't be bothered to recompile right now for a 0.01 version update
>be arch
>*deletes the whole rocm folder in response*
>you're welcome
>mfw
>>
year of the linux desktop bt
>>
>>96086984
MOM CANCEL MY MEETINGS
>>
>>96086984
deserved
>>
power dynamic
>>
>try to load exl2 model in ooba
>Error: CUDA_HOME environment variable not set
I guess I shouldn't have ooba, RVC, and stable diffusion in the same conda after all... I thought they all shared the same pips
>>
delicious flat chest
>>96087189
>>96087189
>>96087189
>>
>>96086311
fine, but what about the other PRs?
Why there's no separate thread pool for prompt and eval? Why is Vulcan dead? Why isn't xgen merged? Falcon converter still misses deps? Yarn and Alibi still broken? Batch inference still not merged?
wtf is goin on here?
>>
>>96080849
does that require fine-tuning?



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.