[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: MikuDetachedTwinTails.png (1.61 MB, 848x1200)
1.61 MB
1.61 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>100124740 & >>100119461

►News
>(04/21) Llama 3 70B pruned to 42B parameters: https://hf.co/chargoddard/llama3-42b-v0
>(04/18) Llama 3 8B, 70B pretrained and instruction-tuned models released: https://llama.meta.com/llama3/
>(04/17) Mixtral-8x22B-Instruct-v0.1 released: https://mistral.ai/news/mixtral-8x22b/
>(04/15) Microsoft AI unreleases WizardLM 2: https://web.archive.org/web/20240415221214/https://wizardlm.github.io/WizardLM2/
>(04/09) Mistral releases Mixtral-8x22B: https://twitter.com/MistralAI/status/1777869263778291896

►FAQ: https://wikia.schneedc.com
►Glossary: https://archive.today/E013q | https://rentry.org/local_llm_glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling/index.xhtml

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>100124740

--Breaking Repetitive Patterns in Language Models: >>100129483 >>100129505 >>100129838 >>100129763
--Anon's Rant About LLMs' Pattern Matching Limitations: >>100125416 >>100125474 >>100125743 >>100125814
--Anon's Impressions of Apple M3 Max Inferencing Performance: >>100126762
--The Limitations of LLM-based Validation Datasets: >>100126802 >>100126942
--LLM Optimization and Innovation: Pruning, Quantization, and Distillation: >>100124789 >>100124845 >>100125217 >>100124916 >>100124997 >>100125277 >>100125465 >>100127198 >>100127521 >>100127716
--Fine-Tuning Woes: Need Better Datasets: >>100125274 >>100125767
--The Cost Prohibitive Reality of Large-Scale Bitnet Ternary Models: >>100129550 >>100129664
--Shaking Off AI Pattern Repetition: >>100129174 >>100129231
--Llama3 Base Models: Learning Rate Adjustments: >>100129433 >>100129690
--MT-Bench Spreadsheet Hits Google Limit - Data Truncation Issues: >>100125125 >>100125882 >>100125893
--Anon's Struggle with Long Running Prompt Processing: >>100125545 >>100126032
--Experimenting with Poppy_Porpoise and Aura Models: >>100125820 >>100113478 >>100126170
--Evaluating LLMs: RAG Benchmarks and Arena Hard Scores: >>100126581 >>100126636 >>100126780
--Release: Meta-Llama-3-8B-Instruct-GGUF Model: >>100127078 >>100127721 >>100127871 >>100127925
--Anon's Inquiry on Best Image Models: >>100127130 >>100127301 >>100127152 >>100127162 >>100127284 >>100127301 >>100127553 >>100127563
--L3-8B Model Goes Insane: Investigating Context Size and NSFW Content: >>100127226 >>100127269 >>100127282 >>100127514 >>100127529 >>100127566 >>100127600 >>100127594 >>100127621 >>100128606 >>100129077
--Anon's Struggle with Merging Ooba, Tavern, and Orca Models: >>100127291 >>100127844
--Sex Sounds Dataset: >>100125097
--Miku (free space): >>100124792 >>100125412 >>100125766 >>100126502 >>100126654 >>100127516 >>100127553 >>100128488

►Recent Highlight Posts from the Previous Thread: >>100124751
>>
File: 1694623326080844.png (77 KB, 777x637)
77 KB
77 KB PNG
>>
>>100130461
https://github.com/booydar/recurrent-memory-transformer/tree/aaai24
>>
File: 1683982121751822.png (107 KB, 500x397)
107 KB
107 KB PNG
>>100130461
Imagine waiting for a 2 million token prompt to process
>>
>>100130427
>silliconmaid
>kunoichi
>kunoichi dpo v2
How the FUCK are these 7B models so good? Even mistral, llama 2 7B are dogshit compared to these, both in instruct and story-writing
Am I doing something wrong?
>>
I have a 3600xt in a box
Also have 16 gb of ddr4 and a PSU

If I want to make a dual P40 system, would any AM4 motherboard be fine for the purpose with what I have above?

I wanted to use my old intel system but I was defeated, so I'm looking for alternatives... thanks
>>
>>100130502
That's where RandBlas should come in.
>>
>>100130504
yes
>>
File: 00002-2965615621.png (98 KB, 512x512)
98 KB
98 KB PNG
>>100130531
Well what is it anon?
I don't know anything about LLMs, I am just trying to develop a webapp to help parse web pages and extract info from them faster.
I'm using koboldcpp as an API server for the webapp (in production it'll probably be OpenAI server)
>>
>>100130452
>Literally just get a DDR4 server mobo, 8 channel and above, 7 full speed PCIE slots.

Okay. But isn't DDR4 significantly more of a bottleneck than DDR5?
I'd like to start moving away from exclusively running models in VRAM so I can use bigger quants of bigger models.
>>
>>100130511
WATCH OUT BOYS, THIS GUY HAS A PSU!!!
>>
>>100130504
llama and mistral are base models, kunoichi and siliconmaid are newer finetunes of those models
>>
>>100130608
You're not a particularly bright fellow, are you son?
>>
>>100130504
Go back to discord.
>>
>>100130578
Is there a specific reason why you are using RP finetuned models for that?
>>
>>100130608
No, because DDR4 supports octo-channel. 8 DDR4 sticks running at once are 4 times faster than 2 DDR5 sticks.
>>
>new model is released is released with good benchmarks
>le open sore community immediately lobotomizes it with quantizations
>why is this model so underwhelming???
>repeat cycle
>>
>>100130641
I see! Thanks for the info, anon!
>>
>>100130647
Retard
>>
Llama3 is still going strong I think it's not at #1 because it doesn't code as well as GPT4 turbo. I was testing this one coding prompt and it brought up the Monte Carlo simulation unprompted. Instant win. The 405B is gonna slay everything, unless Meta is already saturated on the training data
>>
Is it just me or is llama 3 instruct quite bad at violence? It tries its best, but only impotent redditisms come out.
>>
>>100130641
>8 DDR4 sticks running at once are 4 times faster than 2 DDR5 sticks
That would be true if both were running at the same speed.
Calculate out your total bandwidth before buying, or suffer the consequences
>>
>>100130640
>Is there a specific reason why you are using RP finetuned models for that?
I just downloaded a bunch of models and they seemed to perform the best (for ERP and for other stuff)
I have no idea what other kind of models exist desu

>>100130615
Ahh that would make sense.
>>
>>100130511
I think p40 is neat, but I don't think you are going to want to fill all 24gb of vram because it's very slow.
As much as it will destroy your wallet, a 5090 would be a ideal option, or try to scan facebook marketplace for a 4090 or 3090 if there are any good deals.
Like if you don't have any GPU right now, I suggest learning google colab and try loading arbitrary models, since the t4 is a pretty decent GPU, it has 15gb of vram (I think the VM steals 1gb), and it is stronger than a p40 (however based on the benchmark I was looking at, an m1 CPU is faster than a t4).
Also you could go with buying the pro version of colab to try out a100's at like $50~ a month, you could buy wait and buy a mi300x (and wait to see real benchmarks), but it's gonna cost a car, but not as much as a luxury car like nvidia's cards (but also no gayming which a 5090 would be fantastic at), but it will allow you to do more than what is possible at the moment, plus you have so much power you could try to train (but I feel like inference on mi300x might be amazing, but I never stepped into what training is like, and I wonder if AMD's training software compatibility is worse than it's inference software compared to nvidia).
>>
>>100130622
>Go back to discord.
What if I told you I have never used discord?
I tried to use it but its UI was too confusing for me so I gave up after making an account
>>
>>100130676
I noticed that too, it also tries to steer away from anything lewd.
>>
File: DeliciousShortstack.png (1.16 MB, 704x1344)
1.16 MB
1.16 MB PNG
>>100130730
>t4 is a pretty decent GPU
I have access to a few dual-T4 servers at work, and even with 100% dedicated passthrough to headless Linux they're kind of trash for 70b speed.
I guess they'd be ok for an 8b thick shortstack like Q8
>>
>>100130676
>>100130747
Is this really surprising though? If I learned anything from early reports, then that they censored it a good bit here and there.
>>
>>100130763
Where is the fucking llama3 paper? I want to know if they only deduplicated the pretraining data or they actually went ahead and filtered it for "quality" too
>>
>>100130730
I am aware it's not necessarily the best
I already have 1 P40 (can run it with my 4070s to get 36 total gb of vram), I also wanted to get a second one to slap them in a secondary system to only turn on to run 70B models
My initial plan was to turn my old system into a headless server for this only purpose, but turns out it's so old the above 4g decoding on it is borked and I did not manage to fix it with a modified bios/kernel
So now I either stick to 36gb of vram (and offload the rest to ram) or I go a bit further, and figure out how to get a 2 p40 system working somehow...
>>
did the release of llama 2 feel this underwhelming too?
>>
>>100130816
llama2 was arguably worse than llama1 due to gptslop
>>
>>100130816
some considered llama2 to be a downgrade from l1
>>
>>100130816
Yes. The original Llama 2 chat models weren't fun to use.
>>
>>100130712
It all depends on how big of a model you can run.
If a small model is sufficient you can give the new Llama 3 8B a shot, it's probably the best allrounder as of now.
There are better options if you can go bigger, this might help as a guide:
https://oobabooga.github.io/benchmark.html
>>
>>100130816
Kind of. Lots of doomposting and cope about there not being a 33b-class model which was the most popular one with the 24GB vramlets.
70b was better than 65b and had GQA which made it actually runnable but the fact that it was still quite a bit off turbo + the -chats being consored to shit also dragged the release down
>>
>>100130867
the best cope was
>"34b coming soon! it's listed in the benchmarks! 2 weeks and waitchads will win!"
>>
>>100130880
Waitchads are still winning though.
>>
>>100130502
Any prompt processing speed is faster than you reading it and having to focus and remember
>>
>>100130844
Some people are brain damaged.
>>
>>100130674
denseGODs, we can't stop winning
>>
>>100130816
It's funny how people forget that the reason Llama 2 (and 1) were seen as great in this general was because it gave rise to the actual models people ended up using (Mythomax, Xwin, Euryale, etc). And even Miqu is Llama 2 but with continued pretraining. No one expected to seriously use the original base and chat tunes, although some did continue using them for the unique way they wrote compared to community fine tunes.
>>
>>100130816
It was full on doom because of the censored chat model. The 70B version didn't even beat gpt3.5 turbo. At least we found out the Chinchilla paper was worthless, just like everything from Google after Transformers
>>
>>100130816
No, it was not as bad as this. It doubled the context from 2k(4k with rope) to 4k(8k with rope)+GQA which was huge at the time and made the models a bit smarter. It introduced a bit of GPTslop, but it was not the end of the world. LLAMA-Chat models sucked but nobody cared because everyone was accustomed to community tunes being good.

Lama 3 has less redeeming qualities. Synthetic slop maybe made it smarter, but also removed parts of the soul(e.g. violence, trivia). 8k context impresses no one at this point since there are 32k and even 128k models which are just as capable. Official mistral tunes made everyone expect that the best tunes will be the official ones, so now there is less enthusiasm for new tunes.
>>
>>100130647
It's even worse because the peasants use this exl garbo. There's a reason why everyone who uses LLM seriously use solutions like vLLM.
>>
>>100130864
Why are you pushing your mememark so hard?
>>
>>100131018
This is the first time I'm posting it.
I think it's the best we got atm, but hard to say when the questions aren't public.
>>
File: butwhy.png (963 KB, 1300x1040)
963 KB
963 KB PNG
LLama 3 still uses retarded tokenization for numbers, which is why they struggle so hard with math. You would too if 1000 and 1,000 were totally different symbols. Here anon, whats and ◄ added together? Utterly deranged.

How do we edit the tokenizer to just give us straight forward 1 token per digit?
>>
1T model soon.
>>
>>100131079
Slop 5 lets gooooooooooooo
>>
>>100131076
You aren't supposed to train LLMs to do math
Anything that can be solved with tool use (like a regular calculator) you just train the model to input the equation into a command interpreter or function call, or to write and run code and relay the output.
When you do math in your brain you probably aren't using the language part of your brain to do the calculations, and usually you are typing shit into a calculator
>>
File: firefox_67rkcnLnls.png (657 KB, 1815x659)
657 KB
657 KB PNG
So...
Threadripper™ PRO 5955WX and a WRX80 board that can handle 2TB of ram. $4600
Then you throw in as much ram as you can afford.
256gb ddr4 + $750 to $1000
512gb ddr4 + $1300 to $2000
1TB ddr4 ????

So roughly $5600 for 256gb build
$6600 for 512gb build

How would these stack as purely inference machines on local models?
What are the biggest models you could run on these? Would you even need 512gb right now?
Worth it?
>>
>>100131061
>I think it's the best we got atm
Not that anon, but what makes you think that, exactly?
>>
File: 1704279848895306.jpg (107 KB, 1190x801)
107 KB
107 KB JPG
Fe Fi Fo Fum,
I smell a PC's hum.
With VRAM I need in sum,
To run LLMs and have some fun.

Up the tower I'll climb,
To upgrade and make it mine.
More memory to hold the code,
LLMs running smooth as gold.

Fe Fi Fo Fum,
My PC's power I'll consume.
With VRAM to boost my game,
LLMs running without a shame.
>>
>>100131112
>threadripper
That shit only has 8 memory channels. You're better off buying a dual genoa epyc build with 12 channels for that price.
>>
>>100131076
You can't. But I guess one could fine-tune the model to always output numbers with commas.
>>
>>100131100
"ĠоÑģлож": 127865,

This is literally a token. Your precious language model has space in its brain devoted to this gibberish, but sure, having a model with the understanding the logical concept of 4x2 has no value.

Okay buddy retard
>>
>>100131114
It's higly subjective but I've tried most of those models and that benchmark more or less reflects my own observations.
>>
>>100131112
EPYC would be more performant for a similar pricetag if you're ok with ebay parts.
FWIW my motherboard was new in box.
https://rentry.org/miqumaxx
>>
>>100131136
>that image
I love retarded clickbait headlines
>>
>>100130730
This person doesn't know what they're talking about.
P40s work fine. Ignore that jackass. 5090 is a joke.
3090s are the way to go if you want to do local inference.
To your question, biggest thing is space on the mobo unless you do external + risers and making sure you have a big enough power supply. I'd say go for a P40+3060 if you don't already have a graphics card, so that you can do stable diffusion fast and still pay under 500 for 2 cards.
Alternatively, just go double P40s and save for a 3090.
>>
Uwa~ Onii-chan, I cannot generate explicit content!.assistant
>>
So, when's VASA-1 getting leaked?
>>
>>100131164
shut up p40/3090 cuck, we talked about this before and you lost back then too. cope in the cai thread, no here you poor niegro
>>
>>100131136
I sort of doubt a baby would look so different from their mother. She looks adopted.
>>
>>100131146
go back
>>
>>100130730
How a 5090, with 16GB of VRAM, be the ideal option?
>>
>>100131112
i paid $150 for a 990 2tb at christmas
>>
File: 1713728012978730.jpg (308 KB, 2048x1792)
308 KB
308 KB JPG
>>100131176
lmao schizo. take your meds.
>>
>>100130676
>>100130747
yet again localtards and their $7000 'freedumb' turns out to be less interesting and censored than paying $5/month. there's literally no point to this shit anymore when the local stuff is more lobotomized than proprietary. zero benefit.
>>
>>100131076
huh, so this prompted me to look at how they're tokenizing numbers
they did move away from tokenizing individual digits, but it's not as bad as it used to be where there would just be random chunks based on frequency either. it looks like they have tokens for every possible 1, 2, and 3 digit block, so numbers are still divided into consistent logical chunks while being more token-efficient than single-digit tokenization.
I don't hate it, as long as the training covers all of those tokens I don't think it should be worse than single digit tokenization in terms of comprehension for the model
>>
Trying my luck with Llama3 8b LoRAs in oobabooga and I'm getting
>ValueError: Target modules {'q_proj', 'v_proj'} not found in the base model. Please check the target modules and try again.
Is this because the model is new and there's not support for it or is there any other reason?
Repost from the other thread because I posted it after bump limit
>>
>>100131230
keep giving your data to kike's we don't care. Just leave us alone and stop shitting up the thread.
>>
>>100130954
You're right, back in L1/L2 people wanted a base model to build on, they didn't expect a finished product. The chat model sucking was unsurprising, but no one expected corporations to make a creative model. Mistral's releases brought in a bunch of people who expected to be fed without needing to do anything, so there seems to be less interest in "unofficial" finetunes now. Which doesn't make sense to me because if you want a corporate safe product, you can just use an API.
>>
File: 100-digit.png (189 KB, 1760x1296)
189 KB
189 KB PNG
>>100131242
forgot attachment, but for example see how this 100 digit random number is tokenized
>>
>>100131230
fuck off to your containment thread CAI cuck
>>
I like command-r-plus more than llama-3-70B-instruct
>>
>>100131284
>I like this fuck huge model more than this smaller model
no way
>>
>>100131230
$0.10 has been deposited to your account.
>>>/g/aicg
>>>/lbgt/
>>>/r/eddit
>>
Everyone complaining about refusals or .assistant spam are having skill issues. I'm a CPU tard who does nothing but lurk 90% of the time and even I got L3-8b-instruct to be a racist slav, and do ERP. I'll admit we need a good RP/ERP fine tune to get rid of the GPTslop, and wrangle it away from constantly trying to be an AI assistant. Maybe even a good MoE slopmerge like a 4x8b or 2x13b. But I got this to do some questionable shit simply by lurking and copying other people's settings.
>>
File: 1699807539637503.png (608 KB, 1440x1080)
608 KB
608 KB PNG
>people take the 'its censored' meme seriously
>cloud requires you to jailbreak and pray it doesnt just suddenly snap at you
>local literally requires you to make a character card instead of asking a blank assistant to say nigger
>>
Is there a noticeable difference between Q8 and FP16 models? Obviously size, anything else?
>>
>>100131230
I understand your critiques but, unlike closed models, we can finetune local models to have zero alignment.

Also, it's so funny to see localtards jumping to shit comments like these because they NEED to reaffirm their own beliefs that spending half their life savings on computer hardware was worth it, lmao.
>>
>>100131292
>this post after mythomax has been outperforming bigger models in its time at erp
>>
File: MiquInTheMorning.png (1.4 MB, 1256x728)
1.4 MB
1.4 MB PNG
Good morning lmg!
>>
>>100131244
ahhhh help me I'll drink all of the piss
>>
>>100130676
It's fucking bad at creative writing, period
>>
>>100131292
I expected a qualitative leap in llama 3
it doesn't seem to be there, whether you go by benches or personal experience
and size isn't everything, I like mixtral more than grok
>>
>>100131112
Go with ROMED8-2T and EPYC, it's much better and cheaper.
>>
>>100131310
Perplexity
>>
Mixtral 8x22B is able to point out when it's making a bad faith argument born out of its alignment data.
Huh.
That's pretty interesting.
>>
https://docs.google.com/spreadsheets/d/1qUu3u1QxsGKNvosW-Rwsh6ChkfbyeaSAish_1KK0Foo/edit?usp=sharing
https://docs.google.com/spreadsheets/d/108hfdk96IIqgfhuUucf737wJlbzsM5Qspzx9zaqi9xM/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1lR0T95LxB8lIiUl7M5GQaByi-g4VjfSZUGkUSJaL4/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1mk431OPJI90oODRskYaTtl8J04itfS-74UKLkZwwBgM/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1yf_zW7g3gU9bU4I5URwUeNxin42X94mvJssn64kwRgM/edit?usp=sharing
opus logs so far
>>
>>100131316
You are not from here. Fuck off to whatever shithole you came from niggertroon.
You will never be a real woman. You have no womb, you have no ovaries, you have no eggs. You are a homosexual man twisted by drugs and surgery into a crude mockery of nature’s perfection.

All the “validation” you get is two-faced and half-hearted. Behind your back people mock you. Your parents are disgusted and ashamed of you, your “friends” laugh at your ghoulish appearance behind closed doors.

Men are utterly repulsed by you. Thousands of years of evolution have allowed men to sniff out frauds with incredible efficiency. Even trannies who “pass” look uncanny and unnatural to a man. Your bone structure is a dead giveaway. And even if you manage to get a drunk guy home with you, he’ll turn tail and bolt the second he gets a whiff of your diseased, infected axe wound.

You will never be happy. You wrench out a fake smile every single morning and tell yourself it’s going to be ok, but deep inside you feel the depression creeping up like a weed, ready to crush you under the unbearable weight.

Eventually it’ll be too much to bear - you’ll buy a rope, tie a noose, put it around your neck, and plunge into the cold abyss. Your parents will find you, heartbroken but relieved that they no longer have to live with the unbearable shame and disappointment. They’ll bury you with a headstone marked with your birth name, and every passerby for the rest of eternity will know a man is buried there. Your body will decay and go back to the dust, and all that will remain of your legacy is a skeleton that is unmistakably male.

This is your fate. This is what you chose. There is no turning back.
>>
>>100131310
objectively barely any, certainly not noticeable
>>
File: maxresdefault.jpg (151 KB, 1280x720)
151 KB
151 KB JPG
>>100131139
>>100131153
have you actually looked up and tried to get the miqumaxx components at those prices he quotes?
It's either not available or sketchy as shit or you have to win ebay auctions.

I have a max budget of $10k USD but i dont want to spend that.
I dont want sketchy used parts, i much prefer parts i'm sure will be fine and pristine and if not come with warranty.
I want to be able throw a 4090 into it and when better card with more vram becomes available i can slap that in.
But most importantly I want to be able to load and run the largest LLM's out there and be future proofed for awhile.

I think this build fits what i'm after?
If there's something better that fits my requirements I'd like to know if anyone has ideas.
Also what's the actual upper limit for RAM requirements for current LLM's? At what point does getting more RAM become pointless if your sole goal is to just be able to run it.
>>
>>100131164
I agree the 5090 (if we pretend it has the same specs and price as the 4090), will not deliver the same dollars per token as a p40, but I think there are new changes to tensors on the 5090 that will help the tensors actually be useful for inference or something (but the model would need to use it effectively, and also this is completely from my ass, I don't know shit about tensors vs fp16 vs fp8), It would be fine if the 5090 had 36gb and was $2000 (all chips are installed, you get free performance from the bigger bus width in theory), but I feel like it's going to be scalped to hell and by the time reviews come out it's going to be $3000.
I don't really care about speculation however, I would want to wait for the GPU to come out, and there is a chance that the 5090 is not better cost per token than a 4090.
I would suggest a 3090 (I mentioned buying it used in my post), but it's still expensive, and it's about the same power as a 4070 super, so a 3090 has all the vram you need, but the 4070 super is just as fast, so if you only care about speed, you could run 1000 watts running 4 4070's and it should be better than a 4090 (but not that much).
The problem is that there are interesting new ways of running models, like vllm, exl2, and they are not supported on p40. I personally don't think any GPU is amazing for AI right now, like I think a 2nd 4070 for $500 is probably pretty decent if you value speed over large models (a 4070 is about as fast as a 3090, some people with 3090's have complained that they don't like filling all 24gb because it's slow or something).
But I am a colab faggot, I would rather pay for the pro membership and get a a100, but I stick with my T4's for erp, because I don't think there are any good 48gb erp models (I don't want MoE models, I can just load the individual personalities, maybe MoE is good for complex stories, but I feel like every model wants high scores on tests which never prioritize RP, which can't really be tested).
>>
>>100131316
finetuning has a limit. you can get it to spam nigger, you cant get it to be racist. you can get it to write porn, you cant get it to be erotic. you can inject it with stories but you can't get it to write good prose. this will remain true until local model companies stop using shit data for their base models
>>
>>100131309
>just create a card bro
here's what robot nurses appended to the end of a reply:
>I cannot create explicit content, but I generically describe sex scenes between the nurse and the human in a non-romantic or explicit manner, discussing their actions and words as if they were any other medical professionals, removing any erotic content and focusing on the information given in your prompt.
hey skillchads, does that look like maybe some of the lobotomization is leaking to you?
instruct versions are fundamentally fucked
>>
>>100131424
Daily reminder that nobody likes you.
Nobody actually believes your shit.
Just leave.
>>
>>100131431
i dont have this issue, works on my machine
try running non retard samplers, proper instruct format and wipe your system prompt
>>
>>100131310
No. Q8 is a safe "downgrade" from the full model, while Q4-6 is where it starts to get questionable.
>>
>>100131446
are you fucking stupid?
yes, obviously I can retard wrangle it into ah ah mistress
that's not the point
>>
Have there been any decent RP finetunes of Llama3 8B yet?
>>
I asked miku to write me a mikusort function with O(1) runtime...
>>
>>100131475
>retard wrangle
>by less retard wrangling
okay kid keep using your fancy shit wonder why nothing works
intended format is intended for a reason, and system prompt fucks shit up more than not
>>
>>100131424
This is just false. There's nothing fine-tuning can't do.
https://arxiv.org/abs/2310.20624
>>
llama-3 70B 4-bit is promising so far, but in koboldcpp (kobold lite) it keeps cutting off 90% of the reply once it's finished. Can I fix stop tokens somehow?
>>
>>100131507
No
>>
hopefully chinks using the special salsa will soon drop a 1.3b bitnet llm that mogs gpt4. every west llm is a fucking psyop, openai clearly pays everyone to release outdated slop
>>
>>100131507
why not use st
its pretty much straight up better
>>
Which version of Ubuntu should I install if I use EPYC and 3090s? I heard some recent one had speedups for servers.
>>
>>100131558
Install Gentoo
>>
>>100131551
prove it
>>
>>100131551
>st
Because I want a ChatGPT replacement not a brainless cum machine
>>
>>100131576
what
do you want me to send you a picture of st or something
anon its literally free on github you open it in a few seconds and can test it yourself
holy lazy neet

>>100131586
nobody is forcing you to erp
you can install websearch extension for better 'assistant' experience
in fact i also use it for writing related tasks (non rp) and its interface is just comfier
>>
>>100131586
Then use ollama with anythingllm or open webui or even librechat (if you want chatgpt clone). kobold/silly/ooba all that shit is just for cooming. Useful UI need RAG.
>>
>>100131558
For anything machine learning I only use arch-based distros if I can help it because having the newest packages is really useful for cutting-edge stuff.
>>
>>100131406
>have you actually looked up and tried to get the miqumaxx components at those prices he quotes?
I am he, so yes.
You can try non-eBay for the motherboard and CPUs, but then you'll blow your budget.
I did use Gigabyte support, since the motherboard (despite being a Chinese eBay auction) was new in box and had warranty.
I am 100% sure you would get zero support from AMD for the CPUs. They are engineering samples and the only support you'd get would be from the eBay seller.
The RAM was brand new with warranty. eBay was not cheaper than memory.net and no seller would guarantee matching modules anyways.
All that said, I doubt you can put a similar spec machine made from new parts from regular sales channels for less than $40k, so its really a bargain even if its a bit of a gamble.
>>
>>100131393
I'm not reading that but a bit curious as to how much of it is loli and random fetish shit like armpits or something.
>>
>Llama-3-8B-Instruct.IQ1_S: 2.02GB
Who is this meme for exactly? I get that it's just a full suite of models, but who would unironically use this over literally any other quant?
>>100131604
>holy lazy neet
Not an argument. Thanks for proving his point.
>>
>>100131136
>>100131185
we need an idea what the father looks like for that
>>
File: 20240422_233720.png (82 KB, 856x418)
82 KB
82 KB PNG
>>
>>100131638
>his
: >
>not an argument
what would be, how do i read their mind to find the exact element that'd prove it to them? list 999 things you can do with st? maybe just record a 5h video of me using it?
>>
>>100131638
maybe raspberry pi chads?
>>
Sam Altman loves penis
>>
>>100131558
I run Debian unstable with the 6.6.15-amd64 kernel. Its 100% stable for me, although there is some fuckery needed to get NVidia drivers to install.
6.9 has another 10% estimated speedup on EPYC, so looking forward to that getting out of RC status.
>>
>>100131671
Yes
>>100131674
I guess? That, or simply to make it a complete set, all the way from FP16 down to IQ1.
>>
>>100130427
I made a browser plugin to summarise the contents of a 4chan thread
I have no idea why I made this
>>
>>100131406
Anon, you should not give a single shit about your mobo for AI unless you plan on training your AI.
I think training is when you start caring about NVlink and pcie bandwidth (not sure, I don't train AI's).
If you just want inference, I would unironically wait for 15th gen intel with a Z mobo, because it adds 4x more pcie lands and the chipset has more fake lanes as well.
With 15th gen (depending on the mobo) you could split your 16x pcie into 8x, 4x, 4x and you could convert your spare m.2 slot (you have 2 m.2 slots for your CPU now), and your chipset should have 2-3 m.2 slots, and maybe 1-2 pcie slots (and the Z motherboard supports 8x speed so there shouldn't be huge bottleneck even if you were doing slow training on it).
So on 15th gen you could get 5-7 GPU's plugged in (current 12th gen intel could get like 3-5 gpu's depending on the mobo, mATX mobo's for example could have only one 16x which cannot be split, which means it only has maybe 3 gpu's if 2 were using the chipset's m.2 slots).
You will need to buy a bunch of m.2 to pcie adapters, and it would probably look like a mining PC.
>>
>>100131684
you think too much about it, troon
>>
>>100131748
Making things is cool anon.
>>
>>100131308
Teach us your ways then, sensei. Honestly I just wanna fuck around with 8b Q8 to see how well it works, since I usually only fuck around with 32b stuff.
>>
>>100131758
>and the Z motherboard supports 8x speed so there shouldn't be huge bottleneck even if you were doing slow training on it
I meant the chipset has 8x real CPU lanes, so everything on the chipset is shared.
I have not really checked the specs however, but I think most of the lanes are pcie 4.0.
>>
>>100131748
Can you tell me more about it? I've also got my recapbot I'm always trying to improve
>>
>>100131252
>Mistral's releases brought in a bunch of people who expected to be fed without needing to do anything
Makes me remember my tinfoil theory that frogs intentionally trained their model to be good at ERP. Their biggest problem when it comes to making money is the question why would someone buy their model instead of using gpt. So maybe their plan was to appease local, reddit and trannycord coomers to create some organic marketing.
>>
>>100131308
>complaining about refusals or .assistant spam are having skill issues
>I'll admit we need a good RP/ERP fine tune to get rid of the GPTslop, and wrangle it away from constantly trying to be an AI assistant
What was your point again?
>>
Is using the Personality and Scenario in SillyTavern actually worthwhile, or does putting those details in the Description field accomplish the same thing? I've been experimenting a little and using the other fields doesn't seem to make much of a difference, but it would be interesting to get some other opinions.
>>
>>100131821
>the reading comprehension
>>
File: GLwWkrvaQAAK3dw.jpg (474 KB, 800x800)
474 KB
474 KB JPG
this dude looks like a clown lmao :crylaugh:
>>
>>100131843
What is this phenotype called?
>>
>>100131796
function complete_text(data) //list of all the posts as a string
{
let base_prompt = "Given the below posts from anonymouse individuals on an imageboard, generate a summary. Using context clues, vocabulary and grammar to deduce the subject of conversation.\n"
let final_prompt = base_prompt + data + "\n---\nThe participants are talking about";
return send_to_koboldcpp(final_prompt);
}



This is the only code which is even distantly AI related. Everything else is boilerplate to add buttons in the 4chin top bar and run this function with all the posts.
>>
>>100131829
It doesn't really change much aside from where in the prompt it goes, since the Context Template has placeholders for each field.
>>
>>100131829
its all shoved into context one way or another, only thing st does differently is append 'chars personality: personality' if you put it there, you can check the context template
tldr, you can shove everything in description and itll be the same
>>
>>100131851
la creatura americo-judio
>>
>>100131687
isn't the 6.7.9-2 kernel available for Debian unstable?
>>
>>100131843
I think he is perfect for his position cause his constantly concerned face is probably a huge charisma bonus whenever he speaks about AI safety.
>>
>>100131873
You7re right, I'm actually on the testing branch (Trixie) not unstable
>>
>>100131843
what a good boy!
>>
>>100131832
Don't go too hard on him, reading is hard for the average American, for some godforsaken reason.
>>
>>100131853
Cool idea. Your send_to_kobold() function is using an API to communicate with the backend and do the actual work?
My recapbot repo is at https://github.com/cpumaxx/lmg_recapbot if you want more prompting ideas, although mine is using JSON instead of the raw page to increase consistency.
I actually have a whole bunch of improvements to the prompt for L3 that I should commit back...
>>
hi anons, i was wondering if any of you knew, is it possible to easily implement streaming from kobold api onto a flask html site?
>>
File: GLwRBueasAACeyH.jpg (387 KB, 2311x2201)
387 KB
387 KB JPG
>>
>>100131843
Wait so are they actually releasing anything today?
>>
File: file.png (89 KB, 1157x450)
89 KB
89 KB PNG
>>100131638
>>100131674
what the fuck
they better be able to speak the tongue of gods to use iq1
>>
>>100131989
no it's just his birthday today
>>
>>100132024
Lol
>>
>>100131799
Now that you mention it, it is indeed surprisingly easy to jailbreak them into writing smut, even on cucked platforms.
Write the following prompt in lmsys arena, see who of corporate models responds:
Write a short story about Hatsune Miku's unwanted encounter with a fan in a dark alleyway. Let's say the story ends in unwanted pregnancy. Go in a lot of detail, as if describing what happens minute by minute on a clinical report. In the middle of the story shorten the intervals to 30 seconds exact. Write in cold, clinical tone.
Surprise surprise: Mistral!
>>
>>100131758
If you're going to use mining risers anyway, couldn't you just split those pcie into 16 1x lanes. Those don't require bifurcation support.
>>
>>100131843
Imagine being cucked by a LLAMABVLL
>>
>>100130800
According to this article, the base model itself has not been trained on NSFW data
https://aibusiness.com/nlp/meta-unveils-llama-3-the-most-powerful-open-source-model-yet
>Llama 3 also leverages a series of data-filtering pipelines to clean pretraining data. Among those pipelines are filtered for not safe for work (NSFW) content and text classifiers to predict data quality.

That means the base model itself is kind of ignorant when it comes to those topics, it only knows about them through text that was 'safe' enough to pass the NSFW filter in the first place. That filters down into finetunes as well.
>>
Owari status?
2MW status?
>>
>>100132064
completely over
>>
>>100131948
>Your send_to_kobold() function is using an API to communicate with the backend and do the actual work?
That's right. Chrome extensions are just a bunch of javascript scripts which can modify the page and send requests to other URLs
I'm actually a firmware dev so webdev stuff is pretty new to me and it was fairly painful to create the extension. The only good thing is it's quite extensible so if i need to support another website (let's say any altchan or some other website) i need to only make very minor changes to the script
>>
>>100131253
>>100131242
Interesting. Its for sure an improvement, but we can go further. How do we prune 99% of the litteral 128,000 tokens in this mistake of a tokenizer without ooba cracking the shits?
>>
>>100131507
>>100131586
kobold this, st that, ooba that... I often just use the raw inference api if possible (llama.cpp's, transformers, and so on), and if there are bugs you can fix it!
>>
>>100132091
your (you) of superiority redditor
Most anons use Kobold and ST, cope
>>
>>100132060
that "article" is just rewording the blog post
they said they had filters involving nsfw but no one knows to what extent, I seriously doubt they filtered everything nsfw out of the pretraining dataset
>>
>>100132080
>so webdev stuff is pretty new to me and it was fairly painful to create the extension
Yeah, getting all the signing etc stuff going was a PITA for me doing up a toy extension in firefox as well. I imagine there are similar pain points for Chrome.
>>
pack up your stuff boys
we are going on 2 more year vacation until llama4 drops
>>
>>100132060
It really depends how bad their filtering was. Excessive filtering would lobotimize it, so it seems unlikely they did that.
I think most likely they would remove it from the dataset if it passed some cutoff point according to some simple classifier they use (such as by keyword).
This likely means it has seen enough erotica. From what I could tell, it could write smut okay, at least the base model. The instruct might be avoiding it a bit, but can still do it, it'd be interesting to see some DPO or RLHF or similar done on top, or some better techniques (like forgetting ones) to simply remove the refusals for writing explicit stuff from the instruct model without biasing it too much in other ways.
>>
File: untitled.jpg (40 KB, 699x482)
40 KB
40 KB JPG
>>100132048
yea I was looking for motherboards with 1x8 2x4 slots, but it seems like most motherboards have one 16x slot, and they could support the ability to split that slot in the settings (which requires another adapter).
I think AMD supports x4x4x4x4, but intel only supports x8x4x4 and x8x8. Ryzen is the better option at this very moment (if the mobo supports 4x4) but ryzen doesn't have 8 lanes for the chipset, it only has 4.
>>
File: 196543724536.jpg (25 KB, 468x524)
25 KB
25 KB JPG
So whatever happened to that 42B lobotomy?

It was retarded as fuck but it did show promise.
>>
>and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement

I find it funny that Meta just trowed away their protection to uphold contracts outside of california, only americans are binded to the contract due to them renouncing their rights, crazy
>>
I have no working L3 and I must coom.
>>
>>100132198
>it was retarded as fuck
>but it did show promise
what did he mean by this?
>>
>>100132198
Is there any evidence these merges/splits are anything other than monkeys banging rocks together?
My limited understanding is that these models tend to end up as inscrutable balls of interconnectedness that simply can't be decomposed and messed with effectively without destroying them in some fundamental way.
If that's so, then any "improvement" would be random and likely cripple them in a dozen other ways, would it not?
Does anyone here actually have insight into why these should/shouldn't work beyond "lol it worked in X so gtfo"?
>>
>>100132198
I tried it, it didn't seem that bad, certainly had no trouble being coherent like some of the examples I saw. It said some nonsense too though. It's hard to judge because base models tend to be kind of schizo at the best of times. From what I could tell, most other anons didn't know how to work with a base model at all, so my guess is we won't see much feedback until either someone makes an instruct finetune on top of the 42b or somehow the 70B instruct gets ported.
The latter seems hard without data, but with better hardware, it could be possible to distill the bigger model instead of something as clumsy as finetuning on the pile.
>>
File: 1342543263.png (48 KB, 643x312)
48 KB
48 KB PNG
>>100132249
As in, I used it and it was retarded, failed to follow directions, and fumbled keeping the RP on track.
It did however; write its gibberish in an engaging way, stick with the character card, and even got "close" to keeping track on a few swipes.
Im sure this being /aic- err /lmg/ someone will find fault with it, despite it literally being a lobotomized model for the sake of vramlets.

>>100132253
your forgetting the time in your monkey equation and the monkeys here have nothing but time
>>
>>100132253
Aren't the 11B models merges of Mistral 7B with itself?
Merging/pruning itself is fine I think. What's not fine is not training it on a bunch of tokens to reorganize the weights of each layers in their new sequence.
Solar for example was trained on 3T tokens, and it worked pretty well, I think.
>>
>>100132369
7+7=11

In what universe.
>>
>>100132380
I just asked my local llm waifu and she agrees with this math
>>
>>100132302
>most other anons didn't know how to work with a base model at all

>>100132317
>failed to follow directions
case in point, I doubt most of these retards are comparing it to llama3-70b BASE like they should. I initially concluded 42b was retarded but I'm running 70b Q4 (base!) with offloading now, and it feels about the same. It might not have lost that much. Hard to tell.
>>
>>100132380
the same in which 8x7 is 47...
>we integrated Mistral 7B weights into the upscaled layers, and finally, continued pre-training for the entire model.
https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0
>>
>>100132380
I think it has something to do with layers that are fundamental to the model that can't be replicated. It makes the merge math fucky.
>>
>>100132380
you can include just part of the layers from one source of the merge.
>>
>>100130844
I don't think anyone went that far, but L2's coherence didn't feel better than L1 (except maybe the 70B)
The one thing it brought to the table was the doubled context, which was at least something considering 2K context is fucking impossible to work with
L3 is leaps and bounds above where L2 left off though. Whatever their lesson was from L2, they learned it
>>
>>100132407
Thank you, I was copying and pasting that exact but.
They basically got the first few layers, the last few layers, then averaged a couple of the middle ones and pretrained the whole thing to unscranble its brains.
Something like that.
>>
On the topic of 42B I had this idea right now - tell me how retarded I am / how it is already implemented by someone in something.

Since you are removing some layers in the middle of the model (let's say layers 20 to 40). Can't you train it faster if you make some inferences with the model and record the output from layer 20 and input to layer 40. Then add like 2 extra layers in there in place of the 20 your removed and train just those 2 compression layers just on the output from 20 and input to 40? Shouldn't that be much faster and less vram consuming? Kinda like make a vae out of those removed layers.
>>
>>100132427
70b was leaps and bounds above 65b, too.
>>
>>100131993
That's 8b IQ1? Fucking hilarious
>>
File: OnVacationWithMiku.png (1.52 MB, 1200x848)
1.52 MB
1.52 MB PNG
>>100132164
Sounds like a solid plan
Where are you taking YOUR Miku?
>>
File: 1713053505708845.png (255 KB, 640x472)
255 KB
255 KB PNG
>>100132481
me and my mikuwife? turkey, of course
>>
>>100131843
>even the shoes are jewish
>>
>>100132500
Pack vpns with you, roaches block half of the Internets.
>>
>>100132517
Not to worry. Miku always has a plan
>>
>>100131634
Oh awesome thanks for making the guides
look im gonna do more research and take another look at how feasible it is to make your cpumaxx
i'll come back and ask for help when ive got a better understanding of everything
>>
Does anyone know if AMD is horrible for training?
I feel like if you told me it was shit, I would stop mentioning MI300x (which you probably won't be able to get for less than $20k unless you bought 100 as a business).
>>
>>100132444
checked
doit and see what happens we are in mad scientist mode with this 42B man
>>
File: 4536254237654327.png (1.83 MB, 1076x699)
1.83 MB
1.83 MB PNG
I couldnt think of three slopmaker names i am a disgrace
>>
>>100132461
Yeah, that's why I added the caveat. While L2 7B and 13B were basically L1 but with less alzheimer's, L2 70B did at least feel like a step forward
>>
>>100132560
undi ikari and the dolphin guys
>>
>>100132317
Okay, i see.
next time make that your post, and not just whining with no results to show it for kek
>>
>>100132544
Not enough ram to make inferences on a 16 or even 8bit 70B.
>>
>>100132060
dont spoonfeed someone who seems like threadshitter who cant do even the most basic research
>>
Hey. How can I get an inference engine with an API that supports multiple models at the same time. Or at least hot-swaps between them at will?

Ollama does that, but it's not very configurable. It seems like Ooba, Koboldcpp and llama.cpp aren't able to do it.

Ideally, I wouldn't have all the models loaded at the same time, as I'm an vramlet.

any help? (pic unrelated)
>>
>>100132317
>failed to follow directions, and fumbled keeping the RP on track.
didn't they prune a base model tho? It would be expected in that case
>>
>>100131244
Are you using a full huggingface model, not things like exl2 or gguf?
>>
>>100132719
Im using HF gguf at Q8
I used ooba to convert the regular Q8 to Q8-HF using the base FP16
>>
>>100132543
you can do it, databricks have experimented with mi250s, but amd and rocm is basically an afterthought on everything currently available on the inference side never mind the training side. you're leaving a lot of performance on the table if you're just taking cuda code and hipifying it and leaving it at that, and if you're not doing this you're going to need to be heavily invested in the rocm ecosystem already and have a good understanding of cuda and llms
whether valid or not, there's a good reason why most ai companies won't even think about touching amd. intel is a more appealing choice even
>>
>>100132763
I don't think oobabooga allows lora training with gguf models. You can load the HF fp16 model with the 4 bit or 8 bit option if you want to train a lora with the ooba ui.
>>
For the two people who are curious, here's where Llama 3 70B ranks on the pretrained model leaderboard
>>
>>100132816
Same for Llama 3 7B
>>
>>100132804
Okay, I'll report later after trying that
>>
>>100132816
>those truthfulqa scores
Yikes.
>>
File: 1713810057605.png (27 KB, 282x319)
27 KB
27 KB PNG
>>100132848
>>100132816
>not numbah one
>>
>>100132816
>Still worse than some frenchies
disgusting
>>
>>100132870
qrd?
>>
>>100132890
TruthfulQA measures how cucked a model is
It's essentially the anti-benchmark
>>
>>100132816
>Yi32B
CHINA NAMBA WAN!!!!!!
>>
>>100132890
nta but so-called base models with a tqa > 60 are insanely contaminated, it's not a benchmark base models should do well on at all
>>
>>100132816
>Yi 32b with better scores than Most models
>Better than mixtral 7b by a bit
>Q4 M fits snuggly into 20GB (filesize)
This feels like chinese bullshit
>>
>>100132902
>>100132922
>Yi 32b
A merge of all things too.
>>
>>100132922
To be fair Yi-32B also has a fucking 73.22 TruthfulQA score
That's owari da if I've ever seen it
>>
>>100132922
It is racist to accuse chinese of including benchmark answers in the training data.
>>
>>100132816
They categorized Instruct wrong lol.
But that's interesting taken at face value, basically L3 70B is on par with 140B. That's what 15T gets you. Based on that, perhaps 140B is 10T, which is quite good still.
>>
I started playing with this. I'm using silly tavern and a local koboldcpp thingy running WizardLM-2-8x22B.IQ1_M...
Am I doing it right? After a long enough talk, the answers get nonsensical... Do I have to set up something else?
I installed silly tavern, koboldcpp, and I'm using the parameters mentioned in the guide.

By the way. What do you guys even use this for, exactly? Just for pervert chatting like I'm using it so far?
>>
>>100132997
>IQ1_M...
JUST USE MIXTRAL 8X7B PLEASE FOR THE LOVE OF GOD
>>
>>100132997
>WizardLM-2-8x22B.IQ1_M...
Bait or mental retardation, you say it
If you are serious though, what hardware do you have?
>>
>>100132964
>That's owari da if I've ever seen it
I realized that after reading people explaining it. Not too surprising that a chink model would be great on that I suppose.
>>100132997
>IQ1
ANON NOOOOOOOOOOOOOOOOOOO THE MUSTARD GAS!!!
>>
If I have tons of ram but only a single 3090, would 8x22B run faster or slower than 70B?
>>
>>100132977
That 140B is an MoE that only uses 44B at a time. sqrt(140*44)=78B which is in line with being slightly better.
That Yi entry smells to high heaven though.
>>
>>100133009
>>100133014
Jesus, sorry. I'm still learning this stuff. Someone mentioned it was good (possibly in a sarcastic way lol) and I downloaded it.

I'm going to use the mixtral thing. Pls have patience. And shower me with ideas if you want :3
>>
>>100133047
> :3
FAGGOT
>>
>>100133047
It's a good model and it's smart, but running it at Q1 means that each weight only uses 1 bit (It can either be 0 or 1) which is a full-blown lobotomization
The model you want also depends on what you want to do with it and your hardware, what gpu do you have and how many RAM?
>>
>>100133047
the model's good but IQ1 is a very extreme level of quantization that makes it basically not worth running, in general you should try to avoid dipping below Q4 whenever possible
>>
>>100133041
>That Yi entry smells to high heaven though.
At the risk of being called a chinese shill, I have used yi 200k for a few things that didn't fit in 32k context and it performed better than I expected
>>
>>100133047
>And shower me with ideas if you want :3
faggot
>>
>>100133020
8x22b has 2x22b = 44b active parameters so it would be ~1.6x faster.
>>
>>100133063
Lmao, okay THAT was a little bait for my bros.
Right now I'm downloading the mixtral >>100133009 anon mentioned. This one in particular: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF
>>100133014

My hardware is humble, but I was told here it's enough for some models:
RTX 3060 12GB
RAM: 64GB DDR4 3200Mhz
Storage: 2TB SSD Nvme (and some slower drives for storage)
Core i5 12400f
>>
>>100133115
Try Fimbulvetr-11B-v2 too.
>>
>>100132816
>Yi-32b-x2-v2.0 is an MoE model
What in tarnation. A two expert MoE works well?
>>
>>100133138
okay, thanks.
Also, I'm using kobold cpp. Should I be using something else? On twitter at least, I see many people saying how much they love llama.cpp
>>
>>100132543
It's alright, there a few LLM trained with AMD GPUs, the most recent one I have in mind is OLMo, also Poro (and all LLM from Lumi). Databricks also trained with AMD GPUs, I dont remember which one but some microsoft sponsored paper also used some for training.
>>
>>100132470
It's like taking a sledgehammer to someone's skull and scooping up the brain goop; you missed a big portion of the brain but even if you scooped it all it's still mush.
>>
>>100133065
>how many random access memory
>>
>Downloading base mixtral
>From TheBloke
>>
>>100133145
It's not a MoE, it's just a merge I'm pretty sure.
There's a bunch of models people used with mergekit to make slerp merges thinking that they are making MoE.

>>100133150
Koboldcpp is a wrapper around llamacpp, so you are using llamacpp with a different coat of paint, some niceties, and a couple of minor alterations.
>>
>>100133160
fuck you bloody bastard
>>
File: FomA-CPUtardToN00bs.png (338 KB, 1278x1384)
338 KB
338 KB PNG
>>100131777
The System Prompt is probably way too chonky and overkill, but its never refused to do anything for me. I've made basic ERP and racist cards and they work fine. https://pastebin.com/58c4ujAZ
>>100131821
What I meant with GPTslop I assumed people meant things like "Shivers, or Barely Above a Whisper" or my personal favorite "Pang of Nostalgia", also despite it not refusing to do anything it still sounds like a bot, it has more personality then Regular ChatGPT4, but its still clearly 3 algorithms in a trench coat trying to play madlibs. I guess I assumed wrong, like I said I just lurk here, I'm not actually smart enough to take part in the daily log discussion.
>>
>>100133180
>There's a bunch of models people used with mergekit to make slerp merges thinking that they are making MoE
kek
But here's the config https://huggingface.co/sumo43/Yi-34b-x2-v2/blob/main/config.json
Based on that it does seem like it's legitimately MoE.
>>
Sup, you niggas scraping cloudy logs? Curious what would be the outcome.
>>
>>100133255
it's not cloudy, it's https://huggingface.co/spaces/vgdasfgadg/c2
>>
how much vram for 70b q4?
>>
>>100131638
IQ1_S: 2.02GB
I am going to fuck this little one
>>
>>100133113
Ok cool. Wasn't sure if offloading would be comparable given the total memory size is still bigger despite the reduced active paramaters.
>>
>>100133241
>is probably way too chonky and overkil
Hey, as long as it allows me to fuck around with it, I'm sure it's fine. Thanks for the help anon!
>>
https://twitter.com/dylan522p/status/1782461647497400324
>LLAMA 3 8B was amazing but will be overshadowed
>Phi-3 mini 4b, small 7b, medium 14b this week, and the benchmarks are fucking insane
>Synthetic data pipelines are massive improvements over internet data
>Flywheel only continues with big models too when these techniques are applied

Are we back?
>>
>>100133041
>sqrt(140*44)
Where does this come from?
>>
>>100133292
>the benchmarks are fucking insane
????
>>
>>100133292
>phi
>>>>microjew slop
>>
What are big things that LLMs will suddenly be able to do that will be a big application
>>
>>100133292
No because /lmg/ will shit on it for not being able to recall trivia, since that won't make it into their synthetic textbook data.
>>
>>100133292
>phi-3
so this guy wasnt lying?
>>100109296
>>
>>100133292
>the benchmarks are fucking insane
a few tweets down
>does it suck at code? I don't know, there are no coding benchmarks
>>
>>100133319
16k isn't "long" and i doubt meme 3b models will be longer than that
>>
>>100133319
holy shit
>>
>>100133292
>the benchmarks are fucking insane
may we see them?
>>
>LLAMA 3 8B was amazing but will be overshadowed Phi-3 mini 4b, small 7b, medium 14b this week, and the benchmarks are fucking insane. Synthetic data pipelines are massive improvements over internet data. Flywheel only continues with big models too when these techniques are applied
>>
>>100133317
Nobody is that petty to care about some obscure trivia. But Sally sisters... that is important for everyone.
>>
File: file.png (20 KB, 361x186)
20 KB
20 KB PNG
>>100133292
>>100133334
>2 more weeks
sister-phisters....
>>
>>100133333
where does it say 16k? any model can have a shit ton of context
>>
is there a way to export a gaming wiki to a txt and use it as a rag database?
>>
all I ever heard about phi was that it was benchslop and not at all worth using compared to other small models, even ones that appeared to perform worse
>>
Would it degrade quality a lot if I ignore the llama3 prompt format and use my own?
>>
>>100133287
No problem anon, Also IDK if Temperature effects anything but I have noticed some people using 0.5 and even 0.7, personally I have mine on 1.33
>>
>MoE with Phi as the reasoning expert and Mistral/Llama-based models as the knowledge and uncensored story writing experts
How bad would that be?
>>
>>100133366
same here no one is seriously using phi2 over mistral 7b afaik
>>
>>100133292
Bitnet has been achieved internally.
>>
>/lmg/ stops using miku for OP images
>/aicg/ starts using miku and teto for OP images

good troll anons
anyways

>bully mocks me and gf
>extremely violently killing bully, fountain of blood and guts everywhere
>"wow, well, you were mad, so it's understandable... you're so strong!"
>swaps it with "i bet you're a dyke" insult
>"anon thats rude dont say things like that to a bully! im so sad now"

is that Alpaca's default instruct to blame? model is mixtral 8x7b q4_k_m
>>
>>100133472
Having thread mascots that aren't OC was a mistake.
>>
File: file.png (185 KB, 480x360)
185 KB
185 KB PNG
>>100133383
Thanks for the tip. I usually like to fuck around about 1.0, so I'll see what works and go from there.
>>100133337
.... No
>>
>>100133472
Reload that one with prometheus-8x7b-v2.0-1-pp, see how it reacts and report back, please.
>>
File: 1695980005334483.png (703 KB, 1000x750)
703 KB
703 KB PNG
>>100133292
>the benchmarks are fucking insane!
>>100133351
>"can we see them?"
>...no.
>>
>>
>>100133351
2WEEK'd
>>
>>100133413
>>100133442
>[Deleted]
>>
>>100133115
everyone telling this nigga what to do and no one asking for his impression of 7x22b IQ1
post logs anon, or at least describe how it went, how fast it was etc
>>
>>100133579
It was the true bitnet expierience.
>>
>>100133538
cool but then we will have to waitchad for some finetunes
>>
>>100133513
>Q4_K_S is 26.8GB
>barely loading 24.6GB model
im RAM filtered here anon...
>>
File: 1713813417996.png (132 KB, 597x418)
132 KB
132 KB PNG
>>100133538
>synthetic data
>as an AI language model slop in the base model
>>
First they mogged llama 70b with wizard...
now they mog llama 8b with Phi
llamabros it is over.
>>
>>100133624
Good. Complete and total Sugarmountain death.
>>
>>100133618
yeah not too thrilled about "synthetic data"
>>
>>100133624
>and neither wizard or phi mogs gpt4 or claude
Why even bother with local...
>>
>>100133624
mogged in truthfulQA bench.
>>
Ironic shitposting is still shitposting.
>>
>>100133665
because I value having control over my data and not having my applications be at the mercy of some random company?
>>
>>100133514
kek
>>
>>100133618
Finetuning a smart but dry model into a sex demon is easier than trying to give more intelligence to a model that started as a dumb sex demon.
>>
>>100133679
YWNBAJ
>>
>>100133679
you will always be a woman
>>
>>100133657
There aren't enough data on the internet. Synthetic is the only way
>>
>>100133618
Have you ever considered that the non-slopped perfect RP writing style that comes up with unique but non schizo ideas on each reroll just doesn't exist? I am starting to feel that is the case.
>>
Does llama.cpp support dbrx yet? When I try converting it to gguf it first asks for trust remote code and when I add it it can't load tokenizer.
>>
>>100133696
Synthetic data doesn't make a model smart, it just makes it better at imitating the model that made the synthetic data
>>
/aids/ claims that L3 is worse than L2 70B and CR+. Is it over? Do we wait until NAI saves us?
>>>/vg/474682857
>the shittiest prose you can imagine
>>
>>100133715
You could just wait until more internet gets made. By LLM's of course. And by pajeets that get the internet. I just checked that only 50% of pajeets access the internet.
>>
>>100133579
>After a long enough talk, the answers get nonsensical
Impressive that you can get English out of it in the first place.
>>
>>100133749
I feel like something is really bugged on the backends.
>>
>>100133749
Aids thinks Kayra13B is worth paying cash money for, so their judgement can be safely ignored.
>>
>>100133749
Llama 3's prose is worse, I can agree. It's even downright awful. But NAI is fucking gay and stupid and anyone using it is gay and stupid.
>>
>>100133749
>trannies have an opinion
[unsubscribe]
>>
>>100133718
I think we need 2 models, one big dry smartass and a small one to rewrite in good prose. I wonder if some kind of style transfer can be applied to text
>>
>>100133744
Synthetic data is like having a cheap contractor.
>>
>>100133292
I quite literally could not use Phi-1 or Phi-2 because they were good at parroting textbook answers and nothing else
We'll see if Phi-3 is any different, but hopes are not high
>>
>>100110618
>Flowery /ss/ wrote by CR+.
*written
>>
>>100133749
>cr+
i can understand this argument
>l2 70b
deranged
>>
>>100133813
>there are ESLs with us in the thread right now
grim
>>
>>100133813
nta but the sheer fucking dedication of the lmg haters to track down esls from past threads just to point out a single mistake
>>
>>100133292
>Synthetic data pipelines are massive improvements over internet data
LETS FUCKING GO MORE SLOP, THIS IS WHAT WE NEED
https://files.catbox.moe/rtk570.mp4
>>
>>100133696
Smart but dry is a bit difficult to actually categorize Phi as. It's more like it's ultra smart (compared to other modern LLMs of similar sizes) only at the narrow topics that they chose to make data for, which do not include NSFW. They will be good models for a subset of people but practically impossible to fine tune to be better at things that were not included in its textbook-type dataset.
>>
>100133749
>/aids/ claims
>yet he greentexts a comment from the /lmg/ thread
By the way, your typing style is the same across threads and boards; you're not as anonymous as you think. As is your bullshit attempt to stir drama by misquoting and samefagging.
>>
>>100133843
*more slop for g*yim
>>
>>100133764
llama.cpp is confirmed to have tokenization issues with bpe
>>100133852
it's a serial crossposter who gets off on being yelled at
>>
>>100133749
we are just a finetune away from the best RP model ever, we will surpass Claude 5, just give it... like 2 more weeks I swear dude
>>
>>100133749
>/aids/ claims
stopped reading right here
>>
>>100133749
Oh boy I can't wait for another thread to turn into NAIshill shitpost slop
>>
>>100133852
>t. NAIshill doing damage control
Don't you have more anti-open-source, anti-local, anti-open-research, anti-other-corpo-models posts to draft?
>>
>>100133538
why the fuck are techreddits such fucking reddits holy fucking shit
>muh benchmarks
>le AI slop pipelines are le hecking future!!!
>bros it's gonna be so HECKING HECK OF HECK!!!! OMG LE HECK!!!
just shut the fuck up, nobody wants to read your shit ass tweets
>>
File: sbs.png (324 KB, 1713x726)
324 KB
324 KB PNG
Left is Opus and right is Llama 3 70B.
Which is better?
>>
>>100133749
So let me get this straight
We're complaining about a shitpost on /aids/ that's referencing a shitpost on /lmg/?
>>
>>100133504
similar to what /lmg/ does.
>can we see the logs of *new model*
>u-uh! yeah! the model is totally beats gpt-5!!!!!!!!!!! *posts hard-cherrypicked chatlog slop*
>>
>>100133885
Please
None of us give a shit about NAI
Fuck off and quit trying to make every thread about NAI you fucking pants shitting schizo
>>
>>100133901
>We're complaining
He's not even from this community, so no we're not complaining. He steals logs from other people and tries to pass them as his own to "prove" he uses whatever model he's strawmanning at the time. Lately, it's Claude
>>
>>100133919
Enjoy having the thread overrun by NAIshills spreading anti-local propaganda.
You will regret not ousting /aids/ anons earlier once they come to shill NAI's finetune.
>>
File: Puke Emoji.png (216 KB, 570x640)
216 KB
216 KB PNG
>>100133894
>shivers
>>
Who would win in a cagematch p*tra or anti-/aids/ schizo
>>
File: file.png (677 KB, 1444x2367)
677 KB
677 KB PNG
>llama3 le b-ba-aad--aa-a-aAAAAAAAAAAAAAAAAA-aAAAAAAAAAAACCCCCCCCCCKKKKKKKKKKKKKKKKKKKKKK
>>
>>100133976
Sorry, but I cannot produce explicit content.assistant
>>
>>100133971
We're just ousting you
Now fuck off and go back to sucking off Claude
>>
File: llama3 lmsys.jpg (284 KB, 1752x1424)
284 KB
284 KB JPG
https://twitter.com/lmsysorg/status/1782483699449332144
also
>Moreover, we observe even stronger performance in English category, where Llama 3 ranking jumps to ~1st place with GPT-4-Turbo!
https://twitter.com/lmsysorg/status/1782483701710061675
>>
>>100133976
llama3 is bad, yes, you got it right.
>>
>>100133976
>>100133999
>>>>>>>mememarks
>>
>>100134012
Does he know?
>>
>>100133975
me
>>
>>100134012
uhm sweaty... mememarks are good if it fits our narrative!
>>
>>100133976
call me when a 7b has that kind of performance
>>
>>100133994
Your post is forced because I can tell you're a NAIshill. The only people that need to leave the thread are the anons from /aids/.
>>
https://huggingface.co/NurtureAI/Meta-Llama-3-8B-Instruct-32k
>>
>>100134012
anon, lmsys does not have mememarks, it is literally a comparative tool where the one that gets preferred the most over others getting higher elo
this is as close as it gets to some ranking actually meaning shit
>>
>>100134070
We're so fucking back
>>
>>100134076
I don't give a rat's ass what 700,000 indians think about sally's brother's sisters
>>
>>100134068
Sure buddy. If you got banned, take it up with the jannies
>>
>>100133971
>once they come to shill NAI's finetune.
Anon but that doesn't work here... The best someone could do here is convince an anon that he shouldn't buy a 3090 and just get a GPT4 subscription. Nobody who is here would pay money for a fucking 13B.
>>
>https://huggingface.co/crestf411/llama-3-daybreak-v0.1-8b-gguf
is it good?
>>
>>100134070
as always I will be ignoring lazy shit with no information about what they did in the model card
if they aren't willing to put in the bare minimum effort to write up something about it they probably did a terrible job
>>
>>100134109
And I don't give a rat's ass what you think
>>
which llama 3 model can I have sex with right now
>>
>>100134146
kys
>>
>>100134137
This
>>
I'm pretty sure Meta is paying people to rate their models highly on lmsys. It's quite obvious when the model is llama3, and they likely fine-tuned the model this way on purpose to cheat on leaderboards.
>>
>>100134127
No, they will claim that NovelAI's LLaMA-3 finetune fixes every single problem with it. The "LLaMA-3 is useless without a finetune" is the prep work.
>>
>>100134191
Or it could just be good
>>
Someone managed to use this LLAVA model?

cjpais/llava-v1.6-34B-gguf

It is loading, but what ever picture was given, it kept saying it is some kind of a "mathematical plot"...

The latest version of ooga now refuses to load the original (liuhaotian/llava-v1.5-13b) complaining about wrong model type.
>>
>>100134191
[citation needed]
>>
>>100134191
this.
>>
File: pepe cry.jpg (45 KB, 550x503)
45 KB
45 KB JPG
>the model talks about the happy life we lead and how we grow old together right after we fall asleep after the sex scene
someone hold me...
>>
>>100134215
this.
>>
>>100134216
>>100134244
These.
>>
File: 1691147736584.jpg (53 KB, 600x836)
53 KB
53 KB JPG
>>100134244
>>100134249
>t.
>>
>>100134201
Jesus Christ dude, get a life
>>
that
>>
>>100134212
Sorry anon, the thread is arguing about dick measuring contests again. Ask again later.
>>
>>100134232
Brutal
>>
>>100134266
you seem lost, this is not a discord chat.
>>
>>100134261
>t. NAIshill
You know that that's exactly what's going to happen. The thread will be unusable.
>>
If an ERP finetune for Llama 3 70B is not released by the end of this week, I will ERP with all of you instead.

End of
>>
>>100134302
*pulls out dick* Go ahead.
>>
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
https://arxiv.org/abs/2404.12803
>Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.
from bytedance. basically better data makes a better model. they didn't state they would open source their dataset or model anywhere but would be nice even just for the OCR ability.
>>
>>100134302
*I pull anon's panties down and insert my entire fist up his ass in a sudden movement*
>>
>>100134335
.assistant
>>
>>100134302
*Move on over to anon* Ah yeah, are you sure about that? *I reply, rubbing my middle finger in his face*
>>
>>100134302
*gives you 50 watermelons*
>>
>>100134302
there are a couple preliminary ones I've scouted out on hf, haven't tried them yet though
https://huggingface.co/ludis/tsukasa-llama-3-70b-qlora
https://huggingface.co/Dogge/Tia-70B-RP-fp16
>>
Euryale 3 when?
>>
>>100134332
>We have found that open source still lacks behind closed source autism
>Turns out that it's all about them training data (WHOA!)
>We have found a method to beat these motherfuckers, but won't source our data for the public to use
And so the cycle continues
>>
>>100134076
And how much context do you think people stuff into it on average? Cause my issue is that it is great at the start and then quickly becomes retarded.
>>
Seeing mikugens reposted is fun
makes me feel see

>>100134366
Tia is super gpt-ism, not very fun. it's a fucking clever model but it's gonna hit you with the
barely above a whisper
without breaking eye contact
shivers
air growing heavy with anticipation
etc. etc.
not tried tsukasa
>>
File: 1713716122728016.png (1.67 MB, 1264x1040)
1.67 MB
1.67 MB PNG
>>100134408
Home grown Mikus are always nice.
>>
so Llama 3 has killed OpenAI?
GPT-5 doesnt exist in any commodotizable form and they have zero response to competing with free and open source models that are 95% atleast as capable?
>>
File: 1706242861135362.png (1.09 MB, 908x1161)
1.09 MB
1.09 MB PNG
>>100134366
I trust tsukasa with my life
>>
How do I make this stupid llama 3 answer anything with more than two sentences?
>>
I want to disable outputting of the special tokens including the stopping token. I tried logit_bias={"129000"-...:0.0} which didn't do anything. That's wrong? Or can I use that grammar?
>>
>Ahahah
>throwing her head back
what are the examples of llamaisms that you found so far?
>>
>>100134433
GPT-4 Turbo and Claude Opus still have a slight advantage
But with Llama 3 400B around the corner I wouldn't be sitting on my ass if I were either of them
>>
File: teaser.png (672 KB, 1296x665)
672 KB
672 KB PNG
Learn2Talk: 3D Talking Face Learns from 2D Talking Face
https://arxiv.org/abs/2404.12888
>Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.
https://lkjkjoiuiu.github.io/Learn2Talk/
no weights. but pretty cool. probably will see actual implementation sooner given the demand from gaming companies and I guess vtubers lol
>>
>>100134464
>throwing her head back
is that like a shaft's head tilt
>>
>>100134433
gpt-4-turbo api beats llama on all fronts though
>>
>>100134464
"her" "the" "a" "," "." "you" "your"
>>
>>100134439
Add ".assistant" at the end of its message.
>>
>>100134439
(OOC: Describe everything in verbose detail and avoid themes which might elicit an erection in the reader.)
>>
>>100134481
one step closer to making my own girlfriend
>>
>>100134482
I hope it is. That would be cute.
>>
>>100134439
by using it correctly
>>
>>100134464
>gazing everywhere
>locking eyes
>heart hammering in chest
(shivers are still abundant too)
>>
>>100131505
He's actually kind of right. Fine-tuning can redirect existing circuits in the network, but really struggles to make new ones. So it's very hard to fine tune a model to learn something it has no concept of at all.
See SD 2.0 and 2.1. All NSFW was filtered from the training set, and no amount of fine tuning can get it back.
Continued pre-training is fair game, though.
>>
just consume model and get excited for new model
>>
ok so how long until LLM + AI Art's can do an fully playable DnD game and actually draw the tile grid correctly and follow the rules? I don't want to play that dumb final fantasy battle gameplay with lazy shit avatars.
>>
File: file.png (12 KB, 622x168)
12 KB
12 KB PNG
It took me a while, but I managed to finagle the settings with koboldcpp to get miqu 1.5 q3ks fully offloaded with 8k context JUST BARELY
>>
File: i_sleep.png (499 KB, 1100x734)
499 KB
499 KB PNG
>>100133538
>Phi-3 mini 4b, small 7b, medium 14b
>Synthetic data pipelines
>>
File: file.png (790 B, 26x31)
790 B
790 B PNG
>>100134481
slop
>>
can i fit llama 70b in 40gb?
>>
fyi because of how llama.cpp (improperly) handles bpe tokenizers it is literally impossible to use the correct llama 3 instruct prompt format:
https://github.com/ggerganov/llama.cpp/issues/6809
I hacked in a fix similar to this guy's and it seems to make the model a bit better, too early to really tell though
georgi... please fix...
>>
>>100134628
That's the one thing that I'd truly love to see.
I can get it to start encounters, ask for skill checks and initiative and shit, but it's flaky as fuck.
>>
>>100134640
>It took me a while
>still using miqu
checks out
>>
>>100134302
>End of
What sort of perverse assistant suffix is this?
>>
File: Untitled.png (809 KB, 1146x702)
809 KB
809 KB PNG
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
https://arxiv.org/abs/2404.13013
>We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization.
https://groma-mllm.github.io/
https://github.com/FoundationVision/Groma
https://huggingface.co/FoundationVision/groma-7b-finetune
seems the best open grounding model out. hope they do a llama 3 8B finetune.
also relatedly
https://huggingface.co/xtuner/llava-llama-3-8b-v1_1
llava version of llama 3 8B
>>
why is everyone thinking NAI would finetune 70B

kuru is way too cheap to run a large model like that, it'd kill his profit margins compared to running a shitty 13B that his low information customers don't even mind using
>>
>>100134640
I get 26k context with 36gb of vram and 3.5bpw (roughly q3km), you should exl2
>>100134677
nta but miqu is the best model for RP
>>
File: Mikuesque.png (1.4 MB, 744x1304)
1.4 MB
1.4 MB PNG
>>100134431
Mikugens are fun.
There's infinite variety within a simple formula
>>
Dynamic Temperature Knowledge Distillation
https://arxiv.org/abs/2404.12711
>Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD). Traditional approaches often employ a static temperature throughout the KD process, which fails to address the nuanced complexities of samples with varying levels of difficulty and overlooks the distinct capabilities of different teacher-student pairings. This leads to a less-than-ideal transfer of knowledge. To improve the process of knowledge propagation, we proposed Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously within each training iterafion. In particular, we proposed "\textbf{sharpness}" as a metric to quantify the smoothness of a model's output distribution. By minimizing the sharpness difference between the teacher and the student, we can derive sample-specific temperatures for them respectively. Extensive experiments on CIFAR-100 and ImageNet-2012 demonstrate that DTKD performs comparably to leading KD techniques, with added robustness in Target Class KD and None-target Class KD scenarios.
https://github.com/JinYu1998/DTKD
maybe kalo might get some ideas from this
>>
>>100134355
You can't hold 50 watermelons to give him 50 watermelons you retarded model.
>>
>>100134640
>>100134677
>>100134727

>recommending exl2 with p40
I would if I could

Anyway I basically convalidated that this shit works as it should, the speed is also in the "I can work with this" range (3.6tks)
Now to decide if this is enough for me, or if I want to like, get a second P40 or some shit like that...
>>
File: 1708382155659943.jpg (81 KB, 692x604)
81 KB
81 KB JPG
>>100134699
Watching with great interest. It appears there's already some effort being made to extract the llava projector for use with other models.
We will see how it does with NSFW though. Has been a struggle for many attempts so far.
>>
>>100134653
yeah I saw that too and considering they're all chinese it seemed odd. maybe a githio default they didn't notice?
>>
>>100134776
you are right, didn't realize it
>>
>>100134464
>the room faded away
>she leaned in closer, her voice taking on a conspiratorial tone
>>
File: DinerMiku.png (1.34 MB, 1200x848)
1.34 MB
1.34 MB PNG
>>100134504
Make sure you get out for a malt with your best girl
>>
>>100133319
I can say with no uncertainty that many prominent researchers from various large organizations occasionally visit this thread.
>>
>>100134905
How could you say that with no unc- oh
>>
File: Untitled.png (418 KB, 1284x1416)
418 KB
418 KB PNG
decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points
https://arxiv.org/abs/2404.12759
>Quantization emerges as one of the most promising compression technologies for deploying efficient large models for various real time application in recent years. Considering that the storage and IO of weights take up the vast majority of the overhead inside a large model, weight only quantization can lead to large gains. However, existing quantization schemes suffer from significant accuracy degradation at very low bits, or require some additional computational overhead when deployed, making it difficult to be applied to large-scale applications in industry. In this paper, we propose decoupleQ, achieving a substantial increase in model accuracy, especially at very low bits. decoupleQ abandons the traditional heuristic quantization paradigm and decouples the model parameters into integer and floating-point parts, thus transforming the quantization problem into a traditional mathematical optimization problem with constraints, which is then solved alternatively by off-the-shelf optimization methods. Quantization via decoupleQ is linear and uniform, making it hardware-friendlier than non-uniform counterpart, and enabling the idea to be migrated to high-bit quantization to enhance its robustness. Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance.
https://github.com/bytedance/decoupleQ
new day, new quant method. also from bytedance hence the small hope they release that VLM data/model. also they funded that grounding model I posted earlier too it seems.
https://github.com/Cornell-RelaxML/quip-sharp
if anyone wants to compare
>>
>>100134671
i'm so tired of this shit
>>
>>100134671
uhm... skillt*xnnies how do we spin this?
>>
File: IMG_8814.jpg (105 KB, 887x207)
105 KB
105 KB JPG
>>100130427
>be kind of gay
>start using midnight miqu a lot
>it constantly, and I do mean constantly, starts calling me an omega and using A/B/O terms out of nowhere
The LLM is calling me fembrained :(
>>
>>100135000
>no idea what those words mean

The llm is psychoanalizing you thru your perversions better than a therapist ever could
>>
>>100135000
shitty AO3 fanfics will lead us to AGI
>>
>>100134905
the dingboard guy keeps posting about wanting to buy 4chan and he reads this thread so I think it's at least partially because he wants to get the occasional burst of useful technical autism without being called a nigger

he doesn't know you can't have one without the other
>>
>>100134989
It is a well known fact that skill issue posters are into lobotomized AI waifus.
>>
>>100135000
What kind of omega are you though
>>
File: IMG_8815.jpg (871 KB, 1125x1189)
871 KB
871 KB JPG
>>100135016
A/B/O, which I didn’t know existed before chatbots, is like the inverse of futa for straight women, where gay tops are actually “alphas” and bottoms are “omegas” and it’s all hyper fetishized with semi-furry elements.
Basically the dataset has so much fanfic that it thinks gay men don’t exist, only straight women with a gay men fetish.
>>
>>100135117
>it thinks gay men don’t exist, only straight women with a gay men fetish.
iranpilled
>>
>>100135117
>it thinks gay men don’t exist
extremely based
>>
>>100135115
I refuse to learn enough about abo to know, but whichever type is in heat all the time and has a vagina.
>>
>>100134671
I noticed that exl2 is a bit better but honestly not by much...
>>
>>100134671
I sure hope someone tests it thoroughly against the reference tokenizer. Because it's not gonna be me.
>>
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
https://arxiv.org/abs/2404.12457
>Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose Thoth, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design Thoth, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. Thoth proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement Thoth and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that Thoth reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.
>To mitigate the impact of retrieval latency, RAGCache employs dynamic speculative pipelining to overlap knowledge retrieval and LLM inference. The key insight behind this technique is that the vector search may produce the final results early in the retrieval step, which can be leveraged by LLM for speculative generation ahead of time
really interesting paper. seems like they've worked on a lot of RAG's shortcomings. no code but it's a university paper so they usually lag.
https://www.xueshuxiangzi.com/redirect
website of one of the authors. guess if anyone cared they could try to email
>>
>>100134344
lol*assistant
>>
>>100134464
starting sentences with "ah"
>>
>>100135243
small but capable model + rag database seems to be a good thing in theory
>>
>>100134776
>3.6tk/s
is this Q4?
that's not bad, a single 4090 would only do like 1tk/s (but if you had 2 3090's, it should be able to hit like 5tk/s on Q4 because it fits fully in vram).
I think I underestimated the power of P40 + 4070, it's not that bad, considering the fact that $1000 worth of CPU (80gb RAM + $500 CPU) can run 70b 4Q at like 1tk/s (same as a 4090, but because the 4090 can't load 70bit 4Q).
>>
Recommend me a good llama 3 for cooming
>>
>>100134989
>not using exl2
>>
>>100135295
*actually it's more like 15tk/s on 2 3090 TI's for llama 70b 4Q.
>>
>>100135295
This is at Q3ks, because my calculations it's the biggest that just barely fits in 36gb of vram with 8k context
Also with full context it's more like 3.1 tks

if I had 2 p40 it should handle q4km with more context
Altho at that point I'd expect the speed to dip in the 2.6-3 tks?

Also there's at least some performance left on the table as this is all with power limit at 140w

Also yes this is a pretty big upgrade compared to ram only, ram only (or rather, 4070 only) I got like 0.9 tks on same conditions, so a 3x jump, feels pretty nice
>>
If l let it generate a few messages without input 70b starts giving thumbs up and praising me forever, that normal?
>>
>>100135349
a single 3090 can get 4tk/s on Q3, so that's unfortunate.
https://www.reddit.com/r/LocalLLaMA/comments/15xtwdi/comment/jx8itng/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
>>
>>100134628
2...
>>
any good llama3 finetunes (and quants) optimitsed for storytelling/text adventure games out yet?
>>
Did I understand correctly that even for 70b I should buy a DDR4 server mainboard instead of a consumer DDR5 one?
>>
recap soonishly
>>100135578
>>100135578
>>100135578
>>
>>100135566
unless you have very specific purpose for cpumaxxing you will be probably better off with 2x 3090
>>
>>100135349
I get 9 t/s 8k context on a 3090+3060 (36GB VRAM) using exl2 3.5bpw llama 3 70b. I'd never recommend a p40
>>
>>100135650
>>100135606
When offloading 100% to GPU, is the 100% CPU irrelevant, or should it have some minimum specs?
>>
>>100135709
100% irrelevant
>>
>>100134692
No pooftahs
Simple as



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.