/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

Merry Christmas!

Update on the file situation (it's good)

The warrant canary has been updated.

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


Merry Christmas, /robowaifu/ ! Please join the /christmas/ party this year!


Datasets for Training AI Robowaifu Technician 04/09/2020 (Thu) 21:36:12 No.2300
Training AI and robowaifus requires immense amounts of data. It'd be useful to curate books and datasets to feed into our models or possibly build our own corpora to train on. The quality of data is really important. Garbage in is garbage out. The GPT2 pre-trained models for example are riddled with 'Advertisement' after paragraphs. Perhaps we can also discuss and share scripts for cleaning and preparing data here and anything else related to datasets. To start here are some large datasets I've found useful for training chatbots: >The Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/ >Amazon QA http://jmcauley.ucsd.edu/data/amazon/qa/ >WikiText-103 https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/ >Arxiv Data from 24,000+ papers https://www.kaggle.com/neelshah18/arxivdataset >NIPS papers https://www.kaggle.com/benhamner/nips-papers >Frontiers in Neuroscience Journal Articles https://www.kaggle.com/markoarezina/frontiers-in-neuroscience-articles >Ubuntu Dialogue Corpus https://www.kaggle.com/rtatman/ubuntu-dialogue-corpus >4plebs.org data dump https://archive.org/details/4plebs-org-data-dump-2020-01 >The Movie Dialog Corpus https://www.kaggle.com/Cornell-University/movie-dialog-corpus >Common Crawl https://commoncrawl.org/the-data/
>>21061 I may say that I've been enjoying scanning through the raw data Anon. I'm already finding myself wanting to whip up a little browser GUI project for this. Thank you Anon, your effort with this is very-much appreciated! :^)
Objaverse is a Massive Dataset with 800K+ Annotated 3D Objects. https://huggingface.co/datasets/allenai/objaverse >Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI. https://arxiv.org/abs/2212.08051
Open file (370.29 KB 1280x800 0CY9s8mGbIs.jpg)
Creator of Waifu Diffusion just released an instruction dataset with 180k samples: https://huggingface.co/datasets/hakurei/open-instruct-v1 Also some people made a clean Alpaca dataset: https://huggingface.co/datasets/yahma/alpaca-cleaned And there's a raw ShareGPT web scrape (still needs cleaning before use): https://huggingface.co/datasets/jeffwan/sharegpt_vicuna
>>21850 That is really cool news Anon! The opensauce commities will prove to be the true heroes we all need I suspect. Thanks for letting us all know! :^)
Open file (262.70 KB 1280x720 ow0aSZt.jpg)
>>21850 A high-quality instruction dataset just dropped. Has a lot more variety, including short responses for questions that only need a short answer. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm >Dataset of 15K high-quality instruction/response pairs allows commercial use >Built by over 5K employees, enables customized LLMs that talk without API fees or sharing data >New open-source 12B parameter language model trains for less than $30 to handle human conversation Dataset: https://github.com/databrickslabs/dolly/blob/master/data/databricks-dolly-15k.jsonl Model: https://huggingface.co/databricks/dolly-v2-12b
>>21938 Excellent. Thanks Anon.
>>21938 >https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm <"Two weeks ago, we released Dolly, a large language model (LLM) trained for less than $30 to exhibit ChatGPT-like human interactivity (aka instruction-following). Today, we’re releasing Dolly 2.0, the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use." How in the world did they do the training with just US$30. Anon?
This is going to be big for VR and "Visual Waifus". Leonardo A.I. can already create basic 3D assets and PBR textures for game models.This will enable game designers and artists to create entire 3D worlds which previously would have been too labour intensive, time-consuming and expensive. I suspect entire 3D environments or game levels will eventually be generated almost entirely by A.I. Then of course more effort can be put into programming the main game characters - again, perhaps with help from various A.I. 'coding co-pilot' tools.
>>21947 Outstanding stuff SophieDev, thanks! >I suspect entire 3D environments or game levels will eventually be generated almost entirely by A.I. I think you're right, and I suspect that in film at least, it's already being done in large measure. Cheers!
>>21944 Using much higher-quality data and starting from Pythia 12B, a GPT NeoX model >>21947 Training on synthetic data is going to be huge. This stuff is going to improve orders of magnitude faster than anyone expects. I'm working on a Stable Diffusion model that removes the AI-generated look by training on AI-generated images, then training on the ones it generates to iteratively refine itself towards novelty and my aesthetic. I'm amazed it can spit out images that look just like my art without looking like AI did it unless I look at them closely but my hypothesis is that's due to not finetuning the VAE yet I hope someone out there takes up the task of generating a synthetic instruction dataset that outperforms human data. Language models just need to improve a little more before they become usable on low-end hardware to work with other generative models for illustrations, 3D models, co-pilot, sound effects, music, voice acting etc. I predict in 1-2 years we'll be writing prompts for entire games then refining them in a conversational interface
We need to scrape more. The Vacuna model is trained on data from a site called ShareGPT, where users shared their conversations with ChatGPT. Now they don't share that data anymore and the Vacuna team won't share their scraped data.
>>21968 I'm currently fishing around for ideas for a 'final' project for our C++ classroom thread. This might be a really good choice Noidodev. How can we all cooperatively work to score/tweak/combine the scraped data into a high-quality, fully-vetted, unified dataset Anon?
>>21961 >I hope someone out there takes up the task of generating a synthetic instruction dataset that outperforms human data. Can you make a post breaking this out in detail and explaining every step in laymen programmer's terms Anon? What would this look like as a complete systems? How would the parts communicate & work together towards this functionality? TIA. :^)
Open file (16.98 KB 800x395 tooling.png)
>>21968 I've been saving my chats and thinking of creating a preference dataset by sampling multiple models and choosing the best responses. Out of curiosity I gave Sage a description of what I needed to make a dataset just now and it generated a working PyQt application. Just needs some buttons to add and remove chats and save the data. I'll see how it does on modifying it further. >>21973 There's different ways to approach it. The easiest way would be to create a preference model so it can judge the quality of responses itself. Human examples would be the gold standard that it imagines new instructions and responses from using few-shot learning. This worked quite well for Unnatural Instructions and Unsupervised Data Generation: https://arxiv.org/abs/2212.09689 https://arxiv.org/abs/2109.09193 However, you don't just want a flood of synthetic data to train on. If you create a preference model then it can judge the quality of generated examples and do best-of-n sampling to choose the best. Constitutional AI could also be used to critique outputs and revise them to given goals: https://arxiv.org/abs/2212.08073 Going a step further a Generative Teacher Network could be done, where the teacher learns to generate samples on an outer training loop and a student network learns on the data created by the teacher in an inner loop. With flash attention, LoRAs and gradient checkpointing we should be capable of doing this for enough steps that the teacher network learns something useful. This would be extremely compute intensive but I think it would be worth the endeavor by reducing the best-of-n samples needed to get a good response and overall improve the generated dataset: https://arxiv.org/abs/1912.07768 Also with GTN data a preference model could predict how good a sample will be for training a student. Then that prediction could be used with best-of-n sampling during training. And perhaps if training a GTN is too difficult I think a preference model could still learn from regular generated training data to predict the outcome of a student network's training.
>>21977 Excellent response Anon, thanks! I'm wondering what scale of 'flood' do we need? IBs are roughly by definition chock-full of back-and-forth dialog (and generally with easily-chainable conversations at that, given the format). And better than average boards have some quite good content. /robowaifu/ for example is one of them IMO. This is in fact one of the reasons we occasionally publish the board's full JSON archive to provide this fodder to researchers. There are ofc boards much, much larger than ours in size, and a few of them are the kind that plenty of good dialog goes on within them (you'd need better pozz filters on the data in those big-sized cases ofc). Just an idea Anon. :^) >Teacher network <[outer:samples] >Students network <[[inner:samples]] >Also with GTN data a preference model could predict how good a sample will be for training a student. Really intriguing stuff. Seems just on the surface, to my amateur eye, that this could almost be a self-sustaining source of dialogs on most any topic imaginable if managed properly... Thanks again Anon, cheers! :^) >=== -prose edit
Edited last time by Chobitsu on 04/15/2023 (Sat) 04:58:09.
Open file (73.25 KB 645x627 B is for based.png)
>>21978 Yeah, there's really good data here. I've used the board's JSON archives in the past. The best data to have is data that requires expert knowledge. The more new information packed in, the better. In the Unnatural Instructions paper they found that you need about an order of magnitude more synthetic data to get the same results as real data. A problem with the way text is generated now is it's randomly interpolating between things, not really searching for something new and interesting, although contrastive search improves this somewhat. I think as AutoGPT, YouChat and other systems evolve we will reach a point where they're capable of generating both original and valuable data and continue the exponential take off we're on as AI continuously improves itself. AI will begin to scan the literature and connect things together we never thought to connect and then implement them autonomously. I thought I posted it here already but Stanford released a preference dataset and model trained on it. It's an out of the box solution to doing best-of-n sampling. It was trained on Reddit but you can trash asksocialscience, askhr, askanthropology and others from it if desired. I've tried it and it works pretty good. Dataset: https://huggingface.co/datasets/stanfordnlp/SHP Model: https://huggingface.co/stanfordnlp/SteamSHP-flan-t5-large
>>21981 Thanks Anon, you're an invaluable encouragment to us all! I like seeing these GUI images used for scoring/ranking. My brain actually connects the dots behind the scene better if I have something pertinent to look at (even if in hindsight it's apparently obvious). Moar! :^) Noidodev said we need to scrape more. (>>21968) Using cURL to perform automated, parallel downloads (text, images, etc.) is a task we've already solved here (albeit with many refinements yet possible). I have little else to offer our AI side of the house, but possibly together we can all brainstorm our own DIY data harvesting operations? We should be able to easily spread that out amongst any willing anons here, if we can figure out some reasonable way to effectively-agglomerate that data back to some central repository or other (probably one of our own devising)? Also, this could be regularly-updated at lower bandwidth costs (once the initial downloads are performed) on data that itself is incrementally-updated. We built in the ability to optimize bandwidth by simply first downloading and comparing the HTTP header data (alone), and comparing it to past saves by both date & size. By avoiding repititious downloads within their volunteer data-harvesting 'sectors', this should allow anons to multiply the usefulness of their available bandwidths; both to and fro. Just some spitballing some ideas here Anon. Thanks again, cheers! :^) >=== -prose edit
Edited last time by Chobitsu on 04/15/2023 (Sat) 08:32:26.
>>21972 I don't know, and also we don't have the scraped data.
>>21984 This data needs to be cleaned up and filtered if anyone is interested on working on it: https://huggingface.co/datasets/jeffwan/sharegpt_vicuna I was thinking of running langdetect on it to get English responses only + filtering with SHP and OpenAssistant response ratings.
>>21984 >and also we don't have the scraped data. That's not a difficult challenge, simply on of motivation and time spent. The tools are already easily available to us all. We just need to coordinate our efforts at it.
>>22006 The relevant argument in my posting in >>21968 was, that it is gone. Not available anymore. Which indeed makes it hard to scrape it. My point was, that we would've needed to act before it was deleted. That said, >>21994 seems to imply now that someone did download it in time and is sharing it now. Good that we're not the only ones interested in stuff like that.
Open file (115.35 KB 640x745 shortstack.png)
Open file (66.45 KB 650x488 lowmass.png)
Open file (61.10 KB 645x457 getblogged.png)
Open file (161.44 KB 688x857 story2chat.png)
Open file (86.28 KB 697x381 story2char.png)
Stumbled across an interesting model that generates instructions from text for synthesizing training data. When I have time I'll look into making a /robowaifu/ instruction dataset. https://huggingface.co/pszemraj/bart-base-instructiongen Also got an idea to do something similar with stories but convert them into chat logs and character descriptions since Pygmalion doesn't plan to ever release the CAI data they gathered because they're afraid of legal action. ChatGPT can already do this with this prompt: >Can you extract the dialogue from this story and format it into an easy to read chat log? >Can you write a character description for {character}, using only the given text? But I have over 2 GB of stories to process. An in-house solution is needed.
>>22010 >My point was, that we would've needed to act before it was deleted. Ahh, got it Noidodev. Well, as you're aware I'm a yuge proponent of Save Everything. :^) Let's all solve this scraping need together, yeah? >>22020 >An in-house solution is needed. This. And for many different needs besides just chat.
>>21994 Thanks Anon! :^)
Open file (647.86 KB 1000x1000 RedPajama.png)
Open file (40.37 KB 424x395 LongForm.png)
RedPajama is reproducing the LLaMA pre-training dataset of over 1.2 trillion tokens so it can be used to train commercial models. The full dataset is ~5TB unzipped on disk and ~3TB to download compressed, about 5x larger than the Pile and deduplicated. >CommonCrawl: Five dumps of CommonCrawl, processed using the CCNet pipeline, and filtered via several quality filters including a linear classifier that selects for Wikipedia-like pages (2019 to 2023) >C4: Standard C4 dataset >GitHub: GitHub data, filtered by licenses and quality >arXiv: Scientific articles removing boilerplate >Books: A corpus of open books, deduplicated by content similarity (first book is KJV bible) >Wikipedia: A subset of Wikipedia pages, removing boilerplate >StackExchange: A subset of popular websites under StackExchange, removing boilerplate (appears to be questions only) >For each data slice, we conduct careful data pre-processing and filtering, and tune our quality Annoucement: https://www.together.xyz/blog/redpajama Dataset: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T I've skimmed through the categories and it looks good to me. Some early tinkering around by a senior researcher has found it to be amazingly good anecdotally. >RedPajama + The Stack feels like an amazingly powerful dataset. >3 epochs with a 4B parameter model brings us to GPT3 loss with a 120%+ compute overhead. >Then instruction tune and RLHF with the new open conversations dataset. Wow https://twitter.com/andrew_n_carr/status/1648125532175753216 Not exactly a fair comparison since he seems to be training it specifically on code generation but still interesting nonetheless. Also a new instruction dataset just dropped that was created by generating instructions from diverse web documents. They release a 2.7B model that outperforms Alpaca-7B across several tasks. https://github.com/akoksal/LongForm They used GPT3 text-davinci-003 to generate the instructions but I found I can generate similar training data with ChatGPT using this prompt: >What kind of instruction could this be the answer to? Answer starting with "Instruction:" >Answer: {text} Interestingly the amount of parameters don't seem to make a big difference. Acquiring higher quality datasets could be the key to getting AI doing valuable work on consumer hardware.
>higher quality datasets could be the key to getting AI doing valuable work on consumer hardware. That's incredible good news, since this will cut down on computer needed.
Open file (312.34 KB 1920x1080 dvm703m3baf01_fd6z.jpg)
Here is some of our collective efforts the past few years turned into an instruction dataset with 280 tasks. It's not complete yet and probably won't be anytime soon because I got b& :^) But I managed to cover most of the machine learning posts https://files.catbox.moe/irlitb.jsonl Was going to add tasks for writing OP posts, chatting and searching arXiv, namely 'write', 'chat' and 'research' but only got a couple examples done Prompt used to generate most of the instructions: >What kind of instruction could the following be the answer to? Respond with a question, starting with "Instruction:" Threads completed: >>22 >>85 >>1671 >>8958
>>22078 Banned from ChatGPT. I don't think they liked some of the posts on sexbots and angry roasties because that's when they flipped the switch. I'm going to finetune my own model for making these datasets
>>22085 Oh haha. Well anyway this is great stuff please keep it up one way or another! :^) BTW, I'll try to make time to push a new /robowaifu/ JSON archive to the board by this weekend. Cheers.
Open file (392.15 KB 1612x676 kek.png)
Open file (162.91 KB 1604x755 minigpt4.jpg)
The MiniGPT4 dataset is an interesting proof-of-concept. Despite only training on image-caption pairs it generalizes a bit to out-of-domain instructions, although it still heavily overfits to describing images. Creating a more varied dataset with multiple chat turns should yield much better results, as well as using better pretraining data since LAION is pretty trash without filtering. https://minigpt-4.github.io/ I want to create a multimodal dataset. What should be included in it besides captions? Some things I can think of are comments, farming advice, car repair, visual reasoning, recipes, prompts, imageboard posts, tweets, memes, chess and other board game moves, video game input, factory debugging, graphs and weather maps.
>>22098 >comments, farming advice, car repair, visual reasoning, recipes, prompts, imageboard posts, tweets, memes, chess and other board game moves, video game input, factory debugging, graphs and weather maps. Nursing, childcare, education, cleaning, <list of household chores>, knowledge around food, nerd cultural knowledge, first aid, ... That's something to think about it for some time and taking notes from time to time.
>>22098 >>22177 True but until we unlock the complete polymeric falcighol derivation there will be a lot of good things a robowaifu still doesn't know about.
Open file (493.75 KB 1600x1200 dmoHZ4n.jpg)
It's possible to generate datasets better than human-crafted ones and these methods are only going to continue improving. Recursive self-improvement is imminent at least in the domain of data. >In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM model are preferred to outputs from OpenAI ChatGPT. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing large language models. Paper: https://arxiv.org/abs/2304.12244 Dataset: https://huggingface.co/datasets/victor123/evol_instruct_70k Code: https://github.com/nlpxucan/WizardLM Summary and prompts: https://robowaifu.tech/wiki/Evol-Instruct No matter how many times the filters strike me down, I will get the data to bootstrap our robowaifus.
>>22204 Good news, indeed.
>>22204 >We call the resulting model WizardLM Haha excellent! :^) Never give up, never surrender!
Open file (47.18 KB 600x375 Aegis.full.1946321.jpg)
A finetuned GPT2-1.5B model beats Alpaca-7B using a 700 MB dataset of generated instructions and responses from ChatGPT. Next to come will be metalearning via filtering training data to maximize performance. Combined with external memory we should be able to generate datasets even better than ChatGPT. Things are starting to get interesting. Dataset: https://huggingface.co/datasets/MBZUAI/LaMini-instruction (parquet format, I recommend using pyarrow) Models: https://github.com/mbzuai-nlp/LaMini-LM Paper: https://arxiv.org/abs/2304.14402 tl;dr more high-quality data with greater variety is all you need
Open file (107.56 KB 750x544 LdeYPwt.jpg)
>We introduce DataComp-1B, a dataset created by applying a simple filtering algorithm to the 12.8B candidate pool. The resulting 1.4B subset enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet. >Our new ViT-L/14 model outperforms a larger ViT-g/14 trained on LAION-2B by 0.7 percentage points while requiring 9x less training compute. Code to build dataset: https://github.com/mlfoundations/datacomp Paper: https://arxiv.org/abs/2304.14108 Website: https://www.datacomp.ai/ They could do much better but this is a start. >Synthetic Data from Diffusion Models Improves ImageNet Classification Paper: https://arxiv.org/abs/2304.08466 Datasets are shrinking. Loss functions suddenly dropping. Better synthesized data piling. Are you ready to foom?
>>22215 >>22216 Very encouraging Anon, thanks! :^)
Even more instruction tuning data, this time with responses generated by GPT-4. A 7B LLaMA model finetuned with this dataset greatly outperforms Alpaca and is competitive with GPT-4 when rated by human evaluators on helpfulness. https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
>>22240 Wow. This could be a big deal sounds like. Please keep us up to date with your research on it Anon! :^)
Some guys generated 80GB of dialog data: https://huggingface.co/datasets/conceptofmind/flan_dialog_submix With conversation APIs now dark, training with high-quality generated data is more important than ever, so I'm working on filtering this dataset for logical correctness, virtue (purging anything woke and normie trash), and for originality (semantically deduplicating it). I'm open to suggestions for further filtering. When I have time I'll make a 2nd dataset based off the filtered one that focuses on difficult dialogs since high IQ data is much more beneficial for training.
>>23169 This sounds like a tremendously-helpful target goal Anon. Godspeed. :^)
>SlimPajama cleans and deduplicates RedPajama-1T, reducing the total token count and file size by 50%. It's half the size and trains twice as fast! >It’s the highest quality dataset when training to 600B tokens and, when upsampled, performs equal or better than RedPajama. It was no mean feat to deduplicate data on this scale – existing tools do not scale to a trillion tokens. We built a custom parallel data pre-processing pipeline and are sharing the code open source with the community. >We’d like to thank our partner Opentensor for supporting this project. And credit goes to Together Compute and the entire team that created the RedPajama dataset! SlimPajama dataset: https://huggingface.co/datasets/cerebras/SlimPajama-627B Libraries for data pre-processing: https://github.com/Cerebras/modelzoo/tree/main/modelzoo/transformers/data_processing/slimpajama Blog: https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama
>>23742 That's excellent-looking proportions & shading NoidoDev. If they screw up the next Alita, then some anon can use this to do things right! :^)
>>24766 >So just to link this for info. The Mother of all datasets. The beast, the omni-cron, super extravaganza deluxe of datasets, 3.6M files and ~53.13 TB . The libgen book torrent links. How much of this is related to the Sci-Hub archive? There's really a lot of great data out there. I planed to get at least one 18TB HDD, it's already in my shopping cart, but that's not much compared to these sizes.
>>24865 >How much of this is related to the Sci-Hub archive? It's one if their links when you search for sci articles. They have more than one link. If you go here http://libgen.rs/ click on scientific articles radio button then search. You will see the files. Open in new tab and you will see several download locations, usually. On the front page there's drop down selection for Tor.
>bots from JanitorAI https://janitorai.me (NSFW!) Looks like prompts, descriptions for waifu bots, which might come in handy for modelling personalities. > archive of ~70GB of card ... here https://chub-archive.evulid.cc/#/janitorai > You can also download the archive from here if you're interested https://pixeldrain.com/l/Yo8V2uxh
> post-related : (tiny-textbooks, >>25725)
Open file (412.20 KB 826x333 Screenshot_257.png)
Not for downloading, but gathering data >>29928: >Universal Manipulation Interface (UMI) -- a data collection and policy learning framework that allows direct skill transfer from in-the-wild human demonstrations to deployable robot policies.

Report/Delete/Moderation Forms
Delete
Report