/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

The canary has FINALLY been updated. -robi

Server software upgrades done, should hopefully keep the feds away. -robi

LynxChan 2.8 update this weekend. I will update all the extensions in the relevant repos as well.

The mail server for Alogs was down for the past few months. If you want to reach out, you can now use admin at this domain.

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB


(used to delete files and postings)

Knowing more than 100% of what we knew the moment before! Go beyond! Plus! Ultra!

Datasets for Training AI Robowaifu Technician 04/09/2020 (Thu) 21:36:12 No.2300
Training AI and robowaifus requires immense amounts of data. It'd be useful to curate books and datasets to feed into our models or possibly build our own corpora to train on. The quality of data is really important. Garbage in is garbage out. The GPT2 pre-trained models for example are riddled with 'Advertisement' after paragraphs. Perhaps we can also discuss and share scripts for cleaning and preparing data here and anything else related to datasets. To start here are some large datasets I've found useful for training chatbots: >The Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/ >Amazon QA http://jmcauley.ucsd.edu/data/amazon/qa/ >WikiText-103 https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/ >Arxiv Data from 24,000+ papers https://www.kaggle.com/neelshah18/arxivdataset >NIPS papers https://www.kaggle.com/benhamner/nips-papers >Frontiers in Neuroscience Journal Articles https://www.kaggle.com/markoarezina/frontiers-in-neuroscience-articles >Ubuntu Dialogue Corpus https://www.kaggle.com/rtatman/ubuntu-dialogue-corpus >4plebs.org data dump https://archive.org/details/4plebs-org-data-dump-2020-01 >The Movie Dialog Corpus https://www.kaggle.com/Cornell-University/movie-dialog-corpus >Common Crawl https://commoncrawl.org/the-data/
I prepared the Cornell Movie-Dialog Corpus into a text file for training with <|endoftext|> tokens between conversations: https://files.catbox.moe/pvi2ef.xz Website: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html If I messed something up or the file is taken down the Python script to regenerate it: import json lines = open("movie_lines.txt", "rb").read().decode("utf-8", errors='ignore').strip().split("\n") line_db = {} for line in lines: line = line.split(" +++$+++ ") line_db[line[0]] = (line[3], line[4]) conversations = open("movie_conversations.txt", "rb").read().decode("utf-8", errors='ignore').strip().split("\n") for line in conversations: conversation_lines = json.loads(line.split(" +++$+++ ")[3].replace("'", '"')) for line in conversation_lines: speaker, text = line_db[line] print(f"{speaker.title()}: {text}") print("<|endoftext|>")
>>9500 Thanks, got it. Especially appreciate you showing us how it's done, too!
Open file (78.98 KB 965x631 human feedback.png)
Anyone here able to run GPT-2 on their GPU and wanna help build a dataset? I've made a script for creating data to train a reward model by using GPT-2 to generate conversations between two random characters (bot-to-bot). Basically all you have to do is pick the best responses it generates until reaching the max token length and it starts generating another conversation. You can also write in your own responses if the generation is particularly bad or stuck.
Open file (183.92 KB 850x650 1616562664087.png)
>>9503 Looks like an interesting project. Might even turn out to be a rudimentary beginning to anon's Robowaifu@home idea (>>8958), in that we start sharing our own efforts onto a common dataset. Do you have any plans to redistribute the results openly for everyone's benefit back here Anon? Also, this vaguely reminds me in a fashion of anon's >replacing rewards with examples post (>>9438 and following). I wonder if you might be able to kind of integrate that sort of approach into your project?
>>9512 The dataset will be maintained on GitLab. I think that's the easiest way for both contributing data and getting updates, and people can fork it to make different versions if they wish. The recursive classification algorithm is similar to temporal difference learning and needs time steps. I can see this algorithm being useful for completing tasks in a conversation but I don't think this dataset will be much help to it since there aren't any tasks being solved, unless the end result of conversations are reformulated into tasks somehow. Recursive classification still needs another paper or two to develop it into solving multiple tasks and ones it has never seen before in training for unsupervised learning. There are other ways though to make this dataset useful beyond a human feedback reward model. MuZero's dynamics model that predicts the next state given an action and the current state could be modified into seeking a goal state rather than trying to win in a board game. Given enough processing power and hindsight experience replay (HER) to learn from mistakes it might be able to learn how to lead a conversation to a target state. The recursive classification algorithm's results aren't quite as impressive as HER which can learn multiple different tasks, and there has been a significant improvement to HER by combining it with expectation maximization: https://arxiv.org/abs/2006.07549
>>9517 OK, count me in if I can manage to run it with you. My best box has an i7 in it. It's for school, but I can probably set up a dual-boot for it. Is Ubuntu good enough, distro-wise (Python versions, etc., etc., etc.) ? I would appreciate detailed, tutorial-style setup, operation, and results pushes, etc., if you would please. I would recommend you consider sort of approaching this effort the way the Foldit guys did (>>9028). Namely, fashion it sort of like a game that anons can 'play' . The fundamental premise seems to lend itself to this paradigm, and we're more likely to see ongoing participation by only vaguely interested anons that way IMO. This also sounds like it's a project that likely should have it's own thread to me. That way the long chain of (entirely unrelated) dialogue about development doesn't detract from this dataset thread which seems more like it should be a 'library archive' type thread to me.
>>9520 >and we're more likely to see ongoing participation by only vaguely interested anons that way IMO. Also, on this same general tack, what about the idea of setting up a server online and letting many, many anons 'play' along with the waifu this way. After all in this scenario, it's not the raw horsepower and number-crunching that is the valuable thing. Rather obtaining the human-reasoning needed to assess and score the reasonableness of any particular output is the valuable bit. There's a lot more that could be said, but we should probably hold off on it until you make a decision about a new project thread or not.
>>9520 I don't recommend running GPT-2 on a CPU because it's so slow. As the conversation gets longer it will take over half a minute to generate responses even with just the small model and a fast CPU. If there was already a crude T5 chat model it'd be a different story, but until then an Nvidia GPU with at least 3 GB and CUDA 7.0 is needed or Google Colab. The code should be able to run on Linux, Windows or Mac just fine. I recommend using Python 3.8 or 3.7 with Pytorch 1.8.x and CUDA 11.1. CUDA installation will depend on your platform. I'm not familiar with other distros but on Debian Linux buster-backports provides CUDA 11.1. Older cards from 2018 or earlier should be fine with just CUDA 10.2 which PyTorch also supports. Make sure when installing Python on Windows to check 'Add Python 3.x to PATH', otherwise Python won't run from the terminal. Then get PyTorch: https://pytorch.org/get-started/locally/ (It's easiest to install with pip) Once Python and PyTorch are installed, install the transformers library from a terminal with: python -m pip install --user transformers This should be all you need. >>9523 If I had the money for a server I'd rent a GPU instance, train it on extra data and have it done in a few hours. The motivation of this dataset is to train the model with as little compute as possible. I just need a little help for now to push it into a usable state so it can be used instead of GPT-2. T5 takes less than half a second on the CPU to generate a response. That'd make it much more approachable for anons to participate without expensive hardware. The validation perplexity is at 25 now so there's hope. Another idea I have is to train the reward model with sentences taken from other samples in the dataset but that don't match the conversation, or by discriminating its own generated responses as a GAN. I think the former would help it learn more sensible replies. A GAN might be too unstable to train. I'll make a thread for it later so these posts can be moved there.
>>9526 Alright. I'll try to set up a dual-boot before the upcoming weekend is out. Debian, is it Anon? >The validation perplexity is at 25 now so there's hope. Good news. Please keep us up to date Anon. It would be absolutely marvelous to be able to run a reasonably responsive and competent chatbot on embedded hardware today. And I would say there's no need to make a new thread unless you yourself deem it reasonable.
>>9569 Yeah, personally I prefer it because they supported my ancient Pentium 4 for nearly two decades but it tends to lag behind in updates because of this support for older systems. It took over a year to even use the full capabilities of my GPU I bought 2 years ago. Mint and Ubuntu are the easiest to install and use and tend to have more recent updates if you don't need that long-term stability. The chat model has been slowly inching forward but slowed to a crawl. I've been playing around with different hyperparameters but no luck. I think it has bottomed out unless I start doing 2-hour optimizer steps with a ridiculous amount of gradient accumulation steps. The validation set perplexity is at 20 now which isn't bad but not quite good enough either. Once I start throwing other training tasks at it though there should be further gains and we might be able to skip GPT-2 completely. Ideally about 10 GB of data is needed for training but I only have 40 MB of high-quality chat data so the only option right now is to train on other data. I've written a Wikipedia and Stack Exchange scraper to soak up a ton of data, just need to process it into tasks and train. It all hinges on the reward model working well really. Without a working reward model, creating this dataset won't make a big difference without a team of novelists to crank out a 100 books worth of data.
>>9575 >but I only have 40 MB of high-quality chat data What data do you need? Could you use extracted dialogs from subtitles?
>>9595 >Could you use extracted dialogs from subtitles? I think he's already doing that Anon: >>9408
>>9600 Ah, I see, but limited to the source of a single waifu. Finding ways how she would say something and then change other dialogs automatically with a script might help to create more data.
>>9595 Any data that can be formulated into a query and response can be used. At the moment I'm using subtitles from anime and movies with character names but I could potentially use ones without names and just predict the next sentence or line. Some of what it learns on other data and tasks will transfer to learning chat dialog. I could train it on a variety of other tasks like reversing the order of words in a sentence, labeling parts of speech, translation, determining whether recipes are highly rated or not, or go Jeopardy mode and predict the query from a response. The only limit of what you can teach a text-to-text transformer is what you can fit into text, your processing power, and the amount of data you have. The question is what would be the best data to train on? I have a very tiny amount of compute and can't test a hundred different things. It's not obvious what skills are necessary to comprehend a sentence either and what tasks will improve those skills. Basketball training drills for instance don't look anything like basketball but they significantly improve someone's performance.
>>9603 Well, one thing I have is a metric boatload of raw JSONs of shitposting for roughly a year and a half or so. Probably 60+ boards if I dug around. Now this is, again, shitposting, so YMMV. But many of these are deterministically post/response. I could imagine automatically going through it all, finding the many thousands of post/reply pairs and then maybe getting a human involved in check that it is in fact a query/response pair? I've never yet tried to go through and pull all the JSON out from all these archives in their entirety yet, but I do it for /robowaifu/ here generally a few times a week. So, shouldn't be too difficult to pull those and push the archive file to catbox.
v>>9605 BTW, I've already gotten a fairly well wrung-out mechanism for parsing the raw JSON into individual posts written (BUMP, Waifusearch), so it might be wisdom if you can clearly specify how you'd like the data parsed out, and I could do a lot of preprocessing data-massaging in advance for you, instead of just giving you a big dump of un-processed raw JSON files.
>>9607 Also BTW, if you haven't ever done so yet, you can examine exactly what the JSON I'm speaking of looks like for yourself Anon. For example: http://bhlnasxdkbaoxf4gtpbhavref7l2j3bwooes77hqcacxztkindztzrad.onion/robowaifu/res/2300.json https://alogs.theГунтretort.com/robowaifu/res/2300.json (Just replace the domain w/ the proper AlogSpace URI if you don't use Tor)
>>9603 That post reminded me that there's a fan-run website about the gameshow Jeopardy. Their archive got over 400000 clue-question pairs: https://j-archive.com/ I think the more funny and intelligent the quiz show is, the less useful it is for building basic understanding. The gold standard of entertaining quiz games is You Don't Know Jack IMHO. That stuff is too witty and punny to easily build anything faintly resembling common sense from that. The questions from YDKJ are too fragile, by that I mean slightly changing the wording is likely to completely screw with their meaning. Jeopardy isn't quite like that. Still, something more dull and easy than Jeopardy would be better. Maybe some ROMs from quiz games aimed at children have good boring common-sense stuff in plain text.
Open file (15.55 KB 494x198 yummytea.png)
Open file (37.57 KB 985x318 simulation.png)
>>9671 Yeah, it just becomes a database lookup at that point. But I speculate it can learn some useful information on reversing common queries and responses. Like "I'm fine" is usually a response to "How are you?" We take this prior knowledge for granted but unless a model learns this it will fail to make any connection. An interesting future research project might be having the model generate its own tasks to learn and explore looking at data in new ways unsupervised. Update to >>9503 The past few days I've tried a dozen different experiments to use the hidden state of the T5 encoder to discern whether a response matched a query in comparison to a response taken from another random query. Nothing was able to learn anything, which is kinda depressing because that might mean it might not do well later with a recurrent network modulating the hidden states. I'm not really sure why this is but I suspect because the T5 was pretrained for 100 GPU years or whatever amount of time on text-to-text and trying to train it to be used a completely different way with a pitiful GPU in a few hours is not happening. So I started feeding those queries and responses into the T5 model asking if the response makes sense to the query, and to output labels yes or no. Surprisingly it had no struggle learning this way and the responses are becoming much more sensible, even though it's only capable of discerning the right answer 70% of the time so far. In the state it's in it might even be usable in place of GPT-2 for generating a chat dataset, although the quality still has a long way to go. The first image shown is a conversation generated by picking from 3 responses by T5-chat as Dorothy and entering my own input for Jill, and the second image T5-chat as Haruhi. With this working now the chat dataset project can be used to train T5-chat so I'll be making a thread for it once everything is ready to go.
>>9671 >Maybe some ROMs from quiz games aimed at children have good boring common-sense stuff in plain text. Yeah, we're dealing with some odd juxtapositions in our endeavors here, and this is a fundamental one tbh. IMO, the only reasonable hope we have ATM of AI communications that will stand up for hours of engagement are ones that are necessarily 'dumbed down' to a child's level. (BTW, even that is already vastly beyond any of the animals, so a rather notable achievement). However, we also plainly want to be able to engage with our waifus as well, adults. A bit of a conundrum for now I'd say. >sage for off-topic
>>9678 >chats lol. Well sounds like you made a nice breakthrough by using a wonderfully simplistic 'trick'. That's encouraging. No doubt eventually we'll all be able to tick off 100GPU-years of compute on our 'smart' watches in a few years, but for now this is by far our best kind of approach very likely. That is, finding clever approaches that get right to the heart of the problem.
I just mentioned Chatterbot somewhere else, here's the corpus. URL might not work, since the dumb jokes of the forum software: https://github.com/Гунтhercox/chatterbot-corpus - this link here https://chatterbot.readthedocs.io/en/stable/corpus.html might be better anyways, since it comes with some explanations.
>>9753 >conversations: >have you read the communist >yes, marx had made some interesting observations. >stock market >you can never really predict the stock market. >stock market >my lawyer said i shouldn't give stock tips online. >stock market >mutual funds might be better unless you are wealthy. >stock market >i'm not sure an individual alone can really beat the market. >56 KB Top-tier conversation quality
>>9765 I gave an answer on how to handle this. But, I put it in the thread about chatbots here >>9780
Open file (40.28 KB 1112x1075 Selection_003.jpg)
Not sure if this is the right thread OP, just let me know and I can delete it if not. On this video (>>10463), the author promotes Weights and Biases papers page. It now redirects to a community page that seems like it might be interesting to the ML practitioners here on /robowaifu/.
Open file (174.76 KB 1196x828 archive.moe.png)
Some archives of 4chan posts from 2008-2015 SQL Database: https://archive.org/download/archive-moe-database-201506 Files: https://archive.org/details/@archivemoe Penfifteen Archive from 2004-2008: https://archive.org/details/studionyami-com_penfifteen-2012-03-05 And moar post archives: https://wiki.archiveteam.org/index.php/4chan I'm working on some dataset generating scripts for finetuning language models, including image-post pairs for multimodal training >>11731 It'll take a few months to download and process all the data. My plan is to compress the images to 384x384 webp files so each dataset isn't 200+ GB per board (/v/ is over 2 TB). SqueezeNet's input size is 227, AlexNet is 256 and VGG is 224, so I think that is sufficient and leaves room for data augmentation. If someone has the hardware to train StyleGAN2 at 512 or 1024, I'm sure they can download the archives and regenerate the dataset with the scripts. I'll release the image datasets and each board separately so people can pick what they want. Also if anyone wants to help I'll post the scripts when they're ready.
>>11778 Bandwidth is a real issue for me currently. I'll try to help out later. >4chan posts from 2008-2015 Nice. Pretty classic era tbh.
>>11778 >My plan is to compress the images to 384x384 webp files so each dataset isn't 200+ GB per board (/v/ is over 2 TB). Good thinking. How are you planning to situate each selection frame for each image Anon? Seems quite impractical to do by hand, yet there's a need for accurate placement/pre-scaling to capture the vital essence of each image, humanly-speaking.
>>11782 It took me three days just to download the database, 57.5 GB compressed. >>11889 I will be resizing the largest dimension down to 384 or smallest dimension up to 256. That way models can select any 256x256 crop of that, perhaps using a spatial transformer network to position the crop. However, GIFs and Webms will pose a significant challenge. I will skip those for now.
Wake the fuck up robofucker, we got an imageboard to burn: https://files.catbox.moe/6pslbq.xz These are /robowaifu/ posts up to July 2021 containing a post chain for each post. The script to regenerate it from the json files is included. The chains have been shuffled to avoid repetitions that would throw transformers into seizures. Now there's no excuse not to have a shitposting waifu while wasting time talking about making a robowaifu. Pray to God she motivates you into actually working.
>>12044 >It took me three days just to download the database, 57.5 GB compressed. Lel. Please go easy on us assistants, and try to figure out how to slice that up into MUCH smaller parts so we can help you out here, Anon. Think push the compute load out to the edges of the network, not saturate the wireline! :^) This is an exciting project idea, I sure hope you can pull it off. It should revolutionize the whole 'fun with waifus' paradigm!
>>12051 LOL. That was fast Anon, thanks!
>>12052 I can handle the database processing. The issue is the images. They're split up into 10 GB tar files which can be extracted with: cpio -D output_path -ivd -H tar < images.tar.ab However, some of the files will be lost doing it this way, since they're one tar file split into multiple.
Raiders of the Lost Kek 3.5 years of /pol/ posts, June 2016 - November 2019 (in JSON format) Paper: https://deepai.org/publication/raiders-of-the-lost-kek-3-5-years-of-augmented-4chan-posts-from-the-politically-incorrect-board Download: https://zenodo.org/record/3606810 sudo apt-get install zstd unzstd pol_0616-1119_labeled.tar.zst tar -xvf pol_0616-1119_labeled.tar
>>12194 LOL. I'm shocked they published this publicly. Seems likely it's not earning them any brownie points that way? Regardless, better get it while you still can Anon. Save.Everything.
>>12194 Interesting project Anon. Vaguely curious about how much did they Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board >abstract >This paper presents a dataset with over 3.3M threads and 134.5M posts from the Politically Incorrect board (/pol/) of the imageboard forum 4chan, posted over a period of almost 3.5 years (June 2016-November 2019). To the best of our knowledge, this represents the largest publicly available 4chan dataset, providing the community with an archive of posts that have been permanently deleted from 4chan and are otherwise inaccessible. We augment the data with a set of additional labels, including toxicity scores and the named entities mentioned in each post. We also present a statistical analysis of the dataset, providing an overview of what researchers interested in using it can expect, as well as a simple content analysis, shedding light on the most prominent discussion topics, the most popular entities mentioned, and the toxicity level of each post. Overall, we are confident that our work will motivate and assist researchers in studying and understanding 4chan, as well as its role on the greater Web. For instance, we hope this dataset may be used for cross-platform studies of social media, as well as being useful for other types of research like natural language processing. Finally, our dataset can assist qualitative work focusing on in-depth case studies of specific narratives, events, or social theories.
For speech recognition, Mozilla has the voice "Common Voice" corpus with multiple languages, if anyone is interested. English alone is over 2,000 hours of spoken phrases, about 65gb. The other languages I looked at were around 20gb each, but it varies. You can also add to the project yourself by recording or verifying phrases. https://commonvoice.mozilla.org/en
>>14321 Thanks, that might be useful. I wonder if we can also use subtitle files with their related shows and movies.
>>14325 waifusearch clipchan
>>14326 Lol, how could I forget. Blackout, because of insufficient sleep.
>Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models. This is going to be an extremely important dataset towards blessing language models with sight. After training on this dataset it should be possible to train with human feedback to produce more detailed descriptions for improving the model's visio-linguistic understanding. Other modalities could be explored from there like problem solving from images, critiquing artwork, playing games, watching videos, webcam interaction, automated web search, and a lot more. It'll open up a lot of possibilities. Website: https://github.com/google-research-datasets/wit Download: https://github.com/google-research-datasets/wit/blob/main/DATA.md
>>15365 >This is going to be an extremely important dataset towards blessing language models with sight. Sounds like a real breakthrough may be around the corner Anon. Please keep us all up to date on things with this.
Open file (833.87 KB 1555x818 laion-400m.png)
Some incredibly based independent researchers put together an image-text-pair dataset to open-source OpenAI's work so people can replicate DALL-E and do other multi-modal research. Dataset: https://laion.ai/laion-400-open-dataset/ Direct download: https://www.kaggle.com/datasets/romainbeaumont/laion400m (50 GB total, or can be downloaded in 1.8 GB parts according to necessity or hardware limits) Paper: https://arxiv.org/pdf/2111.02114.pdf Tool to search the dataset by text or image: https://rom1504.github.io/clip-retrieval/ To use the dataset you need something that can read parquet files. I recommend fastparquet which uses a minimal amount of memory. # python -m pip install fastparquet from fastparquet import ParquetFile DATA_PATH = "part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet" pf = ParquetFile(DATA_PATH) row_group_iter = iter(pf.iter_row_groups()) # each row group has about 1M rows row_group = next(row_group_iter) row_iter = row_group.iterrows() i, row = next(row_iter) row[1], row[2] # ( image_url, text ) row.keys() # ( 'SAMPLE_ID', 'URL', 'TEXT', 'HEIGHT', 'WIDTH', 'LICENSE', 'NSFW', 'similarity' ) Or you can use img2dataset which will download the images locally and resize them: https://github.com/rom1504/img2dataset The quality of the dataset isn't exactly as spectacular as >>15365 but probably as good as you can get from a raw scrape of the internet and it has a much larger breadth of content. There's also LAION-5B but it's so massive it's beyond our capabilities to really use right now: https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/
>>15834 >chobits hentai pictures wallpaper chobits
>>15834 >Some incredibly based independent researchers put together an image-text-pair dataset to open-source OpenAI's work so people can replicate DALL-E and do other multi-modal research. That is very exciting Anon. Thanks for the heads-up!
>>15834 >Or you can use img2dataset which will download the images locally and resize them: https://github.com/rom1504/img2dataset I just wonder if we can somehow capitalize on something at least vaguely similar to the approach that Nvidia is using for it's proprietary DLSS ? https://en.wikipedia.org/wiki/Deep_learning_super_sampling Basically, have an image analysis pipeline that does the vast bulk of it's work at lower resolution for higher 'frame' rates, and then does a DL, Waifu2x-style upscaling near the latter end of the pipe?
>>15851 For image generation certainly, but for image analysis not so much. However, a lot of work has gone into finding optimal models with neural architecture search. And EfficientNetv2 for example starts training at a lower resolution with weak data augmentation then gradually increases the resolution and difficulty to minimize the amount of compute needed to train it. That last bit of high resolution training is unavoidable though if you want to extract useful information from it. https://arxiv.org/pdf/2104.00298.pdf >>15835 Kek, I think they said 1% of the dataset is NSFW and it's only labelled so by image content. I have an idea though to create a reward model for good image labels and then use it to filter out the poorly captioned images. Finetuning on cleaner data should fix a lot of the weirdness CompVis/latent-diffusion generates and improve CLIP. Another possibility might be using the reward model to generate superhuman quality captions for images. In the human feedback paper the 1B parameter model generated summaries were preferred 60% of the time compared to the actual human summarizes and 70% with the 6B model. https://openai.com/blog/learning-to-summarize-with-human-feedback/ To go even further beyond, it might be possible to generate these superhuman captions, score them, finetune the reward model on the new ones, and train the caption generator to make even better captions in an iterative loop to create extremely high quality datasets that would require 10 million man-hours to make by hand.
Open file (152.95 KB 508x711 trivia_qa.png)
Open file (108.89 KB 412x516 hotpot_qa.png)
Stumbled across a high-quality dataset for reading comprehension. It provides aliases for answers so an answer like "the USA" can be accepted for "the United States". It also gives multiple documents from search results or Wikipedia for each question. >We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. Paper: https://arxiv.org/abs/1705.03551 Download: http://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz (2.7 GB) HotpotQA was briefly mentioned here already but it's also a high-quality reading comprehension and reasoning dataset that shouldn't be missed. >Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HOTPOTQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison. We show that HOTPOTQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions. Paper: https://arxiv.org/abs/1809.09600 Download: https://hotpotqa.github.io/
>>17494 Thanks! DL'g it now.
Wonder is this can be used to compress on-board data, OP? https://github.com/mhx/dwarfs

Report/Delete/Moderation Forms