/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

Site was down because of hosting-related issues. Figuring out why it happened now.

Build Back Better

Sorry for the delays in the BBB plan. An update will be issued in the thread soon in late August. -r

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


“I am not judged by the number of times I fail, but by the number of times I succeed: and the number of times I succeed is in direct proportion to the number of times I fail and keep trying.” -t. Tom Hopkins


Open file (2.21 MB 1825x1229 chobit.png)
Robowaifu@home: Together We Are Powerful Robowaifu Technician 03/14/2021 (Sun) 09:30:29 No.8958
The biggest hurdle to making quick progress in AI is the lack of compute to train our own original models, yet there are millions of gamers with GPUs sitting around barely getting used, potentially an order of magnitude more compute than Google and Amazon combined. I've figured out a way though we can connect hundreds of computers together to train AI models by using gradient accumulation. How it works is by doing several training steps and accumulating the loss of each step, then dividing by the amount of accumulation steps taken before the optimizer step. If you have a batch size of 4 and do 256 training steps before an optimizer step, it's like training with a batch size of 1024. The larger the batch size and gradient accumulation steps are, the faster the model converges and the higher final accuracy it achieves. It's the most effective way to use a limited computing budget: https://www.youtube.com/watch?v=YX8LLYdQ-cA These training steps don't need to be calculated by a single computer but can be distributed across a network. A decent amount of bandwidth will be required to send the gradients each optimizer step and the training data. Deep gradient compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, but it's still going to be using about 0.5 MB download and upload to train something like GPT2-medium each optimizer step, or about 4-6 mbps on a Tesla T4. However, we can reduce this bandwidth by doing several training steps before contributing gradients to the server. Taking 25 would reduce it to about 0.2 mbps. Both slow and fast computers can contribute so long as they have the memory to hold the model. A slower computer might only send one training step whereas a fast one might contribute ten to the accumulated gradient. Some research needs to be done if a variable accumulation step size impacts training, but it could be adjusted as people join and leave the network. All that's needed to do this is a VPS. Contributors wanting anonymity can use proxies or TOR, but project owners will need to use VPNs with sufficient bandwidth and dedicated IPs if they wish that much anonymity. The VPS doesn't need an expensive GPU rental either. The fastest computer in the group could be chosen to calculate the optimizer steps. The server would just need to collect the gradients, decompress them, add them together, compress again and send the accumulated gradient to the computer calculating the optimizer step. Or if the optimizing computer has sufficient bandwidth, it could download all the compressed gradients from the server and calculate the accumulated gradient itself. My internet has 200 mbps download so it could potentially handle up to 1000 computers by keeping the bandwidth to 0.2 mbps. Attacks on the network could be mitigated by analyzing the gradients, discarding nonsensical ones and banning clients that send junk, or possibly by using PGP keys to create a pseudo-anonymous web of trust. Libraries for distributed training implementing DGC already exist, although not as advanced as I'm envisioning yet: https://github.com/synxlin/deep-gradient-compression I think this will also be a good way to get more people involved. Most people don't know enough about AI or robotics enough to help but if they can contribute their GPU to someone's robowaifu AI they like and watch her improve each day they will feel good about it and get more involved. At scale though some care will need to be taken that people don't agree to run dangerous code on their computers, either through a library that constructs the models from instructions or something else. And where the gradients are calculated does not matter. They could come from all kinds of hardware, platforms and software like PyTorch, Tensorflow or mlpack.
>>8958 I think this is both a good idea and one that has been suggested here on /robowaifu/ before. But the basic fact is that very few have the knowledge to contribute meaningfully to the actual development of such a distributed system, even if they have favor towards the basic premise itself. Accordingly, these issues all come immediately to my mind; a) It will need to be a push-button-simple setup for Anons who are willing to contribute their electricity and bandwidth to this endeavor. Just like Folding@home is, for example. b) The majority of anons can't be expected to understand the nuances of AI techniques & technologies. They just want their robowaifus to talk to them effectively. c) As you already clearly suggested, both security and anonymity are issues. If Anons don't trust the basic infrastructure itself, they are quite unlikely to participate in it. Perhaps the model that SETI@home and Folding@home follow (including the basic work-distribution framework) can be utilized here successfully. Oddly enough, they don't seem to have many issues with people worrying about security & anonymity (I've contributed to the latter myself personally). OTOH, they aren't working towards a goal that has the potential to cause more than half the world's Western, (((pozzed))), population to scream RAEEEEEEEEEP!1111111 either. :^) You seem to have some solid ideas and suggestions. Perhaps if you fleshed out specific details and approaches for the development and production roll-out of it, that would help further. I imagine there are probably a fair number of anons who casually roll in and out of /robowaifu/ from time to time who have skills in web development and systems administration for example. If there was a clear-cut set of doable goals and methodologies spelled out clearly, then some of these visiting anons might actually decide to lend a hand to this project. Personally, I can program in C++ a bit, but don't really have much else to contribute to this specific project afaict. Not that I'm unwilling to help out with it, but I can't take more on my plate at the moment to learn whole other sets of skills. I'm not sure I can see that being an application developer would bring much to the table here -- please correct me if I'm wrong. So, in that sense I'm basically just another anon who can contribute a modest amount of compute cycles from my computers. But again, there are plenty of others who do have the needed skills who come through occasionally. Getting them interested could be key here. This is definitely worth investigating. And as an anon here pointed out, the availability of inexpensive used -- but powerful -- GPUs could be just around the corner for us all >>8954 . Maybe the timing for having a Robowaifu@home project is fast approaching being ripe.
>related >>5811 (and following)
One modest suggestion I'd make to the project is this, don't name it Robowaifu@home. Having to explain to your parents this program you're running all hours of the day an night on their computers is actually intending to help give the world, well, robowaifus could be embarrassing at best. IMO it would be much better to call it RW@home. 'Robot World' could be the acronym's alleged meaning ("Hey, it's so I can grow up and work for SpaceX Dad! Please just think of Elon Musk for once, OK?"), but ofc we in the know realize it's RoboWaifu@home. That's it.
>>8958 >Or if the optimizing computer has sufficient bandwidth, it could download all the compressed gradients from the server I'd suggest you take the former tack to optimize bandwidth consumption. GPU power-growth curve is still well outperforming so-called Moore's Law. Bandwidth growth, however, isn't. Telecom infrastructure build-out is both slow and expensive and will likely remain that way for the foreseeable future. Add on top of that housing unit demand for (((Netflix))), et al, and you have a recipe for high-competition for available bandwidth that will only grow over time. GPU power growth OTOH is actually still on a steady projection for ever-better price/performance.
The way hentai@home created a giant distributed CDN built off of bittorrent was to provide the people that run instances of it a local copy of the media they want that gets automatically updated with proper tags instead of having to deal with downloading it manually themselves. There's also a reward structure that lets users download at high speeds so it turns into an offsite backup service for their hentai collection. https://ehwiki.org/wiki/Hentai@Home Asides for the altruistic donation of bandwidth and GPU processing power to help speed development of an AI they might want to use what would running robowaifu@home provide as a benefit to the end user?
>>8963 >The majority of anons can't be expected to understand the nuances of AI techniques & technologies. They just want their robowaifus to talk to them effectively. Yea, some thought is gonna have to go into how to utilize people's computers effectively and automatically. They might only have a 2 GB toaster GPU. While not ideal it could still help prototyping smaller models. I think feedback will be important otherwise people will shut the program off one night and forget to turn it back on when they don't see any results. Larger models could be compressed for people to use on their computers so they can directly reap the benefits of their contributions. It'll need to be able to run on Windows, Mac and Linux to reach the most amount of users. I imagine when they boot up this distributed training program it shows a list of projects their hardware is capable of contributing to and the user selects which one they want to help. Part of the responsibility will be project owners making their project pages look good enough that people want to lend their GPUs. Users could also dedicate their GPU to a project owner so their GPU can be used for any project or prototype by them. I plan on making a simple version soon to utilize all my computers and friends' computers. I'm sure a proof of concept will eventually attract other developers. The biggest issue will be securing it without nerfing what devs can do with it. The simplest solution would be to review code, manually approve projects and basically have package maintainers. And devs could choose to join untrusted projects that haven't been approved yet since they can review the code themselves. It wouldn't be much different from the risk taken when installing open-source software. But there could also be a sandboxed version where people can prototype vanilla models by defining hyper-parameters and network structure from existing modules. >>8964 >Would you guys contribute GPU cycles to create a GPT-3 clone? The problem with trying to clone GPT-3 is the model is too big to fit on anyone's GPU or in memory. The full size GPT-3 requires around 16 x 48GB GPUs and likely they have a few hundred or thousand, not just 16, doing gradient accumulation. The heads can be split up across devices in parallel but the layers can't be so easily and would incur a huge cost going back and forth from GPU/RAM, plus there's the bandwidth cost of sending all that data to the next computer to perform the next substep of the training step. It would be really inefficient and the whole network would have to work together to do any inference on the model. >>8965 Its purpose would be much more general. People could use the system for doing other projects unrelated to robowaifus and AI, such as finding twin primes or something else. It would be more like a crowd-sourced cloud computing platform. Adding a privacy mode is a good idea though in case people do give embarrassing names to their projects so other people using the computer only see 'Distributed Computing' or something like that. >>8966 If necessary the bandwidth can be greatly reduced at the expense of accuracy. A little bit of noise from high compression doesn't seem to impact gradients too much since they're already quite noisy. We don't have to be too pessimistic about bandwidth growth though. Once Starlink finishes rolling out satellites it will have 1 Gbps connections. ISPs are already getting nervous their cartel is threatened and have been doubling bandwidth to customers to keep them. >>8982 >what would running robowaifu@home provide as a benefit to the end user? A virtual waifu and all her functions. Once basic chat is solved people are going to expand their virtual waifus to perform other functions such as playing video games, composing music, drawing, debating, summarizing research papers, searching the web, etc. People wanting these functions will contribute to those projects and receive a compressed version of the training results that their hardware can run or the full size version if they wish. Alternatively, someone could create a marketplace where people can pay crypto for compute, but I'm not familiar with how to do that. I think SingularityNET does something like that with AI services.
>>8990 > It would be more like a crowd-sourced cloud computing platform I see. Then all the more argument not to name the system Robowaifu@home. Some variant of CrowdCloud might be a more appropriate choice. Actually, a name like that could probably attract investment money, if you can secure it.
>>8991 Once you get investors you no longer own your projects. It's theirs to exploit. I think the whole point of this is to decentralize AI and avoid nobodies telling us what we can and can't compute and to give us an edge to compete with Big Tech. If Big Tech owns the platform they're not going to let that happen.
>>8990 >If necessary the bandwidth can be greatly reduced at the expense of accuracy I see. You did mention that before. That's actually rather convenient that you can make trade-offs and dial in functionality like that. >Once Starlink finishes rolling out satellites it will have 1 Gbps connections. I sure hope they pull it off, and then give it away practically for free. A man can dream. One point to mention here are goyphones *[shudders externally]*. If you could somehow quantify the 'total compute power potential' as a measure of which silicon die technology is being most heavily rolled out, I suspect the phones are already ahead of servers/desktops. Throw in Starlink etc. and mobile represents a sizable potential for raw compute power.
>>8990 >They might only have a 2 GB toaster GPU. Even if they have a 16GB high end GPU it might be incompatible with some tasks as AMD uses a different memory structure that can't run many popular CUDA compute libraries. That's why my low end 4GB nvidia GPU almost tripled in price in the last year while similarly specced or slightly more powerful AMD cards haven't. >But there could also be a sandboxed version Nowadays thanks to PCI passthrough with virtualization this project could be 100% OS independent with little to no performance penalty on top of having built in security. That requires very modern hardware to work properly but virtualization is going to be a huge game changer in personal computing in the upcoming decade. Wendell from Level1Tech has been talking about the potential of this as it comes out of the server space for years now. He's also one of the few people that probably has 16 x 48GB GPUs in a server rack but is smart enough to both use them to keep quiet about it. >A virtual waifu and all her functions. I can understand a collaborative effort at optimizing AI rather than having everyone do their own thing or replicate the same work but what would be the benefits of running RW@H compared to just downloading the models it has done and running it on your own hardware without using up extra bandwidth or electricity? That's the hard thing to come up with and would really sell this project.
>>8995 >That's the hard thing to come up with and would really sell this project. Honestly, if the only pitch is 'results-driven' then not likely to even get off the ground (much). The altruism that have made all the X@home projects successful is White people with a sense of 'helping out for the greater good'. It's very culture-specific. If it simply boils down to nothing but shekel-grubbing, 'what's in it for me?' mentality then no probably not going anywhere. Regardless, even things like BitTorrent have proven successful when only a very small number of us seed and 98% of only-self-interested exploiters don't. While that's quite a different model than this, it at least can be somewhat informative for the social dynamics of the thing. >tl;dr People will do it b/c they want to help. You know, give then bonus points for e-peen or jewel power-ups or something.
I think this is a brilliant idea, anon! I once joined a distributed computing group called "Mindmodelling@home" using BOINC (they were interested in curing neurological diseases, but of course I was there in the hopes of advancing A.I. for future robowaifus). But I left after a few months because they never seemed to post any updates and their project appeared to be dead. If something like this were to become reality, I'd upgrade my PC just to help crunch work units! We already have various options for robot bodies and synthetic voices, but it's the A.I. where we are severely behind. I also agree with >>8965 though, that we should name it something agreeable and generic like "robotworld@home" or "droidschool@home". So that the Western MSM has less to latch onto.
>>9000 >"droidschool@home" That's not bad IMO. What about something like "Mindschool@home" ? That seems like it could basically be construed to mean just about anything. Mommies might even approve of something named that! :^) >"How many roads must a man walk down?" >42
>>8993 This is a very good point. I think we should keep things BSD/MIT licensed so any anons can take our ideas and run with it. But you're correct about investors basically being sharks. In fact most of them have a fundamental MO of ousting the founders before long. E.G., Cisco Systems, and countless others.
>>8990 >The problem with trying to clone GPT-3 is the model is too big to fit on anyone's GPU or in memory. Hmm, I see. Well, my guess is that this system could be re-purposed relatively easily for different types of AI problems/solutions correct? What about something like Pattern-Exploiting Training (PET) ? (>>5793, >>5799 and following) Also, I think what Anon mentioned here >>9000 >BOINC Isn't that a generalizable type of flexible framework for this kind of thing? Do you think your project could utilize this?
>>8995 >but what would be the benefits of running RW@H compared to just downloading the models it has done and running it on your own hardware It would be good if people prefer giving their hardware to train new models and functions that don't exist yet instead of wasting power reinventing the wheel or trying to get a 1% improvement on old models. If someone has a good idea people will want to try it and help out, then move on to the next project once the model reaches an acceptable result. >>9009 I'm not familiar with BOINC but it appears capable of running Python with some headaches to deal with to make sure package versions are consistent across different platforms and systems. Using virtual environments should take care of that, but it's not really clear what their API is capable of doing and the Python wrapper has limited functionality, which might be missing necessary things for distributed training. I couldn't find any Python machine learning projects on it so I imagine it's lacking something. >this system could be re-purposed relatively easily for different types of AI problems/solutions correct? What about something like Pattern-Exploiting Training (PET) ? Yeah, any model that people want to create. The PET model is doable with 223M parameters. That's 2/3rds the size of GPT-2 medium. Beating GPT-3 in a small domain of few shot learning is remarkable with 0.1% of the parameters but it doesn't mean that PET excels in everything else. GPT-3 has other glaring flaws in it as well like seeing everything as byte-pair encoded tokens, which gives it trouble with misspelled words and discerning patterns in long strings like ABC.. etc. that a tiny character-level model can pick up on easily. Some informal research has found that GPT-3 seems to be just using the structure of the sentences and the parts of speech to predict text, rather than the actual meaning of the words. VAEs on the other hand can interpolate between the meanings of sentences. For example, incrementally changing a bad review into a good one. They're notoriously difficult to train with GPUs but can still benefit from gradient accumulation and distributed training. There's a lot of other cool models we could try out even with only seven computers contributing. That would reduce a week of training into one day. With 24 what would take two years could be done in only a month, assuming similar performance between them. With 100 or 200 we would have no problem rapidly iterating prototypes and advancing, and for most models we wouldn't reach diminishing returns until hitting around 1k, with benefits vanishing completely by 10k.
>>9018 >VAEs Just in case this guy has dug up something important to us here. https://github.com/matthewvowels1/Awesome-VAEs
>>9018 >BOINC >virtual environments <"Volunteer Computing and Virtualization - CERN Indico" > >There's a lot of other cool models we could try out even with only seven computers contributing. That would reduce a week of training into one day. With 24 what would take two years could be done in only a month, assuming similar performance between them. With 100 or 200 we would have no problem rapidly iterating prototypes and advancing, and for most models we wouldn't reach diminishing returns until hitting around 1k, with benefits vanishing completely by 10k. You don't have to convince me Anon, I'm already 'part of the choir' with you. OTOH, how do you convince this Anon >>8995 ? While I wish his point was entirely invalid, the simple fact is he's right. The vast majority of unused power out there resides on anon's computers who have grown accustomed to a 'gibbs me dat' mentality (not that I'm impugning him in any way in this regards, he's simply pointing the issue out). OTO-OH, many of these X@home projects do succeed at attracting numerous volunteer contributors -- even up to 100'000 of them. https://en.wikipedia.org/wiki/Folding_@Home#Patterns_of_participation > So, how do we successfully promote your idea far and wide? It will take a large exposure IMO to overcome the greed factor already mentioned and find sufficient altruism needed for good success. https://en.wikipedia.org/wiki/Distributed_computing https://en.wikipedia.org/wiki/Citizen_science
>>9021 >So, how do we successfully promote your idea far and wide? It will take a large exposure IMO to overcome the greed factor already mentioned and find sufficient altruism needed for good success. I was thinking of the game route, where gamers literally train their in-game waifus. I was thinking a "gacha" game originally, but maybe a PC option awards some sort of in-game cosmetic or reward for contributing processing power. Although, if it was to become a PC game, then my original plan of simplicity for mobile phone purposes means I have to actually attempt to think up an entertaining game that would attract gamers who need an excuse to donate such processing power. I am currently looking into certain legal aspects, and also the huge issue of making such "gacha" games addicting...the art. It seems that finding a good artist that wouldn't break the game on a rando niche start-up is going to be difficult. I rather program, but taking some time so I can learn how to draw might be my last resort just to get something started.
Open file (396.50 KB 1116x709 Selection_285.png)
Idea: What if we tried to make a game of some kind out of Robowaifu@home OP? Like the Foldit guys did. Can't we sort of interactively 'give grades' to the AI's work and help accelerate it towards actual semantic understanding that way? >
>>8990 >Yea, some thought is gonna have to go into how to utilize people's computers effectively and automatically. That's going to take expertise to determine. The local client hardware can be probed for capabilities easily enough, but associating that with actual AI modeling potentials isn't something for a neophyte. >They might only have a 2 GB toaster GPU. While not ideal it could still help prototyping smaller models. True enough. Hopefully these 'smaller models' will become ever-more important in the future as our capabilities improve with time. >I think feedback will be important otherwise people will shut the program off one night and forget to turn it back on when they don't see any results. Very true. Even if it's simply some kind of graphic related to the actual work going on, similar to Folding@home's approach. >Larger models could be compressed for people to use on their computers so they can directly reap the benefits of their contributions. Some kind of reduction pre-processing? >It'll need to be able to run on Windows, Mac and Linux to reach the most amount of users. Obviously. I'd also suggest the potential of smartphones be investigated too. >I imagine when they boot up this distributed training program it shows a list of projects their hardware is capable of contributing to and the user selects which one they want to help. Sounds like a good plan. >Part of the responsibility will be project owners making their project pages look good enough that people want to lend their GPUs. Users could also dedicate their GPU to a project owner so their GPU can be used for any project or prototype by them. Please tell us more about project managers and their roles? >I plan on making a simple version soon to utilize all my computers and friends' computers. I'm sure a proof of concept will eventually attract other developers. I'm sure we'd all be interested in seeing the specific progress you're making as you go along with that Anon. >The biggest issue will be securing it without nerfing what devs can do with it. It's easy to see why that could be a tension of interests. Some kind of sandboxing springs to mind just offhand. >The simplest solution would be to review code, manually approve projects and basically have package maintainers. And devs could choose to join untrusted projects that haven't been approved yet since they can review the code themselves. It wouldn't be much different from the risk taken when installing open-source software. I like the basic idea of 'trustworthy' package maintainers. We're all basically dependent on them today, and that approach generally seems to work out OK. >But there could also be a sandboxed version where people can prototype vanilla models by defining hyper-parameters and network structure from existing modules. >"by defining hyper-parameters and network structure from existing modules" Mind clarifying that for us with more detail. Not sure I understand what that really means Anon.
>>9028 It depends on the model but people could do this for their projects. >>9029 >The local client hardware can be probed for capabilities easily enough, but associating that with actual AI modeling potentials isn't something for a neophyte. Performance tests can be done on various tasks like matrix multiplication, FFT, CNNs, RNNs, transformers, etc. and projects can be profiled and estimated with the parameters. It'll be a big task but I don't think it'll be something to worry about in the beginning. We'll likely be throwing every CPU and GPU we have at one task at a time rather than distributing them across multiple. Even with 2-5 projects the biggest performance difference will be from allocating GPUs to CNNs and letting RNNs have the CPUs. >Some kind of reduction pre-processing? Most of the weights of models can be pruned without much accuracy loss due to the sparsity of useful parameters. This allows massive models to run on mobile devices. It will be necessary for the work produced to be useful to people. If you use unpruned models for speech recognition, speech synthesis, and text generation together, you're looking at needing 16+ GB to run all that. With pruning though that can be brought down to 2-4 GB or even less by accepting a lower accuracy. >Please tell us more about project managers and their roles? It'd be like managing and maintaining any open-source project. I imagine some work will be involved checking that contributors are sending good data and not using incorrect training data until such tests are automated. Some additional work will be needed later to profile a project and make the information available to others so they can see if their hardware is a good match to contribute. And of course pruning models according to baseline contributor performance and resolving issues so everyone use them. >"by defining hyper-parameters and network structure from existing modules" >Mind clarifying that for us with more detail. Rather than writing Python code for models the sandboxed version could use its own model format that just specifies pre-built modules, the hyper-parameters defining how many parameters those modules use, and the connections between them. It goes in hand with the idea of an AI toolkit where you can drag and drop modules and connect them in a graphical interface without coding anything. These models would be generally safe to run from untrusted sources, far safer than stuff installed from Python's pip, so long as security exploits are minimized in the software itself.
Open file (6.68 MB 640x360 soy_not_even_once.mp4)
>>9026 It will be interesting to see what you come up with exactly, Anon. >>9031 >f you use unpruned models for speech recognition, speech synthesis, and text generation together Yes, I suppose all 3 will be needed for suitable virtual waifu apps (including animation too ofc, but that should be pretty lightweight, computation-wise). >It goes in hand with the idea of an AI toolkit where you can drag and drop modules and connect them in a graphical interface without coding anything. That's actually a very nice idea Anon. And as far as the specified code/models being safer than using python's pip, that would be very welcome. I hope someday we can just move away from the python ecosystem entirely. It seems fraught with numerous difficulties to me. OTOH, I'm obviously in the minority here. 'Computer Scientists' from the most prestigious universities today obviously think it's the single best thing since anything ever. :^)
>>8990 >Starlink I don't really trust Elon with these things, look at Tesla's services. He definitely isn't /ourguy/ and will provide it at a cost of "freedom". > Once basic chat is solved people are going to expand their virtual waifus to perform other functions such as playing video games This has a ltot of potential but it's prob not a good idea, since there doesn't exist just one game and it would become a Herculian task to try to appease everyone. >composing music, drawing Again, it's very vague and doesn't necessarily help the project in of itself nor necessarily appease our interests. >debating We already have many bots who can do that and we know for a fact it's not that good of an idea. >summarizing research papers This is actually very realistic. There has been made a paper around this and an AI which can do it so it's very easy to implement (we just need codemonkeys for that) >searching the web Unless it can do it in a non-pozzed way, it's practically useless. >Alternatively, someone could create a marketplace where people can pay crypto for compute, but I'm not familiar with how to do that. Don't care about that too much, it's still not worth it to enough people for it. My suggestion would be to first make an AI that, just like GPT-3, can code programs that we tell it to. Once we automate the coding part, the rest will become much easier and we won't have to worry too much about it.
>>9086 >My suggestion would be to first make an AI that, just like GPT-3, can code programs that we tell it to. I'm quite skeptical of the fundamentals of that claim. I'm familiar with the article that made the statement, but he basically pinned the entire notion on a single addition operation. It just returned an exemplar from it's working corpus that performed that exact operation already. I would argue the lexicographic problem of the AI digesting the command "Waifu, add 3 plus 4 for me" is much more difficult. We'll probably get there in the future with software programming waifus, but it certainly isn't here just now.
>>9000 https://www.mlcathome.org/mlcathome/ I've managed to find a new project that's just popped up using BOINC. It seems to be something for "Un-black-box-ifying" a lot of traditional machine learning algorithms.
>>9105 Thanks! Just in case anyone, like me, can't get to it via Tor: https://web.archive.org/web/20201225185201/https://www.mlcathome.org/mlcathome/
Seems like this could be somewhat related to your goals OP. https://github.com/tensorflow/mesh Please forgive my naivete if this isn't related.
>>8995 So does that mean even AMD GPUs with support for ROCam and having the appropriate motherboards would not be able to support this future crowd computed robowaifu project?
>>10532 It would depend on the code libraries that are used. Right now most of the most popular ones for AI are designed for Nvidia thanks to CUDA being the industry and academic standard. It might be possible to port it over to run on AMD's GPU computing platform if there's enough demand. My comment was about how the memory was laid out on some AMD card, this isn't the first time this problem comes up in the GPU world. Last time it was with the GTX 970 which ended in a class action lawsuit. This time it's just an issue with cryptocurrencies that need large amounts to store the entire blockchain, some of them can't be mined on AMD cards even if the CUDA code was ported over. What I'm going to do in the future is get two cards, a powerful AMD one due to better drivers on GNU/Linux and a weaker CUDA capable one for use in a virtual machine so all my bases are covered.
>>10533 Not him, but thanks for the explanation Anon.
>>10533 So OpenCL is dead in the water or something?
>>10550 (Not the same anon) AMD uses Rocm which is supported by PyTorch since recently. AMD has professional cards, Rocm seems to primarily support those: https://www.reddit.com/r/hardware/comments/ly6j3h/pytorch_18_supports_amd_rocm/gpr91eq and https://www.reddit.com/r/Amd/comments/nhpsnf/state_of_rocm_for_deep_learning Long story short, till AMD invests much more into the eco system: Forget it.
>>10556 Welp that is quite unfortunate that AMD doesn't bother to improve Rocm support. I find it a bit odd that AMD doesn't try to compete with Nvidia on this part too and instead allows Nvidia to gain a whopping market share in machine learning.
>>10563 What people tend to forget: AMD is much smaller than Nvida. They didn't have the money to do this, thanks to their underdog status in the past. That market might also be less relevant. But then, RocM and underlying technologies like OpenCL are open source, and Intel will release their own discrete GPUs soon, and there might be other players, like Apple for example, or other companies working with Arm based chips and a GPU.
>>10568 > or other companies working with Arm based chips and a GPU. Sadly, certain (((interests))) in the US Govt approved Nvidia's buyout of ARM, and the sale has been completed. Nvidia owns ARM now, lock, stock, and barrel.
Open file (62.11 KB 795x595 1612068100712.jpg)
>>10568 Hmm right, I recall there was a news or blog post what's the difference? I forget, anyway which says that only the bri'ish people buy AMD because it is cheaper and the other people are all buying Nvidia only. >>10577 >Nvidia owns ARM completely now >US Gov' sees no problem with a giant getting even bigger This shit just makes me sad.
>>10591 >This shit just makes me sad. It makes me angry, actually. Just follow the money, Anon, just follow the money.
>>10577 A month or so ago, the talk was that Nvidia buying ARM isn't finished bc Europe and China. Though, I didn't look into it today. ARM also licences it's designs to others, and they certainly won't be allowed to just stop that, if they even want to. I also assume this would only be relevant for future designs, hypothetically. Apple might already be quite independent in designing their own stuff, and there's still Power and Risc-V.
Is the OP still alive? Does he or any others have any takeaways from their research, especially those beyond the usual obvious?
Open file (53.04 KB 620x439 adapters go brrr.png)
Open file (111.94 KB 1048x489 kronecker adapter.png)
Open file (159.01 KB 522x786 hyperformer.png)
In August HuggingFace merged support for loading models in 8-bit: https://github.com/huggingface/transformers/pull/17901 https://arxiv.org/abs/2208.07339 (Note, at the time of writing this bitsandbytes does not support CUDA 11.7+ yet and will fail to load but support is coming) https://github.com/TimDettmers/bitsandbytes/issues/52#issuecomment-1272699887 By training with Kronecker adapters or low rank adapters the parameters can be reduced to 0.2%: https://arxiv.org/abs/2203.16329 https://arxiv.org/abs/2106.09685 This should make it possible to finetune opt-2.7b with only 6 GB VRAM. However, large models are still slow to inference. Using opt-350m or opt-1.3b would be much more practical and let people with smaller GPUs contribute. Some of the most popular GPUs on Steam are still only 4 GB: https://store.steampowered.com/hwsurvey/videocard/ From my experience working with large batch sizes the LAMB optimizer is the way to go, other optimizers can't compete: https://arxiv.org/abs/1904.00962 >Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning. In particular, for BERT training, our optimizer enables use of very large batch sizes of 32868 without any degradation of performance. By increasing the batch size to the memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to just 76 minutes Using large batch sizes with gradient accumulation will also help reduce bandwidth because each node can take many accumulated steps before needing to send a gradient update for an optimizer step. A Pytorch implementation of LAMB is available here: https://github.com/jettify/pytorch-optimizer DeepSpeed also provides training in PyTorch with 1-bit LAMB: >To train large models (like BERT and GPT-3) on hundreds of GPUs, communication has become a major bottleneck, especially on commodity systems with a limited bandwidth TCP network. >On one side large batch-size optimization such as LAMB algorithm was proposed to reduce the frequency of communication. On the other side, communication compression algorithms such as 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially under low network bandwidth. >Motivated by this we aim to combine the power of large-batch optimization and communication compression, but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication efficient algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates under compression. >In addition, we introduce a new system implementation for compressed communication using the NCCL backend of PyTorch distributed, which improves both usability and performance. For BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that 1-bit LAMB with NCCL-based backend is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end timewise speedup, and the same sample-wise convergence speed (and same fine-tuning task accuracy) compared to uncompressed LAMB. https://arxiv.org/abs/2104.06069 https://www.deepspeed.ai/tutorials/onebit-lamb/ https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/fp16/onebit/lamb.py Sending gradient updates over a modest internet connection will be a breeze with adapters and 1-bit LAMB, so I'll start working on distributed training sometime in 2023 if nothing else comes along. Hypernetworks could also be added to the adapters so the model can be finetuned to do multiple tasks: https://arxiv.org/abs/2106.04489 I'm sure people will have disagreements of what to train on, but if training on two different tasks is better than training on just one then everyone wins. And lastly, Hugging Face is working on safetensors which will be necessary so people joining robowaifu mining pools don't get RCE pickled: https://github.com/huggingface/safetensors
>>17510 >the parameters can be reduced to 0.2% That sounds remarkable tbh. Thanks Anon.
>>17510 There's a lot of ambiguity over what gets called a "hypernet", and at least some variants end up being useless. From what I've heard, it isn't clear right now if hypernets are actually a good idea. >1-bit LAMB Very cool. I hadn't looked into distributed training algorithms before. Are 1-bit algorithms enough for Internet speeds? My understanding is that that would still require 1 bit per parameter per batch, which for a 1B model would be 125MB of parameter update data per batch. How long would a distributed reduce operation take on that much data with Internet latencies? Assuming there are 10-100 people training, I would guess it would be on the order of minutes, given that upload speeds tend be much slower than download speeds. Are there any studies on the relationship between batch size and number of batches required for convergence? If the batch sizes can be huge, then maybe those latencies are tolerable. I don't know how much efficiency gets lost with huge batch sizes though. There are so many ways to optimize this that I'm sure it can be done with a few more tricks. For example, I see no reason to stop at 1 bit per parameter. If you have massive batch sizes, you might as well go down to 1 bit per two or three parameters, with some per-batch pseudo-randomly selected coupling between parameters. >I'll start working on distributed training sometime in 2023 if nothing else comes along. I'll be working on distributed datasets, probably later this year. Once we're a bit further along, I can use your use cases as my test cases. >I'm sure people will have disagreements of what to train on, but if training on two different tasks is better than training on just one then everyone wins. Easy way to get consensus: fine-tune an established model on a specialized dataset. If that's possible with an (online learning?) algorithm that's less prone to catastrophic forgetting, you can let people decide for themselves what data they want to train on, so there's even less agreement required for getting good results. I can think of at least one community that would love to contribute gradients if it meant making large models work better with their data.
Open file (24.22 KB 379x462 attention.png)
Open file (74.86 KB 467x482 ANML.png)
>>17530 >There's a lot of ambiguity over what gets called a "hypernet", and at least some variants end up being useless. From what I've heard, it isn't clear right now if hypernets are actually a good idea. A hypernetwork is just a network that generates the weights of another. What makes transformers so effective is their generated weights in the queries, keys and values. This is all a hypernetwork is really doing. It's learning slow weights to program the fast weights of another network. NovelAI for example put adapters before the keys and values in the cross attention layers of Stable Diffusion and it had fantastic results. There are other similar methods that are quite interesting such as gating outputs with another network which has been shown to be super effective in reducing catastrophic forgetting after training sequentially on 600 different tasks: https://arxiv.org/abs/2002.09571 >Are 1-bit algorithms enough for Internet speeds? My understanding is that that would still require 1 bit per parameter per batch, which for a 1B model would be 125MB of parameter update data per batch. It's actually more than that initially before the compression can really begin. Overall through training it needs about 4 bits per parameter. Finetuning the full parameters directly over the net won't be possible. This is why adapters are needed to reduce the training parameters. >Are there any studies on the relationship between batch size and number of batches required for convergence? If the batch sizes can be huge, then maybe those latencies are tolerable. I don't know how much efficiency gets lost with huge batch sizes though. Not sure if there have been more recent studies but these are some I remember: >On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima https://arxiv.org/abs/1609.04836 (explains why Adam fails to generalize on large batch sizes) >Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates https://arxiv.org/abs/1708.07120 >Don't Decay the Learning Rate, Increase the Batch Size: https://arxiv.org/abs/1711.00489 >The Limit of the Batch Size https://arxiv.org/abs/2006.08517 (further investigation confirmed LAMB is the only optimizer that can generalize well at huge batch sizes but starts underperforming baselines at 812K) When training from scratch there isn't much advantage to using a large batch size with LAMB in the beginning but for finetuning there is a huge advantage and you can use much larger learning rates. There are still diminishing returns though once you get to larger batch sizes. In the LAMB paper they tested up to 131K in the first stage of training but found 65K worked best https://arxiv.org/abs/1904.00962 >Easy way to get consensus: fine-tune an established model on a specialized dataset. If that's possible with an (online learning?) algorithm that's less prone to catastrophic forgetting One of the benefits of using adapters is you can finetune on small datasets without degrading model performance, which the Compacter adapter explored and actually outperformed full fine-tuning on SuperGLUE: https://arxiv.org/abs/2106.04647 I'm looking forward to trying out the Hyperformer to finetune a model that can switch between casual chat, question answering and story writing. I envision the task for the Hyperformer being written in natural language so users can provide more general instructions like "respond with a joke" and it generates really good jokes in a way that can't be achieved with prompting. Then if someone wants to train it to generate Arxiv papers they can do that under a task name like "write a research paper". It might generalize at some point and understand "write a joke paper", which would be so much better than the ridiculous prompting models need now to do tasks. I think this could be done by using the language model itself, like taking the detached hidden state of the first 16 tokens or something like that and using those to generate the weights.
>>17533 >ANML Wow. That sounds great for foundation models, where the number of fine-tuning steps is much smaller than the number of pretraining steps, and where selectively ignoring information could be considered a good thing. On the point of catastrophic forgetting, I think several papers have found that if you train with a contrastive objective, then training a single linear output layer is as good as fine-tuning the whole network. That might be useful for decentralized training, where everyone can contribute both (1) gradients for a final layer based on a task-specific loss, and (2) gradients for the rest of the network based on a contrastive loss. For people that are only working on new tasks and not on new data, that along with the update compression tricks could reduce per-batch bandwidth requirements to just several hundred bytes. At that point, latency becomes a much bigger problem than bandwidth. >Finetuning the full parameters directly over the net won't be possible. This is why adapters are needed to reduce the training parameters. Ah, I got it. The adapters reduce the rank of changes to weight matrices, not the actual weight matrices, so they can support really aggressive levels of compression in pretty much all cases. Both adapters and 1-bit algorithms are forms of compression for parameter updates. LoRA does it by creating a parameter space related to the original through a low-rank linear transformation, whereas 1-bit algorithms do it by using a codebook. >[Batch sizes, learning rates] Very cool. I can't believe I hadn't looked into it before. I don't understand why per-layer learning rates seem to be so important for large batch sizes. I guess I'll have to read the LAMB paper for this. Random thought if per-layer learning rates are so important: it might make sense to model layer-layer interactions when setting the learning rate. Intuitively, if learning rates define a metric on the space of gradients, then it might help to use a non-diagonal metric tensor. Maybe LAMB already does this.
>>17510 The problem with adapter-based training is the obvious degradation of complexity of what you can learn. Generally speaking, optimizing the very process of what makes Deep Learning work at all - that is, meddling with weight optimization via backpropagation - is very dangerous. Dangerous not in an interesting mad science sense, but simply because it easily degrades the learning curve, until the moment where you could reach the same loss (result) on a single machine and with a smaller model. We need to pretrain our own model, and adapters here are very likely not useful (and regarding low-rank decomposition tbh I have seen only one paper - by Cohere - where low-rank worked for pretraining at all, for a humble parameter reduction). 1-bit LAMB is more interesting and useful, but again, with serious caveats: 1. Even with 1 bit per parameter without some additional sparsity and/or compression you make training of moderately large models over the internet unrealistic due to data waiting stalls exceeding your computation time by order of magnitude and more. 2. Many 1 bit and sparse schemes incur some sort of loss curve degradation, though there are solid schemes which incur mostly none. 3. The paradigm of X-bit OptimizerName is likely misguided, fully cooperative optimization overcomplicates the engineering and requires the admin of the system to trust nodes more; Parameter-server (possibly with distributed parameter server cluster) based approaches where nodes only compute gradients seem optimal. >By training with Kronecker adapters or low rank adapters the parameters can be reduced to 0.2% Do I need to say that this is wishful thinking, anon, does it help... Task-specific (hypernetwork-like) embeddings have their place, but making it work on a general-purpose system that shuffles these automatically task-appropriately and in a learned way is hard, and recent deepmind's foray into this area ended with modest gains. >>17530 >If you have massive batch sizes Massive batch size is not a given - most known tasks have a critical batch size past which the scaling becomes detrimental to test loss curve performance. >>17533 Nice papers on batch size scaling mentioned. >I'm looking forward to trying out the Hyperformer to finetune a model that can switch between casual chat, question answering and story writing. Tbh I don't see how it's different from knowledge engineering which led us nowhere. Hard to beat scaling on generalist pretraining. TLDR; What has not been validated at scale almost certainly does not work. Designing and executing distributed training of meaningfully-sized models is very hard, some very smart people tried and failed it. A conservative approach is needed, but even if it materializes I just don't see how to attract enough committed volunteers to make it happen. Also have a nice paper and repo (finally something that kind of works lol): https://www.semanticscholar.org/paper/Spartan%3A-Differentiable-Sparsity-via-Regularized-Tai-Tian/210c47fc0c16bf1cfc9beeb01faf70fcdbd3b978 https://github.com/facebookresearch/spartanver (possibly with distributed parameter server cluster) based approaches where nodes only compute gradients seem optimal.
>>17543 Important note: in my models of this distributed training process the maximum allowed batch size is the de-facto limiting factor (with the other limited factor being the parameter server bandwidth - which should be alleviated with distributed parameter server aka trusted supernodes), given even modest success among volunteers. This mandates task and dataset design which make large batch sizes beneficial - an R&D-heavy open-ended problem. This general problem of distributed training requires careful experiment-driven engineering and iteration, not stacking of meme paper upon meme paper (almost certainly not even applicable to the distributed training context) in one's head, only to fail once the initial implementation gets done (does it even, lol). Seriously, try to simulate even a simple distributed training run on your machine, see how the loss curve compares to normal dense training - and become blackpilled like myself and become a better engineer through the experience. I came to see training as a very fragile process which gets ruined quickly by our brazen approximation and compression.
>>17543 >We need to pretrain our own model I say from experience that this is very bad advice. Not only does it cost a lot to pretrain good models from scratch, but you're inevitably going to fall behind SOTA as the (much larger, much better-funded) ML community continues to publish increasingly powerful models. Getting cheap, flexible ways to adapt other people's models to custom use cases seems far more promising, even if those adaptations perform worse than normal fine-tuning. >This general problem of distributed training requires careful experiment-driven engineering and iteration, not stacking of meme paper upon meme paper (almost certainly not even applicable to the distributed training context) in one's head, only to fail once the initial implementation gets done (does it even, lol). Your attitude is completely assinine. You're making assumptions about a person that have already been demonstrated false, and you're using 'meme' language to degrade them.
Open file (98.28 KB 928x534 215.png)
Open file (63.19 KB 822x370 323.png)
Open file (542.01 KB 1049x819 745.png)
>>17549 >I say from experience that this is very bad advice. Not only does it cost a lot to pretrain good models from scratch, but you're inevitably going to fall behind SOTA as the (much larger, much better-funded) ML community continues to publish increasingly powerful models. Getting cheap, flexible ways to adapt other people's models to custom use cases seems far more promising, even if those adaptations perform worse than normal fine-tuning. This assumes the industry will continue feeding the increasingly problematic (their words) and outdated (the future ISN'T opensource, hello anon) hobbyist freeriders their precious checkpoints, which is becoming increasingly tired and borderline crazy assumption to make, as the political machinery of AI regulation and so-called "compute governance" gets ironed out and implemented in laws and industry norms. If you don't see this and think stable diffusion is a counter-example you are myopic, dear anon. Simple question: have you seen an open-source Gato replication, and if not, care to elaborate why, given small apparent cost? (And how are you even going to cope with lack of Gato-tier checkpoint, by training a handful of adapter layers over run-of-the-mill language model trained on e-trash and deemed safe enough for release? It won't be able to behave and you know it, anon - no mental gymnastics will help it gain modes of behavior and concepts it didn't get in pretraining phase - unless you throw it away and just re-train it almost from scratch). The large-scale reality on this issue in the coming years is likely to be shaped like this: AI regulation package is passed by a legislative body of a major geopolitical bloc to little fanfare. Other countries follow suit, like usual. The legislation mostly centers around a concept of "general-purpose AI system" general-purpose here being a wide and blurry enough legalese definition - which firmly encompasses Gato-like and even simpler systems. The legislation effectively forbids startups and individuals from experimentation on such systems, ensuring the regulatory burden is high enough (for example requiring costly bureaucratic project-level and fully transparent runtime audits - to the point of requiring giving away your servers' SSH keys to authorities), while also forbidding the more powerful entities the occasional release of such general-purpose checkpoints under fear of large fines and career damage. This is it, this kills your dream, unless you understand me and start thinking outside the cozy hugbox. The funny thing is the industry is already going in this direction by virtue of self-censorship. Again, you sure understand why we have only pussy VIMA instead of full-on Gato, and why stability.ai did a lot of various things yet wasn't brave enough to tackle this? Smart people are exceedingly good at self-censorship, you should have known if you studied how academia works. Maybe we will see opensource Gato at some point, but this long delay and silent self-censorship on behalf of all opensource AI collectives regarding this issue is already telling. It's like you don't read twitter and don't see how AI alignment meme slowly wins over generally capable yet socially primitive autistic minds of researchers. Add 2 and 2 and stop expecting freeriding in an industry which is increasingly being compared to uranium experimentation by eccentric physicists before the WWII. >much larger, much better-funded Yes this is what defeat looks like. We (who are we lol, misfits, individualist tinkerers?) are being defeated year after year. Consumers are happily consooming mediocre (yet still head and shoulders above the opensource demos due to a modicum of product development and UX polish) products and ask for more. The very creators abandon their creations when their internal motivation falters, lacking attention span to polish them to some trivial degree. It's all obvious and sad.
Open file (166.17 KB 822x693 415.png)
Open file (160.95 KB 1497x623 742.png)
>>17549 >Your attitude is completely assinine. You're making assumptions about a person that have already been demonstrated false, and you're using 'meme' language to degrade them. I don't see how to motivate people to do diligent engineering tbh. If people can get praise and updoots for DL-flavored shower thoughts on an imageboard promising unrealistic order of magnitude gains, they will do just that, mostly. It seems all remotely capable people from the first world are vacuumed by startups and industry, and what remains, in most of the best cases, is just undergrads playing with ideas from meme papers - until they crash into reality, finally git gud and find a job which will look like a salvation by that time (mind you, I don't like this outcome but this is how it works in most cases). Would be cool if someone proved my observation false - and I can find a vanishingly small minority of human counterexamples - just not here. I don't fear causing negative reactions at all, I have seen all too many projects going nowhere mostly via this incentive hijacking and an encroaching victory of talkers over builders. As you may see I have more than enough of wordcel quality as well (thankfully, not the only gift of mother Nature) - this environment selects for this property, which should be noticed and punished before it is all too late. If you are with me by this time, I can give you several valuable suggestions. If you really want to win, fulfill three requirements: 1. Design and implement a realistic deep learning system that is capable of training in self-supervised and reinforcement learning modes. The system should maximize parameter efficiency, and should be able to be run on high-powered consumer hardware. Low-powered hardware is a poor man's deadend. 2. Design and implement either a distributed training system, or a social project to gather cryptocurrency from people and knowhow to channel it into normal centralized training of your system. 3. Amass stable and massive popular support among individuals loyal to your case and owning enough compute or cryptocurrency to participate in the training run. Train your system on general-purpose task distribution in Gato-like fashion. Release the checkpoint via a signed magnet link. The vast majority of approaches that don't tick these checkboxes are pity trash and self-delusion. On a general note, hugboxes are order of magnitude more evil than casual negativity when taken long-term. A hugbox is a cemetery for budding talent.
Open file (60.74 KB 510x456 tay stonks.png)
>Do I need to say that this is wishful thinking, anon, does it help... I've played around with them already on toy problems and saw potential in them, and I doubt someone at Google with 18k citations and someone else at Microsoft with 7k spend their time publishing meme papers for the lulz. Obviously there is a loss in complexity but it gives the option to trade off accuracy for less parameters so they can be sent over the net. I really don't care about state of the art results or being an outstanding engineer. It just needs to get the job done. >17544 >Seriously, try to simulate even a simple distributed training run on your machine, see how the loss curve compares to normal dense training I'll run some tests on different parameter reductions. If you have your own feel free to share your results. I think there was a paper that showed it's faster to train a model with more parameters and distil it into a smaller model. It's something to consider whether it's even worth distributing training or to just buy better hardware. >and become blackpilled like myself and become a better engineer through the experience. I was blackpilled until I started using LAMB instead of Adam. Being bitter and cynical isn't useful to getting stuff done. I neither believe or disbelieve papers. I just see them as possibilities to be explored when appropriate. Once you start making judgments something is just wishful thinking without actually testing it you're cutting off the possibility of ever knowing because you've already decided it's no good. >>17549 He's not really wrong to be honest. It's hard not to be cynical. Most gradient compression methods like PowerSGD and others often make wild claims but in practice require perfect hyperparameters for the task at hand and can become unstable or fail midway without warning. The 1-bit LAMB paper seems a lot more reasonable since they're only claiming a 4x reduction and LAMB has been pretty robust for me in all use cases except small batch sizes. >>17551 >This assumes the industry will continue feeding the increasingly problematic (their words) and outdated (the future ISN'T opensource, hello anon) hobbyist freeriders their precious checkpoints Hugging Face has already started removing 'problematic' checkpoints but people are happily sharing them over torrents. I also don't think people should expect there to be many useful pretrained models in the future from the West, except small nerfed ones verifiably 'unproblematic' and torrents created by individuals. Look at the heat Stability.AI is taking for releasing models publicly and they don't even care a shred about open-source. They happily banned open-source contributors they saw as problematic and the rest is all rhetoric for free advertising. >The large-scale reality on this issue in the coming years is likely to be shaped like this: AI regulation package is passed by a legislative body of a major geopolitical bloc to little fanfare Not going to happen on a significant scale. And if it does whichever countries do this will seal their fates to never having any geopolitical power because they just dropped 95% of their AI researchers by causing massive human capital flight to less regulated countries. If Stable Diffusion gets taken down it implies taking down GPT2, OPT and almost all other models trained on any copyrighted or private data. Again, not going to happen unless a country is suicidal, and if that's the case they have much bigger issues to worry about than playing around with AI models. >And how are you even going to cope with lack of Gato-tier checkpoint, by training a handful of adapter layers over run of the mill language model trained on e-trash and deemed safe enough for release? It won't be able to behave and you know it If someone wants a Gato model or larger VIMA model then yeah they're going to have to train their own. Adapters aren't going to do shit. They're only useful for fine-tuning. On the other hand, pretrained models are pretty robust with what you can do with them. I've ripped embedding layers out of models and retrained them with new tokenizers. I remember there was a paper that found pretraining on Wikipedia helped with reinforcement learning, both speeding up convergence and getting better results: https://arxiv.org/abs/2201.12122 Even if no more pretrained models get released there's plenty to work with for the next 8 years and come up with a plan. Personally I'm not concerned with making models from scratch right now. I don't have 10 80GB A100s at my disposal let alone 1000 and I doubt a ragtag team of 3090s and old GPUs will achieve anything useful from scratch. I just want to finetune language and vision models for conversational AI to a greater degree than can be done alone. The reason there isn't an open-source Gato is because no one wants to spend $10,000 of their own money for a toy that will be obsolete in 3 years. >17552 >I have seen all too many projects going nowhere mostly via this incentive hijacking and an encroaching victory of talkers over builders. Then build your own? There have been plenty of shit talkers to come by over the years telling us why we should do it their way but never post their own work. I'm happy with the progress I'm making and others are making. I'm not making this shit to save the world or take on corporations or for anyone else. I'm just building an AI waifu at home on a budget and sharing what I know.
>>17553 >Being bitter and cynical isn't useful to getting stuff done. I would go a step farther and say it's quite counter-productive. A fundamental belief that you can succeed at a given endeavor is, all else being equal, the most fundamental pre-req for eventual actual success in the effort. History is rife with pertinant examples in the arts & sciences, politics & tech. To wit, compare: >Just give up, it's hopeless!111 >vs. >Keep.moving.forward. Which way would you prefer to live Anon, whether success or failure is the ultimate outcome? Indeed which is more likely to succeed?
This post is off topic. Feel free to delete it or move it if you don't want it here. >>17551 >Yes this is what defeat looks like. That's not a defeat scenario, that's what I expect is the inevitable scenario for all parties, even well-funded ones. Large models have gotten 10x bigger year-on-year since 2018, and I think training costs have risen faster than that. Companies are eventually going to be forced to specialize in their AI direction, at which point none of them will be "the best" at everything. At that point, if you want "the best" AI, you'll need to be able to plug into multiple models from multiple parties regardless of how well-funded you are. In the long run, no model performs better than a mixture of all the leading models. I'm not concerned about new legislation hindering open source AI. The US is far too afraid of China taking the lead on tech to introduce legislation that hinders AI development in any meaningful way. AI deployment, maybe, but AI development, no. I would guess that any Five Eyes country will be the same. The EU is going to get screwed on legislation as usual. That sucks, but as far as I know, the majority of Western open source AI enthusiasm is in the US and UK, and EU legislation is largely irrelevant in this. (Sorry if you're in the EU. If you do get screwed on legislation, maybe some of us can help proxy your work.) At least in the US, it's more likely that people will use current legislation against open source AI development, but I think even that is unlikely to succeed at scale. As far as taking advantage of open source code goes, it looks like Microsoft will be forced to take lead on the defense thanks to GitHub and Copilot, and they are very familiar with large legal battles around software. As far as making sure AI has access to copyrighted data, Google has an enormous stake in this, and they have won at least one related battle (Authors Guild, Inc. v. Google) in the US Supreme Court with a ruling that's definitely broad enough to cover AI use cases. As far as open source development goes, open source code and published research papers fall under the First Amendment, and this has been tested at the federal level even for something as extreme as cryptography (Bernstein v. Department of State). These cases can be overturned by the Supreme Court, but public opinion does not sway the current Supreme Court, as has been demonstrated recently with Row v. Wade. From the defense to the precedents to the judges, everything seems to work in favor of open source software in the US.
>>17557 >This post is off topic. It certainly is, but one of rather high-quality. Thanks, actually. >Feel free to delete it or move it if you don't want it here. I may move the conversation to /meta or the news thread. As to your post, you're simply replying in-kind to the already-offtopic-poster who seems to do this frequently, but as he is an outsider and newcomer by his own admission no surprises tbh, you can get a hallpass this time Anon. :^)
Open file (3.31 MB 1580x1500 demo.gif)
There's now a PyTorch framework that makes parameter-efficient finetuning super simple and painless to do for any model: https://github.com/thunlp/OpenDelta Also a paper they published earlier this year studying parameter-efficient finetuning in depth: https://arxiv.org/abs/2203.06904 I'm using it at the moment for finetuning CLIP and will report back with results when done. My goal is to make it possible to finetune Stable Diffusion models with only 4GB by finetuning an image encoder to the frozen text encoder, then finetuning the text encoder with the frozen image encoder. LAION reported that it seems using a very large batch size (up to 159k) can help reach even higher performance, which is fertile ground for using LAMB. I've been thinking since optimizer updates will take a long time to happen at that scale, or even at 32k that OpenAI used, it will give the master node time to validate if gradients sent in actually improve the model. This will lower the trust required in a training group since someone sending erroneous or malicious gradients will only be taking up bandwidth and can be b& if necessary. If this works then it could begin a culture for distributed training groups by riding on Stable Diffusion's popularity. Then next year when open-source RLHF takes off after ChatGPT goes behind a paywall and people want to implement their own assistants, they will have the necessary tools to collaborate together on language models, voice models, and whatever else they see fit.
>>18163 This sounds very exciting Anon. If we can truly solve this in an effective and manageable way using only open source tools, I'd suggest this will be a big breakthrough for the world generally. Certainly it could make the Robowaifu@home dream a reality. Godspeed.
One general question about this whole idea here: Will it be possible for people to rent a GPU online somewhere and contribute that way? My thinking is, that some people might have a little bit of money to spend but don't want to use their computer at home. Ideally we could get to a point where people with some crypto currency would pay other people for renting a GPU in a data center somewhere, and then wire this GPU into the robowaifu training cluster.
>>18211 I don't see why not, Anon. It's pretty common today to rent compute on the cloud. There are yuge evil vendors, and literally thousands of smaller vendors doing effectively this. GPU cycles may be a rather limited niche for the smaller guys in general. Crypto payment isn't something I can speak to one way or other though.
>>18211 One other thing I'd add: in the vidya & film industries it's not uncommon for animation studios to rent/least 'wee clusters on wheels' for a production's render post period. They're expensive, but you can throw a lot of compute around pretty quickly if you're in the right locations. They deliver right to your door! :^)
>>18211 Yeah, something else I've thought of is creating a system where people can earn Monero by training, but it needs a lot of thought yet to make it viable. I doubt it would be worthwhile to train on GPU rentals then since it would cost more than paying people directly for their GPUs. Vast.ai for example takes a 25% cut of all profit. If people want to contribute to a project but don't have compute, it would make more sense to donate straight to the project. A training group creator would put up some crypto and automatically pay people by how much their contributions improved the model on a hidden test set. Gradient descent is really noisy though and it might be too costly to evaluate a large enough batch size to get a good measurement, but I think it might work out just by taking an EMA similar to the optimizer's beta1 parameter to smooth out the noise. In the future I imagine the end-user program automatically connecting to the highest paying project or whatever preferences set by the user, so they can just install it, forget it and get paid periodically. The whole thing would be decentralized with trackers people can announce their training projects on, kind of like torrents. Realizing this will be difficult though because custom models need custom code to run, which a malicious actor could use to compromise people's systems, but I'm confident there's a way to parse most models with ast into blueprints that can be used to safely construct models on other people's computers from within the program without having to download any code.
>>18219 These are all very fascinating ideas, thanks.
>>18219 >EMA Mind explaining this in a bit more detail for us, Anon? Is this anything similar to finance's use of the calculation? >In the future I imagine the end-user program automatically connecting to the highest paying project or whatever preferences set by the user, so they can just install it, forget it and get paid periodically. That would be pretty remarkable. Trust is key here, and for most men that will translate directly into: >Does my open-source (robo)waifu perform better now? As you're likely well-aware, this is an uphill struggle given the Globohomo & other's instant gratification dogma. >but I'm confident there's a way to parse most models with ast into blueprints that can be used to safely construct models on other people's computers from within the program without having to download any code. I'll shortly be working on some minor ASTs with C++ that at the very least should be smoking fast if nothing else.
Open file (120.17 KB 521x658 lo-fi.png)
Open file (193.70 KB 908x614 WiSE-FT.png)
Open file (109.94 KB 1077x472 lo-fi OPT.png)
It looks like bandwidth and parameter-efficient finetuning isn't needed after all. >lo-fi: distributed fine-tuning without communication https://arxiv.org/abs/2210.11948 >When fine-tuning large neural networks, it is common to use multiple nodes and to communicate gradients at each optimization step. By contrast, we investigate completely local fine-tuning, which we refer to as lo-fi. During lo-fi, each node is fine-tuned independently without any communication. Then, the weights are averaged across nodes at the conclusion of fine-tuning. When fine-tuning DeiT-base and DeiT-large on ImageNet, this procedure matches accuracy in-distribution and improves accuracy under distribution shift compared to the baseline, which observes the same amount of data but communicates gradients at each step. We also observe that lo-fi matches the baseline's performance when fine-tuning OPT language models (up to 1.3B parameters) on Common Crawl. By removing the communication requirement, lo-fi reduces resource barriers for fine-tuning large models and enables fine-tuning in settings with prohibitive communication cost. Nodes can finetune from a common checkpoint independently without any communication and merge their weights at the end of training. This method was used to finetune GPT-JT (>>18241) And very, very related: >Robust fine-tuning of zero-shot models https://github.com/mlfoundations/wise-ft >Large pre-trained models such as CLIP or ALIGN offer consistent accuracy across a range of data distributions when performing zero-shot inference (i.e., without fine-tuning on a specific dataset). Although existing fine-tuning approaches substantially improve accuracy in-distribution, they often reduce out-of-distribution robustness. We address this tension by introducing a simple and effective method for improving robustness: ensembling the weights of the zero-shot and fine-tuned models (WiSE-FT). Compared to standard fine-tuning, WiSE-FT provides large accuracy improvements out-of-distribution, while preserving high in-distribution accuracy. On ImageNet (in-distribution) and five derived distribution shifts, WiSE-FT improves out-of-distribution accuracy by 4 to 6 percentage points (pp) over prior work while increasing in-distribution accuracy by 1.6 pp. WiSE-FT achieves similarly large robustness improvements (2 to 23 pp) on a diverse set of six further distribution shifts, and in-distribution accuracy gains of 0.8 to 3.3 pp compared to standard fine-tuning on seven commonly used transfer learning datasets. These improvements come at no additional computational cost during fine-tuning or inference. Basically, interpolating a finetuned CLIP checkpoint with its starting checkpoint improves its performance in-distribution and out-of-distribution. A similar phenomenon is seen in lo-fi where the low-performing merged models outpace the baseline when merged together, although it performed slightly worse on language modelling. I believe this is due to how weights get updated in attention layers. I haven't investigated this thoroughly yet but a single backward pass is usually enough for a transformer to remember something and then those weights are rarely ever touched again. I've inspected text encoders of various Stable Diffusion models and often 70-95% of the parameters are exactly the same. So what nodes learn from mutually exclusive data will effectively be washed out when merged together. Something I've been experimenting with in merging Stable Diffusion models is only merging the weights that are significantly different and smoothly blending them in with a sigmoid function on their standard deviation from the primary model. So secondary model parameters close to the primary model have an alpha of 0 and ones 4 standard deviations away have an alpha of 1. Usually merging too many models causes detail loss but my new merging method preserves them and the significant features of other models mixed in. I think a similar method is worth exploring in transformers and may push lo-fi past the baseline on language modeling. >>18225 >Mind explaining this in a bit more detail for us, Anon? Is this anything similar to finance's use of the calculation? Yup, an exponential moving average. In PyTorch it would be done something like: def ema_update(target, value, beta): target.data *= beta target.data += (1 - beta) * value.data
>>18243 >It looks like bandwidth and parameter-efficient finetuning isn't needed after all. Big if true. This will be amazing if it turns out effective. Surely thousands of groups will quickly rebel against """OpenAI""" and their ilk and start a true AI@home ecosystem? If that happens then we are surely golden with Robowaifu@home I'd guess? See any real roadblocks to my simple prognostication Anon? Thanks for the explanation of EMA, BTW. :^) >=== -add 'EMA' cmnt
Edited last time by Chobitsu on 12/15/2022 (Thu) 06:04:05.
>>18245 Robowaifu training groups will probably be a small niche relative to everything else going on due to the lack of immediate returns. Hell, I want robowaifus more than anything but I'm working on Stable Diffusion stuff now because money. It's hard to foresee how things will play out because AI is accelerating everything so fast. One thing for certain is robowaifus will have to become competitive with other projects to remain relevant. Training groups will need to have a blue ocean strategy for their models, having a feature or capabilities that no other model can do. For example, combining ChatGPT with Stable Diffusion so you can iteratively generate images and make edits to them in natural language would draw tons of attention because there would be no competitors, and people would find new uses for it you didn't intend like generating choose-your-own-adventure visual novels or getting commentary on the differences between two given pictures, such as feedback to an artist on their sketch with a reference image. All kinds of projects are going to pop up. Some might start one for a model that let's people design and generate unique voices. Some might work on game AI. Some might want something that helps with video editing. Essentially people will be training new components for more and more complex systems of whatever tickles their fancy. And when training groups miss subsets of data other groups will pop up to fill the gap. I imagine many people will need to see results to stick around. If a week goes by and a model has only improved 0.1% and they see some new interesting project they'll jump ship and contribute to whatever is trending. Active development and regular updates will be crucial to maintaining contributed computing power. There will definitely be a social aspect to it too since people will be working together on common interests. Collaborating with Youtube creators will likely be a good way to boost projects. The most crucial thing though will be being the first to create an easy-to-use program for doing federated learning, kind of like how Automatic1111 became the defacto web UI for Stable Diffusion and attracted other devs to contribute to it. By pioneering the right tools people will come.
>>18254 Thanks Anon, you given us something to chew on. >blue ocean strategy I would argue that the entire specific-paradigm we've adopeted here on /robowaifu/ -- something that's never been done before in human history, something that literally millions of men would instantly want themselves the moment they see it, and finally something done very inexpensively such that literally any motivated individual can build one in a few months time as a hobby endeavor -- easily qualifies us as our own blue-ocean strategy. This is something truly revolutionary here. If we pull it off right it will change human history.
Open file (9.20 KB 200x127 1596826135558.png)
Well fuck, someone already made a LoRA finetuner for Stable Diffusion so the entire network can be trained on only 6 GB with 6 MB of parameters: https://github.com/cloneofsimo/lora People are reporting similar things papers have, such as getting better results than a full finetune in Dreambooth on small datasets and being able to use much higher learning rates. I got an idea though on how to create a completely decentralized way of training models where people can share parameters with similar projects in a way that benefits each other while training completely separate models. Getting ahead of the curve is now or never.
>>18256 It's not just about doing something new though. An essential part of a blue ocean strategy is creating new demand by delivering value to people. At the moment all we're doing is research and creating things only people familiar with the tech can replicate or use. Being able to rapidly turn ideas into MVPs people want to use will be the most important skill to have in the coming years.
>>18287 This is a good thing. >>18288 No debate. But the simple fact we're at a watershed moment in human history. It behooves all of us to grab it with both hands. The consequences will nevar be the same! :^)
Has anyone here experience with DC++ ? May be a way to transfer work amongst contributors. https://sourceforge.net/projects/dcplusplus/
Open file (148.99 KB 743x700 adan.png)
New optimizer that outperforms LAMB and works well with similarly large batch sizes across a wide variety of models and tasks: https://github.com/lucidrains/Adan-pytorch I haven't tried it out yet but it looks promising if true, achieving similar results with half the training.
>>18596 Presuming you are suggesting this will assist with the effort at distributed training via the proposed Robowaifu@home, et al, would you clarify your ideas about the way that would work, Anon? TIA.
Open file (748.64 KB 1137x672 zero-eacc4.png)
e/acc may be our allies in this I'll give more details as necessary but their goals are in direct alignment with ours though they aren't about the waifu angle.
>>18597 Yeah, by increasing the batch size training nodes can process more data before having to send any updates, which are costly especially finetuning a full model, but more importantly they report getting the same results in half the time and ultimately better test loss. From my own playing around merging model weights together it doesn't seem like updates are even required beyond the end of training. Merged weights almost always perform better at each respective task or better at one and almost equal on the other. Some experimenting will have to be done to see how well Alan optimized models merge together. >>18608 Once you have an AI assistant it's impossible to go back. I'm sure they will get hooked. Also, what I was saying about model merging. Other groups can finetune on whatever they fancy and /robowaifu/ can finetune on waifu stuff and we can merge that work together, so long as we both start from the same base model.
>>18608 Likely true. Seemingly ironically, I also care about the women being abused by the lies they've swalloped from the Globohomo. Second to my concerns about the abused men, ofc. Once the Robowaifu Age begins, there will be a rapid decline of the power in feminism (after a highly-tumultuous period), and a return to healthier, more trad lifestyles for them will ensue. <Win-win-lose. >Men-Women-Gl*bohomo >>18609 >Merged weights almost always perform better at each respective task or better at one and almost equal on the other That sounds like a natural benefit directly in-line with our goals ITT then?
An older paper that is relevant to training on a budget: https://arxiv.org/abs/2001.04063 Rather than just predicting the next token, they predict the next n tokens. They reported achieving state of the art using ~1/4 of the training epochs over the data with each extra token prediction costing about +15% in training time. I think this paper is even more relevant today because multilingual tokenizers like OPT don't tokenize words but rather pieces of words. For example 'robowaifu' is [1001, 14271, 102, 1594, 257]. Text generation often gets mixed up and uses 'robot waifu' and 'robot wife' randomly with no consistency. Greedy search can improve this but I think there's something fundamentally wrong with the probability distribution that's spitting out nonsense like Markov chains because it's myopic on the next token. I'm going to try predicting 2 tokens on my next finetune, then 3 and 5 and see if it improves consistency. Something I also want to investigate is predicting the embedding of the next sentence. I think this could solve a lot of the issues with using sentence embeddings for external memory too. >>18610 Yeah, it's getting a lot easier to leverage other people's work. For example you can combine powerful new models like GPT-JT-6B with GPT-4chan and get the best of both. Which reminds me there is a new project for running (and even finetuning) 176B parameter models from home by pooling resources with people online: https://github.com/bigscience-workshop/petals >Inference runs at ≈ 1 sec per step (token) — 10x faster than possible with offloading, enough for chatbots and other interactive apps. Not useful for real-time applications but people could use this to generate high-quality training data for smaller models to learn from.
>>18617 Neat! It certainly seems to be following an architectural model that lines up pretty closely with the ones discussed ITT? >Not useful for real-time applications but people could use this to generate high-quality training data for smaller models to learn from. That would be a serious benefit if we can use this data to train smaller systems to run on mobile-suited hardware! BTW question: their (BigScience) loicense wouldn't be suited to anything that any sane individual might say or do. [1] AFAICT, the only """approved""" uses would be (to put it in Current Year vernacular) 'Triple the pozz, and quadruple-down with the pronouns'. Any chance their work (or at least their approach) could be reasonably-mimiced in such a way to avoid such evils? Thanks Anon, encouraging stuff. Godspeed with your efforts! :^) 1. https://huggingface.co/spaces/bigscience/license (atch. A) >=== -minor grmr, prose edit
Edited last time by Chobitsu on 01/09/2023 (Mon) 12:14:32.
>Apache Beam is a unified programming model that provides an easy way to implement batch and streaming data processing jobs and run them on any execution engine using a set of different IOs. Anyone know anything about this? Could this be used for this project ITT? https ://dzone.com/articles/how-to-develop-a-data-processing-job-using-apache
Would robowaifu@home work out the same way other @homes do? I thought you couldn't train NNs on distributed systems.
>>19000 >digits Yes, that's the plan, and yes, it's a difficult proposition. I think if you scour this thread (and partially in other AI-oriented ones) you'll get some idea of the most promising tacks on the table ATM.
>>9029 >>8995 Regarding the need for broad hardware/device support: I have recently been made aware of a JS backend for TensorFlow. [ https://www.tensorflow.org/js/guide/platform_environment ] It uses GPU acceleration by way of WebGL shaders, so in theory any device running a WebGL-enabled browser should be able to take advantage of that to squeeze out some extra compute. Browser is a pretty good LCD as far as availability, though I can't say how feasible it would be to integrate into distributed learning. Even so, I'm checking it out for possible local use since my machine has neither CUDA nor ROCm available as acceleration options. >=== -fix hotlink
Edited last time by Chobitsu on 02/08/2023 (Wed) 23:51:04.
>>19669 Interesting, Anon. Please let us know what you discover.
> Distributed inference via MPI https://github.com/ggerganov/llama.cpp/pull/2099 > via (>>23819) >=== -add crosslink
Edited last time by Chobitsu on 07/04/2023 (Tue) 19:20:44.
>>23826 there are an update from ggerganov himself : https://github.com/ggerganov/llama.cpp/pull/2099#issuecomment-1627804506 > It would be a fun thing to try and potentially achieve world-first inference of 65B model on a cluster of Raspberries
Open file (242.69 KB 482x500 1682300966710.gif)
>>23928 >yfw it's real
>>23928 aand he merged it https://github.com/ggerganov/llama.cpp/pull/2099 https://twitter.com/ggerganov/status/1678438186853203974 > ggerganov approved these changes 38 minutes ago anyone who is good at c/c++ can test it right now, ofc if u have 10+ RPi's with atleast 8gb ram on them :/
>>23933 I mean to have a cluster of them inside my robowaifu anyway, so I'll just make a mid-range plan to begin stocking up on some. Any idea if other h/w platforms are supported by it, 01?
>>23935 anything that runs code and has some sort of decent CPU and plenty of RAM, with linux on top of it, also llama.cpp is perfectly optimized for arm neon, but that one is macOS thing, or not, idk so it also should be good on any capable arm hardware :/
Open file (38.71 KB 599x357 Screenshot_6.png)
>>23937 it's pretty much confirmed for now.
>>23928 >>23933 >>23938 This is really nice news 01 ! That's a talented man to be sure. It's quite gratifying that his own goals for his project seem to align rather well with several of ours. Thanks Anon.
>>23975 You seem disappointed 01, but I consider this a serious breakthrough. Here we are, with performant, opensource solutions providing distributed LLM inferencing, currently working across a smol collection of tiny capacity, commodity processors (with not a GPU in sight). This is the 65B model isn't it? Compare this situation to just one year ago! :^) I'd suggest we all wait till RobowaifuDev and other waifu-targeting AI researchers begin their researches utilizing these same llama.cpp mechanisms. I predict eventual impressive price/performance leaps past this one (which is already quite remarkable IMO). Patience Anon, we're all gonna get there! :^) >=== -prose edit
Edited last time by Chobitsu on 07/16/2023 (Sun) 13:18:41.
>>23975 hmm weird https://github.com/ggerganov/llama.cpp/issues/2209 > Not sure about 65B, but I tried a 33B model that mmaps 26GB on a Mac mini with 24GB RAM. It swapped and worked at 46 seconds per token. Then I added a second Mac mini over MPI and together they worked at 450ms per token, which is 100x faster.
>>23987 AFAICT, neither Ethernet nor MPI will be particularly fast in this context. I imagine the speedup (if it is in fact accurate, and not some type of systemic measurement error) is largely due to doubling the RAM available to the combined system. >=== -sp edit
Edited last time by Chobitsu on 07/16/2023 (Sun) 22:53:00.
>Torrented Models Not sure if this is what the last few comments about or if it was already covered, but I ran into this: Petals - https://petals.ml/ Research - https://research.yandex.com/blog/petals-decentralized-inference-and-finetuning-of-large-language-models Petals Google Colab - https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8 >This notebook will guide you through the basics of Petals &mdash; a system for inference and fine-tuning 100B+ language models without the need to have high-end GPUs. With Petals, you can join compute resources with other people over the Internet and run large language models such as LLaMA, Guanaco, or BLOOM right from your desktop computer or Google Colab. Found through: https://www.youtube.com/watch?v=8jEGVaRKmFc
>>24086 Nice, thanks Anon! No, I don't recall seeing this here IIRC.
Framework for distributing user defined Spiking Neural Networks and other algorithms. >"distributedArchitecture is a Tiny distributed computation framework for spiking ANNs and more! distributAr's primary purpose is to distribute Spiking Artificial Neural Networks among multiple hosts or CPUs. It provides mechanisms for sharing spike times among running threads. Threads running in a single process communicate through shared memory. Threads running in differrent processes or on different machines communicate via network multicast." https://github.com/rand3289/distributAr
> conceivably-related question (>>23971)
>NuNet is building a globally decentralized computing framework that combines latent computing power of independently owned compute devices across the globe into a dynamic ecosystem of compute resources, individually rewarded via tokenomic ecosystem based on NuNet Utility Token (NTX). https://www.nunet.io
Open file (33.58 KB 855x540 NuNet Tokenomics.png)
>>26510 While the basic idea behind the claims is sound (ours is much better however), the entire thing strikes me as yet another scam. If I'm correct, then it's an effort to sweep up any unencumbered compute resources not already controlled by the GH, into their already-obscenely-large hardware stable.
Related: >>30759 >I'm working on infrastructure that's friendly to distributed development of complex AI applications
> platform-related : ( >>34123 )

Report/Delete/Moderation Forms
Delete
Report