>still zero rewards in Montezuma's Revenge and negative rewards in Pitfall in 2022
Montezuma is really hard for RL, I agree. If I were to name a chart of hardest remaining problems in RL, the list would look like this:
1. Efficient and Reliable exploration
2. Lifelong (& multitask & composable skill) learning
3. Reliability in adversarial environment
4. Data efficiency. It is quite hard, but it is being improved upon fairly recently (via EfficientZero: https://github.com/YeWR/EfficientZero
5. Specifically for transformer-based RL: limited context width.
In my project I don't have silver bullets for solving these of course, but some likely good enough solutions can be gleaned from a finely selected subset of current literature. I could list my current choices of arxiv & github crutches for each item on the bullet-list, if you are interested, but I'm going to do it in my project thread soon anyway.
For example exploration is IMO the hardest RL problem, and decision transformer line of models aren't good at it as they are now, but I expect D-REX approach of generalizing over noisy trajectories to be useful here: https://evjang.com/2021/10/23/generalization.html https://arxiv.org/abs/1907.03976
. Perhaps it will be enough, given transformers' native runtime few-shot learning and domain randomization in the training data.
We really need a good, lightweight enough baseline for benchmarking RL exploration. Montezuma per se doesn't cut it, as it's pretty obvious it requires more than a little world knowledge to be solved. As it happens, deepmind has a codebase useful for such RL-agent capability benchmarking, including exploration: https://github.com/deepmind/bsuite/
The problem with making success of your project conditional on some R&D is, of course, notable unreliability of any R&D. Realistically I have very little "R&D points" available. Looks like I'm going to spend most of these on 1) maxing out few-shot learning, 2) optimizing transformer training & parameter efficiency and 3) implementing good-enough RL exploration, while forgoing items 2 and 3 of the main list for now. Well, at least number 2 more or less solves itself with scale.
>only 50 papers on using curiosity for exploration in the past 4 years
When I see sparse experimentation in an obviously promising field I must conclude there being some nontrivial fundamental problem precluding good publications. It is likely that curiosity-driven models are hard to train & optimize, or simply involve engineering too hard for pure academics (and not too hard for deepmind with its top-tier dedicated SWE teams).
Deepmind has had a curiosity-driven exploration paper relatively recently, with promising results: https://arxiv.org/abs/2109.08603
but it seems more about good engineering, with curiosity reward design being inherently straightforward.
>Our work builds upon the curiosity learning approach utilising the forward prediction error of a dynamics model as a reward signal. However, in contrast to typical curiosity setups (Burda et al., 2018a) which are optimised on-policy we employ an off-policy method to train the agent. Our method is also set apart from prior art with regards to the utilisation of self-discovered behaviour. Instead of using model-predictive control (Sharma et al., 2020), we leverage emergent behaviour directly by employing policy snapshots as modular skills in a mixture policy (Wulfmeier et al., 2020a, 2021).
For now I find these surveys helpful, if inconclusive https://arxiv.org/abs/2109.06668 https://arxiv.org/abs/2109.00157
and I'm open to your ideas.
>He doesn't stop winning, does he?
I like Schmidhuber more than some. Decade after decade he delivered advanced research. His latest papers developing transformer further and going beyond look great ... and I can't help but wonder why didn't he lead teams scaling most fruitful of his approaches. Where is NNAISENSE now? Maybe they are training large UDRL models to trade stocks, but I bet they'd publish a cool scaling paper if they ever had source material for one.
Why aren't we witnessing large-scale training of RFWP >>11716
again bothers me. Either Schmidhuber has no talent for organizing large-scale engineering, no funding, or there is something with the model. Maybe RFWPs aren't stable enough for web-scale training, who knows.
In any case, his website is a treasure trove of AI wisdom: https://people.idsia.ch/~juergen/
and his publication record is awesome https://www.semanticscholar.org/author/J.-Schmidhuber/145341374?sort=pub-date