What is the most simple thing you can start with? It seems easier to first make something that does not create novel sentences, but just does more listening than talking and that occasionally selects from a few stock responses in a sensible way. ("Nice." "Sounds complicated." "That much?" "How sad.") More a listen-bot than a chatbot. And more easy than that is something that just highlights what seems unusual or important. Surely something like a highlighter is needed for more complex dialogue interactions as well. So, the roadmap is: mute text highlighter → taciturn listenbot → proper chatbot
What I'm thinking for the highlighter isn't on/off, but degrees and I'm actually not thinking of a single highlighter, but three that work together:
1. The "old Codger" represents generic expectation for parsing English text. It doesn't ever update itself, just applies fixed scoring rules, like giving more points to a word for rarity (using scores from a look-up table and Scrabble score to further distinguish between words with no dictionary entry) and giving points to sentences with formulations that are not rare or special themselves, but imply something like that happening in the sentence (like the words "rare" itself, and "extreme", "the most", "surprising" etc.) or an extreme event ("birth", "death", "war"…). Another indicator for sentence complexity is the FOXDOG metric (copyright me, do not steal): how many different letters appear in a sentence. Since the bot is supposed to give a shit about you, you saying "I"/"me"/"my"/"myself"/"we"/"our"/"us" also boosts importance for a sentence. And of course, there's also some conditional probability stuff about expected adjacent words.
2. The "young Baby", builds expectation from recent inputs and scores relative to that expectation. It also uses a dictionary with synonymous words and topical tags. If a word is unique in the window of consideration, but synonymous or fitting with recent topics, it doesn't score that high. (A similar cooperation of "C." and "B." can happen inside a robot with an expressive face and stuffed with sensors for noise and temperature and other things: with "C." working with a fixed absolute scale, "B." with a scale that is calibrated around the recent impressions, and together they work out how surprised/annoyed/etc. the robot's reaction to noise etc. is.)
3. The "Middle" also builds expectations like "B.", but the considered time frame is much bigger. The expectation of hearing a word again goes down over time like with "B.", but it isn't just slower decay. In the big picture the expectation goes down, but with some bumps on the way, so it isn't monotonic. The idea here is that there are situations in life that repeat with rhythms, so what goes into the expectation of "M." are questions like: What words do I usually hear at… this time of the day/yesterday/the same day last week/the same day one year ago? (For a robot with the right sensors we can add: What words do I usually hear when I'm… seeing that the sky or ceiling has this color/measuring this temperature/facing this cardinal direction/sitting/etc? All with non-monotonic decay. And we can also look for other correlations like how common it is to have this sensor impression this time of day/day of the week etc. and all of that adding a bit to how confused/bored/etc. the facial expression is.)
We can also call these three Caspar, Balthazar, Melchior.
The common words are listed in a database and have tons of tags attached like cute, gross, swear word (one word can have multiple tags) that get weighted together for an appropriate emotional response. Negation in the vicinity of a heard word strongly dampens its weight and the intensity of the reaction, as do words and formulations like "movie" or "just kidding" which suggest the word likely doesn't belong to a real-world situation. Talking in the past tense and using formulations like "long time ago" also dampens the reaction intensity.
When somebody says that "X is long", what it actually means is that X is longer than some implicit reference standard. Likewise with heavy, fast, old… So, for the bot to make a comment of that sort requires that it builds up such a standard. The standard is strongly context-dependent, think of what "near" and "small" means to you when talking about animals and when talking about the planets. Often, talking about a thing conjures up a collection of things it belongs to and whether it appears to be heavy or big etc. strongly depends on how it looks in that set and also how it looks relative to other recently mentioned things. So, for some very common nouns at least there needs to be information in the database about typical size and so on. Since I'm lazy the database will have lots of holes. One common reference point for size and weight etc. will be an average human at the middle of life. Even just recently used numbers can serve as reference values (not very logical, but humans also do that). Several reference points get put on a line and then we look where on the line the mentioned thing is. I'm more partial to the median than the mean as the main reference point.