The current state seems to be you can’t give an LLM/AI full range of access to the internet (it might be just too much data that might better be served by an intelligent search query rather than AI interpretation)…
It seems a bit like you’re thinking of LLMs as primarily a search technology, when that’s not really a good way to think about them IMO.
Modern LLMs are essentially trained on “the entire internet” (GPT-2 was trained on every outgoing Reddit link, GPT-3 uses Common Crawl + some other corpora, GPT-4 who the hell knows?), but they don’t have access to the “live” internet at runtime, and their ability to “memorize” the training data is both limited and, to a certain extent, in conflict with the goal of generalization that’s core to the idea of LLMs as a general AI tool.
There are, of course, specific AI products that access the internet in some way (for example, by doing a web search and then adding the results into the LLM prompt behind the scenes) but that’s essentially building an additional feature on top of an LLM, not part of the LLM technology itself, and it’s always gonna be limited by the quality of the search tool, which will not itself be an LLM.
At my work, we have a Wiki resource to explain hundreds of our processes and specific information about insurance plans, and if I need information about a specific thing, that information is sometimes scattered and duplicated and perhaps updated or not across 10 different wiki entries
The sort of problem you’re talking about - searching for relevant data within a corpus - is a specific field of research called Information Retrieval, which certainly involves plenty of machine learning and NLP techniques nowadays, but LLMs specifically are not really a tool intended for IR.
The other problem is having an AI that “learns” from talking to other people who might troll it as we’ve seen in some public chatbots who distressingly begin to believe racist things some people might say to it purposely. If the AI only had the dataset of what’s in a game and isn’t sharing learned knowledge experience between player/users, that becomes less of a problem.
Unfortunately, the biases in an LLM don’t just come from talking to users. Even pretrained models (that don’t continue to learn after being deployed) consistently display all sorts of biases. This has actually been a problem in NLP since even before the current LLM hype began; see this article from 2017 about how following standard-practice steps to train a sentiment analysis model still results in racism.
You’re right that using a dataset restricted to facts about your game would probably mitigate that; unfortunately, the generalization / “few shot learning” abilities of LLMs fundamentally rely on a truly massive amount of data (e.g. “basically the entire internet” as mentioned previously). You simply cannot train something with the same abilities on a small dataset. You can finetune an LLM on a smaller dataset to be more focused on your particular task, but this doesn’t necessarily get rid of undesirable biases; conversely, there are ML/NLP things you can do with a small dataset that aren’t LLMs, but then, well, it’s a different thing and not an LLM.
(EDIT to elaborate on that last issue: in particular, the ability to interpret an arbitrary natural-language prompt and respond in natural language is something you can only expect from an LLM trained on an ungodly amount of data)