Vienna Research Groups for Young Investigators Call 2024 - Information and Communication Technology – VRG24-013

Knowledge Representation Learning for Large Language Models

VRG leader:

Svitlana Vakulenko

Institution:

Amazon

Proponent:

Axel Polleres

Institution:

WU - Vienna University of Economics and Business

Project title:

Knowledge Representation Learning for Large Language Models

Status:

Ongoing (01.06.2025 – 31.05.2033)

GrantID:

10.47379/VRG24013

Funding volume:

€ 1,597,758

The ability of Large Language Models (LLMs) to generate contextualy relevant natural language responses is truly impressive and a growing number of people are using them on a regular basis to address their information needs. However, since LLMs are parametric models unlike databases they were not designed to reliably store data. While they are trained on massive amounts of textual data from the Web and are likely to pick up factual knowledge from it, their responses are - while usually fluent and grammatically correct - often not factually correct, which makes them to appear plausible and effectively deceive their users. A common fix fo this issue is to couple an LLM with an actual database or an information retrieval system, such as a Web search engine, such that the model can use its results as input. This approach works in some cases, such as for answering simple factoid questions, if the relevant information is at the top of the search results and the LLM is fine-tuned to use them when generating the response. However, in practice it fails in many cases when information needs are more complex and search results are not optimal. In this proposal, I advocate for the need to address the aforementioned problem upfront: instead of "plugging in" an existing information retrieval system, a new solution specifically designed with an LLM in mind should be conceived. While embeddings-based approaches, such as dense retrieval were proposed to this end, they still rely on the same patterns from classic search systems, such as document index, and do not address the fundamental problem of knowledge representation. Alternative symbolic approaches for modeling knowledge, such as knowledge graphs, also follow strict structural rules that define what and how information should be represented. The main novelty of my proposal is to give an LLM the tools necessary to find the optimal structure for its internal symbolic knowledge representation by itself, and that can best support and align with its internal subsymbolic knowledge of the language accumulated in its parameters. To accomplish this, an objective function shall be defined in terms of the success criteria for the resulting knowledge representation, in terms of its efficacy in supporting faithful response generation and structural properties based on the database theory that are more likely to efficiently guide the model towards the global optimum.

Designing an internal knowledge representation for an LLM is the major goal of the project. In addition, my aim is to ensure that the proposed architecture has a clear potential to scale up to encorporate all of the Web's knowledge, which can only be feasible when embracing the decentralised nature of the Web as a network of interconnected data repositories. By establishing collaboration with other researchers working in the field of Decentralised Web technologies and privacy protection, namely the Solid project represented by Assoc. Prof. Ruben Verborgh from Ghent University and Dr. Sabrina Kirrane from WU Wien, I can ensure that the proposed approach will hold up to the challenge by planning in the enabling mechanisms that are necessary for the intergration with the Solid pods technology early on in the system's design stages.

Last but not least, embedding this research project within WU Wien's environment will also allow us to actively collaborate with the WU Legal Tech Center: although the regulation of AI technology is already well under way across the EU states, as manifested in the recent EU AI Act and the General Data Protection Regulation (GDPR), there is a clear need for a more active collaboration between AI researchers and legal experts to ensure that the new architectures are designed in compliance with legal requirements and to put mechanisms in place that allow to examine the system and enforce these requirements. It is not possible to halt the advance of the modern technologies, such as LLMs, at this stage, since they clearly bring important benefits for the modern society but it is also of great importance to be aware of shortcommings and limitations that can lead to undesirable outcomes. It is important to ensure that further development of this technology is geared towards making it more safe, fair and transparent for the end users, businesses and legislators putting them also in a position to co-create the future of modern AI technologies rather than forcing them to adapt to it post factum.

Keywords: Large Language Models; Knowledge Representation; Representation Learning; Information Retrieval

Scientific disciplines: Artificial intelligence (50%) | Machine learning (25%) | Knowledge engineering (25%)