Two recent research papers were presented at the International Conference on Software Testing (ICST), one of the world’s leading venues for researchers, engineers, and practitioners working in software testing, verification, and validation. The studies were conducted as part of the T.A.R.G.E.T. project, in collaboration with Blekinge Institute of Technology and our partner Synteda.
These results now form the foundation for the next phase at QESTIT, where our consultants will begin evaluating the proposed solutions in real-world settings.
ICST 2025 was held in Naples, Italy, and serves as a global platform for cutting-edge research and applied innovation in the testing field.
STUDY 1
Evaluation of the choice of LLM in a multi-agent solution for GUI-test generation
In brief
This study investigates whether a multi-agent system using different large language models (LLMs) can outperform a system where all agents use the same model. The hypothesis was that different LLMs have different strengths and weaknesses when it comes to test generation, and that combining them might lead to better results.
PathFinder: A multi-agent prototype
To evaluate this idea, the team developed a prototype system called PathFinder featuring four autonomous agents. These agents are programmed to:
Three different LLMs – Mistral, Gemm2, and LLama3 – were tested across four e-commerce websites to examine whether using a combination of models would improve the quality of testing.
Results: One model per site works best...
The findings showed that the hypothesis did not hold true when testing a single website: having all agents use the same LLM actually produced better results than mixing them. A likely reason is that the models "speak different languages" and emphasize different aspects of the interface.
For example, LLama focused more on web components, while Gemm2 prioritized user actions such as clicks. These differences made communication and coordination between agents harder when using mixed models.
…But mixed models may be better across different websites
However, when the goal is to create a more generalizable solution for use across various sites or systems, combining models can be advantageous. Different LLMs can complement each other more effectively in varied environments or user scenarios.
STUDY 2
LLM-based reporting of recorded automated GUI-based test cases
In brief
This study explores how generative AI (LLMs) can be used to analyze model-based GUI tests and automatically collect contextual data to improve the usefulness and clarity of test reports.
How it works:
A demonstrator in Scout
The research team built a solution into Scout, an academic test tool that allows users to "record" tests by interacting with a web application, making the testing process more intuitive.
Using generative AI
The system analyzes the tested web application and gathers relevant contextual information (e.g., functional descriptions, flows, and potential user scenarios). This data is added to the test reports and also helps Scout update its internal test model, enhancing its understanding of the application’s behavior.
Results
A quasi-experiment was conducted to evaluate user perception of the enriched reports. Participants found the LLM-enhanced reports more meaningful and easier to understand.
Potential applications
Although this solution was developed for Scout, the principles can be generalized to other tools and model-based testing environments. Automated context enrichment could support better analysis and reporting across the board.
More information:
T.A.R.G.E.T. Research Project – Blekinge Institute of Technology