Is Unicorn real magic?
Unicorn vs. other LLMs – overview of the performance
Author: Paweł Wnuk, PhD
At Yosh.AI, we are relentlessly committed to a deep understanding of technology and making data-driven decisions. Following the launch of Google’s latest Large Language Model (LLM), text-unicorn, our R&D team embarked on a comprehensive study to quantitatively gauge the model’s capabilities and benchmark it against other existing LLMs.
In our initial assessment, we honed in on the task of factual summarization. This allowed us to effectively measure the model’s propensity for generating “hallucinations”—misleading or spurious content. For this purpose, we harnessed a specialized classifier designed to evaluate such occurrences (https://github.com/vectara/hallucination-leaderboard). Our findings are highlighted below:
The data clearly indicates that Google’s unicorn model has made significant strides in reducing “hallucinations,” yet it still trails behind OpenAI’s flagship models.
Moving to the subsequent phase, we scrutinized how the unicorn model fares on task-specific performance. For this evaluation we employed the “Judging LLM-as-a-Judge” framework (https://arxiv.org/abs/2306.05685). Here are the results for a select range of tasks:
Our analysis reveals that while text-unicorn may not excel at coding tasks, Google has aptly addressed this through the provision of a dedicated LLM for coding (code-bison). However, the unicorn model’s performance is highly competitive or nearly on par with GPT-4 for other tasks, leading the pack especially in the Extraction category. It’s noteworthy that Google’s new text-unicorn model has markedly enhanced reasoning skills compared to its predecessors (such as text-bison and chat-bison).
A critical factor for countries in Europe and non-English-speaking regions is the performance of LLMs in languages other than English. Many open-source LLMs are predominantly trained on English-centric datasets, unlike leading LLM providers who invest considerable resources to ensure diversity in language coverage. To determine LLM quality across various languages, we at Yosh.AI have developed proprietary datasets for such assessments. Below we present a comparative analysis of LLM proficiency using Polish as a test case:
Remarkably, the unicorn model showcases a performance parallel to its English leading benchmarks, with its Reasoning abilities seemingly surpassing those of other compared LLM models. This is particularly significant for tasks that require drawing conclusions from textual data and strategizing subsequent actions.