Skip to main content

LLM Summary

LizAbout 2 minLLMLLM

LLM Summary

    1. The Platform for LLM evals
    1. LLM Organization and Product

1. The Platform for LLM evals

1.1. LMSYS

Organization:
LMSYS and UC Berkeley SkyLab

Evaluate Way:
Chatbot Arena - a crowdsourced, randomized battle platform.
Evaluate LLMs by human preference in the real-world.
Ask any question to two anonymous models (e.g., ChatGPT, Gemini, Claude, Llama) and vote for the better one!

Evaluate Result:
Arena Elo
Elo, short for Elo rating system, is named after its inventor, Hungarian-American physicist Arpad Elo. It was originally developed for ranking chess players in the 1960s.

Website:
https://chat.lmsys.org/
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

1.2. LiveBench

Organization:
Abacus.AI

Properties:

  • LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses.
  • Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge.
  • LiveBench currently contains a set of 18 diverse tasks across 6 categories, and we will release new, harder tasks over time.

Website:
https://livebench.ai/

1.3. Fine-tuning Index

The Fine-tuning Leaderboard compares the performance of GPT-4 and popular open-source models that They fine-tuned across a series of tasks.

Remarkably, most of the fine-tuned open-source models surpass GPT-4 with Llama-3, Phi-3 and Zephyr demonstrating the strongest performance.

Website:
https://predibase.com/fine-tuning-index

1.4. SuperCLUE

Domestic Leaderboard

2. LLM Organization and Product

OrganizationProductOpenSourceLocation
Foreign
OpenAIGPTCloseUS, UK
GoogleGemini/Bard/Gemma/PaLMOpen-
AnthropicClaudeCloseUS, UK
MetaLlama/AlpacaOpen-
MicrosoftPhi/WizardLM/BingOpen-
MistralMistral/MixtralOpenUS, France
HuggingFaceZephyrOpen-
CohereCommand ROpen-
NousResearchNous/OpenHermesOpen-
LMSYSVicuna/FastChat--
Reka AIRekaOpenUS, UK, Singapore
NvidiaNemotron/NV/ChipNeMoOpen-
NexusflowStarlingOpenPalo Alto, CA
Databricks/MosaicMLDBRX/Dolly/ MPTOpenMany
OpenChatOpenChat--
SnowflakeSonwflakeClose-
UC BerkeleyStarling/Koala/GorillaClose-
Perplexity AIpplxClose-
Cognitive ComputationsDolphinOpenPersonal
Upstage AISOLAROpenKorea
TIIfalconOpenAbu Dhabi, UAE
Together AIStripedHyenaOpenSan Francisco
Allen AITulu/OLMoOpenSeattle, WA, United States
Nomic AIGPT4AllOpenNew York
RWKVRWKVOpen-
OpenAssistantOpenAssistantOpen-
Stability AIStableLMOpenCanada
BloombergBloombergGPTCloseUS, UK
inflection.aiInflectionCloseSan Francisco Bay Area
xAI(Elon Mask)GrōkCloseSan Francisco Bay Area, California, U.S
ScaleScaleCloseSan Francisco
Character AICharacterCloseMenlo Park, CA
Domestic
AlibabaQwenOpenHangzhou
Tsinghua/Zhipu AIGLM/ChatGLMOpenBeijing
BaichuanBaichuanOpenBeijing
ModelBestCPMOpenBeijing
01 AIYiOpenBeijing
DeepSeek AIDeepSeekOpenHangzhou
Colossal AIColossalOpenBeijing
MoonshotMoonshotCloseBeijing
StepStepCloseShanghai
MiniMaxABABCloseShanghai
BaiduERNIECloseBeijing
SenseTimeSenseChatCloseShanghai
BytedanceDoubao/CozeCloseBeijing
TencentHunyuanCloseShenzhen
360360gptCloseBeijing
XVERSEXVERSEOpenShenzhen