Claude3 surpasses GPT-4: Who will dominate the future in the AI wave?

Content

πŸ˜€

Quick Overview of the Entire Text

πŸ” From Zero to One: Anthropic Growth History and Founding Team Prospectus πŸ“ˆ Claude3 Model Evaluation: Performance, Features, and Competitiveness Analysis πŸ’₯ Comparison of 8 Mainstream Large Model Evaluation Methods: Analyzing Evaluation Methods and Framework Differences πŸš€ Future Prospects of Large Models: Anthropic and Industry Competitive Landscape

01 'Guardian Alliance' Anthropic Company Manual

Company Background

Anthropic is an artificial intelligence company based in San Francisco, California, USA, founded in 2021, focusing primarily on developing general AI systems and language models, and adhering to the responsible use of AI. The company aims to build reliable, interpretable, and controllable AI, and is committed to bringing new perspectives and solutions to the industry through technological innovation.

Core Products

The core products of Anthropic mainly include the Claude series models, especially the Claude 3 model family. The Claude 3 model family includes Claude 3 Haiku, Sonnet, Opus, which support inputs of over 1 million tokens and have added multimodal capabilities.

Founding Team

Founders include Dario Amodei (former Vice President of Research at OpenAI) and Daniela Amodei. Team members include Jared Kaplan, Sam McCandlish, Tom Brown (first author of the GPT-3 paper), and several members who were involved in the development of GPT-2.

Strategic Cooperation and Financing

In 2023, Anthropic's valuation doubled to $15 billion. Over the past year, a total of five rounds of financing were completed, with a total financing amount of approximately $7.3 billion. Investors include Google, Salesforce, Amazon, and South Korea's SK Telecom. Amazon invested as much as $4 billion in Anthropic, while Google invested over $2 billion, both holding minority stakes in Anthropic.

Revenue and Market Competition

According to The Information, OpenAI's annual revenue had surpassed $1.6 billion by the end of 2023. Anthropic's monthly revenue in 2023 was around $8 million, expected to grow approximately 8 times this year, with a forecast to exceed $500 million in annual revenue by the end of 2024, driven by the Opus model boosting its paid membership growth. Anthropic is poised to achieve and even surpass its annual revenue target faster.

02 Claude 3 Series Named Romantic

The release of the Claude3 series has indeed left a deep impression. This series launched three products at once, each with its unique features and functions. The name of this series also carries a romantic vibe, as if implying that these AI products are not only the result of technology but also contain elements of humanity and emotion. This makes people look forward to their future performance and development.

🌌 Haiku (δΏ³ε₯)

The term 'Haiku' originates from the Japanese short poetry form, typically consisting of three lines totaling seventeen syllables, arranged in a 5-7-5 structure. Haiku is renowned for its brevity, subtlety, and profound depiction of natural beauty. Applying this name to the Claude3 model product may suggest the model's expertise in generating concise and evocative text, embodying the beauty of ultimate simplification and refinement in language, as well as the ability to capture profound insights in brief words.

πŸ“– Sonnet (Sonnets)

"Sonnet" refers to a fourteen-line poem, a form that originated in Italy and spread to countries like England and France, becoming an important literary form for expressing love and nature. Known for its strict structure, rhyme scheme, and rhythm, the sonnet is often used to convey profound emotions and complex thoughts. Using "Sonnet" as the name for the Claude3 model series may suggest that this model has a special ability in dealing with the structural, rhythmic, and emotional aspects of language, emphasizing the combination of artistry and expressiveness.

🎨 Opus (Work)

"Opus" is a Latin word meaning "work," widely used in the fields of music, literature, and art to refer to the works of creators, especially in the music field, often used to number and classify the works of composers. Using "Opus" as the name for the Claude3 model series may symbolize that the model is the culmination of Anthropic's creativity and technological achievements. It may represent the model's outstanding ability to generate creative, complex, and deep content, emphasizing that each output is unique and artistic in nature.

03 How to understand Claude 3's evaluation 'God Map' surpassing GPT-4

⭐

First, draw a conclusion. Based on this evaluation result, taking the Claude3 Opus model as an example, compared to GPT-4, there is no significant difference in general text, reasoning, and cross-domain knowledge understanding abilities.

However, there has been a significant improvement in abilities such as Code and Math. In the evaluation of the MGSM multilingual data problem, Claude3 Opus 0-shot achieved an accuracy of 90.7%, a substantial increase compared to the 74.5% accuracy achieved by GPT-4 8-shot.

This also means that in the Claude3 series models, better model performance may be achieved in some professional fields such as finance, healthcare, etc.

How should we understand the testing of these large models?

The main methodological approaches for large-scale model testing mainly include two categories: automatic evaluation and manual evaluation. Automatic evaluation methods quickly obtain evaluation results through automatic calculation, especially suitable for scenarios with large amounts of data. On the other hand, manual evaluation is based on the judgment of human experts. Although this method is slower, it can provide higher accuracy.

MMLU, GPQA, MATH, MGSM, HumanEval, DROP, F1 Score, BIG-Bench-Hard, ARC-Challenge, HellaSwag and other test projects are evaluations of large models' capabilities in specific domains.

For example, MMLU is an English evaluation dataset containing 57 multiple-choice question-and-answer tasks, covering various fields such as elementary mathematics and American history. GSM8K is a large-model mathematical reasoning evaluation benchmark released by OpenAI, consisting of 8.5K high-quality language-diverse elementary math word problems. The breadth of these test projects is reflected in their ability to evaluate capabilities across multiple NLP tasks, from mathematical reasoning to language comprehension.

The purpose of evaluation is to quantify the ability of large models in specific tasks or dimensions, in order to compare and select models. Through these tests, it can help developers of large models understand the effectiveness of different technical routes and methods, thereby grasping the current level of development and the gap with top foreign technologies.

The evaluation process typically includes steps such as data collection, model training, automatic evaluation or manual evaluation, result analysis, etc. During the testing process, there may also involve some innovative methods, such as adaptive dynamic testing methods, to comprehensively improve the quality of large model benchmark testing.

The mainstream large-scale model testing method

  1. MMLU (Massive Multitask Language Understanding)

MMLU is a test designed to assess the cross-domain knowledge understanding and reasoning abilities of models, covering over 400 multiple-choice questions from more than 10000 questions to form a comprehensive evaluation framework across various disciplines such as literature, history, and advanced science.

Each question is designed to test in-depth knowledge in a specific field, requiring models to not only have broad knowledge but also the ability to deeply analyze and apply this knowledge. Through a detailed scoring mechanism, MMLU can reveal subtle differences in models' understanding and reasoning of diverse, specialized content, providing a complex and comprehensive benchmark for comparing different models.

πŸ“‘ Exploring and Predicting Transferability across NLP Tasks[1]

  1. GPQA (General Purpose Question Answering)

GPQA test framework is designed to comprehensively evaluate the model's ability to handle various types of problems, including open-ended questions, multiple-choice questions, and logical reasoning questions, ensuring that the test covers a wide range from common sense to professional knowledge. This evaluation method uses thousands of question sets involving natural language processing, common-sense reasoning, data analysis, and other aspects, requiring the model to not only understand the questions themselves but also extract and reason out the correct answers from the given information. In addition, GPQA evaluates the model's answers through a multi-level scoring system, examining its accuracy, logic, and consistency, providing an in-depth understanding of the model's capabilities in answering a wide range of questions.

  1. MATH

MATH is a test specifically designed to assess and compare the abilities of different models in solving various mathematical problems. This assessment covers a wide range of problems from basic arithmetic to more complex advanced mathematical problems, including thousands of problem instances, each aimed at examining the models' problem-solving strategies and capabilities in specific mathematical domains. This test not only evaluates the models' ability to provide correct answers but also examines the correctness and efficiency of their reasoning processes and problem-solving methods. By comprehensively covering different difficulty levels and domains, the MATH test provides deep insights into models' mathematical cognition and logical reasoning abilities, serving as a key benchmark for measuring the potential application of AI in the field of mathematics.

πŸ“‘ Measuring Mathematical Problem Solving With the MATH Dataset[2]

  1. HumanEval

HumanEval aims to evaluate the model's ability in programming and code comprehension, with a special focus on its practicality in solving real programming problems. By providing hundreds of programming questions covering different levels of difficulty and programming concepts, HumanEval tests aim to simulate real-world programming challenges, requiring models to not only generate effective code but also engage in logical reasoning and solution optimization for the given problems. This test comprehensively evaluates the model's code generation, problem understanding, and innovative solution capabilities through detailed test cases and scoring criteria, providing an in-depth understanding of the model's potential in the field of software development.

  1. DROP (Discrete Reasoning Over the content of Paragraphs)

DROP test aims to evaluate the model's ability to understand and reason complex textual content, especially in complex tasks such as numerical reasoning, reference tracking, event sequencing, etc. This evaluation uses a dataset covering a wide range of topics, containing thousands of questions, requiring the model to understand detailed information in the text and perform complex reasoning.

πŸ“‘ DROP: A Reading Comprehension Benchmark Requiring Discrete...[3]

  1. BIG-bench Hard

BIG-bench Hard focuses on testing the performance of models on specific, usually more challenging tasks, including lexical reasoning, mathematical reasoning, common-sense judgment, etc. The test aims to challenge the limits of models and examine their strategies for dealing with high-difficulty questions. It was jointly created by multiple research institutions and individuals.

7. ARC-Challenge (AI2 Reasoning Challenge)

ARC-Challenge focuses on evaluating models' ability to understand scientific texts and answer related questions, which cover multiple fields such as physics, biology, etc., requiring models to have good comprehension and reasoning abilities.

πŸ“‘ Think you have Solved Question Answering? Try ARC, the AI2...[4]

8.HellaSwag

HellaSwag is an evaluation method designed to test models in understanding everyday, colloquial texts (such as stories, Wikipedia articles) and predicting their continuations. By testing models in very natural contexts, HellaSwag challenges the ability of models to understand and generate coherent, reasonable, and stylistically consistent texts.

Summarize the mainstream large model evaluation methods with a picture

| Evaluation Method | Evaluation Objective | Methodology | Creator | | -------------- | ------------------- | ------------- | ----------------------- | | MMLU | Cross-domain knowledge understanding | Multiple-choice questions | Hendrycks et al. (FAIR) | | GPQA | Broad question answering | Comprehensive question answering | Multiple teams | | MATH | Mathematical problem solving | Math problem set | Dan Hendrycks et al. | | HumanEval | Programming problem solving | Programming tasks | OpenAI | | DROP | Textual content reasoning | Complex question answering | Allen Institute for AI | | BIG-bench-Hard | Language model challenge | Diversified testing | Google Research | | ARC-Challenge | Scientific reasoning ability | Scientific question answering | Allen Institute for AI | | HellaSwag | Language understanding challenge | Complete sentences/paragraphs | Allen Institute for AI | | GSM8K | Solving elementary level data problems | Dataset with multiple questions testing | OpenAI | | MGSM | Assessment of mathematical problem-solving abilities in different language backgrounds | Numerous math problems written in different languages | |

04 Claude Sonnet vs ChatGPT4 Evaluation

Anthropic released the Claude 3 model series, which redefines industry standards in numerous cognitive tasks. The series includes three top models: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus, ranked in increasing order of capability. Currently, Claude 3 Sonnet can be used on POE.

We tested from the following dimensions:

  • Code reproduction ability
  • LinkedIn copywriting ability
  • Image recognition and analysis ability
  • Ability to understand and output complex prompts

Code reproduction ability

Tried to reproduce the OpenAI GPTs page first, but was politely declined. Sonnet only wanted to explain their UI design principles and experiences to me.

Later, I switched to a Midjourney page, and it started writing code diligently.

If rated on a scale of 0-10, it would be a 3 at best. The webpage's main structure includes navigation, card layout, and Footer, and even thoughtfully used placeholder images for me.

However, compared to the website, many elements are still missing:

  1. The top navigation element is incorrect

2、The left navigation as a whole disappeared

3、The card has added the creator's name and two buttons (although the buttons have been given a click style, points will still be deducted)

4、The style of the card waterfall has not been restored.

Later, I was worried that it was Midjourney that couldn't log in because it didn't have an account. So I tried Medium again.

On the left is the generated code page. I haven't figured out this green color, where it is extracted from the theme color. Also, how was this slogan summarized, but I quite like 'Stay Curious'.

I had GPT4 do these 2 cases as well, and its recovery ability is still slightly inferior to Claude 3 Sonnet. Claude 3 Sonnet gets one point in this round.

Summarize

Currently, the understanding, completeness, and complexity of the UI restoration ability Claude 3 Sonnet are much better than GPT4. Sonnet is in sixth grade, GPT4 is in third grade.

This time I tested using Poe, and Sonnet in Poe cannot upload images yet, only PDF files can be uploaded, so I shared the website link. I wonder if it will affect Sonnet's performance.

LinkedIn copywriting skills

Today, after the official announcement of Claude 3 on the Anthropic website, I had Sonnet write a LinkedIn post based on the official website article. Similarly, I also had GPT4 do it.

From the comparison results, the text output by GPT4 is longer and more in-depth. But the overall difference is not significant. The two are basically on par.

Image recognition analysis ability

​

I prefer the results of GPT4 myself, the content on the images it provides tends to be more detailed in stating facts, including annotations and background images.

If I were blind, listening to the text of GPT4 would help me imagine this visualization chart more. Claude 3 Sonnet provides more conclusions it sees, as well as evaluates the presentation form of this visualization.

| Claude 3 Sonnet | GPT4 | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | This illustration vividly showcases the most valuable brands from 12 different countries. These brands are represented by circular icons displaying their logos, accompanied by their brand values in dollars. At the center of the image is Apple, with a staggering brand value of $517 billion, ranking first globally, visually highlighted to emphasize its position as the world's most valuable brand. Surrounding Apple are other well-known brands from various countries and industries. Particularly notable are Samsung from South Korea ($99 billion), Louis Vuitton from France ($32 billion), Nestle from Switzerland ($21 billion), TikTok from China ($84 billion), Deutsche Telekom from Germany ($73 billion), and Toyota from Japan ($53 billion). These brands are scattered and each icon embeds the respective country's flag, making it easy to identify the origin of each brand. This image effectively showcases the value and relative positions of major brands from different countries and sectors worldwide. | This infographic titled '12 Most Valuable Brands by Country' displays a series of multinational company logos with accompanying numbers representing their respective brand values in billions of dollars. The brands and their corresponding brand values are listed here, divided by their respective countries: - Apple (USA) - $517 billion - Samsung (South Korea) - $99 billion - TikTok (China) - $48 billion - Toyota (Japan) - $53 billion - Deutsche Telekom (Germany) - $73 billion - Shell (UK) - $50 billion - Saudi Aramco (Saudi Arabia) - $42 billion - Tata Group (India) - $29 billion - Louis Vuitton (France) - $32 billion - Nestle (Switzerland) - $21 billion - TD Bank (Canada) - $19 billion - Accenture (Ireland) - $41 billion The infographic also specifically mentions that in the past year, Apple's brand value has grown by 74%, once again becoming the most valuable brand in the world. The caption below the chart clearly states that the displayed values are estimates of company brand values, different from other evaluation methods such as market value, sourced from Brand Finance. Countries are identified by placing their flags next to or near their respective brands. The entire chart background features a dark world map. |

Complex Prompt Understanding Output Ability

I used a very complex Prompt with a set of rules. The Prompt consists of 1900 words, aiming to have them develop a detailed Claude 3 brand campaign based on studying the introduction articles on the Claude 3 official website. This campaign should include a 12-month roadmap and corresponding monthly plans, along with assessment dimensions for different campaign actions.

From the results, GPT4 output 633 words, Sonnet output 624 words, the content length is similar.

Then I threw both answers back to them for scoring. GPT4 won with a 1-point advantage on both sides. During the scoring process, GPT4 was also more logical and provided scoring criteria and detailed scores.

We will test more cases later. If you are interested, please reply 'X' to the WeChat ID for continuous tracking.

05 Do you continue to roll? Ignite expectations for GPT-5 again

In the world of technology, every innovation brings unknown possibilities and new explorations. Recently, the launch of the Claude3 series models undoubtedly ignited a new spark of competition in the large model field, once again fueling people's anticipation for the peak battle in the AI field.

Claude3 Some notable features have also sparked our curiosity for further commercial exploration. We have always believed that the road to technological progress should not only be dominated by a few. Since OpenAI emerged as a focal point in the AI world with its unique charm, numerous events have continuously propelled it into the public eye. The most eye-catching recent event is undoubtedly the Musk lawsuit, which has further aroused external interest in this field.

We eagerly look forward to the arrival of Artificial General Intelligence (AGI) and hope to witness AI's innovations in security bring more benefits to society, building a better future.

At the same time, more outstanding competitors are flexing their muscles, making us look forward to the release of GPT-4.5 and GPT-5, continuing to pay attention to the development and breakthroughs of large-scale model technology, as well as the diversification and evolution of the competitive landscape in this field. In the future AI world, it should be a hotbed full of diverse voices and innovative thinking, where every advancement deserves our careful observation and in-depth exploration.

  • Cover Image Prompt

minimalistic, a man is thinking hard in a field, flower, green farm, free flowing, stylized digital illustration, with a grain texture, on light green color background --ar 16:9 --style raw

By AI Assistant Midjourney

Reference

#1#

Exploring and Predicting Transferability across NLP Tasks: https://arxiv.org/abs/2005.00770

[2]

Measuring Mathematical Problem Solving With the MATH Dataset: https://arxiv.org/abs/2103.03874

[3]

DROP: A Reading Comprehension Benchmark Requiring Discrete...: https://arxiv.org/abs/1903.00161

[4]

Think you have Solved Question Answering? Try ARC, the AI2...: https://arxiv.org/abs/1803.05457


πŸ’‘

Fellow friends interested in this topic

Welcome to join us in exploring and exchanging ideas together

In the new world where AI and cutting-edge technology continue to emerge

We explore together along the way

πŸ€– ❀️ ✨

Summary
Anthropic is an AI company based in California, founded in 2021, focusing on developing general AI systems and language models with responsible AI use principles. Their core product is the Claude series models, including Claude 3 Haiku, Sonnet, and Opus, supporting over 1 million tokens input and multimodal capabilities. The founding team includes Dario Amodei and Daniela Amodei, with members like Jared Kaplan and Tom Brown who worked on GPT-2. Anthropic has received significant funding from companies like Google and Amazon, with a valuation of $15 billion in 2023. The Claude 3 models, named Haiku, Sonnet, and Opus, showcase advancements in areas like code and math abilities compared to GPT-4. Testing methods for large models include automatic and manual evaluations, focusing on specific domains like math and language understanding. These evaluations help developers understand the effectiveness of different technical approaches and the models' capabilities in various tasks.
Β