The second half of the big model welcomes the CAMBRIAN Agent explosion.
540 million to 360 million years ago, marine life broke, and higher-level intelligence such as trilobites, sponges and chordates was born compared with previous single-celled or simple multicellular organisms. Similar to the CAMBRIAN explosion of life, experimental products such as BabyGPT, AutoGPT and Generative Agents have come out one after another.
From the perspective of evolution, the development of life is mainly realized by unit enhancement and organization enhancement. These two ways of enhancement complement each other, making life have more diverse and complex expressions.
Just like agent-we hope it is an agent that can think independently and interact with the environment in any system.
Now that it has a "brain" with enough IQ, how to make Agent think and execute like a human being-as long as any goal is given, it can automatically solve various problems. It is a "self" unit to further improve IQ and enhance agents; Or strengthen the "organizational" ability with the help of external modules?
Although today’s Agent can’t complete the general task, it is also difficult to form a dynamic stable body like the socialized division of labor among multiple cells. However, the local modules of individual Agents, such as HuggingGPT, have demonstrated their ability to use tools, including the fact that Plug-in has become an important milestone in the actual landing. In the second half of the large model, it will be the moment when the agent landed in Cambrian.
Where is the landing bottleneck of Agent at this moment? Can it reach GM from dedicated? What will multimodal bring to Agent? How will the future pattern evolve?
Just like the first cell born on the earth, even though the current Agent can’t replace our role in practical work, everything comes from this first cell, which is the starting point of agent evolution.
Even if you are confused about the landing of Agent, you should continue to "emerge".
Because the success or failure of Agent will be the key to decide whether this GPT revolution is a new generation of industrial revolution.
The following will think in a structured way, and where will Agent go?
1. What exactly is AI Agent?
A few days ago, the AI Town with 25 agents was officially open source, and the "westworld" AI Town was built with it. The interaction between AI Agents will deduce the evolution process of the whole civilization.
Andrej Karpathy, co-founder of OpenAI, also shouted: "AI Agent represents a crazy future."
What is an Agent? The word Agent originated from the Latin word Agere, which means "to do".In the context of LLM, Agent can be understood as an agent that can independently understand, plan and make decisions and perform complex tasks.
Agent is not an upgraded version of ChatGPT. It not only tells you how to do it, but also helps you to do it. If CoPilot is the co-pilot, then the Agent is the main driver.
A simplified Agent decision-making process, using function expression:
Agent: p (perception)-—> P (planning)-> a (action)
Similar to the process of human "doing things", the core function of Agent can be summarized as a cycle of three steps:perception(Perception), planning(Planning)And action(Action).
perception(Perception)Refers to the ability of Agent to collect information from the environment and extract relevant knowledge from it, planning(Planning)Refers to the decision-making process and action made by Agent for a certain goal.(Action)Refers to actions based on environment and planning.
Among them, Policy is the core decision of Agent to make Action, and action is observed.(Observation)It becomes the premise and foundation of further Perception and forms an independent closed-loop learning process.
This process is like Marxism’s "theory of practice": "Understanding begins with practice, through which theoretical understanding is obtained, and then returns to practice." Agent also evolved in the integration of knowledge and practice.
A more complete Agent must fully interact with the environment, which includes two parts-one is the Agent part, and the other is the environment part.At this moment, Agent is like "human" in the physical world, and the physical world is the "external environment" of human beings.
As you can imagine, the process of human interaction with the external environment: based on our full perception of the world, we deduce its hidden state, and combine our own memory and knowledge understanding of the world to make Planning, decision-making and action; And actions will react to the environment, giving us new feedback, and human beings will combine the observation of feedback and then make decisions, so as to go back and forth.
The most intuitive formula:
Agent = LLM+Planning+Feedback+Tool use
Among them, in the process of Planning, in addition to the current state, there are also memories, experiences, some reflections and summaries of the past, as well as world knowledge.
Compared with today’s ChatGPT, it is actually not an Agent, but a universal world knowledge, that is, a knowledge source for Planning. It is not based on specific environmental conditions, nor is it based on Memory, Experience and Reflection.
Of course, ChatGPT can do logical reasoning and certain planning based on its own knowledge, and it can also add vector database to solve reasoning problems, and add Reflection to enrich the process. In this way, ChatGPT, an end-to-end black box, can be made more explicit-in fact, symbols are a very explicit system, and based on this, it can be directionally corrected and promoted.
For Feedback, Agent gets positive or trial-and-error feedback, phased results or rewards based on Action. There are many forms of Feedback. If ChatGPT chatting with us is regarded as an Agent, the reply we typed in the text box is a kind of Feedback, just a kind of Feedback in text form. At this time, we are an environment for ChatGPT. RLHF is also an environment, an extremely simple environment.
"Man is human because he can use tools."
As an Agent, agent can also extend its functions with the help of external tools, so that it can handle more complex tasks. For example, LLM uses weather API to obtain weather forecast information. If you don’t call external tools, Action and Feedback can also deal with the environment directly by learning Policy.
It can be seen that Agent is the key to truly release LLM’s potential actively. As the core of LLM, Agent provides subjective initiative for LLM.
How to land the LLM today? LLM is an IQ engine, and other peripheral tools can be used as Prompt. Will LLM be an end-to-end system in the future? If there are not enough tools around, will there be a more general adaptation framework?
Second, the bottleneck of Agent landing is because "IQ" is not enough?
Agent itself uses two parts of abilities, one is LLM as its "intelligence quotient" or "brain" part, the other is based on LLM, and it needs a controller outside to complete various Prompt, such as enhancing Memory through retrieval, getting Feedback from the environment, and how to do Reflection.
Agent needs both brain and external support.
Is LLM’s own "intelligence quotient" not enough or its external systematization not enough?
If the degree of external systematization is not enough, it will be a long-awaited problem. If it’s just a problem of insufficient IQ, when GPT-4 becomes GPT-5 and has a higher IQ, it can make up for the previous problems.
So what is the main bottleneck of Agent?
To truly understand the crux of this problem, we can make wrong attribution first.That is to say, in the actual mistakes, it is clearly attributed to LLM itself or Prompt in the wrong way.
For example, ask the voice assistant "What’s the weather like?" The question itself is ambiguous-where is the weather? What day’s the weather? What information do you want to know about the weather? None of these problems can be solved by LLM itself, and it needs to call an external tool system.
If it is only attributed to "IQ", LLM only needs to understand "what’s the weather like". If there is a specific context, such as "what’s the weather like in Shanghai next month", whether LLM can infer accurate information based on this is a question of "IQ"; But what tools to call and whether the parameters to execute are accurate are not attributed to "IQ".
3. Can a more general external framework of Agent be realized in the future?
Many people regard LLM as the implementation of Agent, which is too simple and rude. For example, after setting a goal for Agent and defining some basic constraints, we expect it to complete the whole process of self-planning, task decomposition, self-prompting, and even calling external tools and giving answers. However, LLM itself is not trained in this way, and it certainly does not have this ability, but this is not due to the "IQ" problem.
From the perspective of Agent landing, external logical framework is still needed.
Although there are many kinds of Agent at present, most of them are superficial and not universal enough. Even the simplest Agent application, voice assistant or intelligent outgoing call system, its complexity and how to introduce environmental Feedback have not been effectively solved.
Therefore, in addition to a more detailed analysis of the errors, one question we should study is: Besides LLM itself being universal enough, will a universal external logic framework be realized to solve the problem of Agent’s real landing?
If we can’t find a common external logical framework, then the so-called AGI revolution may be just a bubble, a huge bubble, which may not be essentially different from the previous generation NLP.
At present, the landing of Agent is not only a question of "intelligence quotient", but also a question of how to get from special to general by external tools-and this is a more important question.
4. How can Agent have a universal adaptive environment? Do you need a learnable environment model?
What will happen if LLM is put into a virtual world?
In the game My World, NVIDIA developed the latest method, Voyager, to light up the technology tree at a speed of 15.3 times, and at the same time, he obtained 3.3 times of unique items and 2.3 times of exploration scope. The reason is attributed to GPT-4′ s in-depth understanding of the rules of the game and rich knowledge reserve, which comes from the pre-training process rather than the subsequent reinforcement learning.
From this perspective, in the process of optimizing Agent, we should not only pay attention to Feedback, but also consider how the model perceives the environment. So what is the relationship between the general brain and the environmental model, and how to cooperate? How does Agent arrive at General from Private?
At present, there are few good and universal landing effects of Agent, and most of them solve specific problems in specific scenes-LLM is used as a general brain and designed into different roles through Prompt to complete special tasks, rather than universal applications.
One of the key problems, namely, Feedback, will become a major constraint factor in the implementation of Agent, which is particularly evident in Tool use. For some simple problems such as weather query, it is only necessary to design an appropriate Prompt, but for complex Tools applications, its success probability will be greatly reduced.
It is impossible to make an Agent with LLM simply and rudely.
On the one hand, this approach ignores the importance of Feedback, on the other hand, even if LLM gets Feedback, it may not fully understand all the environments or Feedback by virtue of its "IQ", and it is even more difficult to adjust its behavior based on this.
In order to successfully land the Agent, it is necessary to give the Agent a more general adaptive environment. One possible solution is to create a small model dedicated to understanding and adapting the environment, so as to interact with LLM.
Because of super brain’s "IQ" part-LLM(such as GPT-4)Because the scale is too large, it is difficult to retrain specific Agent, and the small model can adapt to environmental changes and train many times. In this scene, we can regard LLM as the brain, and the small model is like the cerebellum, which acts as the middle layer to deal with the environment Feedback and interact with GPT-4.
So, what will be the implementation path of Agent from private to general?
Assuming that Agent will eventually land in 100 different environments, can a framework model be abstracted to solve all external universality problems under the premise that even the simplest external application is difficult to realize at present?
It may be one of the ways to realize the universal Agent by making the Agent in a certain scene to the extreme-stable and robust enough, and then gradually turning it into a universal framework.
5. How important is multimodal in the development of Agent?
Now GPT-4 is to convert all the contents into text language, and then human beings will Prompt it. First of all, the transformation process may lose information or produce errors, which may lead to the deviation of the results.
The next version of GPT, if it can realize the super ability of multimodal understanding, will it alleviate the unreliability and information loss and deviation of today’s Agent to some extent? What will be the relationship between multimodal and Agent?
If LLM does not need to interact in the real world, but only performs a specific task in the virtual world, then multimodal may not be very helpful for completing the task. But if LLM needs to interact with the real world, then multimodal is undoubtedly very important.
Multimodal can only solve the problem of Agent’s perception, but not the problem of cognition.
In many cases, such as intelligent customer service scenarios, users may provide information in a variety of ways. Multimodal has good perceptual value, but it is still far from solving some core problems such as logic and reasoning.
Multimodal is an inevitable trend, the future big model must be a multimodal big model, and the future Agent must also be an Agent in a multimodal world.
When developing text-based agents, when the multi-modal watershed moment comes, do these agents continue to develop on the basis of the original text and further integrate multi-modal characteristics? Or do you need to completely change the original concept and architecture of developing Agent to adapt to the future multimodal world?
The development of Agent does not need to be completely reinvented, but once Agent has multi-modal ability, it will be completely different from the existing model. For example, the next generation version of GPT may contain some more powerful multimodal understanding functions such as images. We don’t have to rush to build such a model immediately, but we can also choose to call such a module first, that is, it is best to build the multimodal understanding function into the model.
After half a year, we will see the arrival of multi-modal large models. The arrival of multimodal Agent may be faster than we thought.
First of all, many large companies are developing multimodal in general by stockpiling arms. The accumulation of this quantitative change is very likely to lead to qualitative change, and actual products may be launched soon. Secondly, the Agent people expect is an assistant like human beings. He can not only talk, but also see, hear and feel. Theoretically, an excellent Agent should be able to realize multi-sensory and multi-modal interaction, and both Perception and Policy need multi-modal.
With the release of RT-2, a new visual language action model, VLA, has realized the ChatGPT of the entity robot by stuffing the multi-modal large model into the manipulator. ?
From the trend point of view, the future Agent must be multimodal. Multi-modality must be necessary for Agent to succeed.
In the aspect of multimodal interaction, Digital Man also provides a good example, which shows the advantages of calling external tools. When the big model calls digital people, because the images have been set in advance, we don’t have to worry that it will suddenly generate the image or voice of a political figure, even his illusion.
Although it is euphemistically called "generative AI", it is better to call external tools in its "generative" part to ensure its certainty and avoid the illusion of the big model.
For example, in multimodal interaction, if you want LLM to play Trump and directly generate a congratulatory video, it is likely to be risky. It is safer and more controllable if LLM only generates scripts and then calls the established digital human and sound interface to synthesize video.
6. Will multi-agent really succeed?
Now the Agent is still a caveman, but the interaction of more AI Agents will change everything.
In the Generative Agents experiment inspired by the simulated life game, each role is controlled by an AI Agent, and they live and interact in a sandbox environment, which fully embodies the process of turning feedback and environmental information into action and realizing the "socialization" of AI Agents.
In the process of planning and response, AI Agents will fully consider the relationship between each other and the observation and feedback of one Agent to another Agent to take the next step.
This interesting simulation has led to some dramatic social phenomena, such as the spread of "rumors" and relational memory. In the experiment, two AI Agents often continue their previous topics, hold Party, call friends and other social activities.
Obviously, the real landing of Agent must be based on the perception of environment, dynamic learning and continuous updating.
7. In what scene did Agent first land?
As early as February this year, some online education companies began to be active in the discussion of big models. "Our industry, if we don’t act again, will be the first to be subverted." When most companies haven’t felt the impact of the big model, people from a head online education company first make predictions and worries.
And which industries will be subverted by Agent first, and which industries will not be so fast?
The ability of large model is well known, but "IQ" is only a part of the landing of Agent. Even if OpenAI claims that AGI has come, it is difficult to make practical applications if you know nothing about the industry.
Just like a doctor at Stanford, if you don’t understand the industry and product attributes of a company, then the initial work will be very difficult. Therefore, we need to discuss more deeply which industries are more suitable for Agent landing.
For example, the online education industry that can be completely online and digital, especially during the past three years, many offline industries have been hit, while the online industry has developed rapidly because of the advantages of digitalization, and it will be subverted by Agent first. In contrast, robots or traditional industries are more difficult to be subverted in a short time.
At present, whether in China or the United States, a new consensus is gradually forming: firstly, Agent needs to call external tools, and secondly, the way to call tools is to output codes-LLM brain outputs an executable code, such as a semantic analyzer, which understands the meaning of each sentence, then converts it into a machine instruction, and then calls external tools to execute or generate answers.
Although the present form of Function Call needs to be improved, this way of calling tools is very necessary and the most thorough means to solve the illusion problem.
8. Will the future development pattern of Agent be a hundred flowers blossom or Winners take all?
In the future, the competition pattern of big models is becoming increasingly clear, and it is inevitable that several companies will dominate or be based on open source.
And in the next year or two, what will be the market structure of Agent, and will it form the same situation?
Because Agent can’t be universal, Agent is not a monopoly, but will form a very long tail supplier pattern. There will be many Agent in the market, each of which is operated by a different company.
Like the previous generation NLP, many AI companies provide intelligent customer service or automatic outbound service, but each company can only serve a few customers and cannot form scale effect. Based on the technical judgment of Agent today, it is not much different from the previous generation.
Although the semantic understanding of LLM is universal, the combination of Agent environment and domain we discussed before is not universal, which will lead to a very scattered market, and no stronger company will appear.
Of course, Agent is also divided into depth.(major)And shallowness(universal)Two types.
We want to be a general Agent. In the market environment of China, if an Agent is deeply integrated with the enterprise, it will eventually become "outsourcing" because it needs to be privatized and integrated into the enterprise workflow. Many companies will compete for big customers in insurance companies, banks and automobiles. This will be very similar to the outcome of the previous generation AI company, and the marginal cost is difficult to reduce, and it is not universal.
The future is the world of Agent. Under the Agent process today, the story of AI yesterday is still repeated, and the privatization deployment will face challenges.
This article comes from WeChat WeChat official account:Brother Fei said AI (AI(ID:FeigeandAI), author: Gao Jia
关于作者