From GPT-4 to ChemCrow: How AI is transforming chemical research

I recently came across an article in Nature: "Augmenting large language models with chemistry tools."

The article introduces ChemCrow, a chemistry-intelligent agent specifically designed for tasks in organic synthesis, drug discovery, and materials design. By integrating 18 expert-designed tools and using GPT-4 as the base model, ChemCrow enhances the performance of LLMs in the field of chemistry, showcasing new capabilities. ChemCrow autonomously planned and executed the synthesis of a common insect repellent and three organic catalysts, guiding the discovery of a novel chromophore. Through evaluations by both LLMs and experts, it demonstrated the effectiveness of ChemCrow in automating diverse chemical tasks. ChemCrow not only assists professional chemists but also lowers the barrier to entry for non-specialists in chemical research while bridging the gap between experimental and computational chemistry.


a. Overview of the problem-solving process. By utilizing various chemistry-related software packages and tools, a set of toolkits was created. These tools, along with user input, were provided to a large language model (LLM). The LLM decides its path, tool selection, and input through an automatic, iterative thought chain process, ultimately arriving at an answer. The figure illustrates the process of synthesizing the common insect repellent DEET.

b. The toolkit implemented in ChemCrow: including reaction tools, molecular tools, safety tools, search tools, and standard tools. Image source: The photo in (a) was taken by IBM Research and is used under a Creative Commons license CC BY-ND 2.0.


a. An example of a user running a script to launch ChemCrow.

b. The process of querying and synthesizing thiourea organic catalysts.

c. The IBM Research RoboRXN synthesis platform used to perform experiments (image provided and reprinted by International Business Machines Corporation).

d. Experimentally validated compounds. Image source: The photo in (c) was taken by IBM Research and is used under a Creative Commons license CC BY-ND 2.0.


Left: Human input, operations, and observations.

Right: ChemCrow's operations and final answer, along with suggestions for a new chromophore.


Comparison of GPT-4 and ChemCrow's performance on various tasks.

a. Preference evaluation for each task. For each task, evaluators (n=4) were asked which model's response they preferred. Tasks are divided into three categories: synthesis, molecular design, and chemical logic, ordered by difficulty within each category.

b. Average chemical accuracy (authenticity) of human evaluators (n=4) in organic synthesis tasks, sorted by the synthetic accessibility of the synthesis target.

c. Summary results of all metrics based on ratings from human evaluators (n=56) across all tasks, compared to ratings from EvaluatorGPT (n=14). Error bars represent confidence intervals (95%).

d. Checkboxes highlight the strengths and weaknesses of each system. These pros and cons were determined based on observations left by evaluators.


Although current results are limited by the number and quality of selected tools, the possibility space is enormous, especially since potential tools are not restricted to the chemistry domain. Integrating other language-based tools, image processing tools, etc., could significantly enhance ChemCrow's capabilities. Furthermore, despite the limited selection of evaluation tasks, further research and development can expand and diversify these tasks, truly pushing the limits of these systems.

Evaluations by expert chemists show that ChemCrow outperforms GPT-4 in terms of chemical factuality, reasoning, and completeness of responses, particularly in more complex tasks. While GPT-4 may perform better in memory-intensive tasks, such as synthesizing well-known molecules like paracetamol and aspirin, ChemCrow excels in novel or less-known tasks, which are often more useful and challenging. In contrast, LLM-driven evaluations tend to favor GPT-4 due to its smoother and seemingly more complete responses. However, LLM-driven evaluations may not be as reliable as human evaluations in assessing the actual effectiveness of models in chemical reasoning. This discrepancy suggests the need for further improvements in evaluation methods to better capture the unique capabilities of systems like ChemCrow in solving complex, real-world chemical problems.

The evaluation process is not without challenges, and improved experimental design could enhance the validity of the results. A major challenge under the current API-based LLM approach is the lack of reproducibility of individual results due to the limited control provided by closed-source models. Recent open-source models offer potential solutions to this issue, though possibly at the cost of reduced reasoning ability. Additionally, implicit biases in task selection and the inherent limitations of chemical logic in large-scale testing of task solutions pose difficulties in evaluating machine learning systems.