APBench and benchmarking large language model performance in fundamental astrodynamics problems for space engineering

Daily Zen Mews


Related work

While LLMs have transformed the field of NLP (Natural Language Processing) and have shown vast potential across various fields, rapid advancements in LLM capabilities present the possibility that in the future, superhuman AI systems could help advance “the final frontier”—space engineering. To measure our ability to align LLMs for this purpose, we need evaluation datasets that characterize the trustworthiness of these models on typical classroom-level astrodynamics problems. These tests should serve as the preliminary threshold before any trust is to be put in the answers from an LLM for practical applications to space system. As LLMs are used to answer increasingly difficult questions and are challenged on harder and harder questions where the scalable oversight, a phenomenon describing the incompetency in human’s generating solutions on their own to verify and test the improving difficulties LLM can solve, starts to appear20,37, we would expect that, despite of the existing generally well-known issues in LLMs like hallucination38, sycophancy39, and textual reasoning limitation40, among others, it is necessary to start evaluating the performance of LLMs on the specialized task of space engineering to reveal improvement potentials so as to both support the adoption of LLMs into space engineering and also to return valuable directions for the general AI development.

STEM benchmarks

Evaluating LLMs has made significant strides, and the evolvement of benchmarks has both proven the advancement of LLMs’ improving abilities and led to the call for new benchmarks41. Under the realm of STEM, we have witnessed a great number of benchmark datasets.

Benchmarks such as GSM8K15, designed for middle-school mathematics, and MATH16, focused on high-school math competitions, have become standard in the field. For university-level mathematics, datasets like ProofNet42 and OCWCourses43 are widely recognized. Furthermore, Olympiad-level problems are covered by datasets like MiniF2F44 and AlphaGeometry45, while the SAT dataset46 includes problems from the College Board SAT examination. As LLM performance continues to improve, there is a demand for even more challenging benchmarks. This has led to the creation of benchmarks such as MathOdyssey47, OlympiadBench48, and the collective dataset MathInstruct49. Beyond math, the evaluation of coding capabilities is also crucial40, as argued by CS-Bench50 that Computer Science (CS) represents the intricacies of human intelligence, profoundly advancing artificial intelligence and modern society.

Building upon the progress in math-related benchmarking, other fields have begun to develop specialized benchmarks. In general physics and chemistry, benchmarks targeting college-level problems have been introduced. The latest dataset SciBench51 formulates open questions incorporating both textual and image information. ScienceQA52 provide datasets with multimodal multiple-choice science questions, answers, tasks, lecture notes, and detailed annotations. BIG-Bench53 offers a general-purpose test suite with 204 tasks, ranging from multiple-choice to exact-match questions, while its extension, BIG-Bench Hard54, introduces more complex Chain of Thought (CoT) prompts. SciEval55 evaluates LLM’s understanding and application through a mix of objective and subjective questions across various scientific disciplines. JEEBench56 draws on pre-engineering-level scientific problems from college entrance exams, and AGIEval57 assesses LLM performance on standardized exams such as college entrance and lawyer qualification tests.

More fields within STEM are recognizing the importance and promise of understanding LLMs’ capabilities in their specific domains. In finance, the introduction of the FinBench benchmark19 addresses the need to evaluate LLMs in more complex and specialized tasks. Their dataset serves as the first benchmark designed specifically for inherently complex financial problems, facilitating the integration of finance with the rapidly advancing landscape of LLM development. Similarly, in the field of requirements engineering, Norheim et al.58 highlights the limitations posed by the absence of benchmarks in advancing the understanding of LLMs in requirement engineering. The authors propose the development of benchmark datasets to enable systematic exploration and evaluation of LLM and NLP methods across various RE tasks in the future. In addition, as they cite, “the lack of benchmarks also makes it hard to benchmark the novel LLMs against past methods that have been applied (e.g., rule-based, feature-based ML)59”. This underscores the critical role of benchmarks in evaluating the transformative capabilities LLMs pose across STEM disciplines, as there is a significant difference between LLMs and traditional machine learning.

The curated benchmark, APBench, is strongly inspired by the need to improve benchmark difficulties in recent years. This includes scaling up benchmarks, as seen in math-related ones, and increasing the difficulty levels, as observed in physics-related benchmarks. Additionally, APBench addresses the demand for field-specific benchmarks, with the field of space engineering raising unique concerns about LLMs’ specialized capabilities.

Data curation process and principles

We collected each problem from source PDF documents and processed them locally using a combination of scripts and manual operations to divide the source contents into two sections: Content and Questions. Within Questions, each individual set has five items: Question, Solution, Answer, Format, and Source, while sharing the same Content. To streamline this process, we wrote scripts employing OpenAI’s API to conduct batch process. The script identifies chunks of information, separates them into the the above mentioned sections, and extracts the problems and solution processes while organizing the content and multiple problems accordingly. The final answers are classified as either numeric or message, and the source is indicated by an acronym as shown in Tables 1 and 2. Numeric answers are converted to floating-point numbers while message answers retain their original expressions during model performance evaluation. Additionally, step-by-step solutions are included for each problem. The section Content is reused across multiple problems without duplication.

Progressive problems settings

The source file may have multiple steps for a single problem set with a shared background information. We formulate the problem set structure as a shared content with multiple problems. The problems are organized as progressive problems. During evaluation process as illustrated in Fig. 1, we evaluate the correctness of the solution to a single problem and decide if the answered problem should be returned as part of the prompt based on the following condition: in the case of correctly solving a problem, the correctly answered problem along with its solution is attached to the content information. This provides additional context to the model, as many problems in the dataset under the same content are progressive. Figure 2 illustrates a case in which the answer to the second question could be directly used to solve the third question as the middle question is a necessary step for the right question.

We have also tested the performance of the model without information from past problems; this means that the model would treat each question independently. Our experiments indicate a similar performance; whether correctly answered questions are included in the prompt or questions are answered independently. The two cases of GPT-4 and \(\hbox {GPT}-4^{*}\) in Table 3 indicate that for GPT-4 model (gpt-4-turbo-2024-04-09), employing progressive prompt and employing independent prompt result in similar performance. At first glance, this seems counterintuitive, but according to Park et al.60, generative models possess “hidden capabilities” related to prompting. LLMs could answer a question correctly with a right way of prompting the question. A similar point was made in Zhu et al.11,14,61 regarding the concept of knowledge storage and extraction, suggesting that a question which can be correctly answered and which forms an important step for the next question can always be answered correctly. This implies that the correct answering of a question is determined not by its formulation but by its content. In essence, LLMs know what they know, and explicitly including what it could know in prompting is not necessarily the best way to extend their capability.

Despite the similar testing performance observed for both employing progressive prompt and not, we chose to structure the benchmark with progressive problems. This approach provides a more concise way of formulating the dataset and avoids unnecessary duplication of content information. Moreover, this structure reflects the inherent setup of the problems, offering additional information that could be further explored in future studies.

Data source and ethical statement

To construct the dataset, we select a combination of online resources. We focus on including questions with more challenging, open-ended answers rather than the multiple-choice format commonly used in previous works17,51,52,62. The problem sets include questions requiring both numeric and message answers. This aligns with our goal of providing the benchmarking on the topics of astrodynamics in a general way, since astrodynamics problems requiring either answers are common in the field. Moreover, since a message answer could include analytical solutions and other intuitive insights, those solutions have a broader impact than numerical ones tailored to a specific scenario.

Models and experiment setup

We evaluate the benchmarking dataset on both open-source and closed-source LLMs on our local machine, employing a zero-shot prompting strategy. The evaluation workflow is depicted by an example case in Fig. 7. For each set of questions within a single problem set, the Content and the Question are combined in the prompt to request the LLM’s final solution and answer. The answer from the LLM is then evaluated by comparing it to the correct answer in the problem set. In the zero-shot prompt setting, no prior examples or specific training data are provided. The model is expected to generate accurate responses based solely on its pre-existing knowledge and its ability to generalize from the prompt. This setup tests the LLM’s capacity to solve problems without relying on previously seen examples, thereby evaluating its inherent astrodynamics knowledge and problem-solving capabilities in the context of the task at hand.

Recent development in fine-tuning has made the process of creating specialized LLM much more possible through not only programming techniques, but also the process of the knowledge and methods. The most recent update from OpenAI’s new model, OpenAI o1-preview, is the newest area of improvement. It demonstrates remarkable advancements in critical areas such as mathematics, physics, chemistry, and coding. These improvements highlight the ongoing refinement of LLM capabilities in solving intricate problems across STEM disciplines. Simultaneously, the rapid growth of open-source models has led to an explosion of foundational models, accompanied by diverse and innovative methodologies for creating and optimizing LLMs.

We selected a diverse set of leading LLMs to evaluate their performance on APBench. On the closed-source side, we chose several models from OpenAI, including GPT-4 (gpt-4-turbo-2024-04-09), GPT-4o (gpt-4o-2024-08-06), o1-mini (o1-mini-2024-09-12), and o1-preview (o1-preview-2024-09-12). These models represent a growing trend of confidence in their ability to solve a variety of problems efficiently. Additionally, we included Claude 3.5 Sonnet from Anthropic, released on 2024-06-14. Sonnet is reported to achieve similar or superior performance compared to GPT-4o on several math and coding-related benchmarks63. On the open-source side, we selected multiple versions of Llama from Meta, one of the most popular open-source models. Specifically, we included Llama 2 7B (Llama-2-7b-chat-hf), Llama 3.1 8B (Llama-3.1-8B-Instruct), Llama 3.1 70B (Llama-3.1-70B-Instruct), Llama 3.2 1B (Llama-3.2-1B-Instruct), and Llama 3.2 3B (Llama-3.2-3B-Instruct). Due to hardware limitations, the large Llama 3.1 70B model was loaded using a 4-bit quantization method. These models cover a wide range of sizes and evolutionary steps within the Llama family. We also included AstroLLaMA64 available on 2023-09-12, a fine-tuned model trained on arXiv astronomy abstracts, aiming to be able to conduct “scholarly astronomy”. Given the foundational connections between astronomy and space engineering, we selected this model to explore whether its training on astronomy content would aid in solving space engineering problems. Another significant inclusion was the Reflection Llama series. For the Reflection Llama 3.1 70B model (Reflection-Llama-3.1-70B), we evaluated both the full floating-point version and a 4-bit quantized version to test the impact of quantization. Additionally, we tested the Ollama Reflection model, another 4-bit quantized version of Reflection Llama under the Ollama framework. According to its developers, “trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes”, Reflection Llama claimed to achieve SOTA performance on multiple benchmark on twitter. Last but the not the least, we also include Qwen2.5-Math of various size 1.5B (Qwen2.5-Math-1.5B-Instruct), 7B (Qwen2.5-Math-7B-Instruct), and 72B (Qwen2.5-Math-72B-Instruct). For Qwen2.5-Math 72B model, we loaded it using 4-bit quantization. Qwen2.5-Math is reported to outperform most contemporary models in the mathematics domain.

Automatic scoring pipeline and evaluation criteria

APBench is designed with the target of providing adequate evaluation for deploying foundational models in real-world engineering scenarios. APBench requires a prior Format of being either a numeric or message answer; an automatic scoring pipeline is developed accordingly.

The answer from LLMs is evaluated using predefined rules based on the Format of the reference answer in a problem set. The scoring schema is shown in Fig. 1. For numeric Format, the answer is converted from text strings to floating numbers, and numeric values are extracted for evaluation. An error margin, which adapts to the true answer of the problem, is used to determine if the answer is approximately correct. Similar error margin criteria has been adopted in Wang et al.51; and this criteria can be adjusted, either tightened or loosened, in future evaluations. We employ an error margin expression of \((0.1 + 0.01 \log (|\text {Answer}|))\) to evaluate if the difference between an LLM answer and the reference answer is acceptable. In the case of this calculated error margin exceeds the range between 0 and 1, a fixed error margin of 0.1 is used. The known challenges associated with arithmetic operations were the primary consideration in determining the error boundary.

On the other hand, if the Format of the expected Answer is message, a combination of LLM-as-a-judge evaluation and an embedding-level evaluation is employed. The criteria for problem in Message format is a weighted score of \(0.5(\text {LLM-Judge} + ~\text {Embedding-Similarity})\), and a score higher than 0.6 is considered accurate; else wrong. A similar weighted way of evaluating the performance is also employed in Wang et al.65 while the single employment of LLM-as-a-judge was employed in Foutter et al.66. The LLM-as-a-judge employs GPT4o (gpt-4o-2024-08-06) to evaluate if the LLM Answer and the reference Answer are the same between a scale of 0 to 10 and then normalize that number to 1. LLM’s ability to incorporate different sequence, formulation, and expression of the same expression gives it a unique role as a judge. The evaluation is also augmented by employing embedding-level evaluation. We select the pretrained model all-MiniLM-L6-v267 to map Answer and Model Answer to two 384 dimensional dense vectors. As the expressions may use different symbols and orders, employing embedding-level similarity check could also support the decision from LLM-as-a-judge process. Embedding similarity is checked with the cosine similarity, naturally scaled between 0 and 1, adding vagueness into the criteria.

Benchmark dataset curation process

The benchmark dataset curation process is depicted in Fig. 3. We take the source files in various types (PDF, PNG, JPEG, etc.) and feed them into the block that inspect the source files. Prompt 1 is used to get the number of examples in the file. It is more efficient to include human in the next step of editing the pattern found by LLM and use that pattern with Prompt 2 to create the first JSON file JSON Type-1. JSON Type-1 only has two attributes: Problem and Solution. Then we use Prompt 3 to separate Problem into Content and Question, creating JSON Type-2 JSON file. However, the Question identified could be a combination of multiple questions, so we separate the Question and Solution simultaneously into multiple Questions and corresponding Solutions, generating JSON Type-3 JSON file.

Fig. 5
figure 5

Benchmark curation process workflow and the list of prompts should follow the section Benchmark Dataset Curation Process.

Furthermore, we employ a two-step operation with an if logic condition, to split Question and Solution further into more detailed Question and Solution. The corresponding prompts are Prompt 5 step 1 and Prompt 5 step 2. A human interference is introduced to evaluate the completeness of Question and Solution splitting. If it needs further splitting, we repeat the process. But if the Question and Solution cannot be further separated, we move to the step of updating Format, condensing Answer, and updating Question part. The corresponding prompt is also a two-step prompt: Prompt 6 step 1 to update the Format attribute, and Prompt 6 step 2 to use the updated Format attribute to condense Answer and update Question when the Format is numeric while do nothing for message Format.

Throughout the process, we are able to conduct large amount of automatic operation over the creation of the benchmark datasets. Human interference is still needed in multiple places such as determining whether to further split Question and Solution; determining the quality of the Format determination; also as early as pattern determination in Source files. Finally, we achieve the updated JSON Type-3 JSON file with the attributes shown in Fig. 2.

The list of prompts are:

figure b

Model performance examples

Fig. 6
figure 6

o1-preview, Sonnect, Qwen2.5-Math-7B-Instruct performance on the example question.

Fig. 7
figure 7

Continuous Fig. 6, Llama 3.1 70B performance on the example question.




Source link

Leave a Comment