Logo CodeEditorBench

Evaluating Code Editing Capability of Large Language Models

Jiawei Guo1*, Ziming Li3*, Xueling Liu1*, Kaijing Ma5*,
Tianyu Zheng1, Zhouliang Yu3, Ding Pan3, Yizhi LI4, Ruibo Liu1, Yue Wang1, Shuyue Guo1, Xingwei Qu3,4,
Xiang Yue1, Ge Zhang1,2,6†, Wenhu Chen1,2,6†, Jie Fu1,3

1Multimodal Art Projection Research Community, 2University of Waterloo,
3HKUST, 4University of Manchester, 5Tongji University, 6Vector Institute,

*These authors contributed equally
†Corresponding Authors

🔔News

Abstract

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, a pioneering evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development.

We curated diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluating 17 LLMs revealed that closed-source models, particularly Gemini-Ultra and GPT-4, outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem type and prompt sensitivity. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs.

tech_route

Figure 1: Overview of CodeEditorBench. CodeEditorBench evaluates programming languages by selecting initial data from five sources and filtering based on code length. It enriches the dataset with Large Language Model-generated test cases, which, along with all code, are verified by an Online Judge System(OJ). The benchmark is developed for four problem types using specific methodologies. Assessment of 17 LLMs involves crafting prompts for zero-shot, three-shot, and chain of thought settings. Outputs are filtered and integrated with templates for compilation. The OJ's batch judging determines the LLMs' scores, ensuring a rigorous evaluation process.

CodeEditorBench Dataset

Overview

Figure 2 presents a selection of exemplars from the CodeEditorBench, delineating the spectrum of code editing tasks, including Code Debugging, Code Translating, Code Polishing, and Code Requirement Switching. Each dataset entry shares similar attributes such as title, difficulty, public and private test inputs and outputs, as well as code_language. Unique to the Code Debugging tasks is an 'error_type' attribute, denoting the specific issue present in the code. Translating tasks are distinguished by source_lang and target_lang attributes, indicating the original and target programming languages. Code Polishling tasks include avg_running_time and avg_memory attributes, quantifying the average execution time and memory usage, respectively. Finally, Code Requirement Switching tasks feature a distinct set of attributes including public_similar_tests, public_target_tests and private_target_tests inputs and outputs, which correspond to the initial and target test cases of the coding problem. And the same is true for similar_title and target_title.

data_samle

Figure 2: Data Samples of CodeEditorBench

Statistics

We implemented a timestamp-based filtering process to consider the data pollution phenomenon. This process led to a refined dataset called CodeEditorBench_Plus. The original dataset was thereafter designated as CodeEditorBench_Primary. The CodeEditorBench_Primary, as illustrated in Figure 3, and CodeEditorBench_Plus, as shown in Figure 4, establish an evaluation framework that mirrors the complexities inherent in real-world software development scenarios. This framework is meticulously designed to gauge LLMs' code editing capabilities, thereby presenting a richer and more nuanced set of challenges compared to conventional code generation benchmarks. The datasets are extensive, categorizing tasks along several dimensions: programming languages (C++, Java, Python), number of errors (one, two, three, four), difficulty levels (easy, medium, hard), language transitions (e.g., C++ to Java), and relation strength (strong, weak).

algebraic reasoning

Figure 3: Primary Dataset

algebraic reasoning

Figure 4: Plus Dataset

Experiment Results

Leaderboard

Closed-Source Open-Source
Model Size Debug Translation Switch Polishment Score Win Rate
Zero-shot
gpt-4-0613 - 0.316 0.465 0.264 1.12% 0.855
OpenCodeInterpreter-DS-33B 33B 0.236 0.368 0.141 6.02% 0.776
gemini-ultra - 0.304 0.378 0.041 5.31% 0.750
deepseek-coder-33B-instruct 33B 0.275 0.410 0.162 1.10% 0.737
gemini-pro - 0.286 0.344 0.076 5.86% 0.737
gpt-3.5-turbo-1106 - 0.290 0.475 0.177 0.09% 0.724
OpenCodeInterpreter-DS-6.7B 6.7B 0.233 0.357 0.126 4.45% 0.671
WizardCoder-33B-V1.1 33B 0.274 0.371 0.156 0.79% 0.632
glm-4 - 0.220 0.278 0.085 5.17% 0.526
Magicoder-S-DS-6.7B 6.7B 0.242 0.343 0.130 0.21% 0.513
Phind-CodeLlama-34B-v2 34B 0.230 0.279 0.074 2.84% 0.500
octocoder 15.5B 0.042 0.392 0.030 1.39% 0.434
CodeLlama-13B-Instruct-hf 13B 0.176 0.333 0.021 2.31% 0.421
CodeLlama-34B-hf 34B 0.163 0.310 0.052 1.10% 0.382
Magicoder-S-CL-7B 7B 0.174 0.272 0.039 1.31% 0.329
WizardCoder-15B-V1.0 15B 0.159 0.309 0.067 0.91% 0.329
CodeLlama-7B-Instruct-hf 7B 0.155 0.289 0.017 1.47% 0.289
CodeLlama-34B-Instruct-hf 34B 0.131 0.287 0.027 1.02% 0.211
CodeFuse-CodeLlama-34B 34B 0.166 0.218 0.028 0.33% 0.184
Three-shot
gemini-ultra - 0.286 0.443 0.152 5.62% 0.855
gpt-4-0613 - 0.345 0.517 0.303 1.13% 0.816
OpenCodeInterpreter-DS-6.7B 6.7B 0.233 0.372 0.165 6.47% 0.770
OpenCodeInterpreter-DS-33B 33B 0.230 0.371 0.229 5.75% 0.763
deepseek-coder-33B-instruct 33B 0.272 0.417 0.235 1.18% 0.737
gpt-3.5-turbo-1106 - 0.270 0.364 0.201 1.54% 0.684
gemini-pro - 0.229 0.392 0.139 5.23% 0.671
WizardCoder-33B-V1.1 33B 0.279 0.362 0.243 0.65% 0.645
Magicoder-S-DS-6.7B 6.7B 0.262 0.321 0.192 1.44% 0.605
glm-4 - 0.233 0.299 0.100 5.30% 0.572
CodeLlama-34B-hf 34B 0.133 0.307 0.113 1.75% 0.474
Phind-CodeLlama-34B-v2 34B 0.239 0.275 0.092 1.20% 0.421
CodeLlama-13B-Instruct-hf 13B 0.160 0.327 0.028 1.75% 0.414
Magicoder-S-CL-7B 7B 0.157 0.245 0.075 1.70% 0.329
WizardCoder-15B-V1.0 15B 0.114 0.271 0.099 1.65% 0.322
CodeFuse-CodeLlama-34B 34B 0.166 0.240 0.050 1.61% 0.289
octocoder 15.5B 0.050 0.290 0.054 1.09% 0.211
CodeLlama-7B-Instruct-hf 7B 0.167 0.271 0.028 1.00% 0.211
CodeLlama-34B-Instruct-hf 34B 0.143 0.303 0.032 0.32% 0.211

Table 1: Evaluating LLMs on CodeEditorBench. All results of models are generated by greedy decoding. Code Debug, Code Translate and Code Requirement Switch are evaluated with pass@1, while Code Polish is evaluated with Mean OptScore.

Pass rate distribution

Pass rate boxplot

Figure 5: Pass rate distribution of models on CodeEditorBench_Plus

The pass rates exhibit significant variation across different problem types, as illustrated in Table 2 and Figure 5. Within the PLUS dataset, Switch problems are identified as the most challenging, with a mere 11.18% pass rate. Debug and Translate problems have pass rates approximately around 20% and 30%. For Polish problems, even with the correct original code provided, only 37.47% of the solutions meet all testing criteria. Additionally, only a limited 19.23% of the dataset not only passes all tests but also demonstrates superior average runtime or memory efficiency relative to the original code. It is also significant to note that a notable fraction of solutions simply replicate the original code with no alterations.

Pass rate distribution

Pass Wrong Answer Runtime Error Compile Error Other
Debug 21.57% 53.41% 13.40% 7.51% 4.11%
Polish 19.23% 53.15% 5.46% 3.01% 19.15%
Switch 11.18% 64.74% 10.94% 8.09% 5.05%
Translate 33.15% 45.67% 3.69% 6.06% 11.43%
ALL 20.34% 55.26% 8.53% 6.34% 9.53%

Table 2: Judgment results across different problem types in CodeEditorBench_Plus.

We analyze the aggregated solutions from all models on CodeEditorBench_Plus, as detailed in Table 2, and discovered that only 20.34% of solutions successfully solve the problem, with a significant 55.26% failing due to incorrect answers. Other prevalent causes of failure include compilation and runtime errors, while instances of timeouts or exceeding memory limits are comparatively rare. Specifically, 6.34% of the dataset experiences compilation errors, a phenomenon that may partly stem from post-processing losses incurred during the extraction of code blocks from solutions that include textual explanations. Models producing poorly formatted output, such as OctoCoder, are notably more susceptible to compilation errors. Interestingly, Polish problems demonstrate the lowest frequencies of both runtime and compilation errors, likely attributable to the minimal alterations made to the original code by the models. Conversely, Translate problems are characterized by a lower rate of incorrect answers (45.67%) relative to other problem types, yet suffer the highest rate of timeout errors (10.21%).

BibTeX


      @misc{guo2024codeeditorbench,
      title={CodeEditorBench: Evaluating Code Editing Capability of Large Language Models}, 
      author={Jiawei Guo and Ziming Li and Xueling Liu and Kaijing Ma and Tianyu Zheng and Zhouliang Yu and Ding Pan and Yizhi LI and Ruibo Liu and Yue Wang and Shuyue Guo and Xingwei Qu and Xiang Yue and Ge Zhang and Wenhu Chen and Jie Fu},
      year={2024},
      eprint={2404.03543},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
      }