CodeEditorBench

🔔News

Abstract

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, a pioneering evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development.

We curated diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluating 17 LLMs revealed that closed-source models, particularly Gemini-Ultra and GPT-4, outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem type and prompt sensitivity. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs.

Overview

Figure 2 presents a selection of exemplars from the CodeEditorBench, delineating the spectrum of code editing tasks, including Code Debugging, Code Translating, Code Polishing, and Code Requirement Switching. Each dataset entry shares similar attributes such as title, difficulty, public and private test inputs and outputs, as well as code_language. Unique to the Code Debugging tasks is an 'error_type' attribute, denoting the specific issue present in the code. Translating tasks are distinguished by source_lang and target_lang attributes, indicating the original and target programming languages. Code Polishling tasks include avg_running_time and avg_memory attributes, quantifying the average execution time and memory usage, respectively. Finally, Code Requirement Switching tasks feature a distinct set of attributes including public_similar_tests, public_target_tests and private_target_tests inputs and outputs, which correspond to the initial and target test cases of the coding problem. And the same is true for similar_title and target_title.

Figure 2: Data Samples of CodeEditorBench

Statistics

We implemented a timestamp-based filtering process to consider the data pollution phenomenon. This process led to a refined dataset called CodeEditorBench_Plus. The original dataset was thereafter designated as CodeEditorBench_Primary. The CodeEditorBench_Primary, as illustrated in Figure 3, and CodeEditorBench_Plus, as shown in Figure 4, establish an evaluation framework that mirrors the complexities inherent in real-world software development scenarios. This framework is meticulously designed to gauge LLMs' code editing capabilities, thereby presenting a richer and more nuanced set of challenges compared to conventional code generation benchmarks. The datasets are extensive, categorizing tasks along several dimensions: programming languages (C++, Java, Python), number of errors (one, two, three, four), difficulty levels (easy, medium, hard), language transitions (e.g., C++ to Java), and relation strength (strong, weak).

Figure 3: Primary Dataset

Figure 4: Plus Dataset

Leaderboard

Closed-Source Open-Source

Model	Size	Debug	Translation	Switch	Polishment Score	Win Rate
Zero-shot
gpt-4-0613	-	0.316	0.465	0.264	1.12%	0.855
OpenCodeInterpreter-DS-33B	33B	0.236	0.368	0.141	6.02%	0.776
gemini-ultra	-	0.304	0.378	0.041	5.31%	0.750
deepseek-coder-33B-instruct	33B	0.275	0.410	0.162	1.10%	0.737
gemini-pro	-	0.286	0.344	0.076	5.86%	0.737
gpt-3.5-turbo-1106	-	0.290	0.475	0.177	0.09%	0.724
OpenCodeInterpreter-DS-6.7B	6.7B	0.233	0.357	0.126	4.45%	0.671
WizardCoder-33B-V1.1	33B	0.274	0.371	0.156	0.79%	0.632
glm-4	-	0.220	0.278	0.085	5.17%	0.526
Magicoder-S-DS-6.7B	6.7B	0.242	0.343	0.130	0.21%	0.513
Phind-CodeLlama-34B-v2	34B	0.230	0.279	0.074	2.84%	0.500
octocoder	15.5B	0.042	0.392	0.030	1.39%	0.434
CodeLlama-13B-Instruct-hf	13B	0.176	0.333	0.021	2.31%	0.421
CodeLlama-34B-hf	34B	0.163	0.310	0.052	1.10%	0.382
Magicoder-S-CL-7B	7B	0.174	0.272	0.039	1.31%	0.329
WizardCoder-15B-V1.0	15B	0.159	0.309	0.067	0.91%	0.329
CodeLlama-7B-Instruct-hf	7B	0.155	0.289	0.017	1.47%	0.289
CodeLlama-34B-Instruct-hf	34B	0.131	0.287	0.027	1.02%	0.211
CodeFuse-CodeLlama-34B	34B	0.166	0.218	0.028	0.33%	0.184
Three-shot
gemini-ultra	-	0.286	0.443	0.152	5.62%	0.855
gpt-4-0613	-	0.345	0.517	0.303	1.13%	0.816
OpenCodeInterpreter-DS-6.7B	6.7B	0.233	0.372	0.165	6.47%	0.770
OpenCodeInterpreter-DS-33B	33B	0.230	0.371	0.229	5.75%	0.763
deepseek-coder-33B-instruct	33B	0.272	0.417	0.235	1.18%	0.737
gpt-3.5-turbo-1106	-	0.270	0.364	0.201	1.54%	0.684
gemini-pro	-	0.229	0.392	0.139	5.23%	0.671
WizardCoder-33B-V1.1	33B	0.279	0.362	0.243	0.65%	0.645
Magicoder-S-DS-6.7B	6.7B	0.262	0.321	0.192	1.44%	0.605
glm-4	-	0.233	0.299	0.100	5.30%	0.572
CodeLlama-34B-hf	34B	0.133	0.307	0.113	1.75%	0.474
Phind-CodeLlama-34B-v2	34B	0.239	0.275	0.092	1.20%	0.421
CodeLlama-13B-Instruct-hf	13B	0.160	0.327	0.028	1.75%	0.414
Magicoder-S-CL-7B	7B	0.157	0.245	0.075	1.70%	0.329
WizardCoder-15B-V1.0	15B	0.114	0.271	0.099	1.65%	0.322
CodeFuse-CodeLlama-34B	34B	0.166	0.240	0.050	1.61%	0.289
octocoder	15.5B	0.050	0.290	0.054	1.09%	0.211
CodeLlama-7B-Instruct-hf	7B	0.167	0.271	0.028	1.00%	0.211
CodeLlama-34B-Instruct-hf	34B	0.143	0.303	0.032	0.32%	0.211

Model	Size	Debug	Translation	Switch	Polishment Score	Win Rate
Zero-shot
gpt-4-0613	-	0.493	0.503	0.264	1.33%	0.868
OpenCodeInterpreter-DS-33B	33B	0.429	0.428	0.141	6.49%	0.816
gpt-3.5-turbo-1106	-	0.494	0.480	0.177	0.84%	0.776
deepseek-coder-33B-instruct	33B	0.487	0.451	0.162	1.14%	0.757
gemini-pro	-	0.423	0.344	0.076	6.65%	0.711
WizardCoder-33B-V1.1	33B	0.487	0.438	0.156	0.90%	0.704
Magicoder-S-DS-6.7B	6.7B	0.406	0.401	0.130	1.99%	0.697
OpenCodeInterpreter-DS-6.7B	6.7B	0.402	0.384	0.126	4.28%	0.697
glm-4	-	0.271	0.365	0.085	6.46%	0.592
gemini-ultra	-	0.459	0.278	0.041	3.77%	0.579
Phind-CodeLlama-34B-v2	34B	0.369	0.331	0.074	1.78%	0.539
WizardCoder-15B-V1.0	15B	0.354	0.278	0.067	0.96%	0.408
CodeLlama-13B-Instruct-hf	13B	0.368	0.275	0.021	1.82%	0.368
Magicoder-S-CL-7B	7B	0.317	0.276	0.039	1.31%	0.342
CodeLlama-34B-hf	34B	0.328	0.302	0.052	0.81%	0.329
octocoder	15.5B	0.145	0.223	0.030	2.70%	0.289
CodeLlama-7B-Instruct-hf	7B	0.336	0.231	0.017	1.17%	0.250
CodeLlama-34B-Instruct-hf	34B	0.250	0.240	0.027	0.84%	0.171
CodeFuse-CodeLlama-34B	34B	0.223	0.177	0.028	0.70%	0.105
Three-shot
gpt-4-0613	-	0.523	0.514	0.303	1.14%	0.882
gpt-3.5-turbo-1106	-	0.511	0.431	0.201	1.70%	0.803
OpenCodeInterpreter-DS-33B	33B	0.463	0.437	0.229	4.82%	0.803
deepseek-coder-33B-instruct	33B	0.489	0.465	0.235	0.93%	0.763
OpenCodeInterpreter-DS-6.7B	6.7B	0.440	0.399	0.165	8.59%	0.750
WizardCoder-33B-V1.1	33B	0.515	0.447	0.243	0.63%	0.711
gemini-pro	-	0.386	0.356	0.139	5.64%	0.645
Magicoder-S-DS-6.7B	6.7B	0.478	0.381	0.192	0.89%	0.632
gemini-ultra	-	0.448	0.307	0.152	4.55%	0.632
glm-4	-	0.341	0.360	0.100	6.41%	0.592
Phind-CodeLlama-34B-v2	34B	0.468	0.326	0.092	0.75%	0.461
CodeLlama-34B-hf	34B	0.367	0.252	0.113	1.11%	0.447
Magicoder-S-CL-7B	7B	0.355	0.230	0.075	1.18%	0.382
WizardCoder-15B-V1.0	15B	0.332	0.224	0.099	1.11%	0.329
CodeLlama-13B-Instruct-hf	13B	0.330	0.284	0.028	1.25%	0.322
CodeFuse-CodeLlama-34B	34B	0.262	0.158	0.050	1.39%	0.250
CodeLlama-34B-Instruct-hf	34B	0.330	0.264	0.032	0.67%	0.211
CodeLlama-7B-Instruct-hf	7B	0.362	0.224	0.028	0.71%	0.204
octocoder	15.5B	0.263	0.206	0.054	0.85%	0.184

Table 1: Evaluating LLMs on CodeEditorBench. All results of models are generated by greedy decoding. Code Debug, Code Translate and Code Requirement Switch are evaluated with pass@1, while Code Polish is evaluated with Mean OptScore.

Pass rate distribution

Figure 5: Pass rate distribution of models on CodeEditorBench_Plus

The pass rates exhibit significant variation across different problem types, as illustrated in Table 2 and Figure 5. Within the PLUS dataset, Switch problems are identified as the most challenging, with a mere 11.18% pass rate. Debug and Translate problems have pass rates approximately around 20% and 30%. For Polish problems, even with the correct original code provided, only 37.47% of the solutions meet all testing criteria. Additionally, only a limited 19.23% of the dataset not only passes all tests but also demonstrates superior average runtime or memory efficiency relative to the original code. It is also significant to note that a notable fraction of solutions simply replicate the original code with no alterations.

Pass rate distribution

	Pass	Wrong Answer	Runtime Error	Compile Error	Other
Debug	21.57%	53.41%	13.40%	7.51%	4.11%
Polish	19.23%	53.15%	5.46%	3.01%	19.15%
Switch	11.18%	64.74%	10.94%	8.09%	5.05%
Translate	33.15%	45.67%	3.69%	6.06%	11.43%
ALL	20.34%	55.26%	8.53%	6.34%	9.53%

Table 2: Judgment results across different problem types in CodeEditorBench_Plus.

We analyze the aggregated solutions from all models on CodeEditorBench_Plus, as detailed in Table 2, and discovered that only 20.34% of solutions successfully solve the problem, with a significant 55.26% failing due to incorrect answers. Other prevalent causes of failure include compilation and runtime errors, while instances of timeouts or exceeding memory limits are comparatively rare. Specifically, 6.34% of the dataset experiences compilation errors, a phenomenon that may partly stem from post-processing losses incurred during the extraction of code blocks from solutions that include textual explanations. Models producing poorly formatted output, such as OctoCoder, are notably more susceptible to compilation errors. Interestingly, Polish problems demonstrate the lowest frequencies of both runtime and compilation errors, likely attributable to the minimal alterations made to the original code by the models. Conversely, Translate problems are characterized by a lower rate of incorrect answers (45.67%) relative to other problem types, yet suffer the highest rate of timeout errors (10.21%).

BibTeX


      @misc{guo2024codeeditorbench,
      title={CodeEditorBench: Evaluating Code Editing Capability of Large Language Models}, 
      author={Jiawei Guo and Ziming Li and Xueling Liu and Kaijing Ma and Tianyu Zheng and Zhouliang Yu and Ding Pan and Yizhi LI and Ruibo Liu and Yue Wang and Shuyue Guo and Xingwei Qu and Xiang Yue and Ge Zhang and Wenhu Chen and Jie Fu},
      year={2024},
      eprint={2404.03543},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
      }

CodeEditorBench

Evaluating Code Editing Capability of Large Language Models

🔔News

Abstract

CodeEditorBench Dataset

Overview

Statistics

Experiment Results

Leaderboard

Pass rate distribution

Pass rate distribution

BibTeX