Evaluating Small-Scale Code Models for Code Clone Detection

This repository contains the code and experiments to reproduce results from the paper: Evaluating Small-Scale Code Models for Code Clone Detection

Abstract

Detecting code clones is important for software maintenance and refactoring. This project evaluates several small transformer-based code models, specifically assessing their capability to classify code pairs as clones or non-clones across five benchmark datasets: BigCloneBench, Karnalim, PoolC, POJ104, and CodeJam.

Code Models Evaluated

CodeBERT (125M parameters)
GraphCodeBERT (125M parameters)
Salesforce T5 (220M parameters)
UniXCoder (~200M parameters)
PLBART (140M parameters)
PolyCoder (160M parameters)

Datasets

BigCloneBench: Large, validated clone pairs from open-source projects.
CodeJam: Google Code Jam competition submissions.
Karnalim: Academic exercise-based code pairs.
POJ104: Peking University student submissions.
PoolC: Diverse clone types from open-source projects.

Reproducing the Experiments

Prerequisites

Python 3.8 or higher
PyTorch
Transformers (Hugging Face)
Datasets

Install dependencies using:

pip install torch transformers datasets pandas numpy sklearn

Setup

Clone the repository:

git clone https://github.com/jorge-martinez-gil/small-code-models.git
cd small-code-models

Evaluation Metrics

The scripts report performance using the following metrics:

Accuracy
Precision
Recall
F1-score

Results

Results for each model-dataset combination, including detailed tables and analysis, are presented in the associated paper.

Citation

If you find this work useful, please cite.

@article{martinezgil2025,
  author       = {Jorge Martinez-Gil},
  title        = {Evaluating Small-Scale Code Models for Code Clone Detection},
  journal      = {CoRR},
  volume       = {abs/2506.10995},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2506.10995},
  doi          = {10.48550/arXiv.2506.10995},
  eprinttype   = {arXiv},
  eprint       = {2506.10995}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bcb_detection_models		bcb_detection_models
gcj_clone_detection_models		gcj_clone_detection_models
karnalim_clone_detection_models		karnalim_clone_detection_models
poj104_clone_detection_models		poj104_clone_detection_models
poolc_clone_detection_models		poolc_clone_detection_models
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating Small-Scale Code Models for Code Clone Detection

Abstract

Code Models Evaluated

Datasets

Reproducing the Experiments

Prerequisites

Setup

Evaluation Metrics

Results

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

jorge-martinez-gil/small-code-models

Folders and files

Latest commit

History

Repository files navigation

Evaluating Small-Scale Code Models for Code Clone Detection

Abstract

Code Models Evaluated

Datasets

Reproducing the Experiments

Prerequisites

Setup

Evaluation Metrics

Results

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages