spot_img
2.4 C
London
HomeHIGH ENDFine-Tuning Small Language Models to Optimize Code Review Accuracy

Fine-Tuning Small Language Models to Optimize Code Review Accuracy

Generative AI is transforming enterprises by driving innovation and boosting efficiency across numerous applications. However, adopting large foundational models poses several challenges, including high costs, slow performance, and data privacy concerns. Many enterprises hesitate to share sensitive code or data with external LLM providers. Additionally, while foundational LLMs excel at general tasks, they often require extensive prompt engineering to achieve high accuracy on specific enterprise-focused use cases. 

Fine-tuning small language models (SLMs), often leveraging techniques like knowledge distillation, offer an attractive solution for these challenges. These smaller LLMs can deliver performance close to larger models and are significantly faster and more cost-effective. Additionally, SLMs can be deployed on-premises or in virtual private clouds (VPCs), enabling enterprises to keep sensitive data secure. However, fine-tuning smaller models requires high-quality labeled data, which is time-consuming and expensive to create. 

This post introduces an automated fine-tuning approach that addresses these challenges by using the data flywheel strategy, a feedback-driven mechanism that iteratively enhances model performance. The approach incorporates curriculum learning, a technique inspired by human learning, where training data is introduced progressively based on complexity. By using large “teacher” models to generate and structure synthetic training data, this method optimizes the fine-tuning process, enabling smaller models to handle complex tasks more effectively while minimizing human intervention.

We’ll cover the following topics: 

  • Overview of the automated fine-tuning approach: A teacher-student paradigm for creating efficient training workflows.
  • Implementation steps: Key stages like exam generation, evaluation, and fine-tuning.
  • Applications in code-review automation: Real-world examples like severity rating and explanation generation, where the automated fine-tuned SLM (Llama 3 8B Instruct plus low-rank adaptation (LoRA), or llama3-8b+LoRA) improved accuracy by 18%, outperforming larger models, and delivered expert-aligned explanations—all with lower costs and latency.
  • Lessons learned: Best practices for scalable, cost-effective AI solutions.

By the end of this post, you’ll know how fine-tuned SLMs can enable enterprises to achieve competitive accuracy while addressing challenges related to cost, latency, and scalability. While the focus here is on the code assistance, the methodology is applicable across diverse enterprise use cases.

This post is part of the NVIDIA Chat Labs series, which shares insights and best practices developed from the internal generative AI projects created to help others navigate AI adoption.

Overview of automated fine-tuning approach

The developed automated fine-tuning approach draws inspiration from how teachers adapt lessons to address students’ specific areas of improvement. It incorporates the principles of knowledge distillation, using a teacher-student paradigm.

A process flow depicting the developed automated fine-tuning process. The teacher model generates an exam, the student model takes the exam, and the teacher evaluates the results. Based on the evaluation, the teacher generates a new curriculum for fine-tuning. The loop continues until the desired performance is achieved.
Figure 1. High-level overview of developed automated fine-tuning architecture

In this process, a large LLM (the teacher) organizes and prepares training data (the curriculum) for the smaller LLM or SLM (the student), using five iterative steps:

1. Exam generation: The teacher LLM creates a test for the student SLM, based on prior performance, user feedback (data flywheel), and previous exam results. 

2. Taking the exam: The student takes the test generated by the teacher. 

3. Evaluation: The teacher evaluates the student’s performance, highlighting the student’s strengths and areas of improvement. 

4. Curriculum generation: The teacher customizes the training based on evaluation results, tailoring it to  address specific weaknesses.

5. Fine-tuning: The student is fine-tuned on the updated dataset using techniques such as LoRA. Unlike traditional fine-tuning, which adjusts all the model parameters and requires significant computational resources, LoRA optimizes a smaller set of parameters specific for the task, making the process cost- and memory-efficient.

This process is repeated until the student’s performance stabilizes or a computation budget is reached, ensuring cost-effective and efficient training. 

Implementing fine-tuning

The following sections dive deeper into each of the five iterative steps in the developed automated fine-tuning workflow shown in Figure 1. 

1. Exam generation

The teacher LLM generates the exam using the EXAM_PROMPT below. Inputs to the prompt include:

  1. Task data: Specific data related to the task, including user feedback or synthetic examples.
  2. Task prompt: A prompt that converts task data into a single training example for the student LLM.
  3. Current proficiency: The student LLM proficiency level from previous evaluations.
  4. Feedback: Teacher-generated insights highlighting the student weaknesses. 

To generate exam questions, the teacher LLM applies TASK_PROMPT to each entry in the DATA_SOURCE, tailoring questions to the student and areas of  improvement. 

The example of EXAM_PROMPT is shown below.

EXAM_PROMPT = """
[TASK]
%s

[DATA SOURCE]
%s

[PREVIOUS_EXAM_RESULTS]
Proficiency: %s
Feedback: %s

Create an exam of %s questions based on the task after the [TASK] tag.
You can use the data after [DATA_SOURCE] for creating the dataset.
Modify the data in the data source appropriately to create questions for the exam (for example to balance the exam).  
If there is none, then create your own.
Results for the expected proficiency and feedback from the previous exam (if any)
are indicated after [PREVIOUS_EXAM_RESULTS]. You can use that information to create
a better exam (in terms of difficulty, questions etc).
The complete exam must appear as json after any thoughts you have
and there must be no other text after it. Think about your answer carefully.

Your output format must strictly be in the following format as a json.
"exam": A list of json objects in the format below
"question": A json with the information in the [INPUT_JSON] of the [TASK]
"answer": A json with the response based on the [OUTPUT_JSON] of the [TASK]
"""

TASK_PROMPT generates task specific exam questions. An example for predicting the severity of a code change during code review is shown below.

TASK_PROMPT = """Assign an issue type to the code below.

[ISSUE_TYPES]
critical: Security vulnerabilities, bugs that will cause a crash or code that can abruptly exit the execution.
major: Severe bugs that will cause a system to produce incorrect results.
minor: Results in some unexpected or undesired behavior, but not enough to disrupt system function.
trivial: Issue won't result in any noticeable breakdown of the system. (e.g. docstring changes, comments etc)

The code and review are formatted as json below.
[INPUT_JSON]
%s

Your output must be in json with no other text. The format is below.
{"issue_type": A value in [critical, major, minor, trivial]}
"""

INPUT_JSON = """
"code": The code snippet under review
"review": A review of the code.
"""

The exam generation step produces a list of question-and-answer pairs in the json format. Below is an example of a question-and-answer pair for the code review severity prediction task. The question includes the code change and review feedback, while the answer specifies the expected severity level.

{
      "question": {
        "code": "<code snippet>",
        "review": "<code review>"
      },
      "answer": {
        "issue_type": "major"
      }
    }

2. Taking the exam 

In this step, the questions generated in Step 1 are used to evaluate the student LLM. Each exam question is combined with TASK_PROMPT to create a list of prompts. The student LLM processes these prompts and generates answers based on its understanding.

For example, in code review severity prediction, the student LLM analyzes the code snippet and review feedback, classifying the severity as critical, major, minor, or trivial.

3. Evaluation 

After the student LLM takes the exam, the teacher LLM evaluates its performance using the EXAM_EVALUATION_PROMPT, which includes:

  1. Task prompt: Used to generate the exam questions.
  2. Exam results: Student’s answers from Step 2, formatted as a list of question-answer pairs.
  3. Data source: An optional data source the teacher uses to generate additional training examples. If not provided, the teacher LLM generates its own examples.
  4. Number of training samples: Specifies how many new training samples to create. These examples are added to the existing training data to address the student’s weaknesses. 

The teacher LLM assigns a proficiency score (1-10), provides feedback, and generates a tailored training dataset to improve student’s weaknesses.  

An EXAM_EVALUATION example is shown below.

EXAM_EVALUATION_PROMPT = """For the task specified by [TASK], an exam was
administered for evaluating the model's current capabilities. The results
appear after [EXAM_RESULTS] and is a json list where each element is a json
consisting of fields:
"question": The data for the task
"answer": The correct answer
"model_answer": The model's answer. If there is no answer here, then assume that 
the model could not answer the question correctly.

[TASK]
%s

[EXAM_RESULTS]
%s

Your task is to evaluate these exam results to determine the models capabilities
from a scale of 1-10 with 10 being proficient. Next, based on the models results
on the exam, identify areas for improvement.
Finally, please develop a curriculum for of %s examples for helping to train the model.
The output is VALID JSON formatted as follows:
"feedback": The feedback as a string
"proficiency": The proficiency score of the model as an integer.
"dataset": A list of json objects for training the model.
   "question": A question for training the model in the exact same format as the [TASK] asks
   "answer": The answer to the question in the same format as the [TASK] asks

You must use the data after [DATA_SOURCE] for creating the dataset. You can modify
the data as you see fit. If there is none, then create your own.

[DATA_SOURCE]
%s
"""

4. Curriculum generation

The new training examples generated in Step 3 are combined with the existing dataset to create an updated curriculum. This tailored curriculum specifically addresses the student’s areas of weakness, enabling more effective fine-tuning.

 5. Fine-tuning

The student LLM is fine-tuned using the updated curriculum. Fine-tuning leverages the NVIDIA NeMo Framework, utilizing the megatron_gpt_finetuning.py script available within the NeMo Framework Docker container, for efficient fine-tuning. 

To simplify the process, the fine-tuning workflow is encapsulated in the PEFTFineTuning class, which enables seamless integration and execution within Python. Below is an example of how fine-tuning is initiated.

import subprocess
import pathlib
import os
import shutil


def initialize_directory(directory, clean=True):
   if os.path.exists(directory) and clean:
       shutil.rmtree(directory)
   os.makedirs(directory, exist_ok=True)

class PEFTFineTuning:

   MEGATRON_GPT_FINETUNING_SCRIPT = \
       "/opt/Nemo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py"

   def __init__(self, scheme, dataset,
       model,
       adapter_name=None,
       output_dir=None,
       torchrun_nproc_per_node=1,
       devices=1, num_nodes=1,
       megatron_amp_O2=True, mcore_gpt=True,
       tensor_size=1,
       pipeline_size=1,
       micro_batch_size=1,
       global_batch_size=16,
       ds_num_workers=0,
       train_sampling_probs=[1.0],
       adapter_restore_path=None,
       lr=1e-4,
       adapter_dim=32):

       self.nproc_per_node = torchrun_nproc_per_node

       self.megatron_gpt_params = {
           "trainer.devices": devices,
           "trainer.num_nodes": num_nodes,
           "model.megatron_amp_O2": megatron_amp_O2,
           "++model.mcore_gpt": mcore_gpt,
           "model.tensor_model_parallel_size": tensor_size,
           "model.pipeline_model_parallel_size": pipeline_size,
           "model.micro_batch_size": micro_batch_size,
           "model.global_batch_size": global_batch_size,
           "model.data.train_ds.num_workers": ds_num_workers,
           "model.data.train_ds.concat_sampling_probabilities": train_sampling_probs,
           "model.data.validation_ds.num_workers": ds_num_workers,
           "model.peft.peft_scheme": scheme,
           "model.optim.lr": lr,
           "model.peft.lora_tuning.adapter_dim": adapter_dim
       }

       if adapter_restore_path is not None:
           self.megatron_gpt_params["model.peft.restore_from_path"] = \
               adapter_restore_path


       self.model = model
       self.dataset = dataset
       self._adapter_name = adapter_name
       if self._adapter_name is None:
           self._adapter_name = "%s_%s" % (scheme, dataset.name)

       self.output_dir = output_dir
       if self.output_dir is None:

           self.output_dir = "%s/%s" % (self.model.model_dir,
                                        self._adapter_name)
  
   @property
   def adapter_name(self):

       return self._adapter_name

   def _get_peft_cmd(self):

       cmd = ["torchrun"]
       cmd.append("--nproc_per_node=%s" % (self.nproc_per_node))
       cmd.append(PEFTFineTuning.MEGATRON_GPT_FINETUNING_SCRIPT)
      
       for key, value in self.megatron_gpt_params.items():
           cmd.append("%s=%s" % (key, value))

       return cmd

   def finetune(self, clean=True,
                val_check_interval=20, max_steps=8000):
      
       initialize_directory(self.output_dir, clean)   

       cmd = self._get_peft_cmd()

       cmd += [
           "exp_manager.exp_dir=%s" % (self.output_dir),
           "exp_manager.explicit_log_dir=%s" % (self.output_dir),
           "trainer.precision=%s" % (self.model.precision),
           "trainer.val_check_interval=%s" % (val_check_interval),
           "trainer.max_steps=%s" % (max_steps),
           "model.restore_from_path=%s" % (self.model.model_path),
           "model.data.train_ds.file_names=%s" % (self.dataset.train_ds),
           "model.data.validation_ds.file_names=%s" % (self.dataset.val_ds),
       ]

       subprocess.call(cmd)

   def get_nim_adapter_path(self, base_dir=ncodepro.NIM_STORE):

       nim_store_dir = "%s/%s" % (base_dir, self._adapter_name)
       nemo_model_path = "%s/%s.nemo" % (nim_store_dir, self._adapter_name)
       return nemo_model_path
  
   def save(self, base_dir=ncodepro.NIM_STORE, clean=True):

       nim_store_dir = "%s/%s" % (base_dir, self._adapter_name)
       nemo_model_path = "%s/%s.nemo" % (nim_store_dir, self._adapter_name)
      
       file.initialize_directory(nim_store_dir, clean)

       peft_checkpoint = "%s/checkpoints/" \
           "megatron_gpt_peft_lora_tuning.nemo" % (self.output_dir)
      
       shutil.copyfile(peft_checkpoint, nemo_model_path)

Real-world application in code review automation

Code reviews are essential for ensuring software quality and performance, and are traditionally performed by human reviewers. A typical code review process involves the following: 

  • The author submits a merge request (MR) containing code that implements a feature or a bug fix.
  • Human reviewers assess the MR, suggesting changes or approving the code. 
  • If changes are requested, the author revises and resubmits the MR, repeating the process until the code is accepted.
Overview of the code review process. The author submits an initial MR, which is reviewed by others. Feedback leads to updates, and the cycle repeats until the code is accepted.
Figure 2. Code review process

Recent advancements in generative AI have enabled the automation of the code review process as shown in Figure 3. A fine-tuned LLM evaluates MRs to identify bugs or issues, assigns severity levels to each, and provides explanations for its ratings. The process filters out low-severity issues below a user-defined threshold, enabling developers to focus on critical concerns such as security vulnerabilities.

 Fine-tuned SLMs enhanced the following two key areas in automated code reviews at NVIDIA:

  • Severity rating: Improving the LLM accuracy in assigning severity levels.
  • Explanation generation: Enhancing the clarity and quality of reasoning provided by the LLM. 
Diagram showing automated code review process with LLMs. Code is analyzed, issues are identified and rated for severity, and feedback is provided for the authors to address.The process repeats until the MR is accepted.
Figure 3. Automating code reviews using LLMs

Performance evaluation: Accuracy and quality gains

We fine-tuned the Llama 3 8B Instruct model using our automated fine-tuning technique, resulting in  llama38B+LORA. We assessed its performance on the following two tasks: 

  • Severity rating prediction: Measuring the accuracy of severity predictions.
  • Severity explanation generation: Evaluating the quality of explanations for severity ratings. 

Severity rating prediction

Fine-tuning with GPT-4 as a teacher, leveraging knowledge distillation, significantly enhanced the severity rating prediction accuracy of the smaller model. As shown in Figure 4, the fine-tuned Llama 3 8B plus LoRA (highlighted in green) achieved an improvement of more than 18% compared to its baseline model (Llama 3 8B without fine-tuning).

Notably, the fine-tuned Llama 3 8B plus LoRA (llama3-8b+LORA) also outperformed much larger models, such as Llama 3 70B (8x larger) and Nemotron 4 340B Instruct (40x larger). Despite its superior accuracy, the model maintained lower latency and reduced inference costs, demonstrating that the developed fine-tuning approach is both efficient and highly effective for optimizing smaller models.

Bar chart comparison of severity rating accuracy across models. The fine-tuned Llama 3 8B Instruct  (llama3-8b+LORA) with GPT-4 as a teacher outperformed its baseline by 18% and surpassed larger models, demonstrating better performance with reduced latency and resource usage.
Figure 4. Severity rating accuracy achieved with different LLMs. The LoRA fine-tuned Llama 3 8B Instruct model (llama3-8b+LORA) outperformed its baseline by 18% and surpassed larger models 

Severity explanation quality

To evaluate explanation quality, the teacher LLM (GPT-4) was used as a judge. The teacher compared explanations from the fine-tuned model to those produced by other models. Figure 5 illustrates the preference differential, which measures how often GPT-4 preferred the fine-tuned model’s explanation over another model’s. A positive preference differential indicates the fine-tuned model outperformed the other model, while a negative value suggests the opposite. 

The results demonstrate that the LoRA fine-tuned Llama 3 8B model (llama3-8b+LORA) consistently outperformed Llama 70B, Nemotron 340B, and even its own baseline (Llama 3 8B). For all comparisons, the fine-tuned model was either preferred or performed equally well, showcasing its strong alignment with expert-level standards for explanation quality.

Bar chart comparing the explanation quality of the fine-tuned Llama 3 8B (llama3-8b+LORA) model against other models. Green bars indicate GPT-4 preference for the LoRA fine-tuned  (llama3-8b+LORA) model, while gray bars indicate preference for the other model. The fine-tuned model consistently matches or outperforms other models.
Figure 5. Comparison of severity explanation quality. The preference differential highlights how often GPT-4 preferred the explanations from the fine-tuned LoRA (llama3-8b+LORA) model 

Benefits of fine-tuned SLMs: Efficiency and performance gains

The application of fine-tuned SLMs to code review automation demonstrates two primary advantages:

  • Cost-effective fine-tuning: Using fine-tuned SLMs for code review tasks reduces costs and latency. This makes it an ideal approach for enterprise workflows that need to balance budget constraints with performance requirements.
  • Improved accuracy and alignment: Using fine-tuned SLMs significantly enhances task-specific performance. By improving severity ratings and aligning explanations with expert-level standards, fine-tuned SLMs deliver reliable evaluations that help development teams focus on critical code issues.

Lessons learned from scaling AI with SLMs

The development of fine-tuned SLMs using an automated approach has provided valuable insights into creating cost-efficient and scalable AI solutions tailored for enterprise applications. Key lessons include:

  • Start with targeted fine-tuning: Focus on smaller models to achieve an optimal balance between  performance and resource utilization. This enables enterprises to evaluate trade-offs effectively before scaling up.
  • Leverage parameter-efficient fine-tuning (PEFT) and knowledge distillation: Combining PEFT methods like LoRA with knowledge distillation ensures high performance while minimizing computational overhead, making them ideal for resource-limited environments.

By using fine-tuned smaller LLMs, enterprises can address the challenges of high costs and slow performance often associated with large models. This strategy enables businesses to achieve competitive accuracy while keeping AI solutions cost-effective, responsive, and tailored to specific needs. Although this post highlights applications in code assistance, this methodology is highly versatile and applicable across a wide range of enterprise use cases.

Begin fine-tuning models for your AI applications

Discover how NVIDIA generative AI technologies can help you fine-tune and deploy models for your specific needs. If you’re just getting started, check out Building Your First LLM Agent Application  and Build Your First Human-in-the-Loop AI Agent with NVIDIA NIM to gain practical experience with NVIDIA tools and methodologies for developing and deploying NVIDIA NIM LLM microservices.

Acknowledgments

We extend our heartfelt gratitude to Rushang Karia, Agustin Rivera, Mark Philipp, Abhinav Kumar, Anbang Xu, Rama Akkiraju, Ashwin Poojary, Ahmad Daoud, and Ashwin Jha for their invaluable contributions and unwavering support. Their expertise and dedication were instrumental in bringing this work to fruition.

latest articles

explore more