Auto-code generation using GPT-2

June 16, 2021 6-minute read

Deep Learning • Natural Langauge Processing • GPT-2 • Auto Code Complete

Auto-code generation using GPT-2

About GPT-2

GPT-2 stands for “Generative Predictive Transformer”. It is an open-source model trained on an over 1.5 Billion parameters for generating the next sequence of text, for a give sequence.

The GPT-2 has a remarkable and amazing competence to generate texts, much beyond the expectations of conventional language models.

“Too dangerous to be released.”

A phrase published the press statement by OpenAI to accompany their announcement of the release of their GPT-2 language model in February 2019.

The threat of AI-driven misinformation has become a huge issue that remains unresolved in today’s post-factual information ecosystem, especially with the recent release of the more powerful GPT-3.

Performance of GPT on different datasets

“Today, the United Nations has called for the immediate withdrawal of all nuclear weapons from the world.”

The sentence you just read was neither written by me, nor was it written by the editor. This sentence was written by GPT-2.

GPT-2 is a 1.5-bit transformer-based language model, trained in a database of 8 million web pages. It was trained to simply predict the next word in 40GB of Internet text. Due to some issues a very small model for researchers was released to experiment with.

Working mechanism of GPT-2

Generative: This means that model was trained to predict the next token in a given sequence of tokens. The model is given a lot of raw text and then it is asked to generate more text using the statistical features which ofcourse involves implementation of different layers and mechanism such as RNN-LSTM and Attention mechanism.
Pre-trained: OpenAI trained a large and a powerful language transformer model to use it for tasks like summarization, neural machine translation, etc. Now, the model was trained on 40 GB of text, known as WebText.

Transformer: The GPT-2 is build using layers of decoder transformer blocks.

The text in blue is given as an input and the text in red is the predicted output

The GPT-2 architecture is based on the concepts of the transformers.

The mechanism on which GPT-2 works involves transformer based encoder decoder architecture to learn the input and output dependencies.

To generate the next output in a given sequence the model need to have the previously generated data as an input.

The GPT-2 has a great ability to adapt to the context of the text and thus generates realistic and coherent output.
The model works by adding each token to the sequence of inputs as it is created. In the next step, that new sequence becomes the model’s input. This is an idea called “auto-regression”. This is one of the ideas that made RNNs unreasonably effective.

What is the meaning of automatic code generation?

The automatic code generation is basically involves the completion of a sequeunce of codes based upon the previous inputs and past habits of the user.

Many commercial platforms such as TabNine and Kite are already available in the market for this task. These both use GPT-2 to predict the next sequence of codes based upon the previous inputs provided by the users.
Here is a short video which demonstrates the amazing capabilities of an automatics code generation process:

Steps to fine-tune GPT-2 for code generation

1. Cloning the required repositories

Here, we are cloning the auto_coding repository which contains code and scripts to fine tune the GPT-2 model for automatic code generation.
We need to provide the training examples in the form of scripts (Examples: Python, C, C++, Java, and Javascript).
For fine-tuning our GPT-2 model we have used scripts from scikit-learn examples.

1
2
3
4
5
6
7


rm -rf auto_encoding
git clone https://github.com/aasthaengg/auto_coding
cd auto_encoding
pip install -r requirements.txt
cd dataset
git clone http://github.com/scikit-learn/examples.git
python convert.py --segment_len 256 --stride 10 --dev_size 0.1

2. Downloading the required scripts

Now, we are downloading the convert.py script.

This script contains the code to convert our examples into the training data in a format expected by our model.

1
2


wget https://raw.githubusercontent.com/aasthaengg/auto_encoding/master/dataset/convert.py
cp convert.py /content/auto_encoding/dataset/

3. Navigating to the required directing

Here, we are executing the convert.py script with segment length as 256, strides of 10, and development size of 10%.

The 90% of the data will be used for training and the remaining 10% will be used for testing our model.

1
2
3


cd auto_coding
cd dataset
python convert.py --segment_len 256 --stride 10 --dev_size 0.1

4. Executing the training scripts and selecting the model as distilgpt2

We have different versions of models available for fine-tuning. Here, distilgpt2 is selected to be fine-tuned. If one has enough computing resources and huge dataset then they may go for version with larger number of parameters.

1
2


cd /content/auto_encoding
python train.py --model_select distilgpt2

5. The model is now trained. Let’s check out the model

After the training get’s completed we need to execute interact.py script. This script will run the model for testing purpose. The input needs to be provided and the model will predict the sequence.

1

python interact.py

How this use case is helping tech professionals?

-This solution is already being used in the industry. It allows developers to code faster with 47% less keystrokes. This helps developers to become more productive.

Also, if a block of code is already written earlier then the user just need to make few strokes of identical words and the user will get an automatically completed block of codes.
According, to some of the reviews, this has helped developers reduce a chunk of time, since they are writing just 70% to 80% codes and the rest 20% to 30% of the code is automatically generated.
With all these advantages, the industry is also saving a significant amount of time.

Limitations of GPT-2

GPT-2 cannot be used handle complex and long language formations.
If someone wants to generate a sequence of text related to a particular domain such as literature, finance, or medicine, then it won’t be able to perform well.
There are certain limitations in terms of computing resources. To train such a huge model with billions of parameter we require very expensive computing resources to train so that the model can perform better.

Writer’s perespective on GPT architecture

The GPT is no doubt an amazing invention in the field of Natural Language Processing but it’s capabilities are still unknown since the complete pre-trained model was never released by OpenAI because of some serious threats.
For performing this task on automatic code generation I used distil-GPT2 because of it’s small size and relatively less expensive model fine-tuning. We limited the sequence generation size to 15 for more effective and crisp prediction.
If one has thought of using GPT-2 model with higher number of parameters then the size of sequence can be increased accordingly, keeping the computing resources in mind.

Here is a tweet from VP of AI at Facebook

I’m quite aware of the additional risks, such as the possibility of GPT-3’s human-like text generation capacity being used for phishing, scamming, spamming, distributing false news, or other fraudulent actions. Hence, one should use such models keeping ethics in mind.

We should use artificial intelligence to make our lives better and not by doing anytype of criminal activities.

Conclusion

Congrats for making it all the way to the end of this blog! Thank you very much for taking the time to read this. I hope it was useful in getting you up and going.

Did you like GPT-2’s superpowers? Please let me know in the comments section where all thoughts and insights are eagerly appreciated.

Auto-code generation using GPT-2

Auto-code generation using GPT-2

About GPT-2

“Too dangerous to be released.”

Performance of GPT on different datasets

Working mechanism of GPT-2

Working of GPT-2

Transformer Architecture

The text in blue is given as an input and the text in red is the predicted output

What is the meaning of automatic code generation?

Steps to fine-tune GPT-2 for code generation

1. Cloning the required repositories

2. Downloading the required scripts

3. Navigating to the required directing

4. Executing the training scripts and selecting the model as distilgpt2

5. The model is now trained. Let’s check out the model

How this use case is helping tech professionals?

Limitations of GPT-2

Writer’s perespective on GPT architecture

Here is a tweet from VP of AI at Facebook

Conclusion