본문 바로가기
DL

<Deep Learning> Simple GPT: Building a Lightweight GPT Model from Scratch

by 바건정 2025. 2. 28.

Introduction

Creating a GPT model from scratch is an exciting way to understand the fundamentals of natural language processing and deep learning. In this post, we'll walk through the step-by-step process of building a simple GPT model, covering the Transformer architecture, training process, and text generation techniques.

What is GPT?

GPT (Generative Pre-trained Transformer) is a decoder-only Transformer model designed for text generation. It learns patterns from data and generates text based on the input context.

Key Components of GPT

  • Embedding Layer: Converts words into numerical vectors.
  • Transformer Blocks: Self-attention mechanism that helps the model understand relationships between words.
  • Feed-Forward Network (FFN): Processes information in each Transformer block.
  • Final Linear Layer: Maps the hidden states back to vocabulary tokens.

Step 1: Implementing the Transformer Block

The core of GPT is the Transformer block, which consists of Self-Attention, Layer Normalization, and Feed-Forward Networks.

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_hidden_dim):
        super(TransformerBlock, self).__init__()
        self.attention = nn.MultiheadAttention(embed_dim, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, ff_hidden_dim),
            nn.ReLU(),
            nn.Linear(ff_hidden_dim, embed_dim)
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

    def forward(self, x):
        attn_output, _ = self.attention(x, x, x)
        x = self.norm1(x + attn_output)
        ff_output = self.feed_forward(x)
        x = self.norm2(x + ff_output)
        return x

Explanation:

  • Multi-Head Attention: Captures dependencies between words.
  • Layer Normalization: Stabilizes training.
  • Feed-Forward Network: Applies transformations to extract features.
  • Residual Connections: Helps gradients flow through the network.

Step 2: Building the Simple GPT Model

Now, we stack multiple Transformer blocks to build a GPT model.

class SimpleGPT(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, ff_hidden_dim, num_layers):
        super(SimpleGPT, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.transformer_blocks = nn.Sequential(*[
            TransformerBlock(embed_dim, num_heads, ff_hidden_dim) for _ in range(num_layers)
        ])
        self.fc = nn.Linear(embed_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        x = self.transformer_blocks(x)
        x = self.fc(x)
        return x

Explanation:

  • Embedding Layer: Converts text into numerical representations.
  • Stacked Transformer Blocks: Enhances understanding of long-range dependencies.
  • Final Linear Layer: Predicts the next word in the sequence.

Step 3: Training the Model

We train our model using a dataset that includes chatbot-style conversations.

from transformers import GPT2Tokenizer
import torch.optim as optim

# Load tokenizer and prepare data
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
vocab_size = tokenizer.vocab_size
text = """Hello, how are you today? I am a chatbot designed to assist you.
You can ask me about machine learning, artificial intelligence, and programming."""
tokens = tokenizer.encode(text, return_tensors="pt")

# Model setup
model = SimpleGPT(vocab_size=vocab_size, embed_dim=128, num_heads=4, ff_hidden_dim=512, num_layers=2)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    optimizer.zero_grad()
    output = model(tokens)
    loss = criterion(output.view(-1, vocab_size), tokens.view(-1))
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Key Training Considerations:

  • Use a meaningful dataset to help the model learn useful patterns.
  • Optimize loss to ensure the model is learning efficiently.
  • Experiment with different hyperparameters to find the best configuration.

Step 4: Generating Text with the Model

Once trained, we can use our model to generate text.

def generate_text(model, prompt, max_length=20):
    tokens = tokenizer.encode(prompt, return_tensors="pt")
    with torch.no_grad():
        for _ in range(max_length):
            output = model(tokens)
            next_token = torch.argmax(output[:, -1, :], dim=-1)
            next_token = next_token.unsqueeze(0)
            tokens = torch.cat([tokens, next_token], dim=1)
    return tokenizer.decode(tokens[0], skip_special_tokens=True)

print(generate_text(model, "Hello,"))

How It Works:

  • Takes an input prompt and predicts the next token iteratively.
  • Uses argmax to choose the most likely next word.
  • Combines new tokens into a growing sequence until reaching the desired length.

Conclusion

Building a GPT model from scratch helps in understanding how Transformer architectures work and how text generation is performed. By following these steps, you can create your own lightweight GPT model and extend it for more advanced tasks such as fine-tuning on larger datasets or integrating with real-world applications.

🔥 Next Steps:

  • Experiment with different datasets to improve the model’s performance.
  • Implement sampling methods (top-k, nucleus sampling) for better text generation.
  • Fine-tune the model on custom domains (e.g., medical, legal, gaming texts).

🎯 By taking this structured approach, you gain hands-on experience in building and training language models from scratch!


해당 글은 GPT를 사용하여 작성된 글입니다.
직접 실습하며 공부하고 있는 내용에 대해서 블로그를 대신 작성해보라고 입력해보았더니 나온 결과물이에요. 읽어보시고 혹시 틀린 부분이 있으시다면 언제든지 말씀해주세요!

 

감사합니다. 오늘도 좋은 하루 되세요.

'DL' 카테고리의 다른 글

<Deep Learning> PyTorch와 TensorFlow  (0) 2025.02.28