ajeer-portfolio

Usage

1. Custom Language Model Training with TensorFlow

import numpy as np
import tensorflow as tf
        
Tokenizer = tf.keras.preprocessing.text.Tokenizer
pad_sequences = tf.keras.preprocessing.sequence.pad_sequences
        
Sequential = tf.keras.models.Sequential
Embedding = tf.keras.layers.Embedding
SimpleRNN = tf.keras.layers.SimpleRNN
Dense = tf.keras.layers.Dense
LSTM = tf.keras.layers.LSTM
Dropout = tf.keras.layers.Dropout
        
# Load your text data
# Here I'm simply loading a relative file which contains the array of data (data.py)
from data import text_data_arr
        
# Tokenize the text
tokenizer = Tokenizer(char_level=True, lower=True)
tokenizer.fit_on_texts(text_data_arr)
        
# Convert text to sequences
sequences = tokenizer.texts_to_sequences(text_data_arr)[0]
        
# Prepare input and target sequences
input_sequences = []
output_sequences = []
        
sequence_length = 100
for i in range(len(sequences) - sequence_length):
    input_sequences.append(sequences[i:i + sequence_length])
    output_sequences.append(sequences[i + sequence_length])
        
input_sequences = np.array(input_sequences)
output_sequences = np.array(output_sequences)
        
vocab_size = len(tokenizer.word_index) + 1
        
# Define the model architecture:
model = Sequential([
        # Embedding layer that maps each word in the input sequence to a dense vector
        Embedding(vocab_size, 32, input_length=sequence_length),
        # First LSTM layer with 128 units, returning a sequence of outputs for each time step
        LSTM(128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2),
        # Second LSTM layer with 128 units, returning only the final output for the whole sequence
        LSTM(128, dropout=0.2, recurrent_dropout=0.2),
        # Dense layer with a softmax activation, outputting a probability distribution over the vocabulary
        Dense(vocab_size, activation="softmax"),
])
        
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.summary()
        
# Train the model
epochs = 250  # Increase the number of epochs to give the model more time to learn
batch_size = 32
model.fit(input_sequences, output_sequences, epochs=epochs, batch_size=batch_size)
        
model.save('custom_llm_model.keras')

This Python code implements a custom language model using TensorFlow, allowing you to train and generate text sequences. By tokenizing text data and preparing input sequences, the code defines a Sequential model architecture with LSTM layers. After compiling the model, it is trained on the input sequences to learn patterns in the text data. Finally, the trained model is saved for future use, enabling text generation based on the learned patterns.

2. Predictive Text Generation with Custom Language Models using TensorFlow

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
        
# Load your custom language model
custom_model = tf.keras.models.load_model('custom_llm_model.h5')
        
# Generate text samples
input_text = "What is real estate in dubai?"
tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts([input_text])
input_ids = tokenizer.texts_to_sequences([input_text])
input_ids_padded = pad_sequences(input_ids, maxlen=50, padding='post', truncating='post')
        
# Ensure input sequence has length 50 by padding
input_ids_padded = pad_sequences(input_ids, maxlen=50, padding='post', truncating='post')
        
# Print shapes for debugging
print("Input shape before prediction:", input_ids_padded.shape)
        
# Reshape input for the LSTM layer
input_ids_3d = tf.expand_dims(input_ids_padded, axis=-1)
        
# Define a function for predicting the next token using the LSTM layer
@tf.function
def lstm_predict_step(input_data):
    return custom_model.get_layer('lstm')(input_data)
        
# Apply the function to get the LSTM output
lstm_output = lstm_predict_step(input_ids_3d)
        
# Continue with the rest of the prediction
output_ids = custom_model.get_layer('dense')(lstm_output)
        
# Print shapes for debugging
print("Output shape after prediction:", output_ids.shape)

This Python code showcases how to utilize a custom language model trained with TensorFlow for text generation. The script loads a pre-trained custom language model, tokenizes input text, pads sequences to a fixed length, and feeds them into the model for prediction. Specifically, it demonstrates the process of predicting the next token in a sequence using an LSTM layer. The code offers insights into reshaping input data for LSTM compatibility and obtaining predictions from the model's output layer. This can serve as a foundational guide for integrating and testing custom language models within TensorFlow applications.

3. Text Generation Model Training from PDFs using TensorFlow and PyPDF2

from PyPDF2 import PdfReader
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from sklearn.model_selection import train_test_split
import numpy as np
        
# Function to extract text from PDF using PyMuPDF
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ''
    for page in reader.pages:
        text += page.extract_text()
    return text
        
# Example PDF file path
pdf_path = 'assets/book.pdf'
        
# Extract text from PDF
pdf_text = extract_text_from_pdf(pdf_path)
        
# Tokenize the text
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts([pdf_text])
        
# Convert text to sequences
sequences = tokenizer.texts_to_sequences([pdf_text])
        
input_sequences = []
output_sequences = []
        
for sequence in sequences:
    for i in range(1, len(sequence)):
        input_sequences.append(sequence[:i])
        output_sequences.append(sequence[i])
        
# Pad sequences
max_seq_length = 50  # Adjust as needed
padded_sequences = pad_sequences(input_sequences, maxlen=max_seq_length)
        
# Convert to NumPy arrays
X = np.array(padded_sequences)
y = np.array(output_sequences)
        
# Define hyperparameters
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 128
lstm_units = 256
output_units = vocab_size
        
# Build the model
model = Sequential([
   Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_seq_length),
   LSTM(units=lstm_units),
   Dense(units=output_units, activation='softmax')
])
        
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
        
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
        
# Train the model
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)
        
# Display model summary
model.summary()
model.save('custom_llm_model.h5')

This Python script demonstrates the process of training a text generation model using PDF documents as a corpus. It begins by extracting text from a PDF file using PyPDF2, tokenizing the text, and converting it into sequences for model training. The script then pads the sequences to ensure uniform length, defines hyperparameters for the LSTM-based model, and compiles it with appropriate loss and optimizer settings. After splitting the data into training and validation sets, the model is trained and evaluated. Finally, the trained model is saved for future use. This code serves as a practical guide for building text generation models from PDF sources using TensorFlow and PyPDF2 libraries.