Usage
1. Custom Language Model Training with TensorFlow
import numpy as npimport tensorflow as tfTokenizer = tf.keras.preprocessing.text.Tokenizerpad_sequences = tf.keras.preprocessing.sequence.pad_sequencesSequential = tf.keras.models.SequentialEmbedding = tf.keras.layers.EmbeddingSimpleRNN = tf.keras.layers.SimpleRNNDense = tf.keras.layers.DenseLSTM = tf.keras.layers.LSTMDropout = tf.keras.layers.Dropout# Load your text data# Here I'm simply loading a relative file which contains the array of data (data.py)from data import text_data_arr# Tokenize the texttokenizer = Tokenizer(char_level=True, lower=True)tokenizer.fit_on_texts(text_data_arr)# Convert text to sequencessequences = tokenizer.texts_to_sequences(text_data_arr)[0]# Prepare input and target sequencesinput_sequences = []output_sequences = []sequence_length = 100for i in range(len(sequences) - sequence_length):input_sequences.append(sequences[i:i + sequence_length])output_sequences.append(sequences[i + sequence_length])input_sequences = np.array(input_sequences)output_sequences = np.array(output_sequences)vocab_size = len(tokenizer.word_index) + 1# Define the model architecture:model = Sequential([# Embedding layer that maps each word in the input sequence to a dense vectorEmbedding(vocab_size, 32, input_length=sequence_length),# First LSTM layer with 128 units, returning a sequence of outputs for each time stepLSTM(128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2),# Second LSTM layer with 128 units, returning only the final output for the whole sequenceLSTM(128, dropout=0.2, recurrent_dropout=0.2),# Dense layer with a softmax activation, outputting a probability distribution over the vocabularyDense(vocab_size, activation="softmax"),])model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])model.summary()# Train the modelepochs = 250 # Increase the number of epochs to give the model more time to learnbatch_size = 32model.fit(input_sequences, output_sequences, epochs=epochs, batch_size=batch_size)model.save('custom_llm_model.keras')
This Python code implements a custom language model using TensorFlow, allowing you to train and generate text sequences. By tokenizing text data and preparing input sequences, the code defines a Sequential model architecture with LSTM layers. After compiling the model, it is trained on the input sequences to learn patterns in the text data. Finally, the trained model is saved for future use, enabling text generation based on the learned patterns.
2. Predictive Text Generation with Custom Language Models using TensorFlow
import tensorflow as tffrom tensorflow.keras.preprocessing.sequence import pad_sequencesimport numpy as np# Load your custom language modelcustom_model = tf.keras.models.load_model('custom_llm_model.h5')# Generate text samplesinput_text = "What is real estate in dubai?"tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token="<OOV>")tokenizer.fit_on_texts([input_text])input_ids = tokenizer.texts_to_sequences([input_text])input_ids_padded = pad_sequences(input_ids, maxlen=50, padding='post', truncating='post')# Ensure input sequence has length 50 by paddinginput_ids_padded = pad_sequences(input_ids, maxlen=50, padding='post', truncating='post')# Print shapes for debuggingprint("Input shape before prediction:", input_ids_padded.shape)# Reshape input for the LSTM layerinput_ids_3d = tf.expand_dims(input_ids_padded, axis=-1)# Define a function for predicting the next token using the LSTM layer@tf.functiondef lstm_predict_step(input_data):return custom_model.get_layer('lstm')(input_data)# Apply the function to get the LSTM outputlstm_output = lstm_predict_step(input_ids_3d)# Continue with the rest of the predictionoutput_ids = custom_model.get_layer('dense')(lstm_output)# Print shapes for debuggingprint("Output shape after prediction:", output_ids.shape)
This Python code showcases how to utilize a custom language model trained with TensorFlow for text generation. The script loads a pre-trained custom language model, tokenizes input text, pads sequences to a fixed length, and feeds them into the model for prediction. Specifically, it demonstrates the process of predicting the next token in a sequence using an LSTM layer. The code offers insights into reshaping input data for LSTM compatibility and obtaining predictions from the model's output layer. This can serve as a foundational guide for integrating and testing custom language models within TensorFlow applications.
3. Text Generation Model Training from PDFs using TensorFlow and PyPDF2
from PyPDF2 import PdfReaderfrom tensorflow.keras.preprocessing.text import Tokenizerfrom tensorflow.keras.preprocessing.sequence import pad_sequencesfrom tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import Embedding, LSTM, Densefrom sklearn.model_selection import train_test_splitimport numpy as np# Function to extract text from PDF using PyMuPDFdef extract_text_from_pdf(pdf_path):reader = PdfReader(pdf_path)text = ''for page in reader.pages:text += page.extract_text()return text# Example PDF file pathpdf_path = 'assets/book.pdf'# Extract text from PDFpdf_text = extract_text_from_pdf(pdf_path)# Tokenize the texttokenizer = Tokenizer(oov_token="<OOV>")tokenizer.fit_on_texts([pdf_text])# Convert text to sequencessequences = tokenizer.texts_to_sequences([pdf_text])input_sequences = []output_sequences = []for sequence in sequences:for i in range(1, len(sequence)):input_sequences.append(sequence[:i])output_sequences.append(sequence[i])# Pad sequencesmax_seq_length = 50 # Adjust as neededpadded_sequences = pad_sequences(input_sequences, maxlen=max_seq_length)# Convert to NumPy arraysX = np.array(padded_sequences)y = np.array(output_sequences)# Define hyperparametersvocab_size = len(tokenizer.word_index) + 1embedding_dim = 128lstm_units = 256output_units = vocab_size# Build the modelmodel = Sequential([Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_seq_length),LSTM(units=lstm_units),Dense(units=output_units, activation='softmax')])# Compile the modelmodel.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])# Split the data into training and validation setsX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)# Train the modelmodel.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)# Display model summarymodel.summary()model.save('custom_llm_model.h5')