作者 | Susan Li
来源 | Medium
编辑 | 代码医生团队
为了建立一个基于内容的推荐系统,收集了西雅图152家酒店的酒店描述。正在考虑其他一些训练这种高质量清洁数据集的方法。
为什么不培养自己的酒店描述的文本生成神经网络?通过实施和训练基于单词的递归神经网络,创建用于生成自然语言文本(即酒店描述)的语言模型。
该项目的目的是在给出一些输入文本的情况下生成新的酒店描述。不认为结果是准确的,只要预测的文本是连贯的。
数据
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku
import pandas as pd
import numpy as np
import string, os
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
hotel_df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")
all_descriptions = list(hotel_df.desc.values)
len(all_descriptions)
desc_preprocessing.py
数据库中共有152条描述(即酒店)。
看看第一个描述:
corpus = [x for x in all_descriptions]
corpus[:1]
图1
在标记化之后,可以:
t = Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0)
t.fit_on_texts(corpus)
print(t.word_counts)
print(t.word_docs)
print(t.document_count)
print(t.word_index)
print('Found %s unique tokens.' % len(t.word_index))
文字预处理
符号化
使用Keras的Tokenizer来矢量化文本描述,
# Tokenization
t = Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0)
def get_sequence_of_tokens(corpus):
t.fit_on_texts(corpus)
total_words = len(t.word_index) + 1
input_sequences = []
for line in corpus:
token_list = t.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
return input_sequences, total_words
input_sequences, total_words = get_sequence_of_tokens(corpus)
input_sequences[:10]
sequence.py
图2
上面的整数列表表示从语料库生成的ngram短语。例如假设“ located on the southern tip of lake Union ”的句子由这样的单词索引表示:
表格1
填充序列并创建预测变量和标签
表2
如您所见,如果想要准确性,那将非常困难。
# pad sequences
def generate_padded_sequences(input_sequences):
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen = max_sequence_len, padding = 'pre'))
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
label = ku.to_categorical(label, num_classes = total_words)
return predictors, label, max_sequence_len
predictors, label, max_sequence_len = generate_padded_sequences(input_sequences)
pad_sequence.py
模型
现在可以定义单个LSTM模型。
def create_model(max_sequence_len, total_words):
model = Sequential()
# Add Input Embedding Layer
model.add(Embedding(total_words, 10, input_length=max_sequence_len - 1))
# Add Hidden Layer 1 - LSTM Layer
model.add(LSTM(100))
model.add(Dropout(0.1))
# Add Output Layer
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
return model
model = create_model(max_sequence_len, total_words)
model.summary()
model.fit(predictors, label, epochs=100, verbose=5)
text_generator.py
使用经过训练的LSTM网络生成文本
def generate_text(seed_text, next_words, model, max_seq_len):
for _ in range(next_words):
token_list = t.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_seq_len-1, padding='pre')
predicted = model.predict_classes(token_list, verbose=0)
output_word = ''
for word,index in t.word_index.items():
if index == predicted:
output_word = word
break
seed_text = seed_text + " " + output_word
return seed_text.desc()
generate_text.py
print(generate_text("hilton seattle downtown", 100, model, max_sequence_len))
图3
选择“ best western seattle airport hotel ”作为种子文本,我希望模型预测接下来的200个单词。
print(generate_text("best western seattle airport hotel", 200, model, max_sequence_len))
图4
print(generate_text('located in the heart of downtown seattle', 300, model, max_sequence_len))
图5
结论
关于改进的一些想法:更多训练数据,更多训练时期,更多层,更多层的存储单元,预测更少的单词数作为给定种子的输出。
Jupyter笔记本可以在Github上找到。
https://github.com/susanli2016/NLP-with-Python/blob/master/Hotel%20Description%20Generation%20LSTM.ipynb
参考文献:
https://keras.io/preprocessing/text/
https://medium.com/@shivambansal36/language-modelling-text-generation-using-lstms-deep-learning-for-nlp-ed36b224b275