NLP电影情绪分析项目

https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/
https://machinelearningmastery.com/prepare-movie-review-data-sentiment-analysis/

本教程分为 5 个部分;他们是:

  1. 电影评论数据集
  2. 数据准备
  3. 训练嵌入层
  4. 训练 word2vec 嵌入
  5. 使用预训练嵌入

数据准备

1.将数据分为训练集和测试集。
2.加载和清理数据以删除标点符号和数字。
3.定义首选单词的词汇表。

在空白处拆分标记。
删除单词中的所有标点符号。
删除所有不完全由字母字符组成的单词。
删除所有已知停用词的单词。
删除长度为 <= 1 个字符的所有单词。

我们可以使用字符串 translate() 函数从标记中过滤掉标点符号。
我们可以通过对每个标记使用 isalpha() 检查来删除只是标点符号或包含数字的标记。
我们可以使用使用 NLTK 加载的列表删除英语停用词。
我们可以通过检查短标记的长度来过滤掉它们。

from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords# load doc into memory
def load_doc(filename):# open the file as read onlyfile = open(filename, 'r')# read all texttext = file.read()# close the filefile.close()return text# turn a doc into clean tokens
def clean_doc(doc):# split into tokens by white spacetokens = doc.split()# remove punctuation from each tokentable = str.maketrans('', '', punctuation)tokens = [w.translate(table) for w in tokens]# remove remaining tokens that are not alphabetictokens = [word for word in tokens if word.isalpha()]# filter out stop wordsstop_words = set(stopwords.words('english'))tokens = [w for w in tokens if not w in stop_words]# filter out short tokenstokens = [word for word in tokens if len(word) > 1]return tokens# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):# load docdoc = load_doc(filename)# clean doctokens = clean_doc(doc)# update countsvocab.update(tokens)# load all docs in a directory
def process_docs(directory, vocab, is_trian):# walk through all files in the folderfor filename in listdir(directory):# skip any reviews in the test setif is_trian and filename.startswith('cv9'):continueif not is_trian and not filename.startswith('cv9'):continue# create the full path of the file to openpath = directory + '/' + filename# add doc to vocabadd_doc_to_vocab(path, vocab)# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/neg', vocab, True)
process_docs('txt_sentoken/pos', vocab, True)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))

使用 Counter() 进行统计和去重

保存

# save list to file
def save_list(lines, filename):# convert lines to a single blob of textdata = '\n'.join(lines)# open filefile = open(filename, 'w')# write textfile.write(data)# close filefile.close()# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')

训练嵌入层

https://machinelearningmastery.com/what-are-word-embeddings/

from string import punctuation
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D# load doc into memory
def load_doc(filename):# open the file as read onlyfile = open(filename, 'r')# read all texttext = file.read()# close the filefile.close()return text# turn a doc into clean tokens
def clean_doc(doc, vocab):# split into tokens by white spacetokens = doc.split()# remove punctuation from each tokentable = str.maketrans('', '', punctuation)tokens = [w.translate(table) for w in tokens]# filter out tokens not in vocabtokens = [w for w in tokens if w in vocab]tokens = ' '.join(tokens)return tokens# load all docs in a directory
def process_docs(directory, vocab, is_trian):documents = list()# walk through all files in the folderfor filename in listdir(directory):# skip any reviews in the test setif is_trian and filename.startswith('cv9'):continueif not is_trian and not filename.startswith('cv9'):continue# create the full path of the file to openpath = directory + '/' + filename# load the docdoc = load_doc(path)# clean doctokens = clean_doc(doc, vocab)# add to listdocuments.append(tokens)return documents# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)# load all training reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])# load all test reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, False)
negative_docs = process_docs('txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1# define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

训练 word2vec 嵌入

在这里插入图片描述

word2vec 算法逐句处理文档。这意味着我们将在清理过程中保留基于句子的结构。

from string import punctuation
from os import listdir
from gensim.models import Word2Vec# load doc into memory
def load_doc(filename):# open the file as read onlyfile = open(filename, 'r')# read all texttext = file.read()# close the filefile.close()return text# turn a doc into clean tokens
def doc_to_clean_lines(doc, vocab):clean_lines = list()lines = doc.splitlines()for line in lines:# split into tokens by white spacetokens = line.split()# remove punctuation from each tokentable = str.maketrans('', '', punctuation)tokens = [w.translate(table) for w in tokens]# filter out tokens not in vocabtokens = [w for w in tokens if w in vocab]clean_lines.append(tokens)return clean_lines# load all docs in a directory
def process_docs(directory, vocab, is_trian):lines = list()# walk through all files in the folderfor filename in listdir(directory):# skip any reviews in the test setif is_trian and filename.startswith('cv9'):continueif not is_trian and not filename.startswith('cv9'):continue# create the full path of the file to openpath = directory + '/' + filename# load and clean the docdoc = load_doc(path)doc_lines = doc_to_clean_lines(doc, vocab)# add lines to listlines += doc_linesreturn lines# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)# load training data
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
sentences = negative_docs + positive_docs
print('Total training sentences: %d' % len(sentences))# train word2vec model
model = Word2Vec(sentences, size=100, window=5, workers=8, min_count=1)
# summarize vocabulary size in model
words = list(model.wv.vocab)
print('Vocabulary size: %d' % len(words))# save model in ASCII (word2vec) format
filename = 'embedding_word2vec.txt'
model.wv.save_word2vec_format(filename, binary=False)
from string import punctuation
from os import listdir
from numpy import array
from numpy import asarray
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D# load doc into memory
def load_doc(filename):# open the file as read onlyfile = open(filename, 'r')# read all texttext = file.read()# close the filefile.close()return text# turn a doc into clean tokens
def clean_doc(doc, vocab):# split into tokens by white spacetokens = doc.split()# remove punctuation from each tokentable = str.maketrans('', '', punctuation)tokens = [w.translate(table) for w in tokens]# filter out tokens not in vocabtokens = [w for w in tokens if w in vocab]tokens = ' '.join(tokens)return tokens# load all docs in a directory
def process_docs(directory, vocab, is_trian):documents = list()# walk through all files in the folderfor filename in listdir(directory):# skip any reviews in the test setif is_trian and filename.startswith('cv9'):continueif not is_trian and not filename.startswith('cv9'):continue# create the full path of the file to openpath = directory + '/' + filename# load the docdoc = load_doc(path)# clean doctokens = clean_doc(doc, vocab)# add to listdocuments.append(tokens)return documents# load embedding as a dict
def load_embedding(filename):# load embedding into memory, skip first linefile = open(filename,'r')lines = file.readlines()[1:]file.close()# create a map of words to vectorsembedding = dict()for line in lines:parts = line.split()# key is string word, value is numpy array for vectorembedding[parts[0]] = asarray(parts[1:], dtype='float32')return embedding# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):# total vocabulary size plus 0 for unknown wordsvocab_size = len(vocab) + 1# define weight matrix dimensions with all 0weight_matrix = zeros((vocab_size, 100))# step vocab, store vectors using the Tokenizer's integer mappingfor word, i in vocab.items():weight_matrix[i] = embedding.get(word)return weight_matrix# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)# load all training reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])# load all test reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, False)
negative_docs = process_docs('txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)
# create the embedding layer
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=False)# define model
model = Sequential()
model.add(embedding_layer)
model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

预训练嵌入

from string import punctuation
from os import listdir
from numpy import array
from numpy import asarray
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D# load doc into memory
def load_doc(filename):# open the file as read onlyfile = open(filename, 'r')# read all texttext = file.read()# close the filefile.close()return text# turn a doc into clean tokens
def clean_doc(doc, vocab):# split into tokens by white spacetokens = doc.split()# remove punctuation from each tokentable = str.maketrans('', '', punctuation)tokens = [w.translate(table) for w in tokens]# filter out tokens not in vocabtokens = [w for w in tokens if w in vocab]tokens = ' '.join(tokens)return tokens# load all docs in a directory
def process_docs(directory, vocab, is_trian):documents = list()# walk through all files in the folderfor filename in listdir(directory):# skip any reviews in the test setif is_trian and filename.startswith('cv9'):continueif not is_trian and not filename.startswith('cv9'):continue# create the full path of the file to openpath = directory + '/' + filename# load the docdoc = load_doc(path)# clean doctokens = clean_doc(doc, vocab)# add to listdocuments.append(tokens)return documents# load embedding as a dict
def load_embedding(filename):# load embedding into memory, skip first linefile = open(filename,'r')lines = file.readlines()file.close()# create a map of words to vectorsembedding = dict()for line in lines:parts = line.split()# key is string word, value is numpy array for vectorembedding[parts[0]] = asarray(parts[1:], dtype='float32')return embedding# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):# total vocabulary size plus 0 for unknown wordsvocab_size = len(vocab) + 1# define weight matrix dimensions with all 0weight_matrix = zeros((vocab_size, 100))# step vocab, store vectors using the Tokenizer's integer mappingfor word, i in vocab.items():vector = embedding.get(word)if vector is not None:weight_matrix[i] = vectorreturn weight_matrix# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)# load all training reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])# load all test reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, False)
negative_docs = process_docs('txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1# load embedding from file
raw_embedding = load_embedding('glove.6B.100d.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)
# create the embedding layer
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=False)# define model
model = Sequential()
model.add(embedding_layer)
model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/324436.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

火力发电厂电高压系统电弧光保护监测设备

摘要&#xff1a;介绍了电弧光保护装置的组成,对火力发电厂高压厂用电系统保护的现状及存在的不足进行了分析,以一个典型的2300MW火力发电厂工程为例,讨论了高压厂用电系统电弧光保护装置的设计应用方案,对电弧光保护装置的应用前景进行了展望。1联5系8电2话171微3信5同2号2 关…

PCIe 6.0生态业内进展分析总结-2

3.PCIe 6.0协议分析仪 (1)Keysight Keysight是德科技在2023年6月份对外宣布&#xff0c;第一款支持PCIe 6.0协议验证调试工具。 Keysight PCIe 6.0架构解决方案具备以下特点&#xff1a; 分析PCIe 6.0技术设计的数据链路/事务层 支持所有PCIe技术速率——从2.5 GT/s至64 GT/…

每日一道算法题day-three(备战蓝桥杯)

哈喽大家好&#xff0c;今天来给大家带来每日一道算法题系列第三天&#xff0c;让我们来看看今天的题目&#xff0c;一起备战蓝桥杯 题目&#xff1a; 小 Y的桌子上放着 n 个苹果从左到右排成一列&#xff0c;编号为从 11 到 n。 小苞是小 Y 的好朋友&#xff0c;每天她都会…

Idea live template

1:打印入参日志的配置 log.info("$methodName$ 方法入参: $argsLog$",$argsJson$); methodName:methodName() argsLog:groovyScript( "def result; def params\"${_1}\".replaceAll([\\\\[|\\\\]|\\\\s], ).split(,).toList(); for(i 0; i <…

StreamPark + PiflowX 打造新一代大数据计算处理平台

&#x1f680; 什么是PiflowX PiFlow 是一个基于分布式计算框架 Spark 开发的大数据流水线系统。该系统将数据的采集、清洗、计算、存储等各个环节封装成组件&#xff0c;以所见即所得方式进行流水线配置。简单易用&#xff0c;功能强大。它具有如下特性&#xff1a; 简单易用…

x-cmd pkg | procs - ps 命令的现代化替代品

目录 简介首次用户功能特点类似工具进一步阅读 简介 procs 是用 Rust 编写的 ps 替代品&#xff0c;用于显示有关任务进程的信息 首次用户 使用 x procs 即可自动下载并使用 在终端运行 eval "$(curl https://get.x-cmd.com)" 即可完成 x 命令安装, 详情参考 x-cmd…

地球系统模式(CESM)实践技术应用

目前通用地球系统模式&#xff08;Community Earth System Model&#xff0c;CESM&#xff09;在研究地球的过去、现在和未来的气候状况中具有越来越普遍的应用。CESM由美国NCAR于2010年07月推出以来&#xff0c;一直受到气候学界的密切关注。近年升级的CESM2.0在大气、陆地、海…

Win11系统的优化方法参考文档(彻底优化策略)

目录 一、个性化-应用-关闭防火墙等的设置 二、任务栏优化设置 三、Win11开始菜单更改为Win10经典菜单 四、将Micresoft Store 从固定任务栏取消 五、电源性能优化 六、解决卡顿 七、卸载系统自带软件 八、任务管理器开机启动项的禁用 九、调整为最佳性能 十…

LeetCode-无重复字符的最长子串(3)

题目描述&#xff1a; 给定一个字符串 s &#xff0c;请你找出其中不含有重复字符的 最长子串 的长度。 代码&#xff1a; class Solution {public int lengthOfLongestSubstring(String s) {Set<Character> occnew HashSet<Character>();int lens.length();int…

【排序算法总结】

目录 1. 稳点与非稳定排序2. 冒泡排序3. 简单选择排序4. 直接插入排序5. 快排6. 堆排7. 归并 1. 稳点与非稳定排序 不稳定的&#xff1a;快排、堆排、选择原地排序&#xff1a;快排也是非原地排序&#xff1a;归并 和三个线性时间排序&#xff1a;桶排序 &#xff0c;计数&…

Linux操作系统基础(09):Linux的文件权限

1. 文件权限是什么 在Linux系统中&#xff0c;文件权限是指对文件或目录的访问权限控制&#xff0c;它由三个部分组成&#xff1a;所有者权限、组权限和其他用户权限。文件权限和用户权限之间有密切的关系&#xff0c;文件权限规定了用户对文件的操作权限&#xff0c;而用户权…

大数据情况下如何保证企业数据交换安全

数据交换是指在网络或其他方式下&#xff0c;不同主体按照规定的规则和标准实现数据的共享、传输和处理的过程。大数据时代的到来使得数据交换的重要性更为凸显&#xff0c;大数据带来了海量、多样、高速、低价值密度等特点&#xff0c;也带来了更多的价值挖掘和应用场景。 保障…