DS Wannabe之5-AM Project: DS 30day int prep day14

Q1. What is Autoencoder? 自编码器是什么?

自编码器是一种特殊类型的神经网络,它通过无监督学习尝试复现其输入数据。它通常包含两部分:编码器和解码器。编码器压缩输入数据成为一个低维度的中间表示,解码器则从这个中间表示重建输出,输出尽可能接近原始输入。自编码器常用于特征学习、降维和去噪。

Autoencoderneural network: It is an unsupervised Machine learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. It is trained to attempt to copy its input to its output. Internally, it has the hidden layer that describes a code used to represent the input.

It is trying to learn the approximation to the identity function, to output x̂ x^ that is similar to the xx. Autoencoders belongs to the neural network family, but they are also closely related to PCA

(principal components analysis).

Auto encoders, although it is quite similar to PCA, but its Autoencoders are much more flexible than PCA. Autoencoders can represent both liners and non-linear transformation in encoding, but PCA can perform linear transformation. Autoencoders can be layered to form deep learning network due to its Network representation.

Autoencoder use cases

  • Dimensionality reduction: Smaller dimensional space representation of our inputs.
  • De-noising data: If trained with clean data, irrelevant noise will be ltered out during reconstruction.
  • Anomaly detection: A poor reconstruction will result when the model is fed with unseen inputs.

Types of Autoencoders:

1. Denoising autoencoder

Autoencoders are Neural Networks which are used for feature selection and extraction. However, when there are more nodes in hidden layer than there are inputs, the Network is risking to learn so-called “Identity Function”, also called “Null Function”, meaning that output equals the input, marking the Autoencoder useless.

2. Sparse autoencoder

An autoencoder takes the input image or vector and learns code dictionary that changes the raw input from one representation to another. Where in sparse autoencoders with a sparsity enforcer that directs a single-layer network to learn code dictionary which in turn minimizes the error in reproducing the input while restricting number of code words for reconstruction. The sparse autoencoder consists a single hidden layer, which is connected to the input vector by a weight matrix forming the encoding step. The hidden layer then outputs to a reconstruction vector, using a tied weight matrix to form the decoder.

Q2. What Is Text Similarity? 文本相似性是什么?

文本相似性是指评估两段文本在内容、语义或结构上的相似度。这可以通过各种算法实现,如余弦相似度、杰卡德相似度或基于词嵌入的方法。文本相似性在信息检索、文档分类和自然语言处理中有广泛的应用。

When talking about text similarity, different people have a slightly different notion on what text similarity means. In essence, the goal is to compute how ‘close’ two pieces of text are in (1) meaning or (2) surface closeness. The first is referred to as semantic similarity, and the latter is referred to as lexical similarity. Although the methods for lexical similarity are often used to achieve semantic similarity (to a certain extent), achieving true semantic similarity is often much more involved.

Lexical or Word Level Similarity

When referring to text similarity, people refer to how similar the two pieces of text are at the surface level. Example- how similar are the phrases “the cat ate the mouse” with “the mouse ate the cat food” by just looking at the words? On the surface, if you consider only word-level similarity, these two phrases (with determiners disregarded) appear very similar as 3 of the 4 unique words are an exact overlap.

Semantic Similarity

Another notion of similarity mostly explored by NLP research community is how similar in meaning are any two phrases? If we look at the phrases, “ the cat ate the mouse ” and “ the mouse ate the cat food”. As we know that while the words significantly overlaps, these two phrases have different meaning. Meaning out of the phrases is often the more difficult task as it requires deeper level of analysis.Example, we can actually look at the simple aspects like order of words: “cat==>ate==>mouse” and “mouse==>ate==>cat food”. Words overlap in this case, the order of the occurrence is different, and we can tell that, these two phrases have different meaning. This is just the one example. Most people use the syntactic parsing to help with the semantic similarity. Let’s have a look at the parse trees for these two phrases. What can you get from it?

Q3. What is dropout in neural networks? 神经网络中的Dropout是什么?

Dropout是一种正则化技术,用于防止神经网络的过拟合。在训练过程中,Dropout会随机地丢弃(即,暂时移除)网络中的一些神经元及其连接,这迫使网络以更健壮的方式学习特征,因为它不能依赖于任何单一的输入特征。

When we training our neural network (or model) by updating each of its weights, it might become too dependent on the dataset we are using. Therefore, when this model has to make a prediction or classification, it will not give satisfactory results. This is known as over-fitting. We might understand this problem through a real-world example: If a student of science learns only one chapter of a book and then takes a test on the whole syllabus, he will probably fail.

To overcome this problem, we use a technique that was introduced by Geoffrey Hinton in 2012. This technique is known as dropout.

Dropout refers to ignoring units (i.e., neurons) during the training phase of certain set of neurons, which is chosen at random. By “ignoring”, I mean these units are not considered during a particular forward or backward pass.

At each training stage, individual nodes are either dropped out of the net with probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed.

Q4. What is Forward Propagation? 什么是前向传播?

前向传播是神经网络中的一个过程,其中输入数据在网络的各层之间传递,从输入层开始,经过隐藏层,最终到输出层产生预测。在这个过程中,每一层的输出将成为下一层的输入,直到最终产生输出。Input X provides the information that then propagates to hidden units at each layer and then finally produce the output y. The architecture of network entails determining its depth, width, and the activation functions used on each layer. Depth is the number of the hidden layers. Width is the number of units (nodes) on each hidden layer since we don’t control neither input layer nor output layer dimensions. There are quite a few set of activation functions such Rectified Linear Unit, Sigmoid, Hyperbolic tangent, etc. Research has proven that deeper networks outperform networks with more hidden units. Therefore, it’s always better and won’t hurt to train a deeper network.

Q5. What is Text Mining? 什么是文本挖掘?

文本挖掘是指从文本数据中提取有价值信息的过程。它涉及到信息检索、词性标注、情感分析、主题识别等多种技术。文本挖掘使我们能够从大规模的文本数据集中发现模式、趋势和关联,常用于社交媒体分析、市场情报、客户服务等领域。

Text mining: It is also referred as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Q6. What is Information Extraction?

信息提取是自然语言处理的一个分支,它的目标是从非结构化文本数据中自动提取结构化信息。这包括提取实体(如人名、地点、组织名)、关系(如员工-公司关系)、事件(如交易或投资)和其他特定于领域的信息。

Information extraction (IE): It is the task of automatically extracting structured information from the unstructured and/or semi-structured machine-readable documents. In most of the cases, this activity concerns processing human language texts using natural language processing (NLP).

Information extraction depends on named entity recognition (NER), a sub-tool used to find targeted information to extract. NER recognizes entities first as one of several categories, such as location (LOC), persons (PER), or organizations (ORG). Once the information category is recognized, an information extraction utility extracts the named entity’s related information and constructs a machine-readable document from it, which algorithms can further process to extract meaning. IE finds meaning by way of other subtasks, including co-reference resolution, relationship extraction, language, and vocabulary analysis, and sometimes audio extraction.

Q7. What is Text Generation?

文本生成是指使用自然语言处理技术自动生成人类可读的文本。这可以包括基于规则的方法或使用机器学习模型,如神经网络,来生成新颖的文本内容,如新闻文章、故事、代码或诗歌。

Text Generation: It is a type of the Language Modelling problem. Language Modelling is the core problem for several of natural language processing tasks such as speech to text, conversational system, and the text summarization. The trained language model learns the likelihood of occurrence of the word based on the previous sequence of words used in the text. Language models can be operated at the character level, n-gram level, sentence level or even paragraph level.

A language model is at the core of many NLP tasks, and is simply a probability distribution over a sequence of words:

It can also be used to estimate the conditional probability of the next word in a sequence:

Q8. What is Text Summarization?

文本摘要是将长文本简化为较短版本的过程,同时保留关键信息和文本的主要意图。文本摘要可以是抽取式的,即直接从原文中选取片段;也可以是归纳式的,即重新表述原文以产生更简洁的版本

We all interact with the applications which uses the text summarization. Many of the applications are for the platform which publishes articles on the daily news, entertainment, sports. With our busy schedule, we like to read the summary of those articles before we decide to jump in for reading entire article. Reading a summary helps us to identify the interest area, gives a brief context of the story.

Text summarization is a subdomain of Natural Language Processing (NLP) that deals with extracting summaries from huge chunks of texts. There are two main types of techniques used for text summarization: NLP-based techniques and deep learning-based techniques.

Text summarization: It refers to the technique of shortening long pieces of text. The intention is to create the coherent and fluent summary having only the main points outlined in the document.

How text summarization works:

The two types of summarization, abstractive and the extractive summarization.

1. AbstractiveSummarization:Itselectwordsbasedonthesemanticunderstanding;eventhose words did not appear in the source documents. It aims at producing important material in the new way. They interprets and examines the text using advanced natural language techniques to generate the new shorter text that conveys the most critical information from the original text.

It can be correlated in the way human reads the text article or blog post and then summarizes in their word.

2. Extractive Summarization: It attempt to summarize articles by selecting the subset of words that retain the most important points.

This approach weights the most important part of sentences and uses the same to form the summary. Different algorithm and the techniques are used to define the weights for the sentences and further rank them based on importance and similarity among each other.

Q9. What is Topic Modelling?

主题建模是一种自然语言处理技术,用于发现大量文档集中隐藏的主题结构。常用方法包括潜在语义分析(LSA)和潜在狄利克雷分配(LDA)。这些方法可以帮助组织和理解大型文本集合中的主题和概念。

Topic Modelling is the task of using unsupervised learning to extract the main topics (represented as a set of words) that occur in a collection of documents.
Topic modeling, in the context of Natural Language Processing, is described as a method of uncovering hidden structure in a collection of texts.

Dimensionality Reduction:

Topic modeling is the form of dimensionality reduction. Rather than representing the text T in its feature space as {Word_i: count(Word_i, T) for Word_i in V}, we can represent the text in its topic space as ( Topic_i: weight(Topic_i, T) for Topic_i in Topics ).
Unsupervised learning:

Topic modeling can be compared to the clustering. As in the case of clustering, the number of topics, like the number of clusters, is the hyperparameter. By doing the topic modeling, we build clusters of words rather than clusters of texts. A text is thus a mixture of all the topics, each having a certain weight.

A Form of Tagging

If document classification is assigning a single category to a text, topic modeling is assigning multiple tags to a text. A human expert can label the resulting topics with human-readable labels and use different heuristics to convert the weighted topics to a set of tags.

Q10.What is Hidden Markov Models?

隐马尔可夫模型(HMM)是一种统计模型,用于描述具有隐藏状态的马尔可夫过程。在自然语言处理中,HMM常用于词性标注、语音识别和其他任务,其中需要推断出最可能的隐藏状态序列(如句子中单词的词性)。

Why Hidden, Markov Model?

The reason it is called the Hidden Markov Model is because we are constructing an inference model based on the assumptions of a Markov process. The Markov process assumption is simply that the “future is independent of the past given the present”.

To make this point clear, let us consider the scenario below where the weather, the hidden variable, can be hot, mild or cold, and the observed variables are the type of clothing worn. The arrows represent transitions from a hidden state to another hidden state or from a hidden state to an observed variable.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/467658.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

代码随想录刷题笔记 DAY 24 | 回溯算法理论基础 | 组合问题 No. 77

文章目录 Day 2401. 回溯算法理论基础1.1 什么是回溯法?1.2 为什么要使用回溯法?1.3 如何理解回溯法? 02. 组合问题(No. 77)2.1 题目2.2 笔记2.3 代码 Day 24 01. 回溯算法理论基础 1.1 什么是回溯法? &…

【小沐学GIS】基于C++QT绘制三维数字地球Earth(OpenGL)

🍺三维数字地球系列相关文章如下🍺:1【小沐学GIS】基于C绘制三维数字地球Earth(456:OpenGL、glfw、glut)第一期2【小沐学GIS】基于C绘制三维数字地球Earth(456:OpenGL、glfw、glut)第二期3【小沐…

解决MAC连上wifi或热点却不能上网问题

解决MAC连上wifi或热点却不能上网问题 #新换的mac昨天还能连上wifi,今天就不好使了。 找到连接的wifi点击详细信息,选择TCP/IP 中的配置IPV4 选择关闭

从零开始:Windows下的Miniconda安装教程

🚀从零开始:Windows下的Miniconda安装教程🚀 🌳一、引言🌳 【🚀全新启程🚀】✨踏入Python数据科学的大门,🔥Miniconda🔥将成为您不可或缺的得力助手&#xf…

Spring Boot 笔记 007 创建接口_登录

1.1 登录接口需求 1.2 JWT令牌 1.2.1 JWT原理 1.2.2 引入JWT坐标 1.2.3 单元测试 1.2.3.1 引入springboot单元测试坐标 1.2.3.2 在单元测试文件夹中创建测试类 1.2.3.3 运行测试类中的生成和解析方法 package com.geji;import com.auth0.jwt.JWT; import com.auth0.jwt.JWTV…

Spring Boot 笔记 009 创建接口_更新用户基本信息

1.1.1 给User实体类添加校验 package com.geji.pojo;import com.fasterxml.jackson.annotation.JsonIgnore; import jakarta.validation.constraints.Email; import jakarta.validation.constraints.NotEmpty; import jakarta.validation.constraints.NotNull; import jakarta…

python-自动化篇-办公-批量新建文件夹并保存日志信息

文章目录 说明代码效果 说明 因为业务需要,每天都需要按当天的日期创建很多新文件夹。把这种重复又繁重的操作交给Python来做,一直是我的目标。先说下要求: 默认在桌面新建文件夹。文件夹命名方式,“月.日-1”,比如7…

全国计算机等级考试二级,Web考试大纲(2023年版)

基本要求 1.理解Web基本工作原理,掌握Web技术的基本概念和基础。 2.理解超文本传输协议 HTTP的基本概念和模型,掌握 HTTP的消息格式、常用消息头、请求消息和常用请求方法、响应消息和常用响应状态。 3.熟练掌握超文本标记语言 HTMl文档的结构、常用文档…

JavaWeb:SpingBoot原理 --黑马笔记

1. 配置优先级 在我们前面的课程当中,我们已经讲解了SpringBoot项目当中支持的三类配置文件: application.properties application.yml application.yaml 在SpringBoot项目当中,我们要想配置一个属性,可以通过这三种方式当中…

C++ 贪心 Huffman树 合并果子

在一个果园里,达达已经将所有的果子打了下来,而且按果子的不同种类分成了不同的堆。 达达决定把所有的果子合成一堆。 每一次合并,达达可以把两堆果子合并到一起,消耗的体力等于两堆果子的重量之和。 可以看出,所有…

spring boot整合cache使用Ehcache 进行数据缓存

之前的文章 spring boot整合 cache 以redis服务 处理数据缓存 便捷开发 带着大家通过spring boot整合了 cache 缓存 那么 我们就来说说 其他服务的缓存 而spring boot默认的缓存方案就是 cache 用simple模式 spring boot的强大在于它的整合能力 它将其他缓存技术整合 统一了接…

【北邮鲁鹏老师计算机视觉课程笔记】08 texture 纹理表示

【北邮鲁鹏老师计算机视觉课程笔记】08 texture 纹理表示 1 纹理 规则和不规则的 2 纹理的用处 从纹理中恢复形状 3 分割与合成 4 分析纹理进行分类 通过识别纹理分析物理性质 如何区分纹理 5 寻找有效的纹理分类方法 发现模式、描述区域内模式 A对应图2 B对应图…