Machine Learning: Natural Language Processing (Part 2)

Content

Content
How do you find the similar documents related to some query sentence/search?
What is POS tagger?
1. How to build a POS simple tagger? How to account for the new word?
Learning word sense?
1. How to build your own NER model?
How would you find all the occurrences of quoted text in a news article?
1. Simple Solution
2. Maximum Entropy Model
Build a system that auto corrects text
How would you build a system to translate English text to Greek and vice-versa?
How would you build a system that automatically groups news articles by subject?
How would you design a model to predict whether a movie review was positive or negative?
What is Lexicon and Ontology?
1. Why would someone want to develop an ontology? Some of the reasons are:
What is Syntactic analysis or parsing?
Concept of Parser
What is Semantic analysis?
What is Taxonomy and Ontology?
Explain TF-IDF
1. Advantages:
2. Disadvantages:
What is word2vec? What is the cost function for skip-gram model(k-negative sampling)?
TODO: Tf-Idf fails in document classification/clustering? How can you improve further ?
What are word2vec vectors?
How can I design a chatbot?
Exercise
1. Role-specific questions
2. Related fields such as information theory, linguistics and information retrieval
3. Tools and languages
Question Source

Simplest approach is to do tf-idf of both documents and query, and then measure cosine distance (i.e., dot product)
On top of that, if you use SVD/PCA/LSA on the tf-idf matrix, it should further improve results.
For more on LSI - Latent Semantic Indexing, please check here.

Reference:

What is POS tagger?

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc.

In POS tagging, the goal is to label a sentence (a sequence of words or tokens) with tags like ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB, ARTICLE.

For example, given the sentence “Bob drank coffee at Starbucks”, the labeling might be “Bob (NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION) Starbucks (NOUN)”.

How to build a POS simple tagger? How to account for the new word?

Simple Idea:

First collect tagged sentences

import nltk
tagged_sentences = nltk.corpus.treebank.tagged_sents()

Preprocess the sentences and create [(word_1, tag_1), ... (word_n, tag_n)]. This becomes your $X$ and $Y$.
Train a multiclass classification algorithm like RandomForest, CRF and build your model
Give test sentence, split into words, feed to the model and get corresponding tags.

Reference:

Content

Learning word sense?

Q. How would you train a model that identifies whether the word “Apple” in a sentence belongs to the fruit or the company?

This is a classic example of Named Entity Recognition. It is a statistical technique that (most commonly) uses Conditional Random Fields to find named entities, based on having been trained to learn things about named entities. Essentially, it looks at the content and context of the word, (looking back and forward a few words), to estimate the probability that the word is a named entity.

*In case the above link is broken, click here

For more on Graphical Models, click here

How to build your own NER model?

It’s a supervised learning problem. So first you need to get labelled data, i.e words and entity_tag pair. For example (London,GEO), (Apple Corp., ORG) and then train some model.
As a novice model, apply scikit learn multiclass classification algorithm.
For a more mature model use scikit learn conditional random field technique for creating a better model.

Reference:

Content

How would you find all the occurrences of quoted text in a news article?

Simple Solution

You can do a regex to pick up everything between quotes

list = re.findall("\".*?\"", string)

The problem you’ll run into is that there can be a surprisingly large amount of things between quotation marks that are actually not quotations.

"(said|writes|argues|concludes)(,)? \".?\""

But quotes are a tricky business. Lots of things look like quotes that aren’t, and some things are more quote-like than others. The ideal approach would be able to account for some of that fuzziness in a way that pattern matching doesn’t.

Maximum Entropy Model

This model considers all of the probability distributions that are empirically consistent with the training data; and chooses the distribution with the highest entropy. A probability distribution is “empirically consistent” with a set of training data if its estimated frequency with which a class and a feature vector value co-occur is equal to the actual frequency in the data.

Many problems in natural language processing can be viewed as linguistic classification problems, in which linguistic contexts are used to predict linguistic classes.
Maximum entropy models offer a clean way to combine diverse pieces of contextual evidence in order to estimate the probability of a certain linguistic class occurring with a certain linguistic context.

In the above problem, use feature and apply maximum entropy model to classify if a paragraph has quotes or not. (For example, does the paragraph contain an attribution word like “said?")

Reference:

Content

Build a system that auto corrects text

Q. How would you build a system that auto corrects text that has been generated by a speech recognition system?

A spellchecker points to spelling errors and possibly suggests alternatives. An autocorrector usually goes a step further and automatically picks the most likely word. In case of the correct word already having been typed, the same is retained. So, in practice, an autocorrect is a bit more aggressive than a spellchecker, but this is more of an implementation detail — tools allow you to configure the behaviour.

Content

How would you build a system to translate English text to Greek and vice-versa?

Use seq2seq learning model with attention

Content

How would you build a system that automatically groups news articles by subject?

Text Classification
Topic Modelling

Resource:

Content

How would you design a model to predict whether a movie review was positive or negative?

Typically, sentiment analysis for text data can be computed on several levels, including on an individual sentence level, paragraph level, or the entire document as a whole. Often, sentiment is computed on the document as a whole or some aggregations are done after computing the sentiment for individual sentences. There are two major approaches to sentiment analysis.

Supervised machine learning or deep learning approaches
Unsupervised lexicon-based approaches

However most of the time we don’t have the labelled data. So let’s go for second approach. Hence, we will need to use unsupervised techniques for predicting the sentiment by using knowledgebases, ontologies, databases, and lexicons that have detailed information, specially curated and prepared just for sentiment analysis.

Various popular lexicons are used for sentiment analysis, including the following.

AFINN lexicon
Bing Liu’s lexicon
MPQA subjectivity lexicon
SentiWordNet
VADER lexicon
TextBlob lexicon

Use these lexicon, convert words to their sentiment

Actually there is no machine learning going on here but this library parses for every tokenized word, compares with its lexicon and returns the polarity scores. This brings up an overall sentiment score for the tweet.

Sentiment Analysis

Content

What is Lexicon and Ontology?

A lexicon is a dictionary, vocabulary, or a book of words. In our case, lexicons are special dictionaries or vocabularies that have been created for analyzing sentiments.
Ontologies provide semantic context. Identifying entities in unstructured text is a picture only half complete. Ontology models complete the picture by showing how these entities relate to other entities, whether in the document or in the wider world.

An ontology is a formal and structural way of representing the concepts and relations of a shared conceptualization

I realize that this sentence is really marked up and there’s arrows and red text going all over the place. So let’s examine this closely.

We’ve only recognized (e.g. annotated) two words in this entire sentence: William Shakespeare as a Playwright and Hamlet as a Play. But look at the depth of the understanding that we have. There’s a model depicted on this image, and we want to examine this more carefully. -
You’ll notice first of all that there are a total of 6 annotations represented on the diagram with arrows flowing between them. These annotations are produced by the NLP parser, and modeled (here’s the key point), they are modeled in the Ontology. It’s in the Ontology that we specify how a Book is related to a Date, or to a Language, and a Language to a Country to an Author, to a work produced by that Author, and so on.
Each annotation is backed by a dictionary. The data for that dictionary is generated out of the triple store that conforms to the Ontology. The Ontology shows the relationship of all the annotations to each other.

Why would someone want to develop an ontology? Some of the reasons are:

To share common understanding of the structure of information among people or software agents
To enable reuse of domain knowledge
To make domain assumptions explicit
To separate domain knowledge from the operational knowledge
To analyze domain knowledge

Resource:

Content

What is `Syntactic` analysis or parsing?

Syntax analysis or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech)

Syntactic Analysis of a sentence is the task of recognising a sentence and assigning a syntactic structure to it. These syntactic structures are assigned by the Context Free Grammar (mostly PCFG) using parsing algorithms like Cocke-Kasami-Younger (CKY), Earley algorithm, Chart Parser. They are represented in a tree structure. These parse trees serve an important intermediate stage of representation for semantic analysis.

Syntactic Parse Tree

Resource:

Quora

Content

Concept of Parser

It is used to implement the task of parsing. It may be defined as the software component designed for taking input data (text) and giving structural representation of the input after checking for correct syntax as per formal grammar.
It also builds a data structure generally in the form of parse tree or abstract syntax tree or other hierarchical structure

Content

What is `Semantic` analysis?

Lexical analysis is based on smaller token but on the other side semantic analysis focuses on larger chunks

We already know that lexical analysis also deals with the meaning of the words, then how is semantic analysis different from lexical analysis? Lexical analysis is based on smaller token but on the other side semantic analysis focuses on larger chunks. That is why semantic analysis can be divided into the following two parts −

The semantic analysis of natural language content starts by reading all of the words in content to capture the real meaning of any text.
It identifies the text elements and assigns them to their logical and grammatical role.
It analyzes context in the surrounding text and it analyzes the text structure to accurately disambiguate the proper meaning of words that have more than one definition.
Semantic technology processes the logical structure of sentences to identify the most relevant elements in text and understand the topic discussed.
It also understands the relationships between different concepts in the text.
- For example, it understands that a text is about “politics” and “economics” even if it doesn’t contain the the actual words but related concepts such as “election,” “Democrat,” “speaker of the house,” or “budget,” “tax” or “inflation.”

Semantic analysis is a larger term, meaning to analyse the meaning contained within text, not just the sentiment. It looks for relationships among the words, how they are combined and how often certain words appear together.

Reference:

Semantic Analysis

Content

What is `Taxonomy` and `Ontology`?

An ontology identifies and distinguishes concepts and their relationships; it describes content and relationships.

A taxonomy formalizes the hierarchical relationships among concepts and specifies the term to be used to refer to each; it prescribes structure and terminology. Taxonomy identifies hierarchical relationships within a category.

Ontology

Taxonomy example:

Reference:

Blog

Explain TF-IDF

Q. What is the drawback of Tf-Idf ? How do you overcome it ?

Advantages:

Easy to compute
You have some basic metric to extract the most descriptive terms in a document
You can easily compute the similarity between 2 documents using it

Disadvantages:

TF-IDF is based on the bag-of-words (BoW) model, therefore it does not capture position in text, semantics, co-occurrences in different documents, etc.
For this reason, TF-IDF is only useful as a lexical level feature
Cannot capture semantics (e.g. as compared to topic models, word embeddings)
link

Content

What is word2vec? What is the cost function for skip-gram model(k-negative sampling)?

Content

TODO: Tf-Idf fails in document classification/clustering? How can you improve further ?

What are word2vec vectors?

Word2Vec embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, apple and orange would be close together and apple and gravity would be relatively far. There are two versions of this model based on skip-grams (SG) and continuous-bag-of-words (CBOW).

Content

How can I design a chatbot?

(I had little idea but I tried answering it with intent and response tf-idf based similarity)

Adit Deshpande

Content

Exercise

Can I develop a chatbot with RNN providing a intent and response pair in input?
1. Suppose I developed a chatbot with RNN/LSTMs on Reddit dataset. It gives me 10 probable responses. How can I choose the best reply Or how can I eliminate others replies ?
How do you perform text classification ?
How can you make sure to learn a context !! Well its not possible with TF-IDF ?
- I told him about taking n-grams say n = 1, 2, 3, 4 and concatenating tf-idf of them to make a long count vector ? Okay that is the baseline people start with ? What can you do more with machine learning ? (I tried suggesting LSTM with word2vec or 1D-CNN with word2vec for classification but he wanted improvements in machine learning based methods :-|)
How does neural networks learns non-linear shapes when it is made of linear nodes ? What makes it learn non-linear boundaries ?
What is the range of sigmoid function ?
Text classification method. How will you do it ?
Explain Tf-Idf ?
What are bigrams & Tri-grams ? Explain with example of Tf-Idf of bi-grams & trigrams with a text sentence.
What is an application of word2vec ? Example.
How will you design a neural network ? How about making it very deep ? Very basic questions on neural network.?
How did you perform language identification ? What were the feature ?
How did you model classifiers like speech vs music and speech vs non-speech ?
How can deep neural network be applied in these speech analytics applications ?

Role-specific questions

Natural language processing

What are stop words? Describe an application in which stop words should be removed.
How would you design a model to predict whether a movie review was positive or negative?
Which is a better algorithm for POS tagging – SVM or hidden Markov models?
What is the difference between shallow parsing and dependency parsing?
What package are you aware of in python which is used in NLP and ML?
Explain one application in which stop words should be removed.
Which is better to use while extracting features character n-grams or word n-grams? Why?
What is dimensionality reduction?
Explain the working of SVM/NN/Maxent algorithms
Which is a better algorithm for POS tagging - SVM or hidden markov models ? why?
What packages are you aware of in python which are used in NLP and ML?
What are conditional random fields ?
When can you use Naive Bayes algorithm for training, what are its advantages and disadvantages?
How would you build a POS tagger from scratch given a corpus of annotated sentences? How would you deal with unknown words?

What is entropy? How would you estimate the entropy of the English language?
What is a regular grammar? Does this differ in power to a regular expression and if so, in what way?
What is the TF-IDF score of a word and in what context is this useful?
How does the PageRank algorithm work?
What is dependency parsing?
What are the difficulties in building and using an annotated corpus of text such as the Brown Corpus and what can be done to mitigate them?
Differentiate regular grammar and regular expression.
How will you estimate the entropy of the English language?
Describe dependency parsing?
What do you mean by Information rate?
Explain Discrete Memoryless Channel (DMC).
How does correlation work in text mining?
How to calculate TF*IDF for a single new document to be classified?
How to build ontologies?
What is an N-gram in the context of text mining?
What do you know about linguistic resources such as WordNet?
Explain the tools you have used for training NLP models?

Tools and languages

What tools for training NLP models (nltk, Apache OpenNLP, GATE, MALLET etc…) have you used?
Do you have any experience in building ontologies?
Are you familiar with WordNet or other related linguistic resources?
Do you speak any foreign languages?

Question Source

Published on August 13, 2019

Machine Learning: Natural Language Processing (Part 2)

Content

How do you find the similar documents related to some query sentence/search?

What is POS tagger?

How to build a POS simple tagger? How to account for the new word?

Learning word sense?

How to build your own NER model?

How would you find all the occurrences of quoted text in a news article?

Simple Solution

Maximum Entropy Model

Build a system that auto corrects text

How would you build a system to translate English text to Greek and vice-versa?

How would you build a system that automatically groups news articles by subject?

How would you design a model to predict whether a movie review was positive or negative?

What is Lexicon and Ontology?

Why would someone want to develop an ontology? Some of the reasons are:

What is Syntactic analysis or parsing?

Concept of Parser

What is Semantic analysis?

What is Taxonomy and Ontology?

Explain TF-IDF

Advantages:

Disadvantages:

What is word2vec? What is the cost function for skip-gram model(k-negative sampling)?

TODO: Tf-Idf fails in document classification/clustering? How can you improve further ?

What are word2vec vectors?

How can I design a chatbot?

Exercise

Role-specific questions

Related fields such as information theory, linguistics and information retrieval

Tools and languages

Question Source

What is `Syntactic` analysis or parsing?

What is `Semantic` analysis?

What is `Taxonomy` and `Ontology`?