- SME/Senior Faculty with Jigsaw Academy
- Been in Academia prior to Jigsaw taught courses in Statistics, Research Methods @ GNDU, Amritsar
- Academics: Statistics, Maths and Economics
Gunnvant (Jigsaw Academy)
You have python anaconda installed along with spacy and textblob.If so, all the codes are hosted here, please download this repo.
If not I have created a lab environment. This cloud instance takes around 3-4 minutes to spin up, so you can click here and start the lab
Most ML tasks are centered around finding a mapping between predictors and labels
\[labels=f(Predictors)\]
Most ML tasks are centered around finding a mapping between predictors and labels
\[labels=f(Predictors)\]
And this is how mostly data should be laid out if any ML is to be used:
But what will you do if you want use text as an input to an ML task?
Predictor | Label |
---|---|
This Joke is very funny | Haha |
This Joke is not funny | Meh |
Good joke | Haha |
Pathetic joke | Meh |
The underlying assumption while doing any kind of ML task is that one can estimate the functional dependence \(f\) between predictors and targets
\[labels=f(Predictors)\]
To find any sort of functional dependence one will need to represent both predictors and labels as numbers
Its easier to deal with labels:
Predictor | Label |
---|---|
This Joke is very funny | 1 |
This Joke is not funny | 0 |
Good joke | 1 |
Pathetic joke | 0 |
But how do we handle the text?
Traditionally we can use a BOW (Bag of words approach to represent text)
The problem with BOW representation of text is:
Submit and Compare ClearAn improvement on the BOW approach is the tfidf representation
If you use tfidf representation on a corpus which is fairy large, then (Choose the ones which are correct)
Let's head over to our lab to see how we can predict if a tweet is made by Donald Trump or by some-one else.
Let's take another example to see how we can use tfidf representation in other creative ways. We will take a look at a dataset and try to extract "important" terms out of this document. All we will need to do is pick up the words that have very high tfidf scores.
Doing this will make sense when you have a text corpus which respects the rules of standard english grammar and uses standard english vocabulary.
Let's visit the lab again, you can follow along locally as well.
Numerical representation of text enables us to calculate text similarity. This helps in text search and therefore aids in making recommendations.
The idea is that two similar pieces of text will have similar tfidf representation.
One very popular way of finding the distance between vectors is to use cosine distance.
The ideas discussed can be summarised as below:
These basic building blocks can be used to create contextual search recommendation. Head over to the lab/local code
Let's now turn our attention to what NLP is and how it can help us in different business contexts.
So, how does the ability to do POS tagging or finding out Subject-Object relationships help us? Sample the sentences below
Can you see how POS tagging be of any use here? Let's again go back to our lab and work with nlp_basic.ipynb
What does this lead to?
If you are interested in finding what people are talking about:
Let's head back to our notebook.
The approach discussed so far is useful if you want to do some exploratory analysis. The approach will be useful if you don't know what people are talking about.
What if you already knew what are the aspects you would want to search for? You can use word vectors to find out words similar to the aspects that you have finalised.
But first let's talk about word vectors? What will happen if following sentences are represented by tfidf vectors?
But first let's talk about word vectors? What will happen if following sentences are represented by tfidf vectors?
So, what can be done? Before word vectors became popular people experimented with Markov models where one essentially creates a probability distribution on co-occurence of words something like this:
\[P(word_i | word_j) \forall (word_i,word_j)\]
This partially solves the problem of getting some sense of co-occurence of words.
Word vectors solve this problem by creating vector representation of words in such a manner the co-occurence structure of words is retained.
One makes use of neural networks while creating word vectors. Here is a schematic:
One makes use of neural networks while creating word vectors. Here is a schematic:
This training process results in vector representation of each word in the corpus being learnt:
But what to train for?
How can word vectors help us?
Let's headover to the lab once again and see how we can extract mentions of specific aspects
In aspect based sentiment analysis we try to find out what people think about different features of the product.
One way in which this problem can be solved for is by treating this problem as supervised ML task by labelling the data as follows:
One way in which this problem can be solved for is by treating this problem as supervised ML task by labelling the data as follows:
One way in which this problem can be solved for is by treating this problem as supervised ML task by labelling the data as follows:
Labour intensive probably not a good idea, if MVP is all you want to build. In the long term though, this is what one should do ideally.
What can we do then if we want to build a quick and dirty MVP?
We already have pre-trained sentiment analysers:
Let's see an example and head over to the lab