So, from Computer Vision, will move next to NLP,
and when we talk about NLP,
I cannot think of any better name
and Professor Pushpak Bhattacharyya,
who is from IIT Bombay as probably most of you know,
and now divides his time between IIT Bombay and
IIT Patna where he is the Director for IIT Patna.
So, over to you Sir, Pushpak.
>> Okay. So, yeah, from vision to language.
My talk is on state- of-the-art for NLP.
And I've also added some observations and insights,
based on whatever we have seen for a number of years.
Okay. So I'll make
a few remarks on the nature of NLP as such,
then I'll take a representative problem.
Talk about the new world order for NLP.
Then in some trends,
Indian efforts and draw some conclusions in future work.
So, Natural Language Processing is one of the areas
of AI and the remarks that
Professor Raj Reddy made in the morning.
Sort of contextualizes remarks on AI,
Computer Vision also and we can draw from those remarks.
So, the areas on
the outer circle are the ones which
are closer to human beings
and real-life applications but all of them draw from
very fundamental ideas and techniques of research,
reasoning, and learning, and some planning also.
So, we have always believe
the Natural Language Processing
is a confluence of linguistics and computation.
We believe linguistics is
the eye and computation is the body.
And we can see on the top layer,
the basic ingredients of
linguistics: lexicon, morphology, syntax, semantics.
And at the bottom layer, the
nuts and bolts of computer science,
Probability theory, machine learning,
graphs and trees, and so on.
And this interaction between
linguistics and Computer Science gives rise to
very important areas whose utility
as well as research interest are supreme.
Like sentiment analysis, summarization,
morphology analysis, machine transmission.
All these are extremely big areas
and they have a tremendous application.
Next please? So, just to make the point,
that linguistics is the eye
really and computation is the body.
I would like to make
this example and put
it in front of you right in the beginning.
Okay? Just to emphasize the point
that this physics for Computer Science,
linguistics for Natural Language Processing,
physics for Computer Vision,
linguistics for Natural Language Processing,
physics, phonetics for speech.
I think they will always remain,
they can never be discounted.
So, here is an example,
this is the distributional hypothesis
from Harris back in 1970.
Words with similar distributional properties
have similar meanings
and Harris does mentioned the distributional approaches
can model differences in
meaning rather than the proper meaning itself.
So, this particular idea is
the foundation for Word embedding or Word vectors.
So, the linguistics is the eye-point now.
Now, let's see the competition is the body point.
Next, so, we have the CBOW and
Skip-gram models for generating
word vectors or word embeddings.
So in Skip-gram, what happens is that,
a word is presented,
one hot representation is
presented to a feed-forward network and
the output is prescribed as
the contextual words for that particular word.
For example, if dog is the input,
represent input word one-hot representation,
then all the contextual words for
dog especially the content words like bark,
policeman, vigilance, and so on.
They are placed in the output layer,
in the form of a supervised-learning situation.
And CBOW is exactly the opposite,
the context words are placed
one-hot representations and the output
is the word itself, okay?
So now, the weight vector for the neuron,
in the one-hot representation
becomes the representation of the word.
Okay? So the weight vector represents
the word. Next slide.
So now, if we take an example here,
when we hear the word
Dog by association quite a few other words
appear in our mind and they also appear
in the syntactic environment.
So, bark, police, thief, vigilance,
faithful, friend, animal, milk, carnivore, and so on.
Similarly, for the word Cat,
we have associated words from the corpora,
similarly for a candle.
Now, the similarity between dog and
cat is much more than
the similarity between dog and lamb.
As is shown in the associated word list also.
Now, this is an intuition we have,
this is also specified by
Harris's distributional hypothesis and next slide please?
And, if you present these words to the neural network,
it is these very same words which
appear by excitation at the outermost layer.
Okay? So the representation in this case means
the similarity of the words through
this representation in the weight vector form.
So, this is the way,
we have captured the word embedding
and we take it as the meaning of the word.
Okay? Next please? So, coming back to this aphorism,
linguistics is the eye and computation is the body.
The encode-decode deep learning network is nothing
but an implementation
of Harris's distributional hypothesis.
Okay? So this 1970 hypothesis,
has appeared as the word embedding.
Trained through a feed-forward neural network
by applying Backpropagation algorithm.
And this whole network is nothing
but the implementation of that particular principle.
Linguistics is the eye, competition
is the body. Next please?
So in Natural Language Processing,
we know is a layered task going from morphology at
the bottom which processes words and transits true,
part-of-speech tagging, chunking, parsing,
semantic going up to
tough problems like Pragmatics and Discourse.
And Natural Language Processing
is also multi-dimensional problem.
There are words in the X-axis,
Y-axis are the problems
with respect to language processing.
And computer scientists are interested in
the other axis which is algorithm design.
Now, the property and
the personality of the language is very important.
We have a very favorite example of ours,
[FOREIGN] This is a Marathi sentence
and the English sentence is,
The one who is in front of the house told me.
So, this is the way languages represent information.
[FOREIGN] is a continuous string
and English has this isolating behavior,
it has taken different morphemes from that long string.
And has a different words representing the morphemes.
So, this difference of property between the languages,
is important to keep in mind for algorithm design.
Next? So, need for NLP is a well-known,
we have to track the sentiment of people,
eCommerce companies are very interested in this.
And as Professor Reddy mentioned in the morning,
translation is going to be a very,
very important problem in the years to come.
Next? Now, machine learning has
become one of the important paradigms
of Natural Language Processing.
And in machine learning, what we do is that,
we look at instances and arrive at
an abstract concept by which we can classify the objects.
So, when instances of tables are given,
what we arrive at is the concept of tableness in
some approximate form and
that is refined gradually in presence of more data.
Okay. So, what we are acquiring is
the concept of tableness not particular tables.
Next? So this picture I'm very fond of,
Natural Language Processing and machine learning have got
married for the benefit of both fields. Next please?
Now, natural language processing.
The challenge of natural language processing
comes from the fact that,
there is ambiguity at every stage.
And this is what makes
machine learning and natural language processing
come close to each other.
Ambiguity processing means choosing
one amongst many other options.
So, that precisely is the classification task.
We choose one of the many classes present.
So there is ambiguity at lexical level,
ambiguity of words, the word
present can mean time or gift.
There is structural ambiguity,
one and two bedroom flats live in ready.
Live in ready has to be grouped together,
similarly one and two bedrooms
have to be grouped together,
not only as set but also recursively.
They have their internal structure.
These are very famous sentence,
in semantics, flying planes can be dangerous.
What is dangerous? Flying or the planes?
Pragmatic ambiguity.
Sarcasm is a case of pragmatic ambiguity.
If I'm not given any attention in
a party and while I'm taking leave of the host,
if you ask me how did you like the party then I reply,
or let's say retort, "I love being ignored."
Okay. Which is a sarcastic sentence.
So the surface meaning is
different from the intended meaning.
Forward. The other challenge of NLP, is multilinguality,
and if you had seen the list of
things that professor already put up,
multilingual competition is listed
as one of the basic needs of our civilization.
Next. So now an important point about the nature
of NLP and its interaction with the machine learning.
Rules have been used in
artificial intelligence always right
from the beginning of AI.
Even when data-driven computation is used,
underlying mechanism is rules,
even though they are not apparent,
and we would like to extract those rules
from the machine that has been trained.
That is called the research on explainability.
Rules are very good for explaining the competition.
And, we also like to believe that we should
not learn when we know the underlying knowledge.
Only when the phenomenon seems
arbitrary at the current state of knowledge,
then we'd like to go to data,
for example, [inaudible] Thunderbird is translated as many thanks,
not as several tags and there is no explanation for this.
Okay? So, since this is
a fixed behavior we'd
like to extract this pattern from the data,
and use it. Next please.
So, in this scenario probability
has played a very, very important role.
This is a favorite example of mine again.
If we look at these four sentences,
the sun rises in the east,
the sunrise in the east,
the svn rises in the east and the sun rises in the west.
All the sentences have
some defect or rather grammatical mistake,
spelling mistake, semantic mistakes.
The first sentence is correct.
And when we compute the probability of the sentences,
for the first sentence
the probability comes out to be highest.
What is the meaning of the probability of
the sentence? Next slide please.
The probability of a sentence means
the probability of n-grams and their product.
So probability of sun rises in the east, is p,
probability of sun into probability of rises given sun,
probability of in given sun and rises,
probability of the sun rises in, and so on.
So this is the quadrigram probability.
So, when we try to compute this probability,
because of the fact that rises in the west as
a quadrigram is less frequent than rises in the east,
as it appears in the corpora,
the probability of the first sentence,
comes out to be the highest.
Can you go back once? Yeah. So, the probability
of the first sentence comes out to be the highest,
and it produces an illusion that
the system understands grammatical mistakes,
semantic mistaken and spelling mistake.
Okay. So this very simple calculation which
is robust and fast,
and based on simple ratio of counts,
creates an illusion that
the system is aware of grammatical mistakes,
spelling mistake and semantic mistake.
So this method of doing AI,
according to me is a slightly different word order,
and this is going to stay for
quite some time because data is available.
Next. Next please.
So this particular power of data,
and simple processing on data,
was brought out in 2014 by a paper from Google,
which also got a lot of press and Johann
mentioned this in plentiful number.
There are lots of papers on automatic image captioning.
And the main methodology is simple.
We do image processing on the image,
and natural language processing on the caption,
and put features and parts in
correspondence. Next slide please.
So, one more slide.
Yeah. So, you see these,
all these images have been captioned automatically.
And now there are slight mistakes which are also curious.
So, look at this second image,
two dogs play in the grass.
Actually three dogs are playing.
And also the last but one image,
red motorcycle, actually the motorcycle is pink.
Now, one begins to wonder why
a system which has done everything right,
why does it make these mistakes of
counting the dogs here and the color of a scooter there?
So this we know from the history of AI,
is the sign of shallow understanding.
These systems which have shallow understanding can make
curious mistakes which are otherwise alright,
but makes some mistakes which are
unexplainable. So next please.
So, this kind of
shallow understanding is the characteristic of NLP, ML,
AI and maybe vision speech also where we can
sort of write a vector equation
with some violation of notation.
Deep understanding is equal to
shallow understanding plus big data.
That seems to be the world order
today in terms of research.
Next. So, this reminds us of the grind methodology,
show umpteen number of problems,
for a particular concept,
for example, Newton's Third Law of Motion.
There is the problem of spring,
there is a problem of recall by a gun.
All these are explained by Newton's Third Law of Motion.
But what happens today is the students are subjected to
a very large number of problems being solved. Next slide.
And what they do typically is memorize the patterns,
there are multiple-choice questions,
match the pattern, eliminate choices, select from a few.
And it seems they do not have
a unifying theme which runs across the problems.
Okay? So, next.
So, there is an uncanny resemblance to
today's NLP as I see.
So there is this kind of huge memorization of pattern,
along with probability distribution.
So if I take this example, "I love being ignored."
This is sarcastic-yes, non-sarcastic- no.
This is the hardmax decision.
The movie is great for putting you
to sleep, slightly difficult,
because the sarcasm is not over.
So this is sarcastic with probability point nine,
and non-sarcastic with probability point one.
So, instead of classifying in
a hard manner what we're
learning is the probability distribution,
where all the classes are
probable with different probability values.
Okay? And this corresponds
to, actually extracting patterns,
and giving them probabilities
assuming some kind of probability distribution
on the underlying data.
Next. So the main methodology in NLP.
Also I think in many other fields of AI is this,
object A and object B are putting correspondence,
and parts and features are
extracted from both the objects,
and we establish correspondence
between these parts and features.
We go from larger correspondence
to smaller correspondence.
And the methodology is
most of the time expectation maximization.
And we learn these mappings,
we use these mappings in a new situation called Decoding.
This is the main methodology which is
applied across machine learning driven NLP.
Next. And linguistics and
computation interaction happens through annotation.
Annotated data is better than
raw data that dramatically
changes the way computation is done.
But good annotation design and
good annotators are very difficult to find,
because they have to understand
both statistical phenomena and linguistic phenomena.
Next. So I take a representative problem now,
and this is the problem of numerical sarcasm.
It shows, you know, why rules,
classical machine learning and
deep learning all of them are important.
Why it is necessary to perspectivize all of them.
Next please.
So, about 17% of sarcastic tweets have origin in number.
This phone has an awesome battery backup of 38 hours,
which is non-sarcastic statement, positive sentiment.
Last sentence, this phone has
a terrible battery backup of
two hours, non-sarcastic, negative sentiment.
The second sentence is curious,
this phone has an awesome battery backup of two hours.
Why do we resort to sarcasm?
Okay? Why can't we plainly directly say,
express on negative sentiment.
So this is an interesting question,
why people use sarcasm.
I think, human beings are inherently like dramatization.
They also like forceful articulation.
And, here one of the hypotheses is that,
sarcasm has this effect
of lowering the defense and then attacking.
Okay? And at that time the attack is much more forceful.
So, this phone has awesome battery backup.
Fine. The difference is
lowered and then you dramatically come back saying,
backup of two hours, which is a negative sentiment.
Okay? So there are many, many theories about why
sarcasm is used as an instrument for communication.
Next. So, numerical sarcasm.
All these are examples of numerical sarcasm.
Waiting 45 minutes for
subway in the freezing cold is so much fun.
So much fun is positive,
but waiting 45 minutes
in the subway in freezing cold is negative.
So, simultaneous presence of
positive and negative sentiment indicates sarcasm.
Next. So, there's important datasets
in the area for sarcasm task. Next.
So what is done in sarcasm research is,
numerical sarcasm detection is,
this phone has an awesome battery back-up of two hours.
It is passed through natural language processing,
this is dependency parsing output.
So, we create tuples of this kind,
phone or some battery back-up et cetera noun phrases.
The number is hours,
and the algorithm goes as follows.
In rule-based matching system,
I love writing this paper at 9:00 am.
This a new sentence. The matched sarcastic tweet is.
"I love writing this paper
daily at 3:00 am." That is sarcastic.
Since nine is not close to three,
the test sentence is non-sarcastic.
I'm sure you're raising your eyebrows,
okay, because this is still ad-hoc.
See the next example also.
Test Tweet: "I'm so
productive when my room is 81 degrees."
Matched to Non-sarcastic Tweet is,
"I'm very much productive in
my room as it has 21 degrees."
The absolute difference between 81 and 21
is high and therefore the test tweet is sarcastic.
So this is the rule-based system.
Okay? Now the rule based system is naive,
very ad hoc, agreed?
So then, if we go to machine learning,
when decision seems ad hoc,
we'd like to go to data and
make these decisions come from the data itself.
So here, classical machine learning and
we use different features like positive words,
negative words and so on.
Positive emoticon, negative emoticon.
Punctuation features.
And then since features also depend
on human beings wimps and fancies.
Okay? We'd like to eliminate even that part
of computation and make
it completely deep learning-based,
very little feature engineering.
So Convolutional Neural Network,
Feedforward Network with back propagation.
And here are the results.
So we find that deep learning-based approaches
give about 93% accuracy,
machine learning-based approach is 80%,
rule-based system 82%, okay,
both of them are 80%.
What is the main difference? The difference is
that rule-based system, the decision is ad hoc.
Coming from human beings, it is ad hoc.
The problem with machine learning based system
is that features are being used.
Features are also given by human beings.
The ideal deep learning system will eliminate
all human intervention and
everything will come from data.
That is the ideal deep learning scenario.
And the message being- Okay?
Insight is that ad hocism in the decision.
We rely on the data to give us the decision threshold.
Then you can remove this ad hocism by relying on data.
If we go to machine learning,
then human intervention is a little reduced.
Still, the features are coming from human beings,
and even that high level of human intervention is
removed when we resort to Deep Learning System.
Message is that rule based systems are
great for intuition building and explainability.
That we cannot take from Rule-based Systems.
Okay? The idea for
building the machine comes from rule-based systems.
However, some human decisions seem to be ad hoc,
so relegate that decision to come from data.
And finally, in the final step,
resort to Deep Learning to have
even feature engineering coming from data.
So this is the trend, Rule-based System,
Classical Machine Learning-based System and
then deep learning-based system.
So the new world order for NLP-ML is the following.
There is this trinity of data, technique and idea.
In this new world order,
data has become available.
Techniques have improved a lot,
a lot of insights from machine learning.
Now, there is this playing field where people with ideas
can make use of data and
technique to come up with very useful systems.
Okay? But this playing field is not a level one.
There are haves and have nots.
And data application and
more data is a steeply ascending gradient.
Googles Facebooks and Ubers
have tremendous amount of data.
So they will always try to either outsmart
new entities with ideas or will try to acquire them.
So, I saw this San Francisco based Startup or Postmates,
which aims at getting anything delivered in minutes.
Okay? The idea is very similar to Uber.
Like Uber it has a fleet of cars-with-drivers.
Postmates has a fleet of couriers,
currently about 13,000 who can deliver goods locally.
So huge amount of data is gathered and
subjected to analytics to reduce delivery time,
by drafting nearest courier personnel to pick
items from specified shops and deliver,
just like Uber drafts its drivers.
Okay? So same idea. So, Apple
is very interested in bringing in multilinguality,
sentiment, et cetera in
its Siri system where again NLP plays an important role.
NVIDIA is more interested
in attacking very fundamental question.
A lot of deep learning is actually
based on matix multiplication.
So they would like to make matrix
multiplication faster and faster.
Now, a common start-up theme seems to be a prediction
of problematic instances out of conceptual space.
Separating if a loanee will turn out as a defaulter,
an organization will fall into financial distress,
a rentee will prove "bad" for AirBnB.
So, those are all one class of problems,
which startups are handling.
But another common theme is,
we have services like cars, shops, restaurants,
movies, et cetera, and people wanting service,
like travelers, consumers movie goers.
And between these two, there
is this Match-making software.
Uber, Lyft, MakeMyTrip,
Postmates are nothing but matchmaking softwares.
And therefore, the methodology is,
services register themselves with the software.
People wanting service, they download the App and that
builds the bridge and then
matchmaking takes place for fast service,
quality of service, and customer satisfaction.
Collect users feedback in spoken and written form.
This is a very very important trend for
all these match-making softwares. Indian Effort.
This has been mainly multilinguality driven.
And many of
our very well-known researchers are right in this room.
In 90s with fifth generation Computer Project,
IISC (architecture), IITM (expert
systems) the then NCST doing NLP,
ISI (Computer Vision), TIFR (speech).
Machine Translation has been a large effort since 1980's.
Many institutes have been involved.
Cross-lingual information retrieval, we have
been leading this effort from IIT Bombay.
Indian Language Lexical Resources like Indian WordNets,
Information Extraction very strong
groups exist in IITKGP,
IIT Delhi, IISc, IIT Bombay,
IIIT Hyderabad, IIT Patna also.
Speech and NLP, IITM, TIFR, IIT Bombay.
OCR and NLP, IITD and IISc.
So these institutes have been
contributing a lot to Indian Language and NLP and speech.
So, future outlook is as follows.
It is a reality for NLP.
That lab ideas are going to the land.
Okay, they're becoming large utilities.
Very, very useful utilities making lot of money
and the processing is
based on quick and robust computation.
Quick and robust computation.
Shallow understanding, shallow processing,
but lot of data. This is here to stay.
And NLP and speech marriage is inevitable,
led by speech, NLP has the backend.
Okay.
So speech voice-activated systems are going to be there,
but they will be helped a lot
by Natural Language Processing at the backend.
Chat Bots are becoming very ubiquitous.
Banks, insurance companies, airlines,
all of them are on Chat Bots.
Emotion and Opinion Tracking.
We will have to ignore this only at our peril.
And there is a huge possibility of economic impact.
I'll end with some philosophical questions.
One, method changes but the philosophy does not.
Method changes but fundamental philosophy doesn't.
Classical, Statistical and Deep Learning
method changes not the fundamental principle.
What is the fundamental principle? Let's remember this.
NLP and Machine-learning works
on the principle of establishing correspondences.
Okay, there are large objects,
objects which are put in correspondence.
Their parts and features are
put in correspondence by
expectation-maximization or some such algorithm.
This is the basic methodology.
And then, we apply these land mappings
to new situations, which is called Decoding.
So this particular thing hasn't changed.
Whether it is rule-based
or machine learning-based or deep learning based,
this fundamental principle hasn't changed,
will not change according to me.
So this is the based on
the correspondence of parts
and correspondence or features.
Now, we when we learn
the correspondence of parts and features,
the ingenuity lies in modeling the probability.
Okay, what exactly is the probability distribution?
That is the crux of the matter.
That's where human ingenuity will play its role.
And the philosophical view on
Neural Network as vs language is the following.
In Deep Learning or Neural Network-based Computation,
we know there are only two kinds of
representations: vectors and matrices. In fact only one.
Matrices. And Operators are
only two: vector addition and vector multiplication.
If we go to the world of languages,
the representation is much much richer.
There are characters, words,
sentences, paragraphs, punctuations,
syntax tree, meaning graph,
silence modulation, et cetera, et cetera.
And Operators, concatenation, reversal, reorder,
implicature, irony metaphor, sarcasm,
honor operations of language.
So that a very very important question is
this representation and this matrix only two,
can these capture this whole gamut of operations and
representations in the field of language. Okay Thank you.
>> Thank you professor.
We'll take only one question since
we're running a bit over time.
One question in the audience?
Anyone? Just wait for microphone.
>> Like in Word to Vector representation,
let's say one word is there,
the riverbank of Ganga something and
the State Bank of India
increase the interest rate something.
So these two words are placing different context,
but the representation, how will you make
one vector representation for these two?
>> Yeah. So, this is an example of word ambiguity.
Bank as River Bank or bank as Financial Bank.
So, if you create
the word representation from the corpus as such.
Okay, then you will
not have two different representations
for two different senses.
Okay, however, if you generate word vectors from domains,
then the representations will be different.
The other thing is, this explainable word vectors and
partners proposed this worked
on interpretable word embeddings.
Kevin is working on this.
So you can trace components of the word vector,
which correspond to those domain-specific attributes.
For example, the financial bank
has very close association with money for example.
Okay, so that particular component,
maybe we can trace to a particular
position in the word vector.
>> With that, we'll close this talk.
Thank you very much Professor.
Không có nhận xét nào:
Đăng nhận xét