>> Okay. So, next speaker is
Hai Le from Washington University, St. Louis?
>> Yeah. >> Yeah.
>> Thank you.
>> And he's going to talk about Precision-Recall.
>> Hi, everyone. My name is
Hai from Washington University in St. Louis.
This is Jaguar with my advisor Professor Brendan Juba.
So the problem that we worked on is Class Imbalance.
So, we all know that class imbalance is a situation
when one class severely out-represent other classes.
So some practitioners claim that
class imbalance make it harder for them
for classifying imbalanced data,
but existing learning theory do
not refer to class imbalance.
So although there are
many methods that have been proposed
to correct the imbalance like
various sampling method or the cost sensitive method,
but we have a question,
do those methods really help?
So in our work,
first we derived the relationship
between precision-recall and accuracy,
and then we claim that the large data set
is the only cure for class imbalance.
So we proved that both theoretically and empirically.
Further related work in this problem, Reader et al.
empirically saw that every metric except
for accuracy depends on the class imbalance.
Alvarez also discovered the relationship
between precision, recall and accuracy.
However, he didn't like interpret
the consequence of that relationship.
So first, let's look at our derivation for
the relationship between precision-recall and accuracy.
So our theorem is,
suppose D is a distribution over example with
Boolean label and base positive rate mu,
and suppose we have
a classifier h with precision greater than 0.5.
Then our theorem is define epsilon maximum,
precision and recall error,
then epsilon max has to satisfy,
mu epsilon max is less or equal than epsilon accuracy
and less or equal than three times mu epsilon max.
So, in general, precision and recall is invalid where
the accuracy is scaled by the factor
of three times mu the positive rate.
So the consequence of this theorem is,
if we integrate with the VC of the accuracy,
we can conclude that with
probability at least one minus delta,
at least one over
mu epsilon max times d plus log one over
delta example are necessary and sufficient to
achieve precision and recall
greater than one minus epsilon max,
where d is the VC dimension of the classifier.
So the precision of the term one over mu is very
important because it lead to the consequence that
when the objective is high precision and high recall,
class imbalance does impose a cost on learning.
That means that in order
to achieve high precision and recall,
it demand that the size
of the training data scale by the imbalance.
So we conclude that using
large training data size is the cure for imbalanced data,
because it is what controls the generalization error.
So, let's look at some case study when we really need
high precision because remember
when I talked about our theorem,
we made the assumption that
the precision is greater than 0.5.
So, we look at
some case study when we really need high precision.
So the first example is
the ICU alarm system for predicting cardiac arrest.
So because cardiac arrest is
a very rare event and is very expensive,
so we really want that when the ICU alarm system
is triggered the patient
actually need to transfer to the ICU,
so indeed we want high precision.
Another example that we
want to talk to is the machine translation example.
So, in this one we also need
really high precision and recall because you
can have high accuracy but
low recall if you predict the rare word and never use,
but you can also have higher accuracy but low precision
but in this case the choice that you use of
the words in the translation might be meaningless.
So, Brants et al.
has a system that's so continual improvement as
the training data increases
up to 100 billion of examples.
So, it's not too
clear why he needs so much amount of data,
so we did lay some investigation and
we observed that most of
the sentences contains rare words.
And the frequency of
the rarest word in the target sentence would determine
the amount of data that are required to
reliably translate the sentence.
So we did an experiment on
the New York Times corpus and we observed that at
least 1.5 million example are necessary
to correctly learn 55% of the sentence.
So, for the experiment part,
the aim of this experiment part
is to verify our conclusion
that using large training data size is
necessary to cope with class imbalance.
So in order to do that,
we are going to compare
the performance of different techniques for fixing
class imbalance versus no modification version
but trained on larger training size.
So the data set that we use in
this experiment is drug discovery data set with
about 1 million negative label
and have chose like 62,000 positive label.
So in the first part of the experiment,
first we stratify the data into
20 folds with about 50,000 samples each.
And then we select one fold
and we create different model by
re-sampling this using different imbalance techniques,
like oversampling, undersampling, SMOTE,
and then we change the model using KNN.
For the rest of the data,
we just change using KNN and we
repeat the experiment 200
times and compare the performance.
So you can see, for the no modification version,
when trained on a large data size
we have significant higher precision and
recall than the model
that use different imbalance techniques.
In the second part of the experiment,
we compare the performance by different data
set size from 100,000 to 500,000 samples.
You can see that the precision and recall
scale when the size of the data
set increase in the no modification version.
So, to conclude, we
conclude that it is not possible to achieve
high precision and recall under
severe class imbalance unless
you have a certain amount of data.
And we claim that the method for correcting
class imbalance don't help and you
shouldn't trust classifiers on
typical size data set with high imbalance because
it is very similar to the situation that you
have low accuracy on a small data set size.
So, our advice, when you
have to deal with class imbalance
but you don't have enough data,
our advice is to rather than
trying different techniques for correcting the imbalance,
you try to explore
some prior knowledge about the domain like,
for example, for the machine translation example
that I talked before,
you can try to look for the definition of
the word in the dictionary rather than trying
different techniques for correcting imbalance.
That's all on my talk. Thank you, everyone.
For more infomation >> 【星のカービィ 小ネタ】早速バグ発見!?『星の○○○○』をカービィでプレイする方法(語録字幕)【星のカービィ スターアライズ】 - Duration: 10:04.
For more infomation >> ปัญ MV ชู้กะชู้ มีฉากไหนบ้าง - Duration: 2:54.
For more infomation >> OCTAVOS DE FINAL l | Champions y Europa League - Duration: 9:01.
For more infomation >> Khmer Comedy_ - រឿង ឪល្ងើកូនល្ងង់ #01 - Duration: 21:25. 
Không có nhận xét nào:
Đăng nhận xét