Entropic Evaluation of Classification
A handson, getdirty introduction
Sunday, 8 July 2018, Oceania 7
Resources
Slides for IJCNN04 tutorial: IJCNN18EntropyTriangle
Implementations of the Entropy Triangle:
 R Package and an initialisation file to configure the libraries dependencies.
 Weka package
 Matlab package
 Python package: to be released soon.
Use case vignettes in R
If you really want to get dirty, these are the use cases we will use to illustrate the affordances of the Entropy Triangle in Rmd: Analysis of Confusion Matrices and Simple Use Case for the CBET on classification. You will be able to analyse different classifiers and find out yourself what the Entropy Triangle is doing. In this case, it is recommended to have R Studio installed.
For those who just want to peruse the illustration cases: Analysis of Confusion Matrices and Simple Use Case for the CBET on classification
Our project in ResearchGate:
Web page at ResearchGate where updates to it are posted:
The main papers for the theory:
[bibshow file=http://www.tsc.uc3m.es/~carmen/CMET.bib] The first introduction to the Entropy Triangle CBET [bibcite key=val:pel:10b] and related metrics EMA \& NIT [bibcite key=val:pel:14a], the source multivariate extension SMET [bibcite key=val:pel:17b] and the channel multivariate extension CMET [bibcite key=val:pel:18c].
[/bibshow]
Organizers
 Francisco J. ValverdeAlbacete

 Departamento de Teoría de la Señal y de las Comunicaciones
 Universidad Carlos III de Madrid
 Avda de la Universidad, 30. Leganés 28911 (España)
 https://www.researchgate.net/profile/Francisco_J_ValverdeAlbacete
 Carmen PeláezMoreno

 Departamento de Teoría de la Señal y de las Comunicaciones
 Universidad Carlos III de Madrid
 Avda de la Universidad, 30. Leganés 28911 (España)
 https://www.researchgate.net/profile/Carmen_PelaezMoreno
Description of Goal
To evaluate supervised classification tasks we have two different data sources:
 The observation vectors themselves to be used to infer the classifier to obtain the predicted labels.
 The true labels of the observations to be compared to the predicted labels in the form of a confusion matrix
Two main kinds of measures try to analyze the confusion matrix: countbased measures (accuracy, TPR, FPR and derived measures, like AUC, etc.) and entropybased measures (variation of information, KLdivergence, mutual information and derived measures).
The first kind use the minimization of such error countbased measures as heuristic to improve the quality of the classifier, while the latter try to optimize the flow of information between input and output label distributions (e.g. minimize variation of information, maximize mutual information).
Specifically, this second type of evaluation considers the classification an informationtransmitting channel like the one seen in Figure 1 where K are the labels, X the observation vectors, Y a possible transformation of the observations, and , the predicted labels.
The purpose of the tutorial is to train attendees in using a visual and numerical standalone tool, the Entropy Triangle, to carry out entropybased evaluation on the blocks of Figure 1 and how to compare it to accuracybased evaluation.
This is schematically shown in Figure 2 where an entropy triangle is shown for a joint distribution P_{XY }.
The idea is to apply this triangle on the distribution of true and predicted labels issued from the classifier . Every single classifier, evaluated on its confusion matrix appears as a distinct dot in the previous diagram and has different characteristics depending on whether it is closer to each of the apexes. For instance, the triangles in Figure 3 show how
different classifiers in WEKA can be evaluated on the same task—in this case anneal from the UCI repository—, a diagram that is relevant for data challenge competitions on supervised classification.
We will also offer a glimpse on how Entropy Triangles can be used
 to ascertain whether the underlying classification task, that is the labels and the observations (K,X) is easy to solve or not, and
 even if the transformed features Y are going to be good or detrimental to the process.
Learning by doing
Axiomatic approaches to evaluation ([bibcite key=sok:lap:09,ami:gon:art:ver:09]) sometimes forget that the proof of the pudding is in the eating. We intend to entreat attendees to bring their own confusion matrices and cook some good entropy triangles with them. We will guide attendants along a session where they will learn to assess a single classifier, then compare different classifiers in the same task and finally evaluate a classifier in different
tasks.
When the data challenges for IJCNN’18 are decided and if any of them is a classification task, we intend to contact the challenge organizers so that we can analyze the data in the tutorial along with the attendants.
The main expected outcome of the tutorial is that when attendants see a classifier evaluated on a dataset and represented in an Entropy Triangle they will be able to judge what is wrong or right with it: whether the data are
unbalanced, whether the classifier is good on the dataset or whether it is overtrained, etc.
And, perhaps more importantly, we will advise them on when to trust and when not to trust accuracy without pvalues! ([bibcite key=val:pel:14a]).
The entropy triangle is available in this basic configuration for Matlab, R and Weka. It is also forthcoming for Python, and these resources will be made available to the attendants.
Plan of the session
The following points will be roughly evenly distributed in the 2h for the tutorial
 Intro: multiclass classification assessment and confusion matrices. We will extend the material in the introduction above.
 The entropy balance equation and the entropy triangle. We will introduce this visualization technique that acts as a visual summary of the normalized variation of information, mutual information and divergence from uniformity, of a joint distribution.
 Assessing single classifiers with the ET. At first, we will provide simple guidance on how to assess individually sampled confusion matrices. When to say on absolute terms that a classifier is good, bad or it is cheating (on some data).
 Assessing a type of classifier on several tasks with the ET. We will show attendees how they can visualize the “no free lunch theorem” on their own classifiers to prevent public remonstrance at conferences.
 Assessing several types of classifiers on the same task with the ET. We will use this modality to carry out task assessment at a glance. We will get dirty with data and see examples of when a task has not been solved by a community in spite of stating accuracies of 65% and above.
 Assessing a data source with a variation of the ET. We will use this modality to see if the data is unbalanced or if the transformed features Y are good.
Outline of the Covered Material
The traditional distinction between classification measures roughly parallels the errorcounting vs. the entropysumming distinction: the first uses techniques from classical statistics, while the second uses classical information theory [bibcite key=mir:96,jap:sha:11]).
In this view, the confusion matrix is a summary of the errors and successes of the classifier in the test data. For binary classification tasks, the measure of success is accuracy and the measures of errors are the False Positive Rate (FPR) and False Negative Rate (FNR). If the classifierinducing technique has any parameters, the balance of FPR vs FNR as parameterised can be observed in the ROC curve ([bibcite key=Fawcett:2004p59]). If a single measure is required, the AreaUndertheCurve is a convenient summary of it ([bibcite key=bra:97]). Although accuracy is straightforwar for more than two classes, the multiclass ROC and VolumeUndertheCurve took longer to be generalised. FPR and FNR are not used in this context.
Are so many measures necessary? The paramount errorcounting measure, accuracy, is known to suffer from a paradox: good lab measurements often drop spectacularly on deployment ([bibcite key=zhu:dav:07]), among other problems ([bibcite key=dav:07]). Yet it is the measure most often reported for classifiers.
The model behind entropy measurements is that a classifier is a channel of information from the data to the output class labels (see Figure 1): the better this channel is, the better the classifier has captured the “essence” of the task and hence, the more interpretable the results are. In this sense, the confusion matrix is the joint distribution between input and output channels. Of course this second interpretation is added on top of the previous one.
In this interpretation, the variation of information ([bibcite key=mei:07]) measures how different the set of input and output labels are: this is a quantity to be minimized. Similarly, the mutual information (MI) ([bibcite key=fan:61]) measures how similar both sets are, so this is a quantity to be maximized. Note that since MI is just the KullbackLeibler divergence between the joint distribution of input and output labels and their marginals, many intuitions regarding the KLdivergence can be applied to MI.
In this picture, the actual entropy on the input labels has great importance since it provides limits to the amount of information that classifiers can learn from the training data: maxims like “the processing of information can only deteriorate it” depend on this. Many measures try to include this factor by normalizing wrt marginal entropies.
An analogue of the ROC in this context is the Entropy Triangle (ET) ([bibcite key=val:pel:10b]) (see Figure 2), showing the balance of MI, VI and entropies of the marginals. Likewise, a summary measure on the ET is the EntropyModified Accuracy (EMA) ([bibcite key=val:pel:14a]), and it goes hand in hand with the Normalized Information Transfer rate, measuring the actual fraction of information learnt by a classifier.
In this tutorial we present a new context for entropic measures and introduce the attendees to an entropybased visualization aid to evaluate multiclass classifiers stemming from their confusion matrices. At the beginning we make explicit the information theoretic model for confusion matrices ([bibcite key=fan:61]) and provide easytounderstand informationtheoretic introduction to the Entropy Triangle which is capable of balancing mutual information and the variation of information and, at the same time take into consideration the peculiarities of datasets.
Since it is difficult for an practitioner not to come up with her own measure, we provide a glimpse into the affordances of EntropyModulated Accuracy and Normalized Information Transfer rate ([bibcite key=val:pel:14a]) for measuring performance and learning ability, respectively.
This basic framework can be extended to data source evaluation?, and to the evaluation of multivariate data transformations?. However, these are advanced topics and will only be dwelt upon if the attendants demand further information.
This research line has a Web page at ResearchGate where updates to it are posted:
Justification
This tutorial is intended for practitioners in machine learning and data science who have ever been baffled by the evaluation of their favorite pet classifier. Since the technique is technologyagnostic it can be used in any form of classifier and supervised classification data whatsoever (and indeed for many more type of datasets!).
The notsogory details are easy to understand to researchers and students who have ever seen and understood the concepts of entropy, mutual information and Venndiagrams.
We are aiming at a 1.5h duration to develop the learning outcomes spelled above. But we could extend it to 3h if necessary and include more examples of evaluation, more indepth analysis, and a bit of theoretical justification after the main ideas were captured in learning to read the visual tool.
Informationtheoretic methods of evaluation are being actively pursued to justify e.g. the good results of deep learning with success [bibcite key=sch:tis:17,tis:zas:15]. This tutorial address this issue from an techniqueindependent, wide scope that will help attendants apply it to their own experiments and research. These aims are furthered by the readytouse SW that attendants will be acquainted with.
Although this is the first time this tutorial will be taught, the authors have explained this material in research talks at several groups, Ph.D. classes and conferences for the past 5 years.
They have a tendency to explain it to solitary researchers too much unawares of what is about to befall them. They unanimously get open eyes and mouths from older researchers and expressions of ”So what?” from younger ones. There have been 2 journal papers (PLOS ONE, Pattern recognition letters) and 2 conference papers (Interspeech, CLEF) written around this topic.
Francisco J. ValverdeAlbacete has been teaching a number of subjects in Electrical Engineering, Signal processing, data mining and pattern recognition for the past 20+ years, including master and Ph.D. level subjects. His interests now lay in Information Theoretic approaches to Machine Learning and nonstandard algebras for signal processing.
Carmen PeláezMoreno has been teaching a number of subjects in Signal Processing, Speech, Audio and Multimedia processing and Pattern Recognition for the past 20+ years, including master and Ph.D. level subjects. She is an Associate Professor in the Multimedia Processing group at Universidad Carlos III de Madrid, Spain, where she applies signal processing to Automatic Speech Recognition.
Special requirements
Internet connection, participants should bring their computers with Octave (Matlab), R, Python
and/or Weka installed.
References