IJCNN 2018 Tutorial

Entropic Evaluation of Classification
A hands-on, get-dirty introduction

Sunday, 8 July 2018, Oceania 7


Slides for IJCNN-04 tutorial: IJCNN18-EntropyTriangle

Implementations of the Entropy Triangle:

Use case vignettes in R

If you really want to get dirty, these are the use cases we will use to illustrate the affordances of the Entropy Triangle in Rmd: Analysis of Confusion Matrices and Simple Use Case for the CBET on classification. You will be able to analyse different classifiers and find out yourself what the Entropy Triangle is doing. In this case, it is recommended to have R Studio installed.

For those who just want to peruse the illustration cases: Analysis of Confusion Matrices and Simple Use Case for the CBET on classification

Our project in ResearchGate: 

Web page at ResearchGate where updates to it are posted:

The main papers for the theory:

[bibshow file=http://www.tsc.uc3m.es/~carmen/CMET.bib] The first introduction to the Entropy Triangle CBET [bibcite key=val:pel:10b] and related metrics EMA \& NIT [bibcite key=val:pel:14a], the source multivariate extension  SMET [bibcite key=val:pel:17b] and the channel multivariate extension CMET [bibcite key=val:pel:18c].



Francisco J. Valverde-Albacete
Carmen Peláez-Moreno

Description of Goal

To evaluate supervised classification tasks we have two different data sources:

  • The observation vectors themselves to be used to infer the classifier to obtain the predicted labels.
  • The true labels of the observations to be compared to the predicted labels in the form of a confusion matrix

Two main kinds of measures try to analyze the confusion matrix: count-based measures (accuracy, TPR, FPR and derived measures, like AUC, etc.) and entropy-based measures (variation of information, KL-divergence, mutual information and derived measures).

The first kind use the minimization of such error count-based measures as heuristic to improve the quality of the classifier, while the latter try to optimize the flow of information between input and output label distributions (e.g. minimize variation of information, maximize mutual information).

Specifically, this second type of evaluation considers the classification an information-transmitting channel like the one seen in Figure 1 where K are the labels, X the observation vectors, Y a possible transformation of the observations, and  \hat Kthe predicted labels.

Figure 1: Conceptual scheme for the measurement of the information balances in a
multiclass/multilabel classification task.

The purpose of the tutorial is to train attendees in using a visual and numerical standalone tool, the Entropy Triangle, to carry out entropy-based evaluation on the blocks of Figure 1 and how to compare it to accuracy-based evaluation.

This is schematically shown in Figure 2 where an entropy triangle is shown for a joint distribution PXY .

Figure 2:
Schematic ET as applied to supervised classifier assessment (from [bibshow file=http://www.tsc.uc3m.es/~carmen/MLmeasures.bib] [bibcite key=val:pel:14a])

The idea is to apply this triangle on the distribution of true and predicted labels issued from the classifier P_{K\hat{K}. Every single classifier, evaluated on its confusion matrix appears as a distinct dot in the previous diagram and has different characteristics depending on whether it is closer to each of the apexes. For instance, the triangles in Figure 3 show how
different classifiers in WEKA can be evaluated on the same task—in this case anneal from the UCI repository—, a diagram that is relevant for data challenge competitions on supervised classification.

Figure 3: (Color online) Entropy triangle for some Weka classifiers on the UCI anneal task

We will also offer a glimpse on how Entropy Triangles can be used

  • to ascertain whether the underlying classification task, that is the labels and the observations (K,X) is easy to solve or not, and
  • even if the transformed features Y are going to be good or detrimental to the process.

Learning by doing

Axiomatic approaches to evaluation ([bibcite key=sok:lap:09,ami:gon:art:ver:09]) sometimes forget that the proof of the pudding is in the eating. We intend to entreat attendees to bring their own confusion matrices and cook some good entropy triangles with them. We will guide attendants along a session where they will learn to assess a single classifier, then compare different classifiers in the same task and finally evaluate a classifier in different

When the data challenges for IJCNN’18 are decided and if any of them is a classification task, we intend to contact the challenge organizers so that we can analyze the data in the tutorial along with the attendants.

The main expected outcome of the tutorial is that when attendants see a classifier evaluated on a dataset and represented in an Entropy Triangle they will be able to judge what is wrong or right with it: whether the data are
unbalanced, whether the classifier is good on the dataset or whether it is over-trained, etc.

And, perhaps more importantly, we will advise them on when to trust and when not to trust accuracy without p-values! ([bibcite key=val:pel:14a]).

The entropy triangle is available in this basic configuration for Matlab, R and Weka. It is also forthcoming for Python, and these resources will be made available to the attendants.

Plan of the session

The following points will be roughly evenly distributed in the 2h for the tutorial

  1. Intro: multi-class classification assessment and confusion matrices. We will extend the material in the introduction above.
  2. The entropy balance equation and the entropy triangle. We will introduce this visualization technique that acts as a visual summary of the normalized variation of information, mutual information and divergence from uniformity, of a joint distribution.
  3. Assessing single classifiers with the ET. At first, we will provide simple guidance on how to assess individually sampled confusion matrices. When to say on absolute terms that a classifier is good, bad or it is cheating (on some data).
  4. Assessing a type of classifier on several tasks with the ET. We will show attendees how they can visualize the “no free lunch theorem” on their own classifiers to prevent public remonstrance at conferences.
  5. Assessing several types of classifiers on the same task with the ET. We will use this modality to carry out task assessment at a glance. We will get dirty with data and see examples of when a task has not been solved by a community in spite of stating accuracies of 65% and above.
  6. Assessing a data source with a variation of the ET. We will use this modality to see if the data is unbalanced or if the transformed features Y are good.

Outline of the Covered Material

The traditional distinction between classification measures roughly parallels the error-counting vs. the entropy-summing distinction: the first uses techniques from classical statistics, while the second uses classical information theory  [bibcite key=mir:96,jap:sha:11]).

In this view, the confusion matrix is a summary of the errors and successes of the classifier in the test data. For binary classification tasks, the measure of success is accuracy and the measures of errors are the False Positive Rate (FPR) and False Negative Rate (FNR). If the classifier-inducing technique has any parameters, the balance of FPR vs FNR as parameterised can be observed in the ROC curve ([bibcite key=Fawcett:2004p59]). If a single measure is required, the Area-Under-the-Curve is a convenient summary of it ([bibcite key=bra:97]). Although accuracy is straightforwar for more than two classes, the multi-class ROC and Volume-Under-the-Curve took longer to be generalised. FPR and FNR are not used in this context.

Are so many measures necessary? The paramount error-counting measure, accuracy, is known to suffer from a paradox: good lab measurements often drop spectacularly on deployment ([bibcite key=zhu:dav:07]), among other problems ([bibcite key=dav:07]). Yet it is the measure most often reported for classifiers.

The model behind entropy measurements is that a classifier is a channel of information from the data to the output class labels (see Figure 1): the better this channel is, the better the classifier has captured the “essence” of the task and hence, the more interpretable the results are. In this sense, the confusion matrix is the joint distribution between input and output channels. Of course this second interpretation is added on top of the previous one.

In this interpretation, the variation of information ([bibcite key=mei:07]) measures how different the set of input and output labels are: this is a quantity to be minimized. Similarly, the mutual information (MI) ([bibcite key=fan:61]) measures how similar both sets are, so this is a quantity to be maximized. Note that since MI is just the Kullback-Leibler divergence between the joint distribution of input and output labels and their marginals, many intuitions regarding the KL-divergence can be applied to MI.

In this picture, the actual entropy on the input labels has great importance since it provides limits to the amount of information that classifiers can learn from the training data: maxims like “the processing of information can only deteriorate it” depend on this. Many measures try to include this factor by normalizing wrt marginal entropies.

An analogue of the ROC in this context is the Entropy Triangle (ET) ([bibcite key=val:pel:10b]) (see Figure 2), showing the balance of MI, VI and entropies of the marginals. Likewise, a summary measure on the ET is the Entropy-Modified Accuracy (EMA) ([bibcite key=val:pel:14a]), and it goes hand in hand with the Normalized Information Transfer rate, measuring the actual fraction of information learnt by a classifier.

In this tutorial we present a new context for entropic measures and introduce the attendees to an entropy-based visualization aid to evaluate multi-class classifiers stemming from their confusion matrices. At the beginning we make explicit the information theoretic model for confusion matrices ([bibcite key=fan:61]) and provide easy-to-understand information-theoretic introduction to the Entropy Triangle which is capable of balancing mutual information and the variation of information and, at the same time take into consideration the peculiarities of datasets.

Since it is difficult for an practitioner not to come up with her own measure, we provide a glimpse into the affordances of Entropy-Modulated Accuracy and Normalized Information Transfer rate ([bibcite key=val:pel:14a]) for measuring performance and learning ability, respectively.

This basic framework can be extended to data source evaluation?, and to the evaluation of multivariate data transformations?. However, these are advanced topics and will only be dwelt upon if the attendants demand further information.

This research line has a Web page at ResearchGate where updates to it are posted:


This tutorial is intended for practitioners in machine learning and data science who have ever been baffled by the evaluation of their favorite pet classifier. Since the technique is technology-agnostic it can be used in any form of classifier and supervised classification data whatsoever (and indeed for many more type of datasets!).

The not-so-gory details are easy to understand to researchers and students who have ever seen and understood the concepts of entropy, mutual information and Venn-diagrams.

We are aiming at a 1.5h duration to develop the learning outcomes spelled above. But we could extend it to 3h if necessary and include more examples of evaluation, more in-depth analysis, and a bit of theoretical justification after the main ideas were captured in learning to read the visual tool.

Information-theoretic methods of evaluation are being actively pursued to justify e.g. the good results of deep learning with success [bibcite key=sch:tis:17,tis:zas:15]. This tutorial address this issue from an technique-independent,  wide scope that will help attendants apply it to their own experiments and research. These aims are furthered by the ready-to-use SW that attendants will be acquainted with.

Although this is the first time this tutorial will be taught, the authors have explained this material in research talks at several groups, Ph.D. classes and conferences for the past 5 years.

They have a tendency to explain it to solitary researchers too much unawares of what is about to befall them. They unanimously get open eyes and mouths from older researchers and expressions of ”So what?” from younger ones. There have been 2 journal papers (PLOS ONE, Pattern recognition letters) and 2 conference papers (Interspeech, CLEF) written around this topic.

Francisco J. Valverde-Albacete has been teaching a number of subjects in Electrical Engineering, Signal processing, data mining and pattern recognition for the past 20+ years, including master and Ph.D. level subjects. His interests now lay in Information Theoretic approaches to Machine Learning and non-standard algebras for signal processing.

Carmen Peláez-Moreno has been teaching a number of subjects in Signal Processing, Speech, Audio and Multimedia processing and Pattern Recognition for the past 20+ years, including master and Ph.D. level subjects. She is an Associate Professor in the Multimedia Processing group at Universidad Carlos III de Madrid, Spain, where she applies signal processing to Automatic Speech Recognition.

Special requirements

Internet connection, participants should bring their computers with Octave (Matlab), R, Python
and/or Weka installed.


Comments are closed.