## What is "entropy and information gain"?

I am reading this book (NLTK) and it is confusing. **Entropy** is defined as:

Entropy is the sum of the probability of each label times the log probability of that same label

How can I apply *entropy* and *maximum entropy* in terms of text mining? Can someone give me a easy, simple example (visual)?

I assume entropy was mentioned in the context of building **decision trees**.

To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names each labeled with either `m`

or `f`

, we want to learn a model that fits the data and can be used to predict the gender of a new unseen first-name.

name gender ----------------- Now we want to predict Ashley f the gender of "Amro" (my name) Brian m Caroline f David m

First step is deciding what **features** of the data are relevant to the target class we want to predict. Some example features include: first/last letter, length, number of vowels, does it end with a vowel, etc.. So after feature extraction, our data looks like:

# name ends-vowel num-vowels length gender # ------------------------------------------------ Ashley 1 3 6 f Brian 0 2 5 m Caroline 1 4 8 f David 0 2 5 m

The goal is to build a decision tree. An example of a tree would be:

length<7 | num-vowels<3: male | num-vowels>=3 | | ends-vowel=1: female | | ends-vowel=0: male length>=7 | length=5: male

basically each node represent a test performed on a single attribute, and we go left or right depending on the result of the test. We keep traversing the tree until we reach a leaf node which contains the class prediction (`m`

or `f`

)

So if we run the name *Amro* down this tree, we start by testing "*is the length<7?*" and the answer is *yes*, so we go down that branch. Following the branch, the next test "*is the number of vowels<3?*" again evaluates to *true*. This leads to a leaf node labeled `m`

, and thus the prediction is *male* (which I happen to be, so the tree predicted the outcome correctly).

The decision tree is built in a top-down fashion, but the question is how do you choose which attribute to split at each node? The answer is find the feature that best splits the target class into the purest possible children nodes (ie: nodes that don't contain a mix of both male and female, rather pure nodes with only one class).

This measure of *purity* is called the **information**. It represents the expected amount of information that would be needed to specify whether a new instance (first-name) should be classified male or female, given the example that reached the node. We calculate it
based on the number of male and female classes at the node.

**Entropy** on the other hand is a measure of *impurity* (the opposite). It is defined for a binary class with values `a`

/`b`

as:

Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))

This binary entropy function is depicted in the figure below (random variable can take one of two values). It reaches its maximum when the probability is `p=1/2`

, meaning that `p(X=a)=0.5`

or similarly`p(X=b)=0.5`

having a 50%/50% chance of being either `a`

or `b`

(uncertainty is at a maximum). The entropy function is at zero minimum when probability is `p=1`

or `p=0`

with complete certainty (`p(X=a)=1`

or `p(X=a)=0`

respectively, latter implies `p(X=b)=1`

).

Of course the definition of entropy can be generalized for a discrete random variable X with N outcomes (not just two):

*(the log in the formula is usually taken as the logarithm to the base 2)*

Back to our task of name classification, lets look at an example. Imagine at some point during the process of constructing the tree, we were considering the following split:

ends-vowel [9m,5f] <--- the [..,..] notation represents the class / \ distribution of instances that reached a node =1 =0 ------- ------- [3m,4f] [6m,1f]

As you can see, before the split we had 9 males and 5 females, i.e. `P(m)=9/14`

and `P(f)=5/14`

. According to the definition of entropy:

Entropy_before = - (5/14)*log2(5/14) - (9/14)*log2(9/14) = 0.9403

Next we compare it with the entropy computed after considering the split by looking at two child branches. In the left branch of `ends-vowel=1`

, we have:

Entropy_left = - (3/7)*log2(3/7) - (4/7)*log2(4/7) = 0.9852

and the right branch of `ends-vowel=0`

, we have:

Entropy_right = - (6/7)*log2(6/7) - (1/7)*log2(1/7) = 0.5917

We combine the left/right entropies using the number of instances down each branch as weight factor (7 instances went left, and 7 instances went right), and get the final entropy after the split:

Entropy_after = 7/14*Entropy_left + 7/14*Entropy_right = 0.7885

Now by comparing the entropy before and after the split, we obtain a measure of **information gain**, or how much information we gained by doing the split using that particular feature:

Information_Gain = Entropy_before - Entropy_after = 0.1518

*You can interpret the above calculation as following: by doing the split with the end-vowels feature, we were able to reduce uncertainty in the sub-tree prediction outcome by a small amount of 0.1518 (measured in bits as units of information).*

At each node of the tree, this calculation is performed for every feature, and the feature with the *largest information gain* is chosen for the split in a greedy manner (thus favoring features that produce *pure* splits with low uncertainty/entropy). This process is applied recursively from the root-node down, and stops when a leaf node contains instances all having the same class (no need to split it further).

Note that I skipped over some details which are beyond the scope of this post, including how to handle numeric features, missing values, overfitting and pruning trees, etc..

**What is coronavirus, how did it start and how big could it get?,** What is a coronavirus? Coronaviruses are a family of viruses that cause disease in animals. Seven, including the new virus, have made the This is a list of mathematical symbols used in all branches of mathematics to express a formula or to represent a constant. A mathematical concept is independent of the symbol chosen to represent it. For many of the symbols below, the symbol is usually synonymous with the corresponding concept (ultimately an arbitrary choice made as a result of the cumulative history of mathematics), but in some situations, a different convention may be used. For example, depending on context, the triple bar

**Schumpeter,** What the author (a former contributor to this newspaper) skips over is how a company with so many well-documented flaws can be such a Is definition is - present tense third-person singular of be; dialectal present tense first-person and third-person singular of be; dialectal present tense plural of be How to use is in a sentence.

I can't give you graphics, but maybe I can give a clear explanation.

Suppose we have an information channel, such as a light that flashes once every day either red or green. How much information does it convey? The first guess might be one bit per day. But what if we add blue, so that the sender has three options? We would like to have a measure of information that can handle things other than powers of two, but still be additive (the way that multiplying the number of possible messages by two *adds* one bit). We could do this by taking log2(number of possible messages), but it turns out there's a more general way.

Suppose we're back to red/green, but the red bulb has burned out (this is common knowledge) so that the lamp must always flash green. The channel is now useless, *we know what the next flash will be* so the flashes convey no information, no news. Now we repair the bulb but impose a rule that the red bulb may not flash twice in a row. When the lamp flashes red, we know what the next flash will be. If you try to send a bit stream by this channel, you'll find that you must encode it with more flashes than you have bits (50% more, in fact). And if you want to describe a sequence of flashes, you can do so with fewer bits. The same applies if each flash is independent (context-free), but green flashes are more common than red: the more skewed the probability the fewer bits you need to describe the sequence, and the less information it contains, all the way to the all-green, bulb-burnt-out limit.

It turns out there's a way to measure the amount of information in a signal, based on the the probabilities of the different symbols. If the probability of receiving symbol xi is pi, then consider the quantity

-log pi

The smaller pi, the larger this value. If xi becomes twice as unlikely, this value increases by a fixed amount (log(2)). This should remind you of adding one bit to a message.

If we don't know what the symbol will be (but we know the probabilities) then we can calculate the average of this value, how much we will get, by summing over the different possibilities:

I = -Σ pi log(pi)

This is the information content in one flash.

Red bulb burnt out: pred = 0, pgreen=1, I = -(0 + 0) = 0 Red and green equiprobable: pred = 1/2, pgreen = 1/2, I = -(2 * 1/2 * log(1/2)) = log(2) Three colors, equiprobable: pi=1/3, I = -(3 * 1/3 * log(1/3)) = log(3) Green and red, green twice as likely: pred=1/3, pgreen=2/3, I = -(1/3 log(1/3) + 2/3 log(2/3)) = log(3) - 2/3 log(2)

This is the information content, or entropy, of the message. It is maximal when the different symbols are equiprobable. If you're a physicist you use the natural log, if you're a computer scientist you use log2 and get bits.

**What Is My IP? - Shows your real public IP address - IPv4,** Cases of coronavirus are increasing rapidly in the UK, which suggests it is circulating in the community and spreading from person to person. x ≥ y means x is greater than or equal to y. less than or equal to. x ≤ y means x is less than or equal to y. calculate expression inside first. 2 × (3+5) = 16. calculate expression inside first. [ (1+2)× (1+5)] = 18. both plus and minus operations. 3 ± 5 = 8 or -2. both minus and plus operations. 3 ∓ 5 = -2 or 8. multiplication dot.

I really recommend you read about Information Theory, bayesian methods and MaxEnt. The place to start is this (freely available online) book by David Mackay:

http://www.inference.phy.cam.ac.uk/mackay/itila/

Those inference methods are really far more general than just text mining and I can't really devise how one would learn how to apply this to NLP without learning some of the general basics contained in this book or other introductory books on Machine Learning and MaxEnt bayesian methods.

The connection between entropy and probability theory to information processing and storing is really, really deep. To give a taste of it, there's a theorem due to Shannon that states that the maximum amount of information you can pass without error through a noisy communication channel is equal to the entropy of the noise process. There's also a theorem that connects how much you can compress a piece of data to occupy the minimum possible memory in your computer to the entropy of the process that generated the data.

I don't think it's really necessary that you go learning about all those theorems on communication theory, but it's not possible to learn this without learning the basics about what is entropy, how it's calculated, what is it's relationship with information and inference, etc...

**Coronavirus: Who should be shielding?,** What is ALS? ALS, or amyotrophic lateral sclerosis, is a progressive neurodegenerative disease that affects nerve cells in the brain and the spinal cord. () are called parentheses or curved brackets. They are used do add information to a sentence that is less important or to clarify subtext/backstory. It is also used when referring to your sources in formal texts such as research or assigment paper

When I was implementing an algorithm to calculate the entropy of an image I found these links, see here and here.

This is the pseudo-code I used, you'll need to adapt it to work with text rather than images but the principles should be the same.

//Loop over image array elements and count occurrences of each possible //pixel to pixel difference value. Store these values in prob_array for j = 0, ysize-1 do $ for i = 0, xsize-2 do begin diff = array(i+1,j) - array(i,j) if diff lt (array_size+1)/2 and diff gt -(array_size+1)/2 then begin prob_array(diff+(array_size-1)/2) = prob_array(diff+(array_size-1)/2) + 1 endif endfor //Convert values in prob_array to probabilities and compute entropy n = total(prob_array) entrop = 0 for i = 0, array_size-1 do begin prob_array(i) = prob_array(i)/n //Base 2 log of x is Ln(x)/Ln(2). Take Ln of array element //here and divide final sum by Ln(2) if prob_array(i) ne 0 then begin entrop = entrop - prob_array(i)*alog(prob_array(i)) endif endfor entrop = entrop/alog(2)

I got this code from somewhere, but I can't dig out the link.

**ALS,** What is a DO? DOs are fully licensed physicians who practice in all areas of medicine using a whole person approach to partner with their patients. Find out what your public IPv4 and IPv6 address is revealing about you! My IP address information shows your location; city, region, country, ISP and location on a map.

**What is a DO?,** What are coronaviruses? Coronaviruses are a large group of viruses that can cause illness in animals and humans. Some coronaviruses commonly circulate in Quickly send and receive WhatsApp messages right from your computer.

**What is COVID-19?,** What's not clear to many business leaders is what digital transformation means. Is it just a catchy way to say moving to the cloud? What are the specific steps we Official MapQuest website, find driving directions, maps, live traffic updates and road conditions. Find nearby businesses, restaurants and hotels. Explore!

**What is digital transformation?,** Twenty years after the introduction of the theory, we revisit what it does—and doesn't—explain. Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.

##### Comments

- A nice and intuitive solution math.stackexchange.com/questions/331103/…
- nice and intuitive answer for thsi question math.stackexchange.com/questions/331103/…
- @all3fox: this is explained in the last paragraph, the process should stop for that particular branch if it gets to a
*pure*node (a leaf node where all the instances belong to the same class, so it cant be split any further). The node thus predicts the only class it contains.. - @all3fox: in practice, going all the way to
*pure nodes*produces quite deep decision trees that suffer from overfitting (i.e trees that fit too well the training data, but that generalize poorly to other data not represented in the training set). Hence we usually stop when we get to a certain minimum number of instances in leaf nodes (and just predict the majority class), and/or perform some kind of pruning (see the Wikipedia links provided above to learn more). - @Jas: this is well explained here: en.wikipedia.org/wiki/…
- @Rami: Right, to avoid problem like overfitting, smaller trees are preferred over larger ones (i.e reaching decisions with fewer tests). Note that the heuristic by which splitting features are chosen is a greedy search algorithm, so the generated tree is not guaranteed to be the smallest possible one in the space of all possible trees (nor is it guaranteed to be globally-optimal one w.r.t classification error). This is in fact an NP-complete problem...
- @Rami: Interestingly, there are ensemble learning methods that take a different approach. One idea is to randomize the learning algorithm by picking a random subset of features at each candidate split, and building a bunch of these random trees and averaging their result. Also worth checking out algorithms like Random Forests.
- had the same thoughts Rafael. It's like asking what's quantum physics on stack overflow, a very broad area that's doesn't distill into a single answer well.
- There are so much different entropy() functions out there for images but without good previews? How can you compare your code to Matlab's own entropy() and to the code here mathworks.com/matlabcentral/fileexchange/28692-entropy In the latter, the developer says it is for 1D signals but users keep expanding it to 2D. - - Your entropy function assumes that the original signal is 2 bit and it is rather simplistic. Assume it is MIT-BIH arrythmia ECG signal (11 bit) but generated for 2D images. I think you cannot use simple 2-bit base here then.