Episode 9: ML Notes (Part II): Fundamentals of Decision Tree Theory

Friday, 03/20/2020

These days I am reading the book “Machine Learning” by Tom M. Mitchell and I will be documenting some important concepts from its chapters in a series of blog posts.

Check out “Part I” with notes from Chapter I,II if you haven’t already!

[Background]:

Decision trees iteratively partition the data and generate rules which help us make necessary predictions.

Source: Helping doctors validate decision trees by Samuel Laurence

Concepts:

What are the different decision tree algorithms and how do they differ from each other?[2]

Iterative Dichotomiser 3 (ID3):
- Developed by Ross Quinlan 1986.
- Starts with an empty tree. Iteratively constructs a final tree that’s supposedly the shortest tree that is consistent with the training data.
- Uses post-pruning to fix overfitting.
- Uses information gain to select attributes.
- Works only with discrete-valued attributes.
C4.5:
- Successor of ID3 [3].
- Works with both discrete and continuous-valued attributes (by partitioning).
- Converts ID3 into if-then rules.
- Has a special technique for replacing missing values. If an attribute is missing, it assigns probability to each possible value of that attribute [3,4].
CART:
- Similar to C4.5 but in addition to numerical-valued attributes, it can also support numerical target values (regression) instead of just discrete-valued targets.
- Creates binary trees.
What are the different ways in which entropy can be understood?
ID3 decision tree algorithm works by finding the best attribute to use at each node of the tree. It selects attribute with the highest information gain which is a statistical measure. Information gain uses entropy which is a measure used to characterize the impurity of a set of examples.
Entropy of 0 means all the examples in the set belong to the same target class. (pure data. Impurity is 0).
Entropy of 1 means half of the examples belong to one class and the other half belong to the other class. (very impure data).
It is the minimum number of bits needed to encode information about the target class of a given instance.
Information gain of an attribute measures the drop in entropy (impurity) if the data was split using this attribute. It tells us the number of bits saved in encoding the information of the target class, if this attribute was used to split the data.
Why is ID3 an example of inductive learning?
Constructing a decision tree using ID3 algorithm is essentially a searching problem in which we are finding the best tree from a large set of possible trees (or hypotheses) that best fits the training data. This searching, in case of ID3, is termed as “simple-to-complex”, “hill-climbing search”. It is called simple-to-complex since we start with an empty tree and end up with the final tree consisting of many nodes. Hill-climbing means that it incrementally moves towards an optimal solution which might be locally and not globally the most optimal solution.
What is the inductive bias of ID3 algorithm?
It chooses shorter trees over longer trees since it is a simple-to-complex algorithm. Among these shorter trees, it uses the one that places the attribute with the highest information gain at the root.
What is the difference between preference bias and restrictive bias?
preference/search bias: The inductive bias of ID3 algorithm is an example of preference or search bias. This type of bias is based on the search strategy of the algorithm. In case of ID3, it stops searching through the space, as soon as it finds the best hypothesis from a complete hypotheses space even though there might be several other equally good or even better hypotheses present.
restriction or language bias: In this, the algorithm continues to search through the entire hypotheses space until it finds all the hypotheses consistent with the data, but the hypotheses space it chooses is based on some assumptions, hence it uses an incomplete hypotheses space. An example of an algorithm that uses this type of bias is “Candidate-Elimination Algorithm”.
Preference bias is often preferred over restriction bias since it involves a complete set hypotheses instead of an incomplete one.
Give an example of an algorithm that uses both preference and restrictive bias?
Linear Regression.
Uses restrictive bias because its hypotheses space consists of only linear functions and not non-linear functions for example, so it is an incomplete hypotheses space.
Uses preference bias because it uses least mean square algorithm to find optimal solution which does an ordered search through the set of all possible parameter values.
What are some of the limitations of ID3 algorithm?
1. It maintains only a single current hypothesis and cannot tell whether there could have been alternative trees that are also consistent with the training examples.
2. It does not have an mechanism to backtrack. For example, if it chooses a certain attribute as a node, it cannot go back and reconsider it. This means that it has a tendency to converge to local and not global optimal solutions (hill-climbing algorithm). A common remedy to this is post-pruning trees.
What is Occam’s razor?
This is the inductive bias introduced by William of Occam in 1320. It says, “Prefer the simplest hypotheses that fits the data.” It is also called “law of parsimony”. It is often used to describe a mental model that encourages to look for the simplest solution to a problem that works for given set of circumstances rather than over-complicated solutions. Read this fun article to read more about it “How to Use Occam’s Razor Without Getting Cut”.
What is reduced-error pruning?
It is a technique to fix over-fitting in decision trees. After the tree is constructed, each node is considered for pruning, based on the results on an unseen test set. A node is removed only if the decision tree’s accuracy is decreased or remains the same by removing it. Pruning is done iteratively by considering each node and stops when the accuracy starts getting worse.
Describe one problem with using information gain as a statistical measure?
Information gain favors those attributes that can take many values e.g. “Date” over attributes that can take few values e.g. “Gender”. This is the natural bias of information gain. This can be avoided by using other measures like gain ratio which uses split information.

Some classic old gold papers on decision trees by the ML gurus themselves:

Thought of the Week:

This week’s picks from articles on COVID-19:

The Shift Americans Must Make to Fight the Coronavirus by Meghan O’Rourke [5]
“Americans today are not used to the notion that an infectious disease we have no treatment for might sweep through the nation. It startles the mind that our high-tech, modern medical system, which routinely wrests sick people from the arms of death, could actually be overwhelmed by a challenging but not remarkably deadly virus. One hundred or so years ago, most deaths in the U.S. were caused by infectious disease. Today, most are caused by chronic conditions. This is new to us…But we also live in a country stubbornly hung up on a damaging idea of self-reliance, a nation pathologically invested in the idea that we should all “just do it”—an attitude that challenges us to muscle through it—whatever it might be. We have no shared discourse for the idea that the hard thing to do, the truly challenging thing to do, might be to do less in order to help another…We are so addicted to the concept of individual responsibility that we have a fragmented health-care system, a weak social safety net, and a culture of averting our eyes from other people’s physical vulnerability…Unfortunately, there is no prescription for the “right” thing to do, no answer to the question of whether to have the playdate or fly to Florida right now. But there is an ethos to cling to…To be ill is to know our interconnectedness, but to be ill in America today is to be brought up against the pathology of a culture that denies this fact…No person is an island; the nation that believes in individuals more than it values community risks its own survival. Accepting and embracing this might be the tougher path than muscling through, each by each, whatever comes.”
Why the Coronavirus Has Been So Successful by Ed Yong [6]
“One of the few mercies during this crisis is that, by their nature, individual coronaviruses are easily destroyed. Each virus particle consists of a small set of genes, enclosed by a sphere of fatty lipid molecules, and because lipid shells are easily torn apart by soap, 20 seconds of thorough hand-washing can take one down. Lipid shells are also vulnerable to the elements; a recent study shows that the new coronavirus, SARS-CoV-2, survives for no more than a day on cardboard, and about two to three days on steel and plastic. These viruses don’t endure in the world. They need bodies. The virus has been remarkably stable given how much transmission we’ve seen,” says Lisa Gralinski of the University of North Carolina. “That makes sense, because there’s no evolutionary pressure on the virus to transmit better. It’s doing a great job of spreading around the world right now.”…Researchers can, however, offer a preliminary account of what the new coronavirus does to the people it infects. Once in the body, it likely attacks the ACE2-bearing cells that line our airways. Dying cells slough away, filling the airways with junk and carrying the virus deeper into the body, down toward the lungs. As the infection progresses, the lungs clog with dead cells and fluid, making breathing more difficult. (The virus might also be able to infect ACE2-bearing cells in other organs, including the gut and blood vessels.) The immune system fights back and attacks the virus; this is what causes inflammation and fever. But in extreme cases, the immune system goes berserk, causing more damage than the actual virus. For example, blood vessels might open up to allow defensive cells to reach the site of an infection; that’s great, but if the vessels become too leaky, the lungs fill even more with fluid.”
The Coronavirus Is a Disaster for Feminism by Helen Lewis [7]
“When people try to be cheerful about social distancing and working from home, noting that William Shakespeare and Isaac Newton did some of their best work while England was ravaged by the plague, there is an obvious response: Neither of them had childcare responsibilities…For those with caring responsibilities, an infectious-disease outbreak is unlikely to give them time to write King Lear or develop a theory of optics…A pandemic magnifies all existing inequalities (even as politicians insist this is not the time to talk about anything other than the immediate crisis). Working from home in a white-collar job is easier; employees with salaries and benefits will be better protected; self-isolation is less taxing in a spacious house than a cramped apartment…In both rich and poor countries, campaigners expect domestic-violence rates to rise during lockdown periods. Stress, alcohol consumption, and financial difficulties are all considered triggers for violence in the home, and the quarantine measures being imposed around the world will increase all three…They also worry that opportunities to collect high-quality data which will be useful for the future are being missed…We shouldn’t make that mistake again. Grim as it is to imagine now, further epidemics are inevitable, and the temptation to argue that gender is a side issue, a distraction from the real crisis, must be resisted. What we do now will affect the lives of millions of women and girls in future outbreaks.”
DeepMind’s Protein Folding AI Is Going After Coronavirus by Shelly Fan [8]
“How a protein “looks” in 3D is essential for developing new drugs, especially for new viruses. COVID-19, for example, has really spikey proteins that jut out from its surface. Normally, human cells don’t care—they won’t let the virus inside. But COVID-19’s spikey proteins also harbor a Trojan Horse that “activates” it in certain cells with a complementary component. Lung cells have an abundance of these factors, which is why they’re susceptible to invasion. Bottom line: if a drug is going to “fit” into a protein like a key into a lock to trigger a whole cascade of nasty reactions, then the first step is to figure out the structure of the lock. That’s what DeepMind’s AlphaFold is doing…China’s long-time Google surrogate and AI behemoth, Baidu, is using an algorithm to predict the structure of another important biomolecule, mRNA. mRNA shuttles information from the genome to protein factories, so shoot the mRNA messenger, then the viral proteins are never born. Similarly, AI could one day potentially predict epidemics and how a virus changes over time—but it will only help if there’s enough trust to listen to the models.”

#STAYHOME

References:

[1] Machine Learning by Tom M. Mitchell
[2] Tree algorithms: ID3, C4.5, C5.0 and CART
[3] C4.5: Programs for Machine Learning
[4] In simple language, how does C4.5 deal with missing values?
[5] The Shift Americans Must Make to Fight the Coronavirus
[6] Why the Coronavirus Has Been So Successful
[7] The Coronavirus Is a Disaster for Feminism
[8] DeepMind’s Protein Folding AI Is Going After Coronavirus

Episode 9: ML Notes (Part II): Fundamentals of Decision Tree Theory

[Background]:

Concepts:

Thought of the Week:

References:

Share this: