Random forest definition
A random forest is a machine learning technique to make predictions (regression) or sort data (classification). It's a collection of decision trees, each created from a random subset of the training data. Combining many trees makes the model more accurate and reliable than using only one tree.
See also: machine learning, training data, unsupervised machine learning, artificial intelligence, cluster analysis, anomaly-based detection
History of random forests
- Early 1980s: The idea of decision trees, which would later become the building blocks for random forests, started gaining attention. Algorithms like ID3 (1986) and C4.5 (1993) by J. Ross Quinlan were the most notable.
- 1996: Leo Breiman introduced the concept of bagging (bootstrap aggregating). He trained many decision trees on different bootstrap samples of the data and combined their predictions.
- 2001: Breiman named the method random forests. He improved it by letting each split in the tree use a random subset of features, which made the whole model more accurate.
- Early 2000s: Random forests became popular in the machine learning community for being accurate, easy to use, and versatile.
- 2000s-2010s: Researchers and practitioners further explored the model's capabilities. Its applications expanded to bioinformatics, finance, and environmental modeling.
- Ongoing: Random forests are a mainstay in machine learning. They are often cited in academic research, applied in industry, and serve as a benchmark in machine learning competitions.
How a random forest works
- 1.A random forest starts with bootstrap sampling – dividing the original dataset into many small subsets. The same data point can appear in multiple subsets.
- 2.A decision tree is built for each data subset. However, when splitting a node in the tree, only a random subset of the features is used. This introduces more randomness into the model.
- 3.The model combines all the trees' predictions for a final result. For classification tasks, it goes for the most common choice among the trees. For regression, it uses the average of all the trees' outputs.
Random forest applications
- 1.Classification and regression. Random forests can sort items into categories and predict numbers in areas like banking, healthcare, and stock markets.
- 2.Feature importance. They are also useful in figuring out which parts of the data are the most important.
- 3.Handling large datasets with a lot of noise. Random forests work well with large, complex datasets with many input variables. They are relatively robust to noise and outliers in the data.