ML algorithms (ML syllabus edition 3/8)

The Lindahl Letter

0:00

-12:57

ML algorithms (ML syllabus edition 3/8)

Part 3 of 8 in the ML syllabus series

Dr. Nels Lindahl

Aug 19, 2022

Welcome to the lecture on ML algorithms. This topic was held until the 3rd installment of this series to allow a foundation for the concept of machine learning to develop. At some point, you are going to want to operationalize your knowledge of machine learning to do some things. For the vast majority of you one of these ML algorithms will be that something. Please take a step back and consider this very real scenario. Within the general scientific community getting different results every time you run the same experiment makes publishing difficult. That does not stop authors in the ML space. Replication and the process of verifying scientific results is often difficult or impossible without similar setups and the same datasets. Within the machine learning space where a variety of different ML algorithms exist that is a very normal outcome. Researchers certainly seem to have gotten very used to getting a variety of results. I’m not talking about using post theory science to publish based on allowing the findings to build knowledge instead of the other way around. You may very well get slightly different results every time one of these ML algorithms is invoked. You have been warned. Now let the adventure begin.

One of the few Tweets that really made me think about the quality of ML research papers and the research patterns impacting quality was from Yaroslav Bulatov who works on the PyTorch team back on January 22, 2022. That tweet referenced a paper on ArXiv called, “Descending through a Crowded Valley — Benchmarking Deep Learning Optimizers,” from 2021 [1].

Yaroslav Bulatov @yaroslavvb

Table 2 of arxiv.org/pdf/2007.01547… shows what's wrong with ML research. Papers got in by providing a theorem (checked by reviewers) and "significant improvement" (not checked). Significant improvement disappeared when tested by third party

That paper digs into the state of things where hundreds of optimization methods exist. It pulls together a really impressive list. The list itself was striking just in the volume of options available. My next thought was about just how many people are contributing to this highly overcrowded field of machine learning. That paper about deep learning optimizers covered a lot of ground and would be a good place to start digging around. We are going to approach this a little differently based on a look at the most common ones.

Here are some (10) very common ML algorithms (this is not intended to be an exhaustive list):

XGBoost
Naive Bayes algorithm
Linear regression
Logistic regression
Decision tree
Support Vector Machine (SVM) algorithm
K-nearest neighbors (KNN) algorithm
K-means
Random forest algorithm
Diffusion

I’m going to talk about each of these algorithms briefly or this would be a very long lecture. We certainly could go all hands and spend several hours all in together in a state of irregular operations covering these topics, but that is not going to happen today. To make this a more detailed syllabus version of the lecture I’m going to include a few references to relevant papers you can get access to and read after each general introduction. My selected papers might not be the key paper or the most cited. Feel free to make suggestions if you feel a paper better represents the algorithm. I’m open to suggestions.

XGBoost - Some people would argue with a great deal of passion that we could probably be one and done after introducing this ML algorithm. You can freely download the package for this one [2]. It has over 20,000 stars on GitHub and has been forked over 8,000 times [3]. People really seem to like this one and have used it to win competitions and generally get great results. Seriously, you will find references to XGBoost all over these days. It has gained a ton of attention and popularity. Not exactly to the level of being a pop culture reference, but within the machine learning community it is well known. The package is based on gradient boosting and provides parallel tree boating (GBDT, GBM). This package generally creates a series of models that boost the trees and help create overfitting in sequential efforts. You can read a paper from 2016 about it on arXiv called, “XGBoost: A Scalable Tree Boosting System” [4]. The bottom line on this one is that you get a lot of benefits from gradient boosting built into a software package that can get you moving quickly toward your goal of success.

Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). https://dl.acm.org/doi/pdf/10.1145/2939672.2939785

Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., & Chen, K. (2015). Xgboost: extreme gradient boosting. R package version 0.4-2, 1(4), 1-4. https://cran.microsoft.com/snapshot/2017-12-11/web/packages/xgboost/vignettes/xgboost.pdf

Naive Bayes algorithm - You knew I would have to have something Bayes related near the top of this list. This one is a type of classifier that helps evaluate the probability or relationship between classes. One of the classes with the highest probability will be considered the most likely class. It also assumes that those features are independent. I found a paper on this one that was cited about 4,146 times called, “An empirical study of the naive Bayes classifier” [5].

Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46). https://www.researchgate.net/profile/Irina-Rish/publication/228845263_An_Empirical_Study_of_the_Naive_Bayes_Classifier/links/00b7d52dc3ccd8d692000000/An-Empirical-Study-of-the-Naive-Bayes-Classifier.pdf

Linear regression - This is the most basic algorithm and statistical technique in use here where based on a line (linear) a relationship can be charted for prediction between two things. A lot of the graphics you will see where a lot of content is mapped on a chart with a line dividing the general middle of the distribution would potentially be using some form of linear regression.

Forkuor, G., Hounkpatin, O. K., Welp, G., & Thiel, M. (2017). High resolution mapping of soil properties using remote sensing variables in south-western Burkina Faso: a comparison of machine learning and multiple linear regression models. PloS one, 12(1), e0170478. https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0170478&type=printable

Maulud, D., & Abdulazeez, A. M. (2020). A review on linear regression comprehensive in machine learning. Journal of Applied Science and Technology Trends, 1(4), 140-147. https://jastt.org/index.php/jasttpath/article/view/57/20

Logistic regression - This type of statistical model allows an algorithmic analysis of the probability of success or failure. You could model other binary type questions. The good folks over at IBM have an entire set of pages set up to run through how logistic regression could be a tool to help with decision making [6]. This model is everywhere in simple analysis of things when people are trying to work toward a single decision.

Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W., Verbakel, J. Y., & Van Calster, B. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of clinical epidemiology, 110, 12-22. https://www.researchgate.net/profile/Ewout-Steyerberg/publication/331028284_A_systematic_review_shows_no_performance_benefit_of_machine_learning_over_logistic_regression_for_clinical_prediction_models/links/5c66bed192851c1c9de3251b/A-systematic-review-shows-no-performance-benefit-of-machine-learning-over-logistic-regression-for-clinical-prediction-models.pdf

Dreiseitl, S., & Ohno-Machado, L. (2002). Logistic regression and artificial neural network classification models: a methodology review. Journal of biomedical informatics, 35(5-6), 352-359. https://core.ac.uk/download/pdf/82131402.pdf

Decision tree - Imagine diagramming decisions and coming to a fork where you have to decide to go one way or the other. That is how decision trees work based on inputs and corresponding outputs. Normally you will have a bunch of interconnected forks in the road and together they form up a decision tree. A lot of really great explanations of this exist online. One of my favorite ones is from Towards Data Science and was published way back in 2017 [7].

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms (pp. 0-13). Technical report, Department of Computer Science, Oregon State University. https://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.38.2702

Support Vector Machine (SVM) algorithm - You are going to need to imagine graphing out a bunch of data points then trying to come up with a line that separates them with a maximum margin [8].

Noble, W. S. (2006). What is a support vector machine?. Nature biotechnology, 24(12), 1565-1567. https://www.ifi.uzh.ch/dam/jcr:00000000-7f84-9c3b-ffff-ffffc550ec57/what_is_a_support_vector_machine.pdf

Wang, L. (Ed.). (2005). Support vector machines: theory and applications (Vol. 177). Springer Science & Business Media. https://personal.ntu.edu.sg/elpwang/PDF_web/05_SVM_basic.pdf

Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. IEEE Intelligent Systems and their applications, 13(4), 18-28. https://www.ifi.uzh.ch/dam/jcr:00000000-7f84-9c3b-ffff-ffffbdb9a74e/SVM.pdf

K-nearest neighbors (KNN) algorithm - Our friends over at IBM are sharing all sorts of knowledge online including a bit about the KNN algorithm [9]. Apparently, the best commentary explaining this one comes from Sebastian Raschka back in the fall of 2018 [10]. This one is pretty much what you would expect from a technique that looks at distance between neighboring points.

Peterson, L. E. (2009). K-nearest neighbor. Scholarpedia, 4(2), 1883. http://scholarpedia.org/article/K-nearest_neighbor

Zhang, M. L., & Zhou, Z. H. (2005, July). A k-nearest neighbor based algorithm for multi-label classification. In 2005 IEEE international conference on granular computing (Vol. 2, pp. 718-721). IEEE. https://www.researchgate.net/profile/Min-Ling-Zhang-2/publication/4196695_A_k-nearest_neighbor_based_algorithm_for_multi-label_classification/links/565d98f408ae1ef92982f866/A-k-nearest-neighbor-based-algorithm-for-multi-label-classification.pdf

K-means - Some algorithms work to evaluate clusters and K-means is one of those. You can use this to try to help classify unlabeled data into clusters which can be helpful.

Sinaga, K. P., & Yang, M. S. (2020). Unsupervised K-means clustering algorithm. IEEE access, 8, 80716-80727. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9072123

Random forest algorithm - Most of the jokes that have been told within the machine learning space often relate to decision trees. The field is not full of a lot of jokes, but trees falling in a random forest are often included in that branch. People really liked the random forest algorithm for a time. You can imagine that a bunch of trees are created to engage in the prediction of classification. The random tree in the forest with the best classification production becomes the winner. This is great as it could find something that was noval or unexpected result based on the randomness.

Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25(2), 197-227. https://arxiv.org/pdf/1511.05741.pdf

Diffusion - Previously I covered diffusion back in week 79 to try to figure out why it is becoming so popular. It is in no way as popular as XGBoost, but it has been gaining popularity. Over in the field of thermodynamics you could study gas molecules. Maybe you want to learn about how those gas molecules would diffuse from a high density to a low density area and you would also want to know how those gas molecules would reverse course. That is the basic theoretical part of the equation you need to absorb at the moment. Within the field of machine learning people have been building models that learn how based on degree of noise to diffuse the data and then reverse that process. That is basically the diffusion process in a nutshell. You can imagine that the cost to do this is computationally expensive.

Wei, Q., Jiang, Y., & Chen, J. Z. (2018). Machine-learning solver for modified diffusion equations. Physical Review E, 98(5), 053304. https://arxiv.org/pdf/1808.04519.pdf

Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780-8794. https://proceedings.neurips.cc/paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf

Wrapping this lecture up should be pretty straightforward. Feel free to dig into some of those papers if anything grabbed your attention this week. A lot of algorithms exist in the machine learning space. I tried to grab algorithms that are timeless and will always be relevant when considering where machine learning as a field is going.

Links and thoughts:

“[ML News] BLOOM: 176B Open-Source | Chinese Brain-Scale Computer | Meta AI: No Language Left Behind”

“Is Intel ARC REALLY Canceled? - WAN Show July 29, 2022”

Top 5 Tweets of the week:

David Pierce @pierce

New Vergecast! @reckless is a TikToker now, @alexhcranz is a Quest 2 collector, and I am the Verge's official White House Zoom Correspondent

theverge.comVergecast: Instagram’s risky week and Big Tech earningsEarnings from Apple, Microsoft, Meta, Spotify, and more.

👩‍💻 Paige Bailey #BlackLivesMatter @DynamicWebPaige

a bug bounty for verifying / disproving scientific papers

Dimitri Filipovic @DimFilipovic

Wrote about zone exits, and offensive transition in Part 2 of the playoff project @EPRinkside On why breaking the puck out cleanly is so valuable, and the defensemen that were the best at it this postseason bit.ly/3S7Xozr

François Chollet @fchollet

It may not be entirely obvious at first that a given seller is selling exclusively counterfeit items, because that seller may appear to have thousands of ratings, 99% positive. An important reason why is that Amazon takes down negative reviews related to counterfeits.

Footnotes:

[1] https://arxiv.org/pdf/2007.01547.pdf

[2] https://xgboost.ai/

[3] https://github.com/dmlc/xgboost

[4] https://arxiv.org/pdf/1603.02754.pdf

[5] https://www.cc.gatech.edu/home/isbell/classes/reading/papers/Rish.pdf

[6] https://www.ibm.com/topics/logistic-regression

[7] https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052

[8] https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

[9] https://www.ibm.com/topics/knn

[10] https://sebastianraschka.com/pdf/lecture-notes/stat479fs18/02_knn_notes.pdf

Research Note:

You can find the files from the syllabus being built on GitHub. The latest version of the draft is being shared by exports when changes are being made. https://github.com/nelslindahlx/Introduction-to-machine-learning-syllabus-2022

What’s next for The Lindahl Letter?

Week 83: Machine learning Approaches (ML syllabus edition 4/8)
Week 84: Neural networks (ML syllabus edition 5/8)
Week 85: Neuroscience (ML syllabus edition 6/8)
Week 86: Ethics, fairness, bias, and privacy (ML syllabus edition 7/8)
Week 87: MLOps (ML syllabus edition 8/8)

I'll try to keep the what’s next list forward looking with at least five weeks of posts in planning or review. If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.

The Lindahl Letter

ML algorithms (ML syllabus edition 3/8)

Discussion about this episode