Edgar Anderson’s iris data sample is well-known and is freely available on the web as well as through R-studio. We use this data sample here to show some basic techniques in Mathematica.
We start by importing the iris data collection from the UCI repository
The properties of the data is as follows
and we combine this with the data into a dictionary
as well as in the form of a dataset
Looking for example at the sepal length we can plot an histogram;
This histogram is a plot but one can also obtain the plot function
which reflects the distribution of the sepal length; most of the flowers have a sepal length around 6 cm.
You can ask questions about the distribution, for example what is the probability that the length is less than 6 cm;
or what the standard deviation is
The empirical probability distribution simply reflects the histogram and is based on simply counting the occurences in the data
If you wish to approach the data as an instance of a particular (continuous) distribution you can get the parametrization of, say, the gamma distribution
Alternatively, you can ask for the parameters like this
To compare the distribution with a normal distribution you can use the following, which shows that the Gaussian distribution is not too bad;
A discrete plot of this can be presented as follows
Let’s look at the sepal length and width whether they cluster and reflect the species to which they belong to
A simple list plot does not produce a clear clustering
The data set contains three equal clusters
and an exact clustering can be obtained by grouping the data
If we try the automatic clustering we get almost the same result
It’s clear from the location of the points which differ from the exact clustering that no algorithm could ever detect them; they are simply statistical exceptions.
Using the machine learning algorithms we can train a classification using the sepal data
By looking at the clusters above we anticipate that a flower with sepal length of 5 cm and sepal width of 4 cm should be an Iris-setosa. This indeed what the logistic regression predicts:
while the value {6.2, 3} is somewhat at the boundary of two species.
Given the sepal length and the species we can predict the sepal width through machine learning as well;
Again, by looking at the clustering above we anticipate an iris-setosa with sepal length 4.5 cm to have a sepal width around 3 cm;
The fact that the predict algorithm is a linear regression can be seen in the plot of the prediction for a fixed category;
More information about the prediction can be obtained
For example, it seems the machine learning algorithm considers the sepal width to be distributed normally;