Edgar Anderson’s iris data sample is well-known and is freely available on the web as well as through R-studio. We use this data sample here to show some basic techniques in Mathematica.

We start by importing the iris data collection from the UCI repository

IrisAnalysis_1.png

The properties of the data is as follows

IrisAnalysis_2.png

and we combine this with the data into a dictionary

IrisAnalysis_3.png

as well as in the form of a dataset

IrisAnalysis_4.png

IrisAnalysis_5.png

Looking for example at the sepal length we can plot an histogram;

IrisAnalysis_6.gif

Graphics:Sepal length histogram

This histogram is a plot but one can also obtain the plot function

IrisAnalysis_8.gif

IrisAnalysis_9.gif

which reflects the distribution of the sepal length; most of the flowers have a sepal length around 6 cm.
You can ask questions about the distribution, for example what is the probability that the length is less than 6 cm;

IrisAnalysis_10.png

IrisAnalysis_11.png

or what the standard deviation is

IrisAnalysis_12.png

IrisAnalysis_13.png

The empirical probability distribution simply reflects the histogram and is based on simply counting the occurences in the data

IrisAnalysis_14.gif

IrisAnalysis_15.gif

If you wish to approach the data as an instance of a particular (continuous) distribution you can get the parametrization of, say, the gamma distribution

IrisAnalysis_16.gif

IrisAnalysis_17.gif

Alternatively, you can ask for the parameters like this

IrisAnalysis_18.png

IrisAnalysis_19.png

To compare the distribution with a normal distribution you can use the following, which shows that the Gaussian distribution is not too bad;

IrisAnalysis_20.png

IrisAnalysis_21.gif

A discrete plot of this can be presented as follows

IrisAnalysis_22.gif

IrisAnalysis_23.png

IrisAnalysis_24.gif

Let’s look at the sepal length and width whether they cluster and reflect the species to which they belong to

IrisAnalysis_25.png

A simple list plot does not produce a clear clustering

IrisAnalysis_26.png

IrisAnalysis_27.gif

The data set contains three equal clusters

IrisAnalysis_28.png

IrisAnalysis_29.png

and an exact clustering can be obtained by grouping the data

IrisAnalysis_30.png

IrisAnalysis_31.gif

If we try the automatic clustering we get almost the same result

IrisAnalysis_32.gif

IrisAnalysis_33.gif

It’s clear from the location of the points which differ from the exact clustering that no algorithm could ever detect them; they are simply statistical exceptions.

Using the machine learning algorithms we can train a classification using the sepal data

IrisAnalysis_34.gif

IrisAnalysis_35.gif

By looking at the clusters above we anticipate that a flower with sepal length of 5 cm and sepal width of 4 cm should be an Iris-setosa. This indeed what the logistic regression predicts:

IrisAnalysis_36.png

IrisAnalysis_37.png

while the value {6.2, 3} is somewhat at the boundary of two species.

Given the sepal length and the species we can predict the sepal width through machine learning as well;

IrisAnalysis_38.gif

IrisAnalysis_39.gif

Again, by looking at the clustering above we anticipate an iris-setosa with sepal length 4.5 cm to have a sepal width around 3 cm;

IrisAnalysis_40.png

IrisAnalysis_41.png

The fact that the predict algorithm is a linear regression can be seen in the plot of the prediction for a fixed category;

IrisAnalysis_42.png

IrisAnalysis_43.gif

More information about the prediction can be obtained

IrisAnalysis_44.png

IrisAnalysis_45.png

For example, it seems the machine learning algorithm considers the sepal width to be distributed normally;

IrisAnalysis_46.png

IrisAnalysis_47.png