# Basic data analysis with Mathematica

Edgar Anderson’s iris data sample is well-known and is freely available on the web as well as through R-studio. We use this data sample here to show some basic techniques in Mathematica.

We start by importing the iris data collection from the UCI repository

The properties of the data is as follows

and we combine this with the data into a dictionary

as well as in the form of a dataset

Looking for example at the sepal length we can plot an histogram;

This histogram is a plot but one can also obtain the plot function

which reflects the distribution of the sepal length; most of the flowers have a sepal length around 6 cm.
You can ask questions about the distribution, for example what is the probability that the length is less than 6 cm;

or what the standard deviation is

The empirical probability distribution simply reflects the histogram and is based on simply counting the occurences in the data

If you wish to approach the data as an instance of a particular (continuous) distribution you can get the parametrization of, say, the gamma distribution

Alternatively, you can ask for the parameters like this

To compare the distribution with a normal distribution you can use the following, which shows that the Gaussian distribution is not too bad;

A discrete plot of this can be presented as follows

Let’s look at the sepal length and width whether they cluster and reflect the species to which they belong to

A simple list plot does not produce a clear clustering

The data set contains three equal clusters

and an exact clustering can be obtained by grouping the data

If we try the automatic clustering we get almost the same result

It’s clear from the location of the points which differ from the exact clustering that no algorithm could ever detect them; they are simply statistical exceptions.

Using the machine learning algorithms we can train a classification using the sepal data

By looking at the clusters above we anticipate that a flower with sepal length of 5 cm and sepal width of 4 cm should be an Iris-setosa. This indeed what the logistic regression predicts:

while the value {6.2, 3} is somewhat at the boundary of two species.

Given the sepal length and the species we can predict the sepal width through machine learning as well;

Again, by looking at the clustering above we anticipate an iris-setosa with sepal length 4.5 cm to have a sepal width around 3 cm;

The fact that the predict algorithm is a linear regression can be seen in the plot of the prediction for a fixed category;