# Basic data analysis with Mathematica

Edgar Anderson’s iris data sample is well-known and is freely available on the web as well as through R-studio. We use this data sample here to show some basic techniques in Mathematica.

We start by importing the iris data collection from the UCI repository The properties of the data is as follows and we combine this with the data into a dictionary as well as in the form of a dataset  Looking for example at the sepal length we can plot an histogram;  This histogram is a plot but one can also obtain the plot function  which reflects the distribution of the sepal length; most of the flowers have a sepal length around 6 cm.
You can ask questions about the distribution, for example what is the probability that the length is less than 6 cm;  or what the standard deviation is  The empirical probability distribution simply reflects the histogram and is based on simply counting the occurences in the data  If you wish to approach the data as an instance of a particular (continuous) distribution you can get the parametrization of, say, the gamma distribution  Alternatively, you can ask for the parameters like this  To compare the distribution with a normal distribution you can use the following, which shows that the Gaussian distribution is not too bad;  A discrete plot of this can be presented as follows   Let’s look at the sepal length and width whether they cluster and reflect the species to which they belong to A simple list plot does not produce a clear clustering  The data set contains three equal clusters  and an exact clustering can be obtained by grouping the data  If we try the automatic clustering we get almost the same result  It’s clear from the location of the points which differ from the exact clustering that no algorithm could ever detect them; they are simply statistical exceptions.

Using the machine learning algorithms we can train a classification using the sepal data  By looking at the clusters above we anticipate that a flower with sepal length of 5 cm and sepal width of 4 cm should be an Iris-setosa. This indeed what the logistic regression predicts:  while the value {6.2, 3} is somewhat at the boundary of two species.

Given the sepal length and the species we can predict the sepal width through machine learning as well;  Again, by looking at the clustering above we anticipate an iris-setosa with sepal length 4.5 cm to have a sepal width around 3 cm;  The fact that the predict algorithm is a linear regression can be seen in the plot of the prediction for a fixed category;      