Tuesday, September 6, 2016

Let's Do Data Science with Python

I spent some time this summer playing with Data Science.  My language of choice was Python.

So, what makes Python a great choice for Data Science?  It has to do with how easy Python is to learn as a general programming language, as well as with the availability of popular libraries such as NumPy, SciPy, matplotlib, and IPython, which we will use in the example below. 

I plan to write this post as 1 of 2. Post 1(this post) will present basic prerequisites, concepts, tools and example. In an upcoming Post 2, we will scale up by applying what was presented in Post 1 to Big Data. This will require us to learn and use different tools. Possibly there will be a third post on Machine Learning with Python.


  • Basic Understanding of Statistics. You should be familiar with concepts such as Mean, Mode, Median, Standard Deviation, Data Distributions, Conditional Probabilities and Bayes' Theorem. If you are not familiar with basic statistics, you should first go over a tutorial such as http://stattrek.com/tutorials/statistics-tutorial.aspx 

Development Environment Overview

We will use Enthought Canopy, "a comprehensive Python analysis environment that provides easy installation of over 450 core scientific analytic and Python packages, creating a robust platform you can explore, develop, and visualize on." 
We will use the free version of the product. There are paid, premium 
versions as well.

Also, note that Canopy uses Python 2.7 rather than the 3.x.
This is not ideal, however it's worth it, because of the ease of use as you will see in the following example. You can still have a Python 3.x environment setup on your machine for other projects. 

Unfamiliar with the differences between Python 2 and 3? Take a look at this tutorial:

1. Install Enthought Canopy from this link:


2. Download Canopy, selecting your OS (I am using Windows) and follow instructions to download and install. 

3. Run the Installer and follow the prompts. This is very basic, so I won't add any more details. 

4. When you first launch Canopy, it will ask if you want to set Canopy as the default Python development environment. I recommend not selecting this option.

5. If your installation succeeded, you should see the following Welcome Screen:

5. Now let's also install pydot2We will install it from the Canopy command prompt. Launch the Canopy command prompt from the Windows start menu and type the following command:

pip install pydot2

After a few moments, you should see a message that the installation has succeeded (not shown, since I already had it installed).

What data types are used in data science?

Data points in statistics are classified as belonging to statistical data types.

Categorical data types (e.g. black tops, green tops, blue tops)  don't have an intrinsic quantitative meaning, but we can assign numbers to categorical data for the purpose of statistical analysis. These numbers cannot be used in a meaningful way as numerical data - for example, it doesn't make sense to add add black tops and blue tops.

Numerical data represents measurable quantities (e.g. number of students in the classroom).  

Ordinal data is data that is not a number but the categories have mathematical meaning and relationships (e.g.  ratings). 

For more details on statistical data types: https://en.wikipedia.org/wiki/Statistical_data_type

We also need to be concerned with normalizing the data. What this means is that we need to ensure that our input numerical data is comparable.

There is much more to learn about the concepts and available tools of data science.  But for now, let's move on to a basic coding example, which should be easy to follow for anyone who meets the prerequisites mentioned above. 

Example: Calculate the mean and  median of student test scores. Generate a histogram for the median.

Generate the data set 

We need some data to work with. Since we don't have real data, we will create our own randomly generated data. In the real world you would have real data - perhaps from  a public data source or data provided by your company.

Sidenote: Recently I learned about the Python tool Scrapy which you can use for 
"extracting the data you need from websites in a fast, simple, yet extensible way."  I haven't had  a chance to write code using scrapy yet, but the tool comes highly recommended. Plus, it's free.  Just be sure to read the terms of use before crawling the web for data. https://scrapy.org/ 

Let's randomly generate data for the test scores of 350 students with the use of the numpy package.  "NumPy is the fundamental package for scientific computing with Python."

What tool are we using here for coding and displaying the result? 

Jupyter (IPython) notebooks.   If you have a python file with .ipynb extension, double clicking it will automatically open the file in the browser, where you can edit or run it.

If you are launching jupyter for the first time, you can do so by typing  jupyter notebook from a Windows command prompt.  Jupyter notebooks are installed as part of the Enthought Canopy.

As you can see from the few lines of code and comments below, we import the numpy package after which we call the randint function, passing it low, high score and sample size. It is this easy! 
I have linked to the documentation for the functions we are using, so be sure to check those links so that you know what arguments and return values to provide/expect.

2. Calculate the mean.
The "mean", or average is calculated by adding up all the numbers and then dividing by the number of numbers. 

Yes, it is as easy as calling the mean function of the numpy package.  Based on the above data, the mean test score is 73.88. 

3.  Calculate the median. 

The "median" is the middle value in the list of numbers. To calculate the median for our data set, we call the numpy median function. The median for this data set is 74, which makes sense given the that the mean is 73.88....

Mean and median will be similar so long as there aren't any outliers in our data set to affect the mean. This is not a problem for our data set, but worth paying attention to. 

4. Display a histogram of median.
A histogram is a  diagram consisting of rectangles whose area is proportional to the frequency of a variables and whose width is equal to the class interval. In this case we will display the test scores in buckets of 10.  
To create and show a histogram we will use the matplotlib.  library. According to the documentation, "matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms."

After importing matplotlib, we call the two functions needed to create and display the histogram: plt.hist and plt.show  

Now we have a histogram showing us visually the frequency of our variables - the test scores. 

The source code for this example is available from: