Tuesday, September 6, 2016

Let's Do Data Science with Python

I spent some time this summer playing with Data Science.  My language of choice was Python.

So, what makes Python a great choice for Data Science?  It has to do with how easy Python is to learn as a general programming language, as well as with the availability of popular libraries such as NumPy, SciPy, matplotlib, and IPython, which we will use in the example below. 

I plan to write this post as 1 of 2. Post 1(this post) will present basic prerequisites, concepts, tools and example. In an upcoming Post 2, we will scale up by applying what was presented in Post 1 to Big Data. This will require us to learn and use different tools. Possibly there will be a third post on Machine Learning with Python.

Prerequisites 


  • Basic Understanding of Statistics. You should be familiar with concepts such as Mean, Mode, Median, Standard Deviation, Data Distributions, Conditional Probabilities and Bayes' Theorem. If you are not familiar with basic statistics, you should first go over a tutorial such as http://stattrek.com/tutorials/statistics-tutorial.aspx 

Development Environment Overview

We will use Enthought Canopy, "a comprehensive Python analysis environment that provides easy installation of over 450 core scientific analytic and Python packages, creating a robust platform you can explore, develop, and visualize on." 
We will use the free version of the product. There are paid, premium 
versions as well.

Also, note that Canopy uses Python 2.7 rather than the 3.x.
This is not ideal, however it's worth it, because of the ease of use as you will see in the following example. You can still have a Python 3.x environment setup on your machine for other projects. 

Unfamiliar with the differences between Python 2 and 3? Take a look at this tutorial:

1. Install Enthought Canopy from this link:

https://www.enthought.com/products/canopy/

2. Download Canopy, selecting your OS (I am using Windows) and follow instructions to download and install. 

3. Run the Installer and follow the prompts. This is very basic, so I won't add any more details. 

4. When you first launch Canopy, it will ask if you want to set Canopy as the default Python development environment. I recommend not selecting this option.

5. If your installation succeeded, you should see the following Welcome Screen:




5. Now let's also install pydot2We will install it from the Canopy command prompt. Launch the Canopy command prompt from the Windows start menu and type the following command:

pip install pydot2



After a few moments, you should see a message that the installation has succeeded (not shown, since I already had it installed).

What data types are used in data science?

Data points in statistics are classified as belonging to statistical data types.


Categorical data types (e.g. black tops, green tops, blue tops)  don't have an intrinsic quantitative meaning, but we can assign numbers to categorical data for the purpose of statistical analysis. These numbers cannot be used in a meaningful way as numerical data - for example, it doesn't make sense to add add black tops and blue tops.

Numerical data represents measurable quantities (e.g. number of students in the classroom).  

Ordinal data is data that is not a number but the categories have mathematical meaning and relationships (e.g.  ratings). 

For more details on statistical data types: https://en.wikipedia.org/wiki/Statistical_data_type

We also need to be concerned with normalizing the data. What this means is that we need to ensure that our input numerical data is comparable.

There is much more to learn about the concepts and available tools of data science.  But for now, let's move on to a basic coding example, which should be easy to follow for anyone who meets the prerequisites mentioned above. 


Example: Calculate the mean and  median of student test scores. Generate a histogram for the median.


1. 
Generate the data set 

We need some data to work with. Since we don't have real data, we will create our own randomly generated data. In the real world you would have real data - perhaps from  a public data source or data provided by your company.

Sidenote: Recently I learned about the Python tool Scrapy which you can use for 
"extracting the data you need from websites in a fast, simple, yet extensible way."  I haven't had  a chance to write code using scrapy yet, but the tool comes highly recommended. Plus, it's free.  Just be sure to read the terms of use before crawling the web for data. https://scrapy.org/ 

Let's randomly generate data for the test scores of 350 students with the use of the numpy package.  "NumPy is the fundamental package for scientific computing with Python."

What tool are we using here for coding and displaying the result? 

Jupyter (IPython) notebooks.   If you have a python file with .ipynb extension, double clicking it will automatically open the file in the browser, where you can edit or run it.

If you are launching jupyter for the first time, you can do so by typing  jupyter notebook from a Windows command prompt.  Jupyter notebooks are installed as part of the Enthought Canopy.

As you can see from the few lines of code and comments below, we import the numpy package after which we call the randint function, passing it low, high score and sample size. It is this easy! 
I have linked to the documentation for the functions we are using, so be sure to check those links so that you know what arguments and return values to provide/expect.





2. Calculate the mean.
The "mean", or average is calculated by adding up all the numbers and then dividing by the number of numbers. 

Yes, it is as easy as calling the mean function of the numpy package.  Based on the above data, the mean test score is 73.88. 

3.  Calculate the median. 

The "median" is the middle value in the list of numbers. To calculate the median for our data set, we call the numpy median function. The median for this data set is 74, which makes sense given the that the mean is 73.88....

Mean and median will be similar so long as there aren't any outliers in our data set to affect the mean. This is not a problem for our data set, but worth paying attention to. 




4. Display a histogram of median.
A histogram is a  diagram consisting of rectangles whose area is proportional to the frequency of a variables and whose width is equal to the class interval. In this case we will display the test scores in buckets of 10.  
To create and show a histogram we will use the matplotlib.  library. According to the documentation, "matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms."

After importing matplotlib, we call the two functions needed to create and display the histogram: plt.hist and plt.show  

Now we have a histogram showing us visually the frequency of our variables - the test scores. 


The source code for this example is available from:
https://github.com/stanimeredith/Python/blob/master/firstdatascience.ipynb

Wednesday, April 20, 2016

How to Ace the Whiteboard Coding Interview

1. When presented with the coding problem by the interviewer, be sure to communicate and ask questions to confirm the details of the problem statement. For example, if you are asked to check the permutation order of two strings, clarify if the solutions should account for case sensitivity and if whitespace counts.  Sometimes the interviewer might intentionally omit part of the information necessary for solving the problem, just to see if you will ask for it.

2. The whiteboard has only so much writing space, so plan ahead starting at the top left corner  and possibly dividing the board into two or three columns.

3.  The interviewer will often let you pick the coding language you want to use to implement the solution, however this is not always the case. You should do some homework and find out what is the development stack in the company where you are interviewing. Or perhaps you have already contributed to the company's opensource git repository and are familiar with their code base.

4. As you work on your coding solution, talk out loud so that the interviewer can get an insight into your thought processes.  They are looking at finding out about general problem solving skills as much as about the solution to the specific coding problem.

5. If stuck, don't be afraid to ask for hints. Asking  the right questions is a big part of problem solving and therefore, coding.

6. Draw yourself a visual diagram to help you understand the problem. This also demonstrates your use of a powerful problem solving technique: visualization. However, be mindful of the available space on the whiteboard.

7. Unless you know exactly how to code the solution, work in progressive enhancement manner. Code a simpler version of the solution, perhaps adding //todo comments for the parts you don't know how to code yet.

8. Syntax is important, but it's not all important, at least not for the whiteboard interview. If your algorithm is correct but you forgot a semicolon, you have in essence successfully solved this coding problem.

9. Ask the interviewer if they want you to test the code with sample value(s).

10. Use  basic good programming style, such as meaningful variable names, pick a  naming style such as camelCase and use it consistently for all your variables.

11.  Become familiar with online interview tools, which are becoming a more common part of the interview process. A couple examples of such tools include  https://www.hackerrank.com/  or https://www.proveit.com/default.htm

12. Show a generally positive attitude. There is a large human interaction component to the whiteboarding interview. The interviewer is also assessing if you will be a good fit for their team.


13.The above tips will help with the whiteboard coding interview, but they won't be enough if you don't come prepared with solid programming knowledge, which can only be acquired over an extended period of time and with daily practice.

The following is a list of topics which often make appearance at programming interviews. Mastery of these topics can only be acquired with daily practice, so start practicing your coding now!
For an in depth coverage of the Coding Interview process as well as coding challanges, I recommend the book Cracking the Coding Interview by Gayle Laakmann McDowell

Possible Technical Interview topics:
String Manipulation
Data Structures: Arrays, Linked Lists, Stacks, Queue, Trees, Graphs
Code efficiency (Big O notation)
Sorting and Searching, Bit Manipulation, Recursion,
Object Oriented Design
Testing Fundamentals
Scalability
and more...


Monday, March 28, 2016

Getting started as an Android Developer

Android's dominating market share, its open source license to device manufacturers, and its open source development model make it a great choice for starting out as a mobile developer. 
It can be expected that the demand for Android developers will continue to grow. 

As an Android Developer you can develop for a wide array of devices ranging from phones and tablets to  wearables, home  and vehicle systems. You will develop a variety of apps such as games, news, photography, education or shopping apps, to name a few. 

To develop Android Apps you  will use Android Studio, which is now the official IDE for developing Android Apps. Be sure to check out the many helpful samples which come with the SDK. 
Android Studio and the SDK are available for free download and cross platform compatible. 

Prerequisite Coding Skills:
In order to be successful in learning the Android programming skills, you should already have background in the following languages:

Java:  You need to have solid programmings basic skills in Java and also be versed with  Object Oriented Programming in  Java.   

XML: (eXtensible Markup Language) is used to store data. 

SQL:  You need to have basic working knowledge of databases, such as writing CRUD  queries.  (Android comes with the popular embedded SQLite database).

Ways to learn to develop for Android:

For the most basic introduction see Android's Building Your First App 
If you are self motivated you can try using the Udacity Android Beginner course:

If you are based in Seattle, Seattle Central College is one of several venues in the Seattle area where you can take an instructor led Android class.  I will be teaching the next session in Spring 16 quarter. For registration and information visit the college website and search for ITC 162. 


If you are looking for a book, I highly recommend the book we use in my Android class: Murach's Android Programming, 2nd Edition.

Also, consider joining meetup groups in your area - check out meetup.com and search for Android and Mobile development meetups.  In Seattle you can join Seattle Android Developers GDG, for example. 

In addition, there are numerous online forums and social media 

outlets where you can follow and participate in the latest Android 

news and updates.


Deploying and monetizing your app:

You can distribute your app through a marketplace such as Google Play, Amazon Appstore or directly to users via website or email. At this time there is a $25 registration fee to register with Google Play. 

In house or contract Android developer career path:

There is also a great demand for contract or in house Android developers. Some of the core skills required for entry level Android developers include experience programming with Java, using the Android SDK, Android Studio, Gradle, and Git as well as familiarity with the software development life cycle and agile methodologies.

Learning Android programming and keeping up with constant updates from Google can be challenging, but it is also fun and rewarding.












Thursday, March 24, 2016

It's like Scully and Mulder in my head, but more like Scully



Image Credit: The X Files
While it is the case that not admitting that there is more outside of the domain of science than we currently know today is narrow-minded, investigating the subject can be a slippery slope filled with charlatans such as Stuart Hameroff or  Deepak Chopra. 

Or, going back in time, Madame Blavatsky or G.I.Gurdjieff who made claims about magical kingdoms and mystical lands, which would be impossible to disprove  in the days when we had no GPS...

Cults take this pseudo scientific pursuit to another level, praying on the vulnerable and on the basic human need to search for meaning. "I want to believe", as Mulder from the X Files said...An obscurity cloud is used as a way to keep people in cults, such as Gurdjieff's Fourth Way, because what is discussed there is simply too crazy to be shared with others. However, the version given to the cult members is that this is sacred knowledge beyond the access of "ordinary" people.This is a very problematic approach in contradiction with the scientific method. 

So, while there is more to explore beyond the event horizon of science today, I am highly skeptical (this does not make me narrow minded materialist, by the way), of going about it by investigating so called ancient traditions, because they generally contain more nonsense than usefulness, and the two parts are intertwined in a way that is very difficult to separate by most. The question, then is: how does one go about investigating psi phenomena and other present day speculative subjects while keeping up with the scientific method (the best we've got so far). 

There must be some objective, repeatable criteria, which is missing from the "mystical" traditions. Recently I listened to Ben Goertzel's  talk on expanding science to include second person verification - this sounds promising. 

Also, studying secular meditation in a scientific setting sounds promising - we are faced with the barrier of subjective experience as the main obstacle, i think.

Top Tips for Web Project Management by the Seattle Central College Web 105 students

This is a compilation of student tips shared at the end of Web 105 Working on a Web Project class I taught at SCC.




  1. Map out what you’re doing before you do it with clearly defined goals.


  1. Use tools such as Cost estimator, Keyword search tool, Adobe Kuler, etc


  1. Collaborate with your team for a more successful project.  


  1. Organize with Post-Its.


  1. Research the competitive landscape.


  1. Respect Contracts (Non-disclosure agreements, customer contracts.)


  1. Value an effective division of labor on your project, playing to different employees strengths.


  1. Document everything to keep everyone on the same page and make sure everyone has access to all the ideas, research and information.


  1. Develop a workflow and stick to it broadly.


  1. Have a good communication plan (email, google docs, staging area, video conference).


  1. Understand how agile workflow functions and its importance to your future employment.


  1. Be  Agile during the project.
  2. Have an understanding of all aspects of the project (IE: front-end development, wireframes, SEO, agile workflow, industry standards, etc.).


  1. Time management is HUGE. Budget, understand how much time each task takes.


  1. Understand the USER, and that their needs may not be glamorous but are important.


  1. Typography is (arguably) 95% of web design (a “huge deal”)


  1. Know the project phases: Plan, Develop the structure, Design visual interface, Build and integrate.


  1. Prepare a project summary - vItal to making sure you and your client are on the same page.


  1. Get paid what you’re worth and manage your time.


  1. Watch out for scope-creep.


  1. Scope creep is inevitable, but it can be managed.


  1. Be mindful who you're designing for and who you're leaving out, especially when it comes to technology.


  1. Don't assume you're on the same page, get all agreements in writing.

Tuesday, March 8, 2016

Top 5 skills to keep IT professionals employed as we transition to an AI driven economy

While the demand for software developers is at an all-time high,  exponential growth of technology, machine learning, data science, as well as large capital investment in AI research are expected to lead to the automation and/or elimination of most jobs, including most IT and programming jobs.
Ed Messerly, BITCA Faculty at Seattle Central College and I have put together a list of several career paths which, we believe,  are not at near term risk of large scale automation. Long term solutions to this problem must include high level policy reform, such as Universal Basic Income.
Please note that there is always the possibility of technological disruption which could change current trends in  unpredictable directions.
1. Tool-Centric Development: Becoming craftsman like developers - learning the tools and “tricks” to build applications. The skills set is not very specialized in terms of math or computer science, but one needs to know System Analysis , Traditional Logic and be good at problem solving in order to put together  or maintain programming snippets from various frameworks and tools. Some examples of such tool-centric technologies  include Bootstrap, Wordpress and Drupal.

2. Mobile Development is predicted to be one of the growing programming fields.  Mobile development  has moved beyond cell phones and now includes a large matrix of mobile devices including Wearables or Automotive devices.   To develop mobile apps for the widely popular, open source Android platform you need to know Java and XML.

3.  Data Centric Programming including developing in a Cloud environment. Data Analytics is a booming field and today humans are generating digital data at unprecedented rates. This data is very useful for business insights and needs to be mined and analyzed.   Scratch is an free and easy to learn, cloud based programming framework to teach basic coding skills.  One of the most popular languages used in data science is Python. You also need to have an aptitude for math and especially for statistics.  

4. AI Programming is another area which is not going away just yet (not until the AIs learn fully how to program themselves).  To program AIs you need to have strong computer science background and to understand different AI programming approaches such as Search, Logic, Probabilistic reasoning, Decision Making, Natural Language Processing, Genetic Algorithms and more... Some of the popular languages for programming AIs include LISP, Python and Haskell.

5. There is still a lot of legacy code which needs to be maintained. Many companies are invested in their current infrastructure and will not upgrade to the latest and greatest technologies due to business considerations.   Much of the legacy code is written in Java and even C.