This repo contains the teaching material for the Introduction to Python (and useful libraries) masterclass at the Data Science Retreat.
- About Me
- The Python Programming Language
- Pandas
- NumPy and Matplotlib
- Scikit-learn and your first Data Science case
- SciPy
Slides for this section can be found here.
Slide deck for this entire section is available here.
Slides on this topic start here
Slides on this topic start here
Slides on this topic start here
A great notebook covering the main differences has been written by Sebastian Raschka.
To keep your code compatible with both Python 2 and Python 3, you might also want to use this Cheat Sheet.
Slides on this topic start here
The most basic interactive Python command line, where each line starts with a >>>
.
Standard editor in Python distributions, easy to use but very basic.
A more sophisticated interactive Python command line. It incorporates tab-completion, interactive help and regular shell commands. Also look up the %
-magic commands.
Spyder is part of the Anaconda Python distribution. It is a small IDE mostly for data analysis, similar to RStudio. It automatically highlights Syntax errors, contains a variable explorer, debugging functionality and other useful things.
Interactive environment for the web browser. A Jupyter notebook contains Python code, text, images and any output from your program (including plots!). It is a great tool for exploratory data analysis.
A general-purpose text editor that works on all systems. There are many plugins for Python available. There are a free and a commercial version available.
The Open Source cousin of Sublime2.
PyCharm is probably the most luxurious IDE for Python. It contains tons of functions that are a superset of all the above. PyCharm is a great choice for bigger Python projects.
If you must use a text editor on Windows to edit Python code, refuse to use anything worse than Notepad++.
I know people who are successfully using Vim to write Python code and are happy with it.
I know people who are successfully using Emacs to write Python code, but haven't asked them how happy they are.
Slides on this topic start here
A live demo will be given during the masterclass.
Experiment further with the IPython Notebook environment with this Jupyter Notebook. Try to clone or download it, before opening it, running and modifying its cells.
Many more Jupyter features in this blog post.
Times to get your hands dirty. Read and test for yourself the examples provided in: The SciPy Lectures -- The Python Language.
Practice those examples using alternatively python files, the IPython interpreter and an IPython Notebook.
To practice:
- Tutorial: Data structures
- Tutorial: Working with dataframes
- Tutorial: Using pandas on the MovieLens dataset
Start with the official NumPy Tutorial. Note: if this link returns an error, move to the PDF version.
Move on to these exercises.
Learn the basics and some more advanced plotting tricks in Matplotlib with this hands-on tutorial.
- Introduction to machine learning with scikit-learn slides
- Doing machine learning with scikit-learn slides
- Tutorial: Introduction to scikit-learn
- To go further
A great source of data problems nowadays is the Kaggle platform. We'll be starting today with a simple but representative dataset: Titanic: Machine Learning from Disaster.
- Guide for orientation to approach the problem
IMPORTANT: you will find plenty of materials to analyze this data, however you'll learn the most if you give the problem some thought and try out several things before resorting to ready-made answers.
SciPy is a collection of mathematical algorithms and convenience functions built on the Numpy extension of Python. Here is a hands-on overview of this collection, together with practical exercises and more advanced problems.
For those willing to go further on the statistical aspects of SciPy, I recommend having a look at these IPython Notebooks on Effect Size, Random Sampling and Hypothesis Testing.
This repository contains a variety of content: some developed by Amélie Anglade, some derived from or largely inspired by third-parties' work, and some entirely from third-parties.
The third-party content is distributed under the license provided by those parties. Any derivative work respects the original licenses, and credits its initial authors.
Original content developed by Amélie Anglade is distributed under the MIT license.