Course Outline

Course Outline

Week 1 - Hands-on the Terminal

Objective:

  • Able to navigate file system in Terminal, using shell
  • Create the first python script and execute it

MAC:

  • Cmd+space to open Spotlight; search “Terminal” to open terminal

Shell commands:

  • cd to switch working folder
    • Path separated by /
    • Special paths: .,..,-,~
  • lsto list files/ folders in current folder
  • pwd to check current working folder
  • ls/pwd is your friend; type often to make sure where you are
  • touch to create empty new file; mkdir to create new directory
  • python to execute python scripts (usually in .py but not necessary)
  • Format of shell commands:
    • <command-name> <arg1> <arg2>.. (space separated arguments)

Challenge:

  1. Write a Python script to output "Good evening" in the Terminal.

References:

Week 2 - Use Python as a daily tool

Objective:

  • Can use Python as a daily tool -- at least a powerful calculator

Python language introduction:

  • Variables and assignment
  • Basic data types: int, float, str, bool
  • Arithmetic:
    • +, -, *, /, //, %, **
    • math , numpy (may need pip)
  • Use functions and modules: import; . notation; () notation.
  • Common modules
    • sys
    • numpy, scipy
    • str.* functions
    • random

Challenge:

  1. Build a mortgage calculator - given principal P, interest rate r and load period n, calculated the amortised monthly payment A
  2. Calculate the area of a circle given its radius r
  3. Given the length of hypotenuse of a right triangle, calculate the length of its legs. You may want to get values like $$\sin(\frac{\pi}{6})$$ via numpy.pi and numpy.sin
  4. Generate 10 random numbers. (it is OK to run your script 10 times)

References:

Week 3 - Python for anything

Objective:

  • Master the composite data type [] and {} in Python
  • Master the control logics in Python
  • Understand Python engineering

Python language:

  • help
  • bool and comparisions
    • str comparison and int comparison
  • Composite data types: list [], dict {}
  • Control flow:
    • for, while
    • if
    • try..except
  • Function, class, module:
    • def
    • class
    • *.py; from, import

Workflow:

  • Python interpreter
  • pip: pip3 for python3
    • --user option in shared computer

Challenge:

  1. Distances among cities:
    1. Calculate the "straight line" distance on earth surface from several source cities to Hong Kong. The source cities: New York, Vancouver, Stockholm, Buenos Aires, Perth. For each source city, print one line containing the name of the city and distance. "Great-circle distance" is the academic name you use to search for the formula.
    2. Use list and for loop to handle multiple cities
    3. Use function to increase the reusability
  2. Divide HW1 groups randomly: (case contribution)
    1. Get the list of student IDs from the lecturer
    2. Generate the grouping randomly
  3. Solve the "media business model" calculator.

References:

Week 4 - Web Scraping

Objective:

  • Understand the basics of HTML language, HTTP protocol, web server and Internet architecture
  • Able scrape static web pages and turn them into CSV files

Tools: ( Step-by-step reference )

  • Virtualenv -- Create isolated environment to avoid projects clutter each other
  • Jupyter notebook -- Web-based REPL; ready to distribute; all-in-one presentation

Modules:

  • Handle HTTP request/ response: requests
  • Parse web page: lxml, Beautiful Soup, HTMLPaser, help(str)
    • Useful string functions: strip(), split(), find(), replace(), str[begin:end]
  • Serialiser: csv, json

Challenges: (save to *.csv

  • Use lxml / bs4 requests
  • Bonus:
    • Collect the tweets from a celebrity like this post. You can search "python twitter" for many useful modules online.

References:

Further reading:

  • Study urllib as an alternative to requests
  • Study Regular Expression and re library in Python
  • See how reproducibility is improved with Jupyter notebook and other tools (not only Python).

Week 5 - JSON and API

Objective:

  • Reinforce the knowledge of scraper. Able to analyse and scrape normal web pages
  • Understand API/ JSON and can retrieve data from online databases (twitter, GitHub, weibo, douban, ...)

Modules:

  • Handle HTTP request/ response: requests
  • Serialiser: json

Challenges:

  • Taiwan had an earthquake in early Feb. Let's discuss this issue:
    • Search for the earthquake instances around Taiwan in recent 100 years and analyse the occurrences of earthquakes. You can refer to the same database used here. Checkout the API description. The count and query API are useful.
    • Search on Twitter and collect user's discussions about this topic. See if there is any findings. You can approach from the human interface here (hard mode) or use python-twitter module (need to register developer and obtain API key).
  • Retrieve and analyse the recent movie. Douban's API will be helpful here.

Further readings:

Post-class note: Week 5 was spent to strengthen the knowledge of scraper. This section is left for self-study. It is not dependency for future weeks. One can pick up in need.

Week 6 - Table manipulation and 1-D analysis

Objective:

  • Master the schema of "data-driven story telling": the crowd (pattern) and the outlier (anomaly)
  • Can efficiently manipulate structured table formatted datasets
  • Use pandas for basic calculation and plotting

Modules:

  • pandas
  • seaborn
  • matplotlib

Statistics:

  • mean, media, percentile
  • min, max
  • variance
  • histogram
  • sort
  • central tendency and spread of data
  • Scatter plot and correlation

Datasets to work on:

References:

Additional notes:

  • You need to finish Dataprep before analysis. That is, we start with structured data. Preparing the structured and cleaned data has no common schema. We have pointers in Dataprep for your own reading.

Week 7 - Text analysis

Objective:

  • Further strengthen the proficiency of pandas: DataFrame and Series
  • Learn to plot and adjust charts with matplotlib
  • Master basic string operations
  • Understand some major text mining models and be able to apply algorithm from 3rd party libraries.

Modules & topics:

  • str - basic string processing
    • .split(), in, .find()
    • %s format string
    • ''.format() function
  • collections.Counter for word frequency calculation
  • jieba - the most widely used Chinese word segmentation package.
  • (optional) re- Regular Expression (regex) is the swiss knife for text pattern matching.
  • (optional) nltk - contains common routines for text analysis
  • (optional) gensim - topic mining package. It also contains the Word2Vec routine.
  • (optional) Sentiment analysis - construct classifier using sklearn or use an API like text-processing. TextBlob is also useful and applied in group 2's work.

Related cases:

References:

  • Construct Naive Bayes based classifier for sentiment analysis. Read here

Datasets to work on:

Week 8 - Time series

  • Understand the principle of timestamp and datetime format
  • Master basic computation on datetime values
  • Understand periodical analysis (daily, weekly, monthly, seasonal, etc)
  • Can handle timezone conversion

Modules:

  • datetime
  • dtparser
  • pandas
    • basic visualisation .plot
    • zoom in/ out: .resample, .aggregate
  • seaborn

References:

  • timestamp usually come in unit of milliseconds (1/1000) of a second. An example to parse this timestamp format into datetime format.

Datasets:

Week 9 - Graph theory and social network analysis

Objective:

  • Understand the basics of graph theory
  • Understand most common applications in social network analysis
  • Can conduct graph analysis and visualisation in networkx

Graph metrics and algorithms:

  • Shortest path
  • Graph profiling: diameter, degree distribution, clustering coefficient
  • Centrality: degree, PageRank, betweenness, closeness, ...
  • Community detection

Challenges:

References:

Week 10 - 2D analysis and more on visualisations

Week 11 - High-dimensional analysis

Objective:

  • Understand correlation and causality. Can conduct visual (explorative) analysis of correlation
  • Can interpret common statistic quantities
  • Dimensionality reduction

Challenge:

  1. Explore the HK Legco voting records

Modules:

  • sklearn
    • decomposition.PCA
  • seaborn
  • (optional) scipy.statsmodel

References:


Following are TBC topics


Week 10 - Machine learning: clustering, classification and regression

Objective:

  • (TODO\)

Week 12 - Project presentation

Objective:

Be able to efficiently sell your work after so many heavy duty hard works!

Open topics

Those topics may be discussed if there is plenty Q/A time left in certain week. Or, you are welcome to explore those topics via group project.

  • Cloud (AWS)
  • Deep learning

results for ""

    No results matching ""