Week 06 1-D and 2-D analysis

1. Preperations Before Data Analysis

Notes:

Before learning, we will use jupyter notebook here, please enter venv environment first and enter into jupyter notebook.

install pandas, seaborn, matplotlib, requests, csv

pip install pandas

pip install seaborn

pip install matplotlib

2. Use "Pandas" to do Data Analysis

Example: Today, We will use the data from Openrice as an example and do the restaurant analysis. Assuming that we have already got certain amount of data from Openrice and saved it into csv file.

Step1: Save csv file

Here is the link of csv life and the csv needs to be downloaded here.

https://github.com/hupili/python-for-data-and-media-communication/tree/master/w6-pandas
Click "raw" on the right upper corner.
You can see the raw csv file as below.
Click right and choose "save as"
Then the csv file can be saved as csv(comma-separated values).

Step2: Read csv file

Put csv file into the same folder with venv.
import pandas
Read csv file pandas.read_csv('openrice.csv')
The output will be as below:
If there is no header in the csv file.We can use Pandas as below to add proper headers for a form.
```
df = pandas.read_csv('openrice.csv', header=None, names=['name', 'location', 'price', 'style', 'type', 'likes'])
```
then the output will be like this:

Notes:
dfis short for "dataframe", is used in as return value in pandas.

Step3: Select data from csv

If you want to the first 10 data from the csv file. then you can use
```
df.head(10)
```
the output will be as blow:
If you want to select one column. You can use dataframe as a dictionary, use a key to refer to certain value. For example, you want all the restaurant locations.You can type:
```
df['location']
```
Then the output will be as below (the picture do not show all the locations due to the limited space):

Step4: Analysis just one dimension

One dimension is one column in a form.
You can use
df['location'].value_counts()
then the output will be as below, showing you how many likes each restaurant have got.
Then you may need to calculate certain dimension. For example, how many likes that each restaurant have got. First, you will get all "likes" column data as follow:

then, you need to to know the mean, media, percentile, min,max number of this dimension as below:
If you want to know how many restaurants having likes is 558, or less than 60, then you can use filter function:
df[df['likes'] == 558]
df[df['likes'] < 60]

the output will be as below:
then you can put these filter data into a distribution, using
df['likes'].hist()
and you can get a distribution like below:
if you want to see change parameter, you can use
df['likes'].hist(bins=20)

Step5: How to describe distribution

After you get the distribution, you can do some analysis. Compare the distribution with mean, media numbers.
If you need to compare price which is a interval.You need to pay special attention on numbers. Python recognize '$101-200'<'$51-100' because Python only compare the
numbers in sequence of each interval.

You need to convert each interval string into numbers, which means you need to choose a number to represent each interval to do comparison.
Here, we use "mapping" function
```
mapping = {
  '$101-200': 200,
  '$201-400': 400,
  '$51-100': 100,
  '$401-800': 800,
  '$50以下': 50
}
```

Now, you can use:

  original_string = '$60以下'
  mapping.get(orignal_string, 0)
  def cleaning(e):
  return mapping.get(e, 0)
  cleaning('$50以下')

Then you can use the coding below to transfer intervals into numbers.

df['price_num'].apply(cleaning)
If you want to select location of Mongkok.
```
  df1[
     df1['location'] == '旺角'
    ]
```
the output will be:
If you want to select the seafood restaurants with price number less than 100.
If you want to sort price from high to low.
```
  df.sort_values(by='price_num', ascending=False)
```
the output will be

Notes: Week 06

Week 06 1-D and 2-D analysis

1. Preperations Before Data Analysis

2. Use "Pandas" to do Data Analysis

results for ""

No results matching ""