Week 7 - Text analysis

Check the information in terminal

which pip3is to learn where is pip3 in your computer. You can also input which pipto know the location of pip.
ls -l means to list the files in the mode of listing with details.

From the above screenshot, you can find that Python equals to python3.6 and the same as python3. Pip does not equal to pip3.
cat to check the content to verify if they are the same.

Obviously, they are the same.
You can also drag the pip file and pip3 file into the Visual Studio Code,after you use which to know their location.

Simple steps to start Jupyter notebook in terminal.

4 Steps

Step1
pyvenv venv means create a virtual environment(venv) folder called 'venv'. You can change the folder's name as you like, like pyvenv BIGDATA.
Be careful where you create the folder.
Step2
source venv/bin/activate means activate the virtual environment. Then you will see the '(venv)',which means you are in a virtual environment.

You can deactivate the virtual environment.

Step3

pip install jupyter
pip install requests

Step4
Jupyter notebook to open notebook.

New

If you are in the same computer, or on your own laptop, you don't need to totally follow the 4 steps.
Pili's virtual environment folder is 'environment'. So if he restart his computer, he can just
```
source enviroment/bin/activate
Jupyter notebook
```

Install modules in jupyter

You can just pip install packages in jupyter notebook.

For loop

Range()

Function 'Range' has 3 parameters. From XX to XX with the step size XX.
The 3rd parameter is the step size. If you don't input the 3rd one, it will take 1 in default.
The parameters can be negative.
'i' is defined in the whole coding,from the beginning of 'import' to the end of the coding, not only inside for loop.
If you define sth inside the def, the definition will only work inside the def.

Append VS Extend

In the for loop, i will be 10, 8 then 6 and 4 at last. Only the last value is left.

append means add. Every time you have an 'i', you add it into the 'pages' list.

append means that the whole list is appended as an element, or one item.
extend means extract the items and add those items into the new list.

Verify every step -splinter python

Fail to save content.

r=requests.get('XXXXthe website')

The above step is OK.

 r.text
 open('mypage','w').write(r.text)

Then you can find a file.

If you open it, you will find that
It is blank, which means the file you save is blank.
SO try to verify everything step by step.

Resolution: splinter

Splinter is a browser to emulate a real person. So the website won't know whether you are a man or a robot.
You can google to learn that.

Exercise: Tweeter Troll Data

Puli's Github
https://github.com/hupili/python-for-data-and-media-communication
Exercise Data:
https://github.com/hupili/python-for-data-and-media-communication/blob/master/w7-text/Twitter Troll data from NBC (nltk).ipynb
Research get those deleted data from archive.

Download the file.

Control + right click the 'save the file(link) as ...'.
Drag that into our working folder or just download into the folder.

Dataframe

import pandas as pd
df= pd.read_csv('XXXXXXXXXXXXXX.csv')

It can also be opened by 'https://XXXX links'.

import pandas as pd
df= pd.read_csv('https://XXXXXXXXXXXXXXXXX')

```
df.sample(10)
```
It means it randomly print 10 samples. It is useful when your dataset is very large, which will be slow to run the code.
```
df['user_key'].value_counts()
```
Count the popular users. They post largest number of messages.
```
'a' in 'am'
```
in is to check if it is contained in the text.
```
'abc'.find('b')
```
It shows the index, which starts from 0. You can see from [46], space is also contained. And '-1' means the last one.

Apply a function onto every element of the dataframe.

```
df.apply()
```
def is to define a function called check_name, which checks if 'amXXX' in x. If it is true, it will return 'amXXX'.
x is just a variable.
apply to make the function work for all the 'text' in the dataframe. In other words, x='text' in the example.
There is an error in the second line. There are some dirty data in 'text'.

Convert text by str(),lower().

```
str(x)
```
```
.value_counts()
```
It is to check how many times it appears. And they are the same, which means there are some errors.
```
lower()
upper()
```

Use the previous step as a filter

```
df[df['XXX'] ]
```
[61] is a filter. Now it works. We successfully find out how many times they are retweeted.

df['text'].apply[check_names].value_counts()[True]

We extract the True.

Find the most one retweeted - by function.

def check_name(x):
  retutn 'ten_gop' in str(x).lower()
df['text'].apply[check_names].value_counts()[True]

It is the previous step.

def count_retweeted_number(name):
  def check_name(x):
      retutn 'name' in str(x).lower()
  return df['text'].apply(check_name).value_counts()[True]

Now we write the previous one into a function. In the inside function, we change 'x' into 'name'.
```
count_retweeted_number('XXX')
```

Apply the function into all the names.

Try1: Fail

df['user_key']

It is a Series.
```
s_user=df['user_key']
```
The value_counts is just to show you how many times they appear. 's_user' is just like a dictionary.
```
s_user.apply(count_retweeted_number)
```
apply is a function which only works for the values.
Apply the function into all the 'user_key'. But there is an error. Because we are applying on the values of the 's__user', which is obviously integers in [75]. So we have to change the name as the value of the Series. Then we can apply to the names.

Try2: change the name as the value of the Series

```
s_user.index
s_user.values
```
It is to check the index and values. They are correspond to each other.
```
s_user.to_frame.reset_index()
```
to_frame is to change Series into Dataframe.
reset_index is to add an index. Then the formal index will be change into a value, whose column name is 'index'.
```
df_user['index'].apply()
```
The error is in the picture below:

In this step, if the answer is false, there will be an error.

Try3: succeed

As we write before:
```
s_user=df['user_key']
```
```
.get()
```
[87] is something appear in the content.
[88] is the same.
[89] does not exist in the content.
[90] and [91] means we change the return of the 'false'. In default, it is empty. We can change it in the 2nd parameter. It is better to set it as 0 in this example.
```
.get(True,0)
```
```
sort_values(by='user_key',ascending=False)
```
We can find out who tweeted the largest number of tweets.
```
sort_values(by='count',ascending=False)
```
We can find out who is retweeted most.

So it will execute 454 times. It really takes a long time to finish the whole code.

Save time

You can interrupt it.
You can run the top20.

Calculate the frequent terms

Get text

Get the text.
```
.split()
```
Split by space or comma.
```
[:10]
```
Get the formal 10 items' text.
```
extend()
```
Split the formal 5 items' text and split them by space. Then extract the items and add the items into list 'all_text'.
If we run for whole text, cancelling '[:5]'. There is an error.

We have to change the text into str.

Word count

```
pd.Series()
```
Convert 'word count' into a Series, and reset index.
```
.to_frame().reset_index()
```
Convert into a dataframe.
```
sort_values(ascending=False)
```
They are not informative, as there are so many 'stop-words'. We can delete those words manually.
```
set(['RT', 'the', 'of'])
```
set is more efficient for the integers to check in or not in.
You can search google you can find 'stop word' resources.
NLTK:

Stop_word

Step1

def is_stop_word(x):
return x in stop_words

Step2

df_wrod_count[df_word_count['index'].apply(is_stop_word)]

Step3
```
.sort_values(by=0,ascending=False)
```
Step4
```
is not stop
```

Word cloud

Jieba

```
jieba.cut()
```
It means we have to change it into a list.

Pandas plotting

Please learn to learn from others by google.
Pandas can be more powerful than excel.First of all,let's start from the excel function.

Notes: Week 07

Week 7 - Text analysis

Check the information in terminal

Simple steps to start Jupyter notebook in terminal.

4 Steps

New

Install modules in jupyter

For loop

Range()

Append VS Extend

Verify every step -splinter python

Fail to save content.

Resolution: splinter

Exercise: Tweeter Troll Data

Download the file.

Dataframe

Apply a function onto every element of the dataframe.

Convert text by str(),lower().

Use the previous step as a filter

Find the most one retweeted - by function.

Apply the function into all the names.

Try1: Fail

Try2: change the name as the value of the Series

Try3: succeed

Save time

Calculate the frequent terms

Get text

Word count

Stop_word

Word cloud

Jieba

Pandas plotting

results for ""

No results matching ""