which pip3
is to learn where is pip3 in your computer. You can also input which pip
to know the location of pip.
ls -l
means to list the files in the mode of listing with details.
From the above screenshot, you can find that Python equals to python3.6 and the same as python3. Pip does not equal to pip3.
cat
to check the content to verify if they are the same.
Obviously, they are the same.
You can also drag the pip file and pip3 file into the Visual Studio Code,after you use which
to know their location.
Step1pyvenv venv
means create a virtual environment(venv) folder called 'venv'. You can change the folder's name as you like, like pyvenv BIGDATA
.
Be careful where you create the folder.
Step2source venv/bin/activate
means activate the virtual environment. Then you will see the '(venv)',which means you are in a virtual environment.
You can deactivate
the virtual environment.
Step3
pip install jupyter
pip install requests
Step4Jupyter notebook
to open notebook.
If you are in the same computer, or on your own laptop, you don't need to totally follow the 4 steps.
Pili's virtual environment folder is 'environment'. So if he restart his computer, he can just
source enviroment/bin/activate
Jupyter notebook
The parameters can be negative.
'i' is defined in the whole coding,from the beginning of 'import' to the end of the coding, not only inside for loop.
If you define sth inside the def
, the definition will only work inside the def.
append
means add. Every time you have an 'i', you add it into the 'pages' list.append
means that the whole list is appended as an element, or one item.
extend
means extract the items and add those items into the new list.
r=requests.get('XXXXthe website')
r.text open('mypage','w').write(r.text)
If you open it, you will find that
It is blank, which means the file you save is blank.
SO try to verify everything step by step.
Splinter is a browser to emulate a real person. So the website won't know whether you are a man or a robot.
You can google to learn that.
Puli's Github
https://github.com/hupili/python-for-data-and-media-communication
Exercise Data:
https://github.com/hupili/python-for-data-and-media-communication/blob/master/w7-text/Twitter Troll data from NBC (nltk).ipynb
Research get those deleted data from archive.
Control + right click
the 'save the file(link) as ...'.
Drag that into our working folder or just download into the folder.
import pandas as pd
df= pd.read_csv('XXXXXXXXXXXXXX.csv')
It can also be opened by 'https://XXXX links'.
import pandas as pd
df= pd.read_csv('https://XXXXXXXXXXXXXXXXX')
df.sample(10)
It means it randomly print 10 samples. It is useful when your dataset is very large, which will be slow to run the code.
df['user_key'].value_counts()
Count the popular users. They post largest number of messages.
'a' in 'am'
in
is to check if it is contained in the text.
'abc'.find('b')
It shows the index, which starts from 0. You can see from [46], space is also contained. And '-1' means the last one.
df.apply()
def
is to define a function called check_name, which checks if 'amXXX' in x. If it is true, it will return 'amXXX'.
x is just a variable.
apply
to make the function work for all the 'text' in the dataframe. In other words, x='text' in the example.
There is an error in the second line. There are some dirty data in 'text'.
str(x)
.value_counts()
It is to check how many times it appears. And they are the same, which means there are some errors.
lower()
upper()
df[df['XXX'] ]
[61] is a filter. Now it works. We successfully find out how many times they are retweeted.
df['text'].apply[check_names].value_counts()[True]
We extract the True.
def check_name(x):
retutn 'ten_gop' in str(x).lower()
df['text'].apply[check_names].value_counts()[True]
It is the previous step.
def count_retweeted_number(name):
def check_name(x):
retutn 'name' in str(x).lower()
return df['text'].apply(check_name).value_counts()[True]
Now we write the previous one into a function. In the inside function, we change 'x' into 'name'.
count_retweeted_number('XXX')
df['user_key']
It is a Series.
s_user=df['user_key']
The value_counts
is just to show you how many times they appear. 's_user' is just like a dictionary.
s_user.apply(count_retweeted_number)
apply
is a function which only works for the values.
Apply the function into all the 'user_key'. But there is an error. Because we are applying on the values of the 's__user', which is obviously integers in [75]. So we have to change the name as the value of the Series. Then we can apply to the names.
s_user.index
s_user.values
It is to check the index and values. They are correspond to each other.
s_user.to_frame.reset_index()
to_frame
is to change Series into Dataframe.reset_index
is to add an index. Then the formal index will be change into a value, whose column name is 'index'.
df_user['index'].apply()
The error is in the picture below:
In this step, if the answer is false, there will be an error.
As we write before:
s_user=df['user_key']
.get()
[87] is something appear in the content.
[88] is the same.
[89] does not exist in the content.
[90] and [91] means we change the return of the 'false'. In default, it is empty. We can change it in the 2nd parameter. It is better to set it as 0 in this example.
.get(True,0)
sort_values(by='user_key',ascending=False)
We can find out who tweeted the largest number of tweets.
sort_values(by='count',ascending=False)
We can find out who is retweeted most.
So it will execute 454 times. It really takes a long time to finish the whole code.
You can interrupt it.
You can run the top20.
Get the text.
.split()
Split by space or comma.
[:10]
Get the formal 10 items' text.
extend()
Split the formal 5 items' text and split them by space. Then extract the items and add the items into list 'all_text'.
If we run for whole text, cancelling '[:5]'. There is an error.
We have to change the text into str.
pd.Series()
Convert 'word count' into a Series, and reset index.
.to_frame().reset_index()
Convert into a dataframe.
sort_values(ascending=False)
They are not informative, as there are so many 'stop-words'. We can delete those words manually.
set(['RT', 'the', 'of'])
set
is more efficient for the integers to check in or not in.
You can search google you can find 'stop word' resources.
NLTK:
Step1
def is_stop_word(x):
return x in stop_words
Step2
df_wrod_count[df_word_count['index'].apply(is_stop_word)]
Step3
.sort_values(by=0,ascending=False)
Step4
is not stop
jieba.cut()
It means we have to change it into a list.
Please learn to learn from others by google.
Pandas can be more powerful than excel.First of all,let's start from the excel function.