Notes: Week 07

Week 7 - Text analysis

Check the information in terminal

  • which pip3is to learn where is pip3 in your computer. You can also input which pipto know the location of pip.

  • ls -l means to list the files in the mode of listing with details.

    From the above screenshot, you can find that Python equals to python3.6 and the same as python3. Pip does not equal to pip3.

  • cat to check the content to verify if they are the same.

    Obviously, they are the same.

  • You can also drag the pip file and pip3 file into the Visual Studio Code,after you use which to know their location.

Simple steps to start Jupyter notebook in terminal.

4 Steps
  • Step1
    pyvenv venv means create a virtual environment(venv) folder called 'venv'. You can change the folder's name as you like, like pyvenv BIGDATA.
    Be careful where you create the folder.

  • Step2
    source venv/bin/activate means activate the virtual environment. Then you will see the '(venv)',which means you are in a virtual environment.

    You can deactivate the virtual environment.

  • Step3

    pip install jupyter
    pip install requests
    
  • Step4
    Jupyter notebook to open notebook.

New
  • If you are in the same computer, or on your own laptop, you don't need to totally follow the 4 steps.

  • Pili's virtual environment folder is 'environment'. So if he restart his computer, he can just

    source enviroment/bin/activate
    Jupyter notebook
    

Install modules in jupyter

  • You can just pip install packages in jupyter notebook.

For loop

Range()
  • Function 'Range' has 3 parameters. From XX to XX with the step size XX.
    The 3rd parameter is the step size. If you don't input the 3rd one, it will take 1 in default.

  • The parameters can be negative.


  • 'i' is defined in the whole coding,from the beginning of 'import' to the end of the coding, not only inside for loop.

  • If you define sth inside the def, the definition will only work inside the def.

Append VS Extend

  • In the for loop, i will be 10, 8 then 6 and 4 at last. Only the last value is left.

  • append means add. Every time you have an 'i', you add it into the 'pages' list.

  • append means that the whole list is appended as an element, or one item.

  • extend means extract the items and add those items into the new list.

Verify every step -splinter python

Fail to save content.

r=requests.get('XXXXthe website')
  • The above step is OK.
 r.text
 open('mypage','w').write(r.text)
  • Then you can find a file.


If you open it, you will find that
It is blank, which means the file you save is blank.
SO try to verify everything step by step.

Resolution: splinter
  • Splinter is a browser to emulate a real person. So the website won't know whether you are a man or a robot.

  • You can google to learn that.

Exercise: Tweeter Troll Data

Download the file.
  • Control + right click the 'save the file(link) as ...'.

  • Drag that into our working folder or just download into the folder.

Dataframe
  • import pandas as pd
    df= pd.read_csv('XXXXXXXXXXXXXX.csv')
    

  • It can also be opened by 'https://XXXX links'.

    import pandas as pd
    df= pd.read_csv('https://XXXXXXXXXXXXXXXXX')
    
  • df.sample(10)
    


    It means it randomly print 10 samples. It is useful when your dataset is very large, which will be slow to run the code.

  • df['user_key'].value_counts()
    


    Count the popular users. They post largest number of messages.

  • 'a' in 'am'
    


    in is to check if it is contained in the text.

  • 'abc'.find('b')
    


    It shows the index, which starts from 0. You can see from [46], space is also contained. And '-1' means the last one.

Apply a function onto every element of the dataframe.
  • df.apply()
    


  • def is to define a function called check_name, which checks if 'amXXX' in x. If it is true, it will return 'amXXX'.

  • x is just a variable.

  • apply to make the function work for all the 'text' in the dataframe. In other words, x='text' in the example.

  • There is an error in the second line. There are some dirty data in 'text'.

Convert text by str(),lower().
  • str(x)
    

  • .value_counts()
    


    It is to check how many times it appears. And they are the same, which means there are some errors.

  • lower()
    upper()
    


Use the previous step as a filter
  • df[df['XXX'] ]
    


  • [61] is a filter. Now it works. We successfully find out how many times they are retweeted.

  • df['text'].apply[check_names].value_counts()[True]
    

  • We extract the True.

Find the most one retweeted - by function.
  • def check_name(x):
      retutn 'ten_gop' in str(x).lower()
    df['text'].apply[check_names].value_counts()[True]
    

    It is the previous step.

  • def count_retweeted_number(name):
      def check_name(x):
          retutn 'name' in str(x).lower()
      return df['text'].apply(check_name).value_counts()[True]
    

  • Now we write the previous one into a function. In the inside function, we change 'x' into 'name'.

  • count_retweeted_number('XXX')
    

Apply the function into all the names.
Try1: Fail
  • df['user_key']


    It is a Series.

  • s_user=df['user_key']
    



    The value_counts is just to show you how many times they appear. 's_user' is just like a dictionary.

  • s_user.apply(count_retweeted_number)
    


    apply is a function which only works for the values.
    Apply the function into all the 'user_key'. But there is an error. Because we are applying on the values of the 's__user', which is obviously integers in [75]. So we have to change the name as the value of the Series. Then we can apply to the names.

Try2: change the name as the value of the Series
  • s_user.index
    s_user.values
    


    It is to check the index and values. They are correspond to each other.

  • s_user.to_frame.reset_index()
    


    to_frame is to change Series into Dataframe.
    reset_index is to add an index. Then the formal index will be change into a value, whose column name is 'index'.

  • df_user['index'].apply()
    



  • The error is in the picture below:

    In this step, if the answer is false, there will be an error.

Try3: succeed
  • As we write before:

    s_user=df['user_key']
    


  • .get()
    





    [87] is something appear in the content.
    [88] is the same.
    [89] does not exist in the content.
    [90] and [91] means we change the return of the 'false'. In default, it is empty. We can change it in the 2nd parameter. It is better to set it as 0 in this example.

  • .get(True,0)
    

  • sort_values(by='user_key',ascending=False)
    


    We can find out who tweeted the largest number of tweets.

  • sort_values(by='count',ascending=False)
    


    We can find out who is retweeted most.

    So it will execute 454 times. It really takes a long time to finish the whole code.

Save time
  • You can interrupt it.

  • You can run the top20.

Calculate the frequent terms

Get text



  • Get the text.

  • .split()
    




    Split by space or comma.

  • [:10]
    


    Get the formal 10 items' text.

  • extend()
    


    Split the formal 5 items' text and split them by space. Then extract the items and add the items into list 'all_text'.

  • If we run for whole text, cancelling '[:5]'. There is an error.


    We have to change the text into str.

Word count

  • pd.Series()
    



    Convert 'word count' into a Series, and reset index.

  • .to_frame().reset_index()
    


    Convert into a dataframe.

  • sort_values(ascending=False)
    


    They are not informative, as there are so many 'stop-words'. We can delete those words manually.

  • set(['RT', 'the', 'of'])
    


    set is more efficient for the integers to check in or not in.

  • You can search google you can find 'stop word' resources.

  • NLTK:

Stop_word


  • Step1

    def is_stop_word(x):
    return x in stop_words
    
  • Step2

    df_wrod_count[df_word_count['index'].apply(is_stop_word)]
    
  • Step3

    .sort_values(by=0,ascending=False)
    
  • Step4

    is not stop
    

Word cloud

Jieba

  • jieba.cut()
    

  • It means we have to change it into a list.

Pandas plotting

  • Please learn to learn from others by google.

  • Pandas can be more powerful than excel.First of all,let's start from the excel function.

results for ""

    No results matching ""