So it is suggested to enter virtual environment before using Jupyter notebook.
pip3 install --user requests
pip3 install --user bs4
pip3 install --user lxml
https://hupili.gitbooks.io/python-for-data-and-media-communication/content/module-jupyter.html
A:
The answer is in the picture: control+C. ( Pay attention to the text. )Then you will get the following picture.Please input y
in 5 seconds.
A: input
deactivate
There are some useful tips for you.
tab
If we input the mypage.t
, then tab
, we will find there are other commands(functions),including tag_name, text and so on.
type
to check the object. It is very useful when we write complicated codings.
eg:
a=1
type(a)
b="hello"
type(b)
You will get:
int
str
help(str.strip)
to know the details.
print
step by step to check where is the error.
(In Jupyter, you can just input the variables without the function of print.)
shift+return
to run the code.option+command+i
to open the Chrome develop console.import requests
import bs4
import csv
requests
is a module containing diverse functions relating to the web page.
bs4
is the abbreviation of BeautifulSoup4.It is used to analyse web page.
r = requests.get('http://initiumlab.com/blog/20170329-trump-and-ivanka/'
r.text
It can be wrote in one line.
r = requests.get('http://initiumlab.com/blog/20170329-trump-and-ivanka/').text
Store the web as r
, get
means try to get response of that web page.
text
means to show the text of the web page.
from bs4 import BeautifulSoup
mypage = BeautifulSoup(r.text)
BeautifulSoup
is to extract the web's content.
It relies on certain engine to analyse. lxml
is one of the engine. By default, it will choose the best engine.
Warning is not an error. So no need to worry at this stage.
h1 class="post__title" itemprop="name headline"> 特朗普父女推特解密</h1
<h1........</h1>
.(see the above 'html' part.)myh1 = mypage.find('h1')
mytitle = myh1.text
mytitle.strip()
find
to find what we want, and output the first item.find_all
means output all we find. strip()
means delete the meaningless coding.help(str.strip)
to see the usage of strip.<time itemprop="dateCreated" datetime="2017-03-29T....." content="2017-03-29"> 2017-03-29 </time>
mydate = mypage.find('time').text.strip()
myauthor = mypage.find('span')
command+f
to open the search bar in console,and input 'span'.You can see, there are more than 2 'span'.myspans = mypage.find_all('span')
find_all
means output all the items it finds.find
means only output the first one.
'myspans' is a list.
myspans[0]
means extract the first one in the list.
myspans[1]
means extract the second one in the list.
In the HTML, we can find that authors upper tag is 'td'. But there are too many td. And it is difficult to be specific.
As 'tr' has class, which means more specific than 'td', so try to locate the more upper file: 'tr':
len(mypage.find_all('tr')
5
mytr = mypage.find('tr', attrs={'class': 'post__authors'})
attrs
= attributes. You can add more detailed information about tr, which helps to locate the tr. mytr
<tr class="post__authors"> <td>.......</td> <td> <span>Li Yiming</span> <span>Li Yuqiong</span> </td> </tr>
mytr.find_all('span')
[<span>Li Yiming</span>, <span>Li Yuqiong</span>]
authors = []
for myspan in mytr.find_all('span'):
authors.append(myspan.text)
Create a new empty list called 'authors'. Then add the information in it.
list1.append()
means add the item into the list1.
article = {
'title': mytitle,
'authors': authors,
'date': mydate
}
{}
is a dictionary.The left of the colon is the key, or name. The right of the colon is the value of the key.def scrape_one_page(url):
r = requests.get(url).text
mypage = BeautifulSoup(r,'lxml')
mytitle = mypage.find('h1').text.strip()
mydate = mypage.find('time').text.strip()
mytr = mypage.find('tr', attrs={'class': 'post__authors'})
authors = []
for myspan in mytr.find('span'):
authors.append(myspan)
article = {
'title': mytitle,
'authors': authors,
'date': mydate
}
return article
Create a function for future.
Following is an example of the functionscrape_one_page
's usage:
print(scrape_one_page('http://initiumlab.com/blog/20170401-data-news/'))
urls = [
'http://initiumlab.com/blog/20170407-open-data-hk/',
'http://initiumlab.com/blog/20170401-data-news/',
'http://initiumlab.com/blog/20170329-trump-and-ivanka/'
]
articles= []
for url in urls:
article= scrape_one_page(url)
articles.append(article)
print(articles)
scrape_one_page
function to extract.Then add those into the articles.with open('eggs.csv', 'w') as f:
writer= csv.writer(f)
for article in articles:
writer.writerow([
article['authors'],
article['date'],
article['title']
])
We can save those infornation with in different ways. csv
is one of the types which can be easily opened and read.
with open ('eggs.cdv','w') as f:
means create a file called 'eggs.csv'. 'w' is one of the way to open it, which is usually in default. 'as f' means that when we write f, it equals to open the file.
csv.writer
means edit a csv file.
writer.writerow([])
means add information in row. It can be wrote like this s.writerow(['Spam', '1', '2', '3'])