首页 日常,🏆爬虫,🧊Python

导入库

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

爬取一章内容

选择的小说是你是我的城池营垒,如果要把所有章节爬取下来就要点进每一章然后去爬取,一开始觉得有点击所以要用selenium,但是写到后面发现传每一章的url就可以不用模拟点击,所以可以不用selenium来实现用requests也可以。

请求网站:

url = 'http://www.fyhuabo.com/bqg/3805/4369788.html'
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)

先点进去一章看一下结构,tRKZinF212.png可以看到文章标题的class为title,内容在一个class="content"的div里面。

把title和div存起来:
在title后面加一个"n"换行。
div后面也加一个,要不然每一章小说就会连在一起。

title = driver.find_element_by_class_name('title')
title = title.text + "\n"
print(title)
div = driver.find_element_by_id('content')
str = div.text + "\n\n"

存到文件里,这里要设置一下编码为utf-8

f = open("d:/a.txt", 'a+', encoding='utf-8')
f.write(title)
f.write(str)
f.close()

这样爬取一章的代码写好了,可以运行测试一下

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

url = 'http://www.fyhuabo.com/bqg/3805/4369788.html'
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)
title = driver.find_element_by_class_name('title')
title = title.text + "\n"
div = driver.find_element_by_id('content')
f = open("d:/a.txt", 'a+', encoding='utf-8')
print(title)
str = div.text + "\n\n"
f.write(title)
f.write(str)
f.close()

爬取所有章节

把上面的爬取一个章节封装成一个函数,一会调用。

接着分析页面:
AZYop0Icjc.png
发现最新章节和下面的正文div的class属性一样,我们要获取第二个的div所以要让all_li获取所有的class="section-box"的div然后取第二个,就是我们要的正文。

all_li = BeautifulSoup(driver.page_source, "lxml").find_all(class_="section-box")
all_li = all_li[1]

我们要的是li里面的a的href属性,所以我们执行all_li = all_li.find_all('a')获取所有a的值。
查看all_li的值:
第1章 序
第2章 上个路口遇见你 1
可以发现所有的href链接都是有长度相等的字符串,所以可以用切片的方法获取每一章的链接:

for li in all_li:
    str_0 = str(li)
    str_0 = str_0[9: 31]

然后把链接传到爬取每一章的函数里就可以完成整章小说爬取了

所有代码

from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import time
from lxml import etree
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

def download(url_0):
    url = 'http://www.fyhuabo.com/bqg/3805' + url_0
    dcap = dict(DesiredCapabilities.PHANTOMJS)
    dcap["phantomjs.page.settings.userAgent"] = (
        "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"
    )
    driver = webdriver.PhantomJS(desired_capabilities=dcap)
    driver.get(url)
    title = driver.find_element_by_class_name('title')
    title = title.text + "\n"
    div = driver.find_element_by_id('content')
    f = open("d:/a.txt", 'a+', encoding='utf-8')
    print(title)
    str = div.text + "\n\n"
    f.write(title)
    f.write(str)
    f.close()

driver = webdriver.PhantomJS()
driver.get('http://www.fyhuabo.com/bqg/3805/')
all_li = BeautifulSoup(driver.page_source, "lxml").find_all(class_="section-box")
all_li = all_li[1]
all_li = all_li.find_all('a')

for li in all_li:
    str_0 = str(li)
    str_0 = str_0[9: 31]
    download(str_0)



文章评论