python抓小说教程来了!urllib2、BeautifulSoup抓小说!

有网友留言问怎么用python抓取小说，今天小编就给大家分享一下用python抓取起点中文网的免费小说教程，用到的库有urllib2、BeautifulSoup，下面就来看看吧！（关注并私信我python，给你发价值万元的python学习教程。）

库

urllib2

模拟http请求获取html

BeautifulSoup

根据选择器获取dom结点,可查看css选择器

抓取逻辑

1.查看起点免费小说列表：https://www.qidian.com/free/all

2.先搞懂一本书的抓取逻辑

2.1 根据选择器获取到书的链接和书名

 bookCover = book.select("div[class='book-mid-info'] h4 > a")[0]

利用css选择器，直接定位到我们需要的div。

2.2 创建并打开文件

 bookFile = open("crawler/books/" + bookCover.string + ".txt", 
 "a+")

使用"a+"模式打开，如果不存在就创建这个文件，如果存在，就追加内容。创建的txt文件名也就是抓取到的dom结点的text

2.3 跳转到正文内容

先获取到"div[class='book-mid-info'] h4 > a" 这个结点的href地址，然后获取到返回内容，如下图

再获取到免费试读这个结点的href，再获取它的返回内容

2.4 递归获取到每一张的内容，写入文件

通过class获取到结点内容，然后再获取到下一章的href然后递归获取每章内容。

如果没有下一页而是书末页就说明已经最后一章了，递归结束，一本书的内容也就获取完毕了。

循环获取当前页的每本书内容

每本书其实都是一个li标签，先获取到所有的li然后按照第二步进行遍历。

循环获取所有页面的书

当当前页面所有的书本都抓取完毕了，那么我们可以获取下>对应的href然后获取到返回内容，继续循环抓取。

直到抓取到最后一页,>这个dom结点的class会增加一个为lbf-pagination-disabled,可以根据这个来判断是否为最后一页。

成品展示

完整代码

# coding=utf-8
import urllib2
import sys
from bs4 import BeautifulSoup
#设置编码
reload(sys)
sys.setdefaultencoding('utf-8')
startIndex = 0 #默认第0本
startPage = 0 #默认第0页
#获取一个章节的内容
def getChapterContent(file,url):
 try:
 bookContentRes = urllib2.urlopen(url)
 bookContentSoup = BeautifulSoup(bookContentRes.read(), "html.parser")
 file.write(bookContentSoup.select("h3[class='j_chapterName']")[0].string + '\n')
 for p in bookContentSoup.select(".j_readContent p"):
 file.write(p.next + '\n')
 except BaseException:
 #如果出错了，就重新运行一遍
 print(BaseException.message)
 getChapterContent(file, url)
 else:
 chapterNext = bookContentSoup.select("a#j_chapterNext")[0]
 if chapterNext.string != "书末页":
 nextUrl = "https:" + chapterNext["href"]
 getChapterContent(file,nextUrl)
#获取当前页所有书的内容
def getCurrentUrlBooks(url):
 response = urllib2.urlopen(url)
 the_page = response.read()
 soup = BeautifulSoup(the_page, "html.parser")
 bookArr = soup.select("ul[class='all-img-list cf'] > li")
 global startIndex
 if startIndex > 0:
 bookArr = bookArr[startIndex:]
 startIndex = 0
 for book in bookArr:
 bookCover = book.select("div[class='book-mid-info'] h4 > a")[0]
 print "书名：" + bookCover.string
 # 先创建.txt文件，然后获取文本内容写入
 bookFile = open("crawler/books/" + bookCover.string + ".txt", "a+")
 bRes = urllib2.urlopen("https:" + bookCover['href'])
 bSoup = BeautifulSoup(bRes.read(), "html.parser")
 bookContentHref = bSoup.select("a[class='red-btn J-getJumpUrl ']")[0]["href"]
 getChapterContent(bookFile, "https:" + bookContentHref)
 bookFile.close()
 nextPage = soup.select("a.lbf-pagination-next")[0]
 return nextPage["href"]
if len(sys.argv)==1:
 pass
elif len(sys.argv) == 2:
 startPage = int(sys.argv[1])/20 #从第几页开始下载
 startIndex = int(sys.argv[1])%20 # 从第几本开始下载
elif len(sys.argv) > 2:
 startPage = int(sys.argv[1])
 startIndex = int(sys.argv[2])
#根据传入参数设置从哪里开始下载
url = "//www.qidian.com/free/all?orderId=&vip=hidden&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=1&page="+str(startPage+1)
#死循环 直到没有下一页
while True:
 if url.startswith("//"):
 url = getCurrentUrlBooks("https:" + url)
 else:
 break;

关注并私信我python，给你发价值万元的python学习教程。

网站首页 > 基础教程正文

猜你喜欢

网站首页 > 基础教程 正文

python抓小说教程来了!urllib2、BeautifulSoup抓小说!

猜你喜欢

网站首页 > 基础教程正文