Python Scrapy实战之古诗文网的爬取-偶像资源网

需求

通过python,Scrapy框架，爬取古诗文网上的诗词数据，具体包括诗词的标题信息，作者，朝代，诗词内容，及译文。爬取过程需要逐页爬取，共4页。第一页的url为（https://www.gushiwen.cn/default_1.aspx）。

1. Scrapy项目创建

首先创建Scrapy项目及爬虫程序

在目标目录下，创建一个名为prose的项目：

scrapy startproject prose

进入项目目录下，然后创建一个名为gs的爬虫程序，爬取范围为 gushiwen.cn

cd prose
scrapy genspider gs gushiwen.cn

2. 全局配置 settings.py

对配置文件settings.py做如下编辑：

①选择不遵守robots协议

②下载间隙设置为1

③并添加请求头，启用管道

④此外设置打印等级：LOG_LEVEL=“WARNING”

具体如下：

# Scrapy settings for prose project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = \'prose\'

SPIDER_MODULES = [\'prose.spiders\']
NEWSPIDER_MODULE = \'prose.spiders\'

LOG_LEVEL = \"WARNING\"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = \'prose (+http://www.yourdomain.com)\'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    \'user-agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36\',
    \'Accept\': \'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\',
    \'Accept-Language\': \'en\',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    \'prose.middlewares.ProseSpiderMiddleware\': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    \'prose.middlewares.ProseDownloaderMiddleware\': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    \'scrapy.extensions.telnet.TelnetConsole\': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   \'prose.pipelines.ProsePipeline\': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = \'httpcache\'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = \'scrapy.extensions.httpcache.FilesystemCacheStorage\'

3. 爬虫程序.py

首先是进行页面分析，这里不再赘述该过程。

这部分代码，也即需要编辑的核心部分。

首先是要把初始URL加以修改，修改为要爬取的界面的第一页，而非古诗文网的首页。

需求：我们要爬取的内容包括：诗词的标题信息，作者，朝代，诗词内容，及译文。爬取过程需要逐页爬取。

其中，标题信息，作者，朝代，诗词内容，及译文都存在于同一个<div>标签中。

为了体现两种不同的操作方式，

标题信息，作者，朝代，诗词内容四项，我们使用一种方法获取。并在该for循环中使用到一个异常处理语句（try…except…）来避免取到空值时使用索引导致的报错；

对于译文，我们额外定义一个parse_detail函数，并在scrapy.Request()中传入其，来获取。

关于翻页，我们的思路是：遍历获取完每一页需要的数据后（即一大轮循环结束后），从当前页面上获取下一页的链接，然后判断获取到的链接是否为空。如若不为空则表示获取到了，则再一次使用scrapy.Requests()方法，传入该链接，并再次调用parse函数。如果为空，则表明这已经是最后一页了，程序就会在此处结束。

具体代码如下：

import scrapy
from prose.items import ProseItem


class GsSpider(scrapy.Spider):
    name = \'gs\'
    allowed_domains = [\'gushiwen.cn\']
    start_urls = [\'https://www.gushiwen.cn/default_1.aspx\']

    # 解析列表页面
    def parse(self, response):
        # 一个class=\"sons\"对应的是一首诗
        div_list = response.xpath(\'//div[@class=\"left\"]/div[@class=\"sons\"]\')
        for div in div_list:
            try:
                # 提取诗词标题信息
                title = div.xpath(\'.//b/text()\').get()
                # 提取作者和朝代
                source = div.xpath(\'.//p[@class=\"source\"]/a/text()\').getall()
                # 作者
                # replace
                author = source[0]
                # 朝代
                dynasty = source[1]
                content_list = div.xpath(\'.//div[@class=\"contson\"]//text()\').getall()
                content_plus = \'\'.join(content_list).strip()
                # 拿到诗词详情页面的url
                detail_url = div.xpath(\'.//p/a/@href\').get()
                item = ProseItem(title=title, author=author, dynasty=dynasty, content_plus=content_plus, detail_url=detail_url)
                # print(item)
                yield scrapy.Request(
                    url=detail_url,
                    callback=self.parse_detail,
                    meta={\'prose_item\': item}
                )
            except:
                pass

        next_url = response.xpath(\'//a[@id=\"amore\"]/@href\').get()
        if next_url:
            print(next_url)
            yield scrapy.Request(
                url=next_url,
                callback=self.parse
            )


    # 用于解析详情页面
    def parse_detail(self, response):
        item = response.meta.get(\'prose_item\')
        translation = response.xpath(\'//div[@class=\"sons\"]/div[@class=\"contyishang\"]/p//text()\').getall()
        item[\'translation\'] = \'\'.join(translation).strip()
        # print(item)
        yield item
        pass

4. 数据结构 items.py

在这里定义了ProseItem类，以便在上边的爬虫程序中调用。（此外要注意的是，爬虫程序中导入了该模块，有必要时需要将合适的文件夹标记为根目录。）

import scrapy


class ProseItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 标题
    title = scrapy.Field()
    # 作者
    author = scrapy.Field()
    # 朝代
    dynasty = scrapy.Field()
    # 诗词内容
    content_plus = scrapy.Field()
    # 详情页面的url
    detail_url = scrapy.Field()
    # 译文
    translation = scrapy.Field()
    pass

5. 管道 pipelines.py

管道，在这里编辑数据存储的过程。

from itemadapter import ItemAdapter
import json


class ProsePipeline:
    def __init__(self):
        self.f = open(\'gs.txt\', \'w\', encoding=\'utf-8\')

    def process_item(self, item, spider):
    	# 将item先转化为字典， 再转化为 json类型的字符串
        item_json = json.dumps(dict(item), ensure_ascii=False)
        self.f.write(item_json + \'\\n\')
        return item

    def close_spider(self, spider):
        self.f.close()

6. 程序执行 start.py

定义一个执行命令的程序。

from scrapy import cmdline

cmdline.execute(\'scrapy crawl gs\'.split())

程序执行效果如下：

我们需要的数据，被保存在了一个名为gs.txt的文本文件中了。

以上就是Python Scrapy实战之古诗文网的爬取的详细内容，更多关于Python Scrapy爬取古诗文网的资料请关注其它相关文章！

版权声明 1 本网站名称：偶像资源网
2 本站永久网址：https://www.ox520.com
3 本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长 QQ593098775进行删除处理。
4 本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5 本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6 本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END