目录
- 需求
- 1. Scrapy项目创建
- 2. 全局配置 settings.py
- 3. 爬虫程序.py
- 4. 数据结构 items.py
- 5. 管道 pipelines.py
- 6. 程序执行 start.py
需求
通过python,Scrapy框架,爬取古诗文网上的诗词数据,具体包括诗词的标题信息,作者,朝代,诗词内容,及译文。爬取过程需要逐页爬取,共4页。第一页的url为(https://www.gushiwen.cn/default_1.aspx)。
1. Scrapy项目创建
首先创建Scrapy项目及爬虫程序
在目标目录下,创建一个名为prose的项目:
scrapy startproject prose
进入项目目录下,然后创建一个名为gs的爬虫程序,爬取范围为 gushiwen.cn
cd prose scrapy genspider gs gushiwen.cn
2. 全局配置 settings.py
对配置文件settings.py做如下编辑:
①选择不遵守robots协议
②下载间隙设置为1
③并添加请求头,启用管道
④此外设置打印等级:LOG_LEVEL=“WARNING”
具体如下:
# Scrapy settings for prose project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = \'prose\' SPIDER_MODULES = [\'prose.spiders\'] NEWSPIDER_MODULE = \'prose.spiders\' LOG_LEVEL = \"WARNING\" # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = \'prose (+http://www.yourdomain.com)\' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 1 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { \'user-agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36\', \'Accept\': \'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\', \'Accept-Language\': \'en\', } # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # \'prose.middlewares.ProseSpiderMiddleware\': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # \'prose.middlewares.ProseDownloaderMiddleware\': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # \'scrapy.extensions.telnet.TelnetConsole\': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { \'prose.pipelines.ProsePipeline\': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = \'httpcache\' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = \'scrapy.extensions.httpcache.FilesystemCacheStorage\'
3. 爬虫程序.py
首先是进行页面分析,这里不再赘述该过程。
这部分代码,也即需要编辑的核心部分。
首先是要把初始URL加以修改,修改为要爬取的界面的第一页,而非古诗文网的首页。
需求:我们要爬取的内容包括:诗词的标题信息,作者,朝代,诗词内容,及译文。爬取过程需要逐页爬取。
其中,标题信息,作者,朝代,诗词内容,及译文都存在于同一个<div>标签中。
为了体现两种不同的操作方式,
标题信息,作者,朝代,诗词内容 四项,我们使用一种方法获取。并在该for循环中使用到一个异常处理语句(try…except…)来避免取到空值时使用索引导致的报错;
对于译文,我们额外定义一个parse_detail函数,并在scrapy.Request()中传入其,来获取。
关于翻页,我们的思路是:遍历获取完每一页需要的数据后(即一大轮循环结束后),从当前页面上获取下一页的链接,然后判断获取到的链接是否为空。如若不为空则表示获取到了,则再一次使用scrapy.Requests()方法,传入该链接,并再次调用parse函数。如果为空,则表明这已经是最后一页了,程序就会在此处结束。
具体代码如下:
import scrapy from prose.items import ProseItem class GsSpider(scrapy.Spider): name = \'gs\' allowed_domains = [\'gushiwen.cn\'] start_urls = [\'https://www.gushiwen.cn/default_1.aspx\'] # 解析列表页面 def parse(self, response): # 一个class=\"sons\"对应的是一首诗 div_list = response.xpath(\'//div[@class=\"left\"]/div[@class=\"sons\"]\') for div in div_list: try: # 提取诗词标题信息 title = div.xpath(\'.//b/text()\').get() # 提取作者和朝代 source = div.xpath(\'.//p[@class=\"source\"]/a/text()\').getall() # 作者 # replace author = source[0] # 朝代 dynasty = source[1] content_list = div.xpath(\'.//div[@class=\"contson\"]//text()\').getall() content_plus = \'\'.join(content_list).strip() # 拿到诗词详情页面的url detail_url = div.xpath(\'.//p/a/@href\').get() item = ProseItem(title=title, author=author, dynasty=dynasty, content_plus=content_plus, detail_url=detail_url) # print(item) yield scrapy.Request( url=detail_url, callback=self.parse_detail, meta={\'prose_item\': item} ) except: pass next_url = response.xpath(\'//a[@id=\"amore\"]/@href\').get() if next_url: print(next_url) yield scrapy.Request( url=next_url, callback=self.parse ) # 用于解析详情页面 def parse_detail(self, response): item = response.meta.get(\'prose_item\') translation = response.xpath(\'//div[@class=\"sons\"]/div[@class=\"contyishang\"]/p//text()\').getall() item[\'translation\'] = \'\'.join(translation).strip() # print(item) yield item pass
4. 数据结构 items.py
在这里定义了ProseItem类,以便在上边的爬虫程序中调用。(此外要注意的是,爬虫程序中导入了该模块,有必要时需要将合适的文件夹标记为根目录。)
import scrapy class ProseItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # 标题 title = scrapy.Field() # 作者 author = scrapy.Field() # 朝代 dynasty = scrapy.Field() # 诗词内容 content_plus = scrapy.Field() # 详情页面的url detail_url = scrapy.Field() # 译文 translation = scrapy.Field() pass
5. 管道 pipelines.py
管道,在这里编辑数据存储的过程。
from itemadapter import ItemAdapter import json class ProsePipeline: def __init__(self): self.f = open(\'gs.txt\', \'w\', encoding=\'utf-8\') def process_item(self, item, spider): # 将item先转化为字典, 再转化为 json类型的字符串 item_json = json.dumps(dict(item), ensure_ascii=False) self.f.write(item_json + \'\\n\') return item def close_spider(self, spider): self.f.close()
6. 程序执行 start.py
定义一个执行命令的程序。
from scrapy import cmdline cmdline.execute(\'scrapy crawl gs\'.split())
程序执行效果如下:
我们需要的数据,被保存在了一个名为gs.txt的文本文件中了。
以上就是Python Scrapy实战之古诗文网的爬取的详细内容,更多关于Python Scrapy爬取古诗文网的资料请关注其它相关文章!
暂无评论内容