BeautifulSoup中find和find_all的使用详解-偶像资源网

爬虫利器BeautifulSoup中find和find_all的使用方法

二话不说，先上段HTML例子

<html>
  <head>
    <title>
      index
    </title>
  </head>
  <body>
     <div>
        <ul>
           <li id=\"flask\"class=\"item-0\"><a href=\"link1.html\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" >first item</a></li>
          <li class=\"item-1\"><a href=\"link2.html\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" >second item</a></li>
          <li class=\"item-inactie\"><a href=\"link3.html\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" >third item</a></li>
          <li class=\"item-1\"><a href=\"link4.html\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" >fourth item</a></li>
          <li class=\"item-0\"><a href=\"link5.html\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" >fifth item</a>
         </ul>
     </div>
    <li> hello world </li>
  </body>
</html>

使用BeautifulSoup前需要先构建BeautifulSoup实例

# 构建beautifulsoup实例
soup = BeautifulSoup(html,\'lxml\')
# 第一个参数是要匹配的内容
# 第二个参数是beautifulsoup要采用的模块，即规则

需要注意的是，导入对的模块需要事先安装，此处导入的LXML事先已经安装。可以导入的模块可通过查询BeautifulSoup的文档查看

接下来是find和find_all的介绍

1. find
只返回第一个匹配到的对象
语法：

find(name, attrs, recursive, text, **wargs)　　　　
# recursive 递归的，循环的

参数：

参数名	作用
name	查找标签
text	查找文本
attrs	基于attrs参数

例子：

# find查找一次
li = soup.find(\'li\')
print(\'find_li:\',li)
print(\'li.text(返回标签的内容):\',li.text)
print(\'li.attrs(返回标签的属性):\',li.attrs)
print(\'li.string(返回标签内容为字符串):\',li.string)

运行结果：

find_li: <li class=\”item-0\” id=\”flask\”><a href=\”link1.html\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” >first item</a></li>
li.text(返回标签的内容): first item
li.attrs(返回标签的属性): {\’id\’: \’flask\’, \’class\’: [\’item-0\’]}
li.string(返回标签内容为字符串): first item

find也可以通过‘属性=值\’的方法进行匹配

li = soup.find(id = \'flask\')
print(li,\'\\n\')

<li class=\"item-0\" id=\"flask\"><a href=\"link1.html\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" >first item</a></li>

需要注意的是，因为class是python的保留关键字，若要匹配标签内class的属性，需要特殊的方法，有以下两种：

在attrs属性用字典的方式进行参数传递
BeautifulSoup自带的特别关键字class_

# 第一种:在attrs属性用字典进行传递参数
find_class = soup.find(attrs={\'class\':\'item-1\'})
print(\'findclass:\',find_class,\'\\n\')
# 第二种:BeautifulSoup中的特别关键字参数class_
beautifulsoup_class_ = soup.find(class_ = \'item-1\')
print(\'BeautifulSoup_class_:\',beautifulsoup_class_,\'\\n\')

运行结果

findclass: <li class=\”item-1\”><a href=\”link2.html\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” >second item</a></li>

BeautifulSoup_class_: <li class=\”item-1\”><a href=\”link2.html\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” >second item</a></li>

2. find_all

返回所有匹配到的结果，区别于find（find只返回查找到的第一个结果）

语法：

find_all(name, attrs, recursive, text, limit, **kwargs)

参数名	作用
name	查找标签
text	查找文本
attrs	基于attrs参数

与find一样的语法

上代码

# find_all 查找所有
li_all = soup.find_all(\'li\')
for li_all in li_all:
	print(\'---\')
	print(\'匹配到的li:\',li_all)
	print(\'li的内容:\',li_all.text)
	print(\'li的属性:\',li_all.attrs)

运行结果：

—
匹配到的li: <li class=\”item-0\” id=\”flask\”><a href=\”link1.html\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” >first item</a></li>
li的内容: first item
li的属性: {\’id\’: \’flask\’, \’class\’: [\’item-0\’]}
—
匹配到的li: <li class=\”item-1\”><a href=\”link2.html\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” >second item</a></li>
li的内容: second item
li的属性: {\’class\’: [\’item-1\’]}
—
匹配到的li: <li cvlass=\”item-inactie\”><a href=\”link3.html\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” >third item</a></li>
li的内容: third item
li的属性: {\’cvlass\’: \’item-inactie\’}
—
匹配到的li: <li class=\”item-1\”><a href=\”link4.html\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” >fourth item</a></li>
li的内容: fourth item
li的属性: {\’class\’: [\’item-1\’]}
—
匹配到的li: <li class=\”item-0\”><a href=\”link5.html\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” >fifth item</a>
</li>
li的内容: fifth item

附上比较灵活的find_all查询方法：

# 最灵活的使用方式
li_quick = soup.find_all(attrs={\'class\':\'item-1\'})
for li_quick in li_quick:
	print(\'最灵活的查找方法:\',li_quick)

运行结果：

最灵活的查找方法: <li class=\”item-1\”><a href=\”link2.html\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” >second item</a></li>
最灵活的查找方法: <li class=\”item-1\”><a href=\”link4.html\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” rel=\”external nofollow\” >fourth item</a></li>

完整代码：

# coding=utf8
# @Author= CaiJunxuan
# @QQ=469590490
# @Wechat:15916454524

# beautifulsoup

# 导入beautifulsoup模块
from bs4 import BeautifulSoup

# HTML例子
html = \'\'\'
<html>
  <head>
    <title>
      index
    </title>
  </head>
  <body>
     <div>
        <ul>
           <li id=\"flask\"class=\"item-0\"><a href=\"link1.html\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" >first item</a></li>
          <li class=\"item-1\"><a href=\"link2.html\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" >second item</a></li>
          <li cvlass=\"item-inactie\"><a href=\"link3.html\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" >third item</a></li>
          <li class=\"item-1\"><a href=\"link4.html\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" >fourth item</a></li>
          <li class=\"item-0\"><a href=\"link5.html\" rel=\"external nofollow\" rel=\"external nofollow\" rel=\"external nofollow\" >fifth item</a>
         </ul>
     </div>
    <li> hello world </li>
  </body>
</html>
\'\'\'

# 构建beautifulsoup实例
soup = BeautifulSoup(html,\'lxml\')
# 第一个参数是要匹配的内容
# 第二个参数是beautifulsoup要采用的模块,即规则
# html.parser是python内置的结构匹配方法，但是效率不如lxml所以不常用
# lxml 采用lxml模块
# html5lib,该模块可以将内容转换成html5对象
# 若想要以上功能,就需要具备对应的模块，比如使用lxml就要安装lxml

# 在bs4当中有很多种匹配方法,但常用有两种:

# find查找一次
li = soup.find(\'li\')
print(\'find_li:\',li)
print(\'li.text(返回标签的内容):\',li.text)
print(\'li.attrs(返回标签的属性):\',li.attrs)
print(\'li.string(返回标签内容为字符串):\',li.string)
print(50*\'*\',\'\\n\')

# find可以通过\'属性 = 值\'的方法进行select
li = soup.find(id = \'flask\')
print(li,\'\\n\')
# 因为class是python的保留关键字，所以无法直接查找class这个关键字
# 有两种方法可以进行class属性查询
# 第一种:在attrs属性用字典进行传递参数
find_class = soup.find(attrs={\'class\':\'item-1\'})
print(\'findclass:\',find_class,\'\\n\')
# 第二种:BeautifulSoup中的特别关键字参数class_
beautifulsoup_class_ = soup.find(class_ = \'item-1\')
print(\'BeautifulSoup_class_:\',beautifulsoup_class_,\'\\n\')

# find_all 查找所有
li_all = soup.find_all(\'li\')
for li_all in li_all:
	print(\'---\')
	print(\'匹配到的li:\',li_all)
	print(\'li的内容:\',li_all.text)
	print(\'li的属性:\',li_all.attrs)

# 最灵活的使用方式
li_quick = soup.find_all(attrs={\'class\':\'item-1\'})
for li_quick in li_quick:
	print(\'最灵活的查找方法:\',li_quick)

版权声明 1 本网站名称：偶像资源网
2 本站永久网址：https://www.ox520.com
3 本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长 QQ593098775进行删除处理。
4 本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5 本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6 本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END

BeautifulSoup中find和find_all的使用详解

请登录后发表评论