当前位置：首页 > 编程开发 > Python > 正文内容

python BeautifulSoup4（bs4）使用教程

Codefans2年前 (2023-04-20)Python1435

Python 中的 BeautifulSoup4（通常简称为 bs4）是一个流行的 HTML 解析器，是从 HTML 或 XML 文件中提取数据（或屏蔽 HTML/XML 标记）的一种库。在本文中，我将介绍如何安装、导入和使用 bs4 库。

安装

要安装 bs4，可以使用 pip 命令：

pip install beautifulsoup4

导入

在 Python 代码中使用 bs4，需要首先导入该库：

from bs4 import BeautifulSoup

使用

以下是 BeautifulSoup 库的基本用法：

创建 Beautiful Soup 对象

要将 HTML 或 XML 数据解析为 Beautiful Soup 对象，可以创建 BeautifulSoup 对象。

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')

上面的代码将一个 HTML 字符串传递给 BeautifulSoup 构造函数，并指定解析器类型为 'html.parser'。BeautifulSoup 对象将显示 HTML 的解析结果。

Python 中的 BeautifulSoup4（通常简称为 bs4）是一个流行的 HTML 解析器库。在 BeautifulSoup 构造函数中，可以指定解析器类型参数，常用的解析器类型有以下几种：

'html.parser': Python内置的解析器，速度适中，容错能力较强。
'lxml': 速度更快，但需要额外安装该库。
'xml': Python内置的XML解析器。
'html5lib': 最好的容错解析器，但速度较慢，需要额外安装该库。

一般来说，如果是处理HTML文档，且不想安装额外的库，则使用 Python 内置的 'html.parser' 解析器是一个不错的选择。如果需要速度更快、容错能力更强的解析器，则可以考虑 'lxml' 或 'html5lib' 解析器。

搜索标签

使用 find_all() 方法可以从文档中搜索指定标签名称的所有标签。

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 查找所有的 a 标签
for link in soup.find_all('a'):
    print(link.get('href'))

上面的代码搜索 HTML 中的所有 a 标签，并使用 get() 方法获得每个标签的 href 属性值。

搜索属性

除了搜索标签外，还可以根据标签的属性进行搜索。

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 查找所有属性为 class='sister' 的 a 标签
for link in soup.find_all('a', {'class': 'sister'}):
    print(link.get('href'))

上面的代码搜索 HTML 中的所有 class 属性值为 sister 的 a 标签，并使用 get() 方法获得每个标签的 href 属性值。

结论

这是 BeautifulSoup 库的一个简单介绍。您可以使用 BeautifulSoup 轻松地解析 HTML 或 XML 数据，并从中提取出需要的信息。同时，BeautifulSoup 常用的API也在文章中进行了详细讲解。

扫描二维码推送至手机访问。

本文链接：http://www.tuh8.com/?id=65

分享给朋友：

返回列表

上一篇：python requests库使用详细教程

下一篇：Python pandas库159个常用方法使用说明

“python BeautifulSoup4（bs4）使用教程” 的相关文章

python BeautifulSoup4（bs4）使用教程

安装

导入

创建 Beautiful Soup 对象

搜索标签

搜索属性

“python BeautifulSoup4（bs4）使用教程” 的相关文章

dataframe踩坑笔记（2）：表格合并、拼接

Python中 pandas 数据处理常用函数与方法的详细介绍

Python requests库中几个常用方法的使用示例

python requests库使用详细教程

Python pandas库159个常用方法使用说明

Copyright © tuh8.com Rights Reserved. 浙ICP备11023144号-1

Powered By Z-BlogPHP. Theme by TOYEAN.