python爬虫怎么抓取网页文本

169次阅读

共计 825 个字符，预计需要花费 3 分钟才能阅读完成。

在 Python 中，我们可以使用第三方库 BeautifulSoup 来抓取网页文本。以下是抓取网页文本的基本步骤：

安装 BeautifulSoup 库：使用 pip 命令安装 BeautifulSoup 库，命令为 pip install beautifulsoup4。
导入库：在 Python 代码中导入 BeautifulSoup 库和 requests 库。

from bs4 import BeautifulSoup
import requests

发送 HTTP 请求并获取网页内容：使用 requests 库发送 HTTP 请求，获取网页的 HTML 内容。

url = "https://example.com"  # 替换成待抓取的网页链接 
response = requests.get(url)
html_content = response.content

解析 HTML 内容：使用 BeautifulSoup 库解析 HTML 内容，生成 BeautifulSoup 对象。

soup = BeautifulSoup(html_content, 'html.parser')

提取网页文本：使用 BeautifulSoup 提供的方法，提取出想要的网页文本。

# 提取所有文本内容 
text = soup.get_text()

# 提取指定标签的文本内容 
text = soup.find('div').text  # 提取第一个 <div> 标签的文本内容 

# 提取具有指定 CSS 类的标签的文本内容 
text = soup.find(class_='class-name').text  # 提取具有 class 属性为 class-name 的标签的文本内容 

# 提取具有指定 id 的标签的文本内容 
text = soup.find(id='id-name').text  # 提取具有 id 属性为 id-name 的标签的文本内容