Python Requests 教程¶

在本教程中，我们展示了如何使用 Python Requests 模块。我们获取数据，发布数据，流数据并连接到安全的网页。在示例中，我们使用在线服务，nginx 服务器，Python HTTP 服务器和 Flask 应用。

超文本传输协议（ HTTP ）是用于分布式协作超媒体信息系统的应用协议。 HTTP 是万维网数据通信的基础。

Python Requests¶

Requests是一个简单优雅的 Python HTTP 库。它提供了通过 HTTP 访问 Web 资源的方法。 Requests是内置的 Python 模块。

1	`$ sudo service nginx start`

我们在本地主机上运行 nginx Web 服务器。我们的一些示例使用nginx服务器。

Python Requests 版本¶

第一个程序打印请求库的版本。

version.py

#!/usr/bin/env python3

import requests

print(requests.__version__)
print(requests.__copyright__)

该程序将打印请求的版本和版权。

1
2
3

$ ./version.py
2.21.0
Copyright 2018 Kenneth Reitz

这是示例的示例输出。

Python Requests 读取网页¶

get()方法发出 GET 请求；它获取由给定 URL 标识的文档。

read_webpage.py

#!/usr/bin/env python3

import requests as req

resp = req.get("http://www.webcode.me")

print(resp.text)

该脚本获取www.webcode.me网页的内容。

1	`resp = req.get("http://www.webcode.me")`

get()方法返回一个响应对象。

1	`print(resp.text)`

text 属性包含响应的内容，以 Unicode 表示。

$ ./read_webpage.py
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My html page</title>
</head>
<body>

    <p>
        Today is a beautiful day. We go swimming and fishing.
    </p>

    <p>
         Hello there. How are you?
    </p>

</body>
</html>

这是read_webpage.py脚本的输出。

以下程序获取一个小型网页，并剥离其 HTML 标签。

strip_tags.py

#!/usr/bin/env python3

import requests as req
import re

resp = req.get("http://www.webcode.me")

content = resp.text

stripped = re.sub('<[^<]+?>', '', content)
print(stripped)

该脚本会剥离www.webcode.me网页的 HTML 标签。

1	`stripped = re.sub('<[^<]+?>', '', content)`

一个简单的正则表达式用于剥离 HTML 标记。

HTTP 请求¶

HTTP 请求是从客户端发送到浏览器的消息，以检索某些信息或采取某些措施。

Request's request方法创建一个新请求。请注意，request模块具有一些更高级的方法，例如get()，post()或put()，为我们节省了一些输入。

create_request.py

#!/usr/bin/env python3

import requests as req

resp = req.request(method='GET', url="http://www.webcode.me")
print(resp.text)

该示例创建一个 GET 请求并将其发送到http://www.webcode.me。

Python Requests 获取状态¶

Response对象包含服务器对 HTTP 请求的响应。其status_code属性返回响应的 HTTP 状态代码，例如 200 或 404。

get_status.py

#!/usr/bin/env python3

import requests as req

resp = req.get("http://www.webcode.me")
print(resp.status_code)

resp = req.get("http://www.webcode.me/news")
print(resp.status_code)

我们使用get()方法执行两个 HTTP 请求，并检查返回的状态。

1
2
3

$ ./get_status.py
200
404

200 是成功 HTTP 请求的标准响应，而 404 则表明找不到所请求的资源。

Python Requests HEAD 方法¶

head()方法检索文档标题。标头由字段组成，包括日期，服务器，内容类型或上次修改时间。

head_request.py

#!/usr/bin/env python3

import requests as req

resp = req.head("http://www.webcode.me")

print("Server: " + resp.headers['server'])
print("Last modified: " + resp.headers['last-modified'])
print("Content type: " + resp.headers['content-type'])

该示例打印服务器，www.webcode.me网页的上次修改时间和内容类型。

$ ./head_request.py
Server: nginx/1.6.2
Last modified: Sat, 20 Jul 2019 11:49:25 GMT
Content type: text/html

这是head_request.py程序的输出。

Python Requests GET 方法¶

get()方法向服务器发出 GET 请求。 GET 方法请求指定资源的表示形式。

httpbin.org是免费提供的 HTTP 请求&响应服务。

mget.py

#!/usr/bin/env python3

import requests as req

resp = req.get("https://httpbin.org/get?name=Peter")
print(resp.text)

该脚本将具有值的变量发送到httpbin.org服务器。该变量直接在 URL 中指定。

$ ./mget.py
{
  "args": {
    "name": "Peter"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.21.0"
  },
  ...
}

这是示例的输出。

mget2.py

#!/usr/bin/env python3

import requests as req

payload = {'name': 'Peter', 'age': 23}
resp = req.get("https://httpbin.org/get", params=payload)

print(resp.url)
print(resp.text)

get()方法采用params参数，我们可以在其中指定查询参数。

1	`payload = {'name': 'Peter', 'age': 23}`

数据在 Python 字典中发送。

1	`resp = req.get("https://httpbin.org/get", params=payload)`

我们将 GET 请求发送到httpbin.org站点，并传递params参数中指定的数据。

1 2	`print(resp.url) print(resp.text)`

我们将 URL 和响应内容打印到控制台。

$ ./mget2.py
http://httpbin.org/get?name=Peter&age=23
{
  "args": {
    "age": "23",
    "name": "Peter"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.21.0"
  },
  ...
}

这是示例的输出。

Python Requests 重定向¶

重定向是将一个 URL 转发到另一个 URL 的过程。 HTTP 响应状态代码 301“永久移动”用于永久 URL 重定向； 302 找到临时重定向。

redirect.py

#!/usr/bin/env python3

import requests as req

resp = req.get("https://httpbin.org/redirect-to?url=/")

print(resp.status_code)
print(resp.history)
print(resp.url)

在示例中，我们向https://httpbin.org/redirect-to页面发出 GET 请求。该页面重定向到另一个页面；重定向响应存储在响应的history属性中。

$ ./redirect.py
200
[<Response [302]>]
https://httpbin.org/

对https://httpbin.org/redirect-to的 GET 请求被重定向到https://httpbin.org 302。

在第二个示例中，我们不遵循重定向。

redirect2.py

#!/usr/bin/env python3

import requests as req

resp = req.get("https://httpbin.org/redirect-to?url=/", allow_redirects=False)

print(resp.status_code)
print(resp.url)

allow_redirects参数指定是否遵循重定向。默认情况下，重定向之后。

1
2
3

$ ./redirect2.py
302
https://httpbin.org/redirect-to?url=/

这是示例的输出。

用 nginx 重定向¶

在下一个示例中，我们显示如何在 nginx 服务器中设置页面重定向。

location = /oldpage.html {

        return 301 /newpage.html;
}

将这些行添加到位于 Debian 上/etc/nginx/sites-available/default的 Nginx 配置文件中。

1	`$ sudo service nginx restart`

编辑完文件后，我们必须重新启动 nginx 才能应用更改。

oldpage.html

<!DOCTYPE html>
<html>
<head>
<title>Old page</title>
</head>
<body>
<p>
This is old page
</p>
</body>
</html>

这是位于 nginx 文档根目录中的oldpage.html文件。

newpage.html

<!DOCTYPE html>
<html>
<head>
<title>New page</title>
</head>
<body>
<p>
This is a new page
</p>
</body>
</html>

这是newpage.html。

redirect3.py

#!/usr/bin/env python3

import requests as req

resp = req.get("http://localhost/oldpage.html")

print(resp.status_code)
print(resp.history)
print(resp.url)

print(resp.text)

该脚本访问旧页面并遵循重定向。如前所述，默认情况下，请求遵循重定向。

$ ./redirect3.py
200
(<Response [301]>,)
http://localhost/files/newpage.html
<!DOCTYPE html>
<html>
<head>
<title>New page</title>
</head>
<body>
<p>
This is a new page
</p>
</body>
</html>

这是示例的输出。

$ sudo tail -2 /var/log/nginx/access.log
127.0.0.1 - - [21/Jul/2019:07:41:27 -0400] "GET /oldpage.html HTTP/1.1" 301 184
"-" "python-requests/2.4.3 CPython/3.4.2 Linux/3.16.0-4-amd64"
127.0.0.1 - - [21/Jul/2019:07:41:27 -0400] "GET /newpage.html HTTP/1.1" 200 109
"-" "python-requests/2.4.3 CPython/3.4.2 Linux/3.16.0-4-amd64"

从access.log文件中可以看到，该请求已重定向到新的文件名。通信包含两个 GET 请求。

用户代理¶

在本节中，我们指定用户代理的名称。我们创建自己的 Python HTTP 服务器。

http_server.py

#!/usr/bin/env python3

from http.server import BaseHTTPRequestHandler, HTTPServer

class MyHandler(BaseHTTPRequestHandler):

    def do_GET(self):

        message = "Hello there"

        self.send_response(200)

        if self.path == '/agent':

            message = self.headers['user-agent']

        self.send_header('Content-type', 'text/html')
        self.end_headers()

        self.wfile.write(bytes(message, "utf8"))

        return

def main():

    print('starting server on port 8081...')

    server_address = ('127.0.0.1', 8081)
    httpd = HTTPServer(server_address, MyHandler)
    httpd.serve_forever()

main()

我们有一个简单的 Python HTTP 服务器。

1
2
3

if self.path == '/agent':

    message = self.headers['user-agent']

如果路径包含'/agent'，则返回指定的用户代理。

user_agent.py

#!/usr/bin/env python3

import requests as req

headers = {'user-agent': 'Python script'}

resp = req.get("http://localhost:8081/agent", headers=headers)
print(resp.text)

该脚本向我们的 Python HTTP 服务器创建一个简单的 GET 请求。要向请求添加 HTTP 标头，我们将字典传递给headers参数。

1	`headers = {'user-agent': 'Python script'}`

标头值放置在 Python 字典中。

1	`resp = req.get("http://localhost:8081/agent", headers=headers)`

这些值将传递到headers参数。

1 2	`$ simple_server.py starting server on port 8081...`

首先，我们启动服务器。

1 2	`$ ./user_agent.py Python script`

然后我们运行脚本。服务器使用我们随请求发送的代理名称进行了响应。

Python Requests POST 值¶

post方法在给定的 URL 上调度 POST 请求，为填写的表单内容提供键/值对。

post_value.py

#!/usr/bin/env python3

import requests as req

data = {'name': 'Peter'}

resp = req.post("https://httpbin.org/post", data)
print(resp.text)

脚本使用具有Peter值的name键发送请求。 POST 请求通过post方法发出。

$ ./post_value.py
{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "name": "Peter"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Content-Length": "10",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.21.0"
  },
  "json": null,
  ...
}

这是post_value.py脚本的输出。

Python Requests 上传图像¶

在以下示例中，我们将上传图片。我们使用 Flask 创建一个 Web 应用。

app.py

#!/usr/bin/env python3

import os
from flask import Flask, request

app = Flask(__name__)

@app.route("/")
def home():
    return 'This is home page'

@app.route("/upload", methods=['POST'])
def handleFileUpload():

    msg = 'failed to upload image'

    if 'image' in request.files:

        photo = request.files['image']

        if photo.filename != '':

            photo.save(os.path.join('.', photo.filename))
            msg = 'image uploaded successfully'

    return msg

if __name__ == '__main__':
    app.run()

这是具有两个端点的简单应用。 /upload端点检查是否有某些图像并将其保存到当前目录。

upload_file.py

#!/usr/bin/env python3

import requests as req

url = 'http://localhost:5000/upload'

with open('sid.jpg', 'rb') as f:

    files = {'image': f}

    r = req.post(url, files=files)
    print(r.text)

我们将图像发送到 Flask 应用。该文件在post()方法的files属性中指定。

JSON 格式¶

JSON （JavaScript 对象表示法）是一种轻量级的数据交换格式。人类很容易读写，机器也很容易解析和生成。

JSON 数据是键/值对的集合；在 Python 中，它是通过字典实现的。

读取 JSON¶

在第一个示例中，我们从 PHP 脚本读取 JSON 数据。

send_json.php

<?php

$data = [ 'name' => 'Jane', 'age' => 17 ];
header('Content-Type: application/json');

echo json_encode($data);

PHP 脚本发送 JSON 数据。它使用json_encode()功能完成该工作。

read_json.py

#!/usr/bin/env python3

import requests as req

resp = req.get("http://localhost/send_json.php")
print(resp.json())

read_json.py读取 PHP 脚本发送的 JSON 数据。

1	`print(resp.json())`

json()方法返回响应的 json 编码内容（如果有）。

1 2	`$ ./read_json.py {'age': 17, 'name': 'Jane'}`

这是示例的输出。

发送 JSON¶

接下来，我们将 JSON 数据从 Python 脚本发送到 PHP 脚本。

parse_json.php

<?php

$data = file_get_contents("php://input");

$json = json_decode($data , true);

foreach ($json as $key => $value) {

    if (!is_array($value)) {
        echo "The $key is $value\n";
    } else {
        foreach ($value as $key => $val) {
            echo "The $key is $value\n";
        }
    }
}

该 PHP 脚本读取 JSON 数据，并发送带有已解析值的消息。

send_json.py

#!/usr/bin/env python3

import requests as req

data = {'name': 'Jane', 'age': 17}

resp = req.post("http://localhost/parse_json.php", json=data)
print(resp.text)

该脚本将 JSON 数据发送到 PHP 应用并读取其响应。

1	`data = {'name': 'Jane', 'age': 17}`

这是要发送的数据。

1	`resp = req.post("http://localhost/parse_json.php", json=data)`

包含 JSON 数据的字典将传递给json参数。

1
2
3

$ ./send_json.py
The name is Jane
The age is 17

这是示例输出。

从字典中检索定义¶

在以下示例中，我们在 www.dictionary.com 上找到术语的定义。要解析 HTML，我们使用lxml模块。

1	`$ pip install lxml`

我们使用pip工具安装lxml模块。

get_term.py

#!/usr/bin/env python3

import requests as req
from lxml import html
import textwrap

term = "dog"

resp = req.get("http://www.dictionary.com/browse/" + term)
root = html.fromstring(resp.content)

for sel in root.xpath("//span[contains(@class, 'one-click-content')]"):

    if sel.text:

        s = sel.text.strip()

        if (len(s) > 3):

            print(textwrap.fill(s, width=50))

在此脚本中，我们在www.dictionary.com上找到了术语狗的定义。 lxml模块用于解析 HTML 代码。

注意: 包含定义的标记可能会在一夜之间更改。在这种情况下，我们需要修改脚本。

1	`from lxml import html`

lxml模块可用于解析 HTML。

1	`import textwrap`

textwrap模块用于将文本包装到特定宽度。

1	`resp = req.get("http://www.dictionary.com/browse/" + term)`

为了执行搜索，我们在 URL 的末尾附加了该词。

1	`root = html.fromstring(resp.content)`

我们需要使用resp.content而不是resp.text，因为html.fromstring()隐式地希望字节作为输入。（resp.content以字节为单位返回内容，而resp.text以 Unicode 文本形式返回。

for sel in root.xpath("//span[contains(@class, 'one-click-content')]"):

    if sel.text:

        s = sel.text.strip()

        if (len(s) > 3):

            print(textwrap.fill(s, width=50))

我们解析内容。主要定义位于span标签内部，该标签具有one-click-content属性。我们通过消除多余的空白和杂散字符来改善格式。文字宽度最大为 50 个字符。请注意，此类解析可能会更改。

$ ./get_term.py
a domesticated canid,
any carnivore of the dog family Canidae, having
prominent canine teeth and, in the wild state, a
long and slender muzzle, a deep-chested muscular
body, a bushy tail, and large, erect ears.
...

这是定义的部分列表。

Python Requests 流请求¶

流正在传输音频和/或视频数据的连续流，同时正在使用较早的部分。 Requests.iter_lines()遍历响应数据，一次一行。在请求上设置stream=True可以避免立即将内容读取到内存中以获得较大响应。

streaming.py

#!/usr/bin/env python3

import requests as req

url = "https://docs.oracle.com/javase/specs/jls/se8/jls8.pdf"

local_filename = url.split('/')[-1]

r = req.get(url, stream=True)

with open(local_filename, 'wb') as f:

    for chunk in r.iter_content(chunk_size=1024):

        f.write(chunk)

该示例流式传输 PDF 文件并将其写入磁盘。

1	`r = req.get(url, stream=True)`

在发出请求时将stream设置为True，除非我们消耗掉所有数据或调用Response.close()，否则请求无法释放回池的连接。

with open(local_filename, 'wb') as f:

    for chunk in r.iter_content(chunk_size=1024):

        f.write(chunk)

我们按 1 KB 的块读取资源，并将其写入本地文件。

Python Requests 凭证¶

auth参数提供基本的 HTTP 身份验证；它使用一个元组的名称和密码来用于领域。安全领域是一种用于保护 Web 应用资源的机制。

$ sudo apt-get install apache2-utils
$ sudo htpasswd -c /etc/nginx/.htpasswd user7
New password:
Re-type new password:
Adding password for user user7

我们使用htpasswd工具创建用于基本 HTTP 身份验证的用户名和密码。

location /secure {

        auth_basic "Restricted Area";
        auth_basic_user_file /etc/nginx/.htpasswd;
}

在 nginx /etc/nginx/sites-available/default配置文件中，我们创建一个安全页面。领域的名称是“限制区域”。

index.html

<!DOCTYPE html>
<html lang="en">
<head>
<title>Secure page</title>
</head>

<body>

<p>
This is a secure page.
</p>

</body>

</html>

在/usr/share/nginx/html/secure目录中，我们有这个 HTML 文件。

credentials.py

#!/usr/bin/env python3

import requests as req

user = 'user7'
passwd = '7user'

resp = req.get("http://localhost/secure/", auth=(user, passwd))
print(resp.text)

该脚本连接到安全网页；它提供访问该页面所需的用户名和密码。

$ ./credentials.py
<!DOCTYPE html>
<html lang="en">
<head>
<title>Secure page</title>
</head>

<body>

<p>
This is a secure page.
</p>

</body>

</html>

使用正确的凭据，credentials.py脚本返回受保护的页面。