典例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import requests

url='www.example.com'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0',
'referer':url}
with requests.get(url=url,headers = headers) as response:
if not response.ok:
print('request error!')
else:
data2 = response.text
print(data2)
try:
data = response.json
print(data)
except:
print('response is not in json format')

response对象

类成员 使用方法
status_code 状态码 (200 is OK, 404 is Not Found)
text 以unicode返回内容
apparent_encoding 返回响应显式的编码(可能有隐编码)
encoding Returns the encoding used to decode r.text
json() 返回response的json类型,前提是response是json格式的,否则报错
ok 小于400则ok
close() Closes the connection to the server
content Returns the content of the response, in bytes
cookies Returns a CookieJar object with the cookies sent back from the server
elapsed Returns a timedelta object with the time elapsed from sending the request to the arrival of the response
headers Returns a dictionary of response headers
history Returns a list of response objects holding the history of request (url)
is_permanent_redirect Returns True if the response is the permanent redirected url, otherwise False
is_redirect Returns True if the response was redirected, otherwise False
iter_content() Iterates over the response
iter_lines() Iterates over the lines of the response
links Returns the header links
next Returns a PreparedRequest object for the next request in a redirection
raise_for_status() If an error occur, this method returns a HTTPError object
reason Returns a text corresponding to the status code
request Returns the request object that requested this response
url Returns the URL of the response

返回码速查

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
2xx:成功

200 OK:请求成功,服务器已处理请求并返回了所请求的资源。
201 Created:请求成功,服务器已创建了一个新的资源。
204 No Content:请求成功,但没有返回任何内容。
3xx:重定向

301 Moved Permanently:请求的资源已被永久移动到新的URL。
302 Found:请求的资源已被临时移动到新的URL。
307 Temporary Redirect:请求的资源已被临时移动到新的URL,但将来可能还会回到原来的URL。
4xx:客户端错误

400 Bad Request:请求的语法错误,服务器无法理解。
401 Unauthorized:请求需要用户验证,但未提供有效的认证信息。
403 Forbidden:请求被拒绝,没有权限访问该资源。
404 Not Found:请求的资源不存在。
5xx:服务器错误

500 Internal Server Error:服务器内部错误,无法完成请求。
502 Bad Gateway:作为网关或代理角色的服务器从上游服务器接收到了无效的响应。
503 Service Unavailable:服务器当前无法处理请求,通常是由于过载或维护。

post

1
2
payload = {'username':'admin','password':'123456'}
html = requests.post(url, headers=myheaders, data=payload).text

xpath-python

其它XPath:见元素选择XPath

Playwright安装及常用函数 | Min的博客 (xxminxx.love)

安装

1
conda install lxml

使用

爬虫典型常用

导库

1
from lxml import etree

string转为etree(html格式)

1
html = etree.HTML(text)

XPath匹配

1
html.xpath(<xpath>)

比如

1
html.xpath('//li/a')

其它

  • etree除了从string创建也可以从文本文件创建

打开./test.html文件

1
html = etree.parse('./test.html', etree.HTMLParser())
  • 打印树

以html这个etree实例为例子,先转为string,变为bytes,再解码

1
print(etree.tostring(html).decode('utf-8'))