爬虫系列: 自建维护代理ip池

多数爬虫源码运行的报错都是由于访问目标网站过于频繁，从而导致目标网站返回错误或者没有数据返回。

目前大多数网站都是有反爬措施的，如果 IP 在一定时间内请求次数超过了一定的阈值就会触发反爬措施，拒绝访问，也就是我们经常听到的“封IP”。

那么怎么解决这个问题呢？

一种解决办法就是降低访问频率，访问一次就等待一定时长，然后再次访问。这种方法对于反爬措施不严格的网站是有效的。

如果遇到反爬措施严格的网站，访问次数多了还是会被封杀。而且有时候你需要爬取数据，这种解决办法会使获取数据的周期特别长。

第二种解决办法就是使用代理 IP。我不断地切换 IP 访问，让目标网站认为是不同的用户在访问，从而绕过反爬措施。这也是最常见的方式。

接着，我们又面临一个问题：哪来这么多独立 IP 地址呢？

最省事的方式当然是花钱买服务，这种花钱买到的 IP 一般都是比较稳定可靠的。

今天我们来聊一下不花钱免费获取代理 IP 的方式。

获取ip保存

自行爬取ip。
- 通过第三方库spider_proxy_pool （推荐）。除了原有免费资源，还支持拓展
- 自行爬取存入数据库或者文件里
收费api接口

读取

服务器读取

文件读取

校验可用、更新ip池

# 测试代理IP是否可用
def check_proxy_ip():
    proxies = {
                'http': 'http://{}:{}'.format(ip, port),
                'https': 'https://{}:{}'.format(ip, port), }
    test_url = 'http://www.baidu.com/'

    try:
        res = requests.get(url=test_url, proxies=proxies, timeout=8)
        if res.status_code == 200:
            print(ip, ":", port, 'Success')
            with open('proxies.txt', 'a') as f:
                f.write(ip + ':' + port + '\n')
    except Exception as e:
        print(ip, port, 'Failed')

如何使用代理ip访问

文件中随机获取
收费代理API

通过fake_useragent 生成随机user-agent

安装

# 安装
pip install fake_useragent

# 更新
pip install -U fake-useragent

使用方法

# 每次都能随机生成一个UA表示，大大增强了爬虫的真实性。

from fake_useragent import UserAgent

def get_random_ua(self):
		# 实例化 user-agent 对象
        ua = UserAgent()        # 创建User-Agent对象
        useragent = ua.random   # 一步到位，随机生成一个 user-agent
        # 或者可以这样写
        useragent = ua.chrome   # 指定浏览器 user-agent
        return useragent

代理参数-proxies：转为使用ip准备

定义：代替你原来的IP地址去对接网络的IP地址。隐藏自身真实IP，避免被封。

语法结构

from fake_useragent import UserAgent

proxies = {'协议':'协议://IP:端口号'}
# http和https是相同的

proxies = {
  'http':'http://59.172.27.6:38380',
  'https':'https://59.172.27.6:38380'
}

def get_proxies():
		with open('proxies.txt', 'r') as f:
    		result = f.readlines()  # 读取所有行并返回列表
    		
		proxy_ip = random.choice(result)[:-1]       # 获取了所有代理IP
		L = proxy_ip.split(':')
        proxy_ip = {
                    'http': 'http://{}:{}'.format(L[0], L[1]),
                    'https': 'https://{}:{}'.format(L[0], L[1])}
	return proxy_ip
	
	
list = []
for ip in ip_list:
   try:
       response = requests.get('https://www.baidu.com', proxies=ip, timeout=2)
       if response.status_code == 200:
           list.append(ip)
   except:
       pass
   else:
       print(ip, '检测通过')

ModuleNotFoundError: No module named redis

windows 下载redis安装

http://c.biancheng.net/redis/windows-installer.html

Python默认是不支持Redis的，当引用redis时就会报错

Python安装Redis库，下载https://github.com/redis/redis-py 后

解压并安装

切换到redis-py目录，找到setup.py

执行python setup.py install 即可

进入python编辑器，可以正常导入redis

PyCharm python3.9，from lxml import etree报错

python3.7对应的 lxml版本

解决办法（在PyCharm中*更换python3.7和lxml==4.6.5版本*）：

终端：pip install lxml==4.6.5

ProxyPool 介绍

https://blog.csdn.net/weixin_48923393/article/details/124138628
https://juejin.cn/post/7064907584263684133
【Python爬虫代理IP池(proxy pool)安装配置教程-哔哩哔哩】 https://b23.tv/LJMw5bT

Python使用selenium建立代理ip池访问网站

chorme使用ip代理比较简单，使用如下代码即可

import time
from selenium import webdriver

url = "https://www.baidu.com/s?wd=ip"
proxy = "118.190.217.182:80"
chromeOptions = webdriver.ChromeOptions()  # 设置代理
chromeOptions.add_argument(f"--proxy-server=http://{proxy}")
driver = webdriver.Chrome("D:/tools/wedriver/chromedriver.exe", chrome_options=chromeOptions)
driver.get(url)
time.sleep(2)
print(driver.title)
driver.close()