代理池就是用于应付网站的反爬虫的备用代理ip集合。做过爬虫都知道,在抓取某些网站的时候,如果你抓取的频率太高了,你的ip就会被该网站封掉或者说屏蔽掉,接下来你就会爬取失败,所以此时你就需要使用代理来解决这个问题。因此,一个庞大而有效的代理池是每个爬虫系统都应该具备的,它将为爬虫系统的正常运行提供保障。
当然,下面就是我的一些小小改动,使用后发现爬取效率大大地提高了。无私分享全套Python爬虫干货,如果你也想学习Python,@ 私信小编获取
首先设置redis方法,方便调用
class RedisClient(object):
"""
Reids client
"""
def __init__(self, name, host, port):
"""
init
:param name:
:param host:
:param port:
:return:
"""
self.name = name
self.__conn = redis.Redis(host=host, port=port, db=0)
def get(self):
"""
get random result
:return:
"""
key = self.__conn.hgetall(name=self.name)
# return random.choice(key.keys()) if key else None
# key.keys()在python3中返回dict_keys,不支持index,不能直接使用random.choice
# 另:python3中,redis返回为bytes,需要解码
rkey = random.choice(list(key.keys())) if key else None
if isinstance(rkey, bytes):
return rkey.decode('utf-8')
else:
return rkey
# return self.__conn.srandmember(name=self.name)
def put(self, key):
"""
put an item
:param value:
:return:
"""
key = json.dumps(key) if isinstance(key, (dict, list)) else key
return self.__conn.hincrby(self.name, key, 1)
# return self.__conn.sadd(self.name, value)
def getvalue(self, key):
value = self.__conn.hget(self.name, key)
return value if value else None
def pop(self):
"""
pop an item
:return:
"""
key = self.get()
if key:
self.__conn.hdel(self.name, key)
return key
# return self.__conn.spop(self.name)
def delete(self, key):
"""
delete an item
:param key:
:return:
"""
self.__conn.hdel(self.name, key)
# self.__conn.srem(self.name, value)
def inckey(self, key, value):
self.__conn.hincrby(self.name, key, value)
def getAll(self):
# return self.__conn.hgetall(self.name).keys()
# python3 redis返回bytes类型,需要解码
if sys.version_info.major == 3:
return [key.decode('utf-8') for key in self.__conn.hgetall(self.name).keys()]
else:
return self.__conn.hgetall(self.name).keys()
# return self.__conn.smembers(self.name)
def get_status(self):
return self.__conn.hlen(self.name)
# return self.__conn.scard(self.name)
def changeTable(self, name):
self.name = name
然后在spider中实现
from scrapy_redis_test.utils.proxy_ip_pool import RedisClient
class zhihuspider(RedisSpider):
...
redis_conn = RedisClient('useful_proxy' , 'localhost' , 6379)
...
def parse(self, response):
...
ip = 'http://' + self.redis_conn.get() #随机取出一个IP
print('crawling parse_JsonResponse,random ip:', ip)
yield scrapy.Request(self.answer_start_url.format(question_id, 5, 10) , callback=self.parse_answer, headers=header, meta={"question_id": question_id , 'proxy' : ip}, encoding="utf8")
...
def parse_answer(self, response):
...
if response.status != 200:
self.redis_conn.delete(response.meta.get('proxy')) #将无效IP删除
ip = 'http://' + self.redis_conn.get() #重新获取IP
print(response.meta.get('proxy') , 'is Unuseful' ,'\n' , 'refetch a ip:' , ip)
else:
ip = response.meta.get('proxy') #response返回200后,将有效IP继续使用
print('This ip is usefull again:' , ip)
yield scrapy.Request(next_url , headers=header , callback= self.parse_answer , encoding="utf8" , meta={'proxy' : ip})
这时确保IP代理池和redis-server都在运行着,负责不断地更新代理IP,而爬取过程中会消耗IP代理池中的IP。
为了帮助大家更轻松的学好Python,我给大家分享一套Python学习资料,希望对正在学习的你有所帮助!
获取方式:关注并私信小编 “ 学习 ”,即可免费获取!