Python爬虫-2
[TOC]
Urllib模块简介与基础使用 Urllib模块是一个可以用于编写爬虫的非常常用的模块, 接下来我们将为大家介绍Urllib基础使用方面的知识。
打开一个网页对象 在urlib
包中
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import urllib.requestdata = urllib.request.urlopen("http://www.jd.com" ).read().decode("utf-8" , "ignore" ) print (len (data)) import repat = "<title>(.*?)</title>" print (re.compile (pat, re.S).findall(data))
将一个网页信息爬虫到硬盘 urllib.request.urlretrieve("网页url",filename="文件路径名")
1 2 3 urllib.request.urlretrieve("http://www.jd.com" , filename="C:/Users/hibis/Desktop/jd.html" )
浏览器伪装 1 data3 = urllib.request.urlopen("https://wwww.qiushibaike.com/" ).read().decode("utf-8" ,"ignore" )
报错
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Traceback (most recent call last): File "F:/untitled/Urlib.py", line 26, in <module> data3 = urllib.request.urlopen("https://wwww.qiushibaike.com/").read().decode("utf-8","ignore") File "C:\Python38\lib\urllib\request.py", line 222, in urlopen return opener.open(url, data, timeout) File "C:\Python38\lib\urllib\request.py", line 531, in open response = meth(req, response) File "C:\Python38\lib\urllib\request.py", line 640, in http_response response = self.parent.error( File "C:\Python38\lib\urllib\request.py", line 569, in error return self._call_chain(*args) File "C:\Python38\lib\urllib\request.py", line 502, in _call_chain result = func(*args) File "C:\Python38\lib\urllib\request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 404: Not Found
对方识别了爬虫,不是浏览器就拒绝访问
打开开发者模式,发现
1 2 3 请求网址:https://www.qiushibaike.com/ 请求方法:GET User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firef
User-Agent
字段中字段表明是不是浏览器
创建 operner
对象,可以添加高级设置。 创建元组属性。
1 UA = ("User-Agent" , "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firef" )
加入元组属性即可,伪装参数,安装全局opener
1 urllib.request.install_opener(opener)
实际案例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import urllib.requesturl = "https://www.qiushibaike.com" opener = urllib.request.build_opener() UA = ("User-Agent" , "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firef" ) opener.addheaders = [UA] urllib.request.install_opener(opener) data = urllib.request.urlopen(url).read().decode("utf-8" , "ignore" ) print (len (data))
用户代理池 过多次数单一浏览器访问会被拒绝,用多个浏览器标识构成集合,作为一个访问池
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import randomimport urllib.requesturl = "https://www.qiushibaike.com/" uapools = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0" , "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36" , ] def UA (): opener = urllib.request.build_opener() thisua = random.choice(uapools) ua = ("User-Agent" , thisua) opener.addheaders = [ua] urllib.request.install_opener(opener) print ("当前使用的UA" + str (thisua)) for i in range (0 , 10 ): UA() data = urllib.request.urlopen(url).read().decode("utf-8" , "ignore" ) print (len (data))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 当前使用的UAMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36 9869 当前使用的UAMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36 9869 当前使用的UAMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36 9869 当前使用的UAMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36 9869 当前使用的UAMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0 9869 当前使用的UAMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0 9869 当前使用的UAMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36 9869 当前使用的UAMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0 9869 当前使用的UAMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36 9869 当前使用的UAMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0 9869 Process finished with exit code 0
批量爬取糗事百科段子数据 目标网站:https://www.qiushibaike.com/ 目标数据:热门段子 要求:自动翻页
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import urllib.requestimport reimport randomimport timeuapools = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0" , "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36" , "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)" , ] def UA (): opneer = urllib.request.build_opener() thisua = random.choice(uapools) ua = ("User-Agent" , thisua) opneer.addheaders = [ua] urllib.request.install_opener(opneer) print ("当前使用UA: " + str (thisua)) for i in range (0 , 35 ): UA() thisurl = "https://www.qiushibaike.com/8hr/page/" + str (i + 1 ) + "/" try : data = urllib.request.urlopen(thisurl).read().decode("utf-8" , "ignore" ) pat = '<div class="content">.*?<span>(.*?)</span>.*?</div>' rst = re.compile (pat, re.S).findall(data) for j in range (0 , len (rst)): print (rst[j]) print ("-------" ) except Exception as err: pass
爬取慕课网页课件中的图片 课程地址:http://mooc1.chaoxing.com/course/200903990.html 课程名称:普通地质学
分析链接结构 课程列表结构分析 章节列表:
1 2 3 4 申报书:http://mooc1.chaoxing.com/nodedetailcontroller/visitnodedetail?courseId=200903990&knowledgeId=167036766 教学大纲:http://mooc1.chaoxing.com/nodedetailcontroller/visitnodedetail?courseId=200903990&knowledgeId=116724219 课程简介:http://mooc1.chaoxing.com/nodedetailcontroller/visitnodedetail?courseId=200903990&knowledgeId=116724220 课程教案:http://mooc1.chaoxing.com/nodedetailcontroller/visitnodedetail?courseId=200903990&knowledgeId=167037929
得出规律,章节类表统一链接
1 http://mooc1.chaoxing.com/nodedetailcontroller/visitnodedetail?courseId=200903990&knowledgeId=[num]
名称
knowledgeId
课程思政示范课程申报书
1部分
申报书
167036766
课程概要
2部分
教学大纲
116724219
课程简介
116724220
课程教案
167037929
教学课件
3部分
绪论
116724181
第一章
地球的概述
第一节 地球形状大小及表面特征
116724221
第二节 地球的物理性质
116724222
第三节 地球的圈层结构
116724223
第四节 地质作用概述
116724224
第二章 矿物
116724188
第三章 岩浆作用与岩浆岩
116724189
第四章 变质作用与变质岩
116724200
第五章 外动力地质作用与沉积岩
116724194
第六章 地质年代
116724201
第七章 构造运动与地质构造
116724208
第八章 板块构造学说
116724195
第九章 风化作用
116724191
第十章 风的地质作用
116724179
第十一章 河流的地质作用
116724180
第十二章 地下水的地质作用
116724178
第十三章 冰川的地质作用
116724202
第十四章 湖泊及沼泽的地质作用
116724209
第十五章 海水的地质作用
116724193
第十七章 块体运动
116724192
课件结构分析 例如第二章矿物
每一个页面内图片课件链接都来自一个窗口,打开源代码,截取部分,找到规律。
1 2 3 <div class ="imglook" id ="img" > <img class ="documentImg" index ="1" src ="https://s3.ananas.chaoxing.com/doc/cc/1b/14/60addc1f45fa315b87420b3a43045d1e/thumb/1.png" style ="width:100%;" id ="ext-gen1042" > <img class ="documentImg" index ="2" src ="https://s3.ananas.chaoxing.com/doc/cc/1b/14/60addc1f45fa315b87420b3a43045d1e/thumb/2.png" style ="width:100%;" id ="ext-gen1043" > <img class ="documentImg" index ="3" src ="https://s3.ananas.chaoxing.com/doc/cc/1b/14/60addc1f45fa315b87420b3a43045d1e/thumb/3.png" style ="width:100%;" id ="ext-gen1044" > <img class ="documentImg" index ="4" src ="https://s3.ananas.chaoxing.com/doc/cc/1b/14/60addc1f45fa315b87420b3a43045d1e/thumb/4.png" style ="width:100%;" id ="ext-gen1045" > <img class ="documentImg" index ="5" src ="https://s3.ananas.chaoxing.com/doc/cc/1b/14/60addc1f45fa315b87420b3a43045d1e/thumb/5.png" style ="width:100%;" id ="ext-gen1046" > <img class ="documentImg" index ="6" src ="https://s3.ananas.chaoxing.com/doc/cc/1b/14/60addc1f45fa315b87420b3a43045d1e/thumb/6.png" style ="width:100%;" id ="ext-gen1047" > <img class ="documentImg" index ="7" ....
1 2 <div id ="img" class ="imglook" > <img /> </div > //*[@id="ext-gen1042"]
如果直接爬取文件内容,无法读取。因为后台采用ajax异步传输方式将pdf传输到前台浏览器。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import requestsfrom lxml import etreeimport urllib.requesturl1 = "https://www.baidu.com" url = "http://mooc1.chaoxing.com/nodedetailcontroller/visitnodedetail?courseId=200903990&knowledgeId=116724188" headers = { "User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36" } response = requests.get(url=url, headers=headers) wb_date = response.text html = etree.HTML(wb_date) print (wb_date)b = html.xpath('//img/@src' ) print (b)print (b[0 ])