python爬虫


893 浏览 5 years, 6 months

3.2 urllib

版权声明: 转载请注明出处 http://www.codingsoho.com/

urllib

https://docs.python.org/2.7/library/urllib.html

在Python3中,urllib分割成了几块,并重新命名为urllib.request, urllib.parse, urllib.error

Python3中的urllib.request.urlopen()功能等同于urlllib2.urlopen(),urllib.urlopen()被移除了

2to3能够自动适配到Python3

基本操作
GET
[http://www.healforce.com/cn/index.php](http://www.healforce.com/cn/index.php)?ac=search&at=result&lng=cn&keyword=P100&countnum=1
import urllib
params = urllib.urlencode({'ac': 'search','at': 'result','lng': 'cn','keyword': 'P100'})
f = urllib.urlopen("[http://www.healforce.com/cn/index.php](http://www.healforce.com/cn/index.php)?%s" % params)
print f.read()
POST

python2

import urllib
params = urllib.urlencode({'ac': 'search','at': 'result','lng': 'cn','keyword': 'EB03'})
f = urllib.urlopen("[http://www.healforce.com/cn/index.php](http://www.healforce.com/cn/index.php)?%s" % params)
print f.read()
使用HTTP代理

上面的操作Fiddler无法抓包,需要设置代理之后才能监控

import urllib
params = urllib.urlencode({'ac': 'search','at': 'result','lng': 'cn','keyword': 'P100'})
proxies = {'http' : '[127.0.0.1:8888](127.0.0.1:8888)'}
f = urllib.urlopen("[http://www.healforce.com/cn/index.php](http://www.healforce.com/cn/index.php)?%s" % params, proxies=proxies)
print f.read()

报错

IOError: [Errno url error] invalid proxy for http: '127.0.0.1:8888'

在这儿出错了

    def open_unknown_proxy(self, proxy, fullurl, data=None):
        """Overridable interface to open unknown URL type."""
        type, url = splittype(fullurl)
        raise IOError, ('url error', 'invalid proxy for %s' % type, proxy)
opener

比如要用opener设置代理

import urllib
proxies = {'http': '[127.0.0.1:8888](127.0.0.1:8888)'}
opener = urllib.FancyURLopener(proxies)
f = urllib.urlopen("[http://www.healforce.com/cn/index.php](http://www.healforce.com/cn/index.php)")
print f.read()

但是如果使用下面方式访问的话,同样会报错

IOError: [Errno url error] invalid proxy for http: '127.0.0.1:8888'

import urllib
proxies = {'http': '[127.0.0.1:8888](127.0.0.1:8888)'}
opener = urllib.FancyURLopener(proxies)
params = urllib.urlencode({'ac': 'search','at': 'result','lng': 'cn','keyword': 'P100'})
f = opener.open("[http://www.healforce.com/cn/index.php](http://www.healforce.com/cn/index.php)?%s" % params)

这两个错误后面单独调试再看