python爬虫-批量下载图片

刚好有个要爬国内某大型网站图片库的需求(国内网站中有海量图片库的网站屈指可数哦),索性就用python练练手,也很久不写了。试试

思路看代码就好,某网站地址我用a_website过滤,你懂的 🙂

python 环境:ActivePython 2.7.2.5

下载图片需要wget,没有的请自行下载。

<br />
# coding:utf-8<br />
import os<br />
import urllib<br />
import urllib2<br />
import re<br />
import cookielib</p>
<p>def a_website(keyword, count):<br />
    url = &quot;a_website-1&quot;<br />
    url2 = &quot;a_website-2&amp;word=关键词&quot;</p>
<p>    # 设置Header<br />
    header = {<br />
        &quot;GET&quot;: url + url2 + &quot;0&quot;,<br />
        &quot;Host&quot;: &quot;image.a_website.com&quot;,<br />
        &quot;Referer&quot;: url1 + url2 + &quot;0&quot;,<br />
        &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36&quot;<br />
    }<br />
    # 创建目录<br />
    dirname = &quot;./picure&quot;<br />
    if os.access(dirname, 0):<br />
        pass<br />
    else:<br />
        os.makedirs(dirname)<br />
    os.chdir(dirname)</p>
<p>    # 创建Cookie对象<br />
    cj = cookielib.CookieJar()<br />
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))</p>
<p>    pn = 0<br />
    num = 1<br />
    while count &gt; 0:<br />
        request = urllib2.Request(url + url2 + str(pn))<br />
        for key in header:<br />
            request.add_header(key, header[key])<br />
        html = opener.open(request).read()</p>
<p>        # 正则匹配<br />
        regex = re.compile(&quot;(?&lt;=objURL&quot;:&quot;)(http.*?.(jpg|jpeg|JPG|gif|png|bmp))&quot;)<br />
        results = regex.findall(html)</p>
<p>        pn += 30<br />
        # 重新设置Header<br />
        header[&quot;GET&quot;] = url + url2 + str(pn)<br />
        header[&quot;Referer&quot;] = header[&quot;GET&quot;]</p>
<p>        # 下载图片<br />
        if results:<br />
            for picture in results:<br />
                dlcommand = &quot;wget.exe -q -t 2 -T 5 %s&quot; % (picture[0])<br />
                if os.system(dlcommand) == 0:<br />
                    print num, &quot;Success! url:&quot; + picture[0]<br />
                    num += 1<br />
                    count -= 1<br />
                    if count == 0:<br />
                        break</p>
<p>if __name__ == '__main__':<br />
    # 输入搜索关键字、数量<br />
    keyword = raw_input(&quot;Please enter the picture keyword:&quot;)<br />
    count = raw_input(&quot;Please enter the number you want to search:&quot;)<br />
    if keyword != '' and count != '' and int(count) &gt; 0:<br />
        keyword = urllib.quote(keyword)<br />
        a_website(keyword, int(count))<br />
        print &quot;The&quot;, count, &quot;pictrues download is complete.&quot;<br />
    else:<br />
        print &quot;Error : picture keyword or search number can not be empty.&quot;<br />
    raw_input(&quot;Press any key to exit...&quot;)<br />

发表评论

电子邮件地址不会被公开。 必填项已用*标注

This site uses Akismet to reduce spam. Learn how your comment data is processed.