2016-06-13 136 views
1

這刮從4chans攝影板的圖像。問題是它擦兩次相同的圖像。我無法弄清楚爲什麼我會得到重複的照片,如果任何人都可以幫助我,那真是太棒了。Python Beautifulsoup重複條目的

from bs4 import BeautifulSoup 
import requests 
import re 
import urllib2 
import os 


def get_soup(url,header): 
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url, headers=header)), 'lxml') 

image_type = "image_name" 
url = "http://boards.4chan.org/p/" 
url = url.strip('\'"') 
print url 
header = {'User-Agent': 'Mozilla/5.0'} 
r = requests.get(url) 
html_content = r.text 
soup = BeautifulSoup(html_content, 'lxml') 
anchors = soup.findAll('a') 
links = [a['href'] for a in anchors if a.has_attr('href')] 
images = [] 
def get_anchors(links): 
for a in anchors: 
    links.append(a['href']) 
return links 

raw_links = get_anchors(links) 

for element in raw_links: 
if ".jpg" in str(element) or '.png' in str(element) or '.gif' in str(element): 
    print element 
    raw_img = urllib2.urlopen("http:" + element).read() 
    DIR="C:\\Users\\deez\\Desktop\\test\\" 
    cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1 
    print cntr 
    f = open(DIR + image_type + "_"+ str(cntr)+".jpg", 'wb') 
    f.write(raw_img) 
    f.close() 
+0

4chan的有兩個鏈接到同一個圖像中的每個崗位,在藍色鏈接的圖像之前,圖像本身圍繞一個環節。 – grochmal

+0

啊,這是有道理的。我怎麼能擺脫另一個? –

+0

您可以將所有鏈接添加到列表中,刪除重複項,然後只從列表中下載所有鏈接。 – grochmal

回答

0

不要拉錨每次,使用類名來獲得一定的聯繫:

import requests 
from bs4 import BeautifulSoup 

soup = BeautifulSoup(requests.get("http://boards.4chan.org/p/").content) 

imgs = [a["href"] for a in soup.select("div.fileText a")] 

print(imgs) 

你有愚弄的原因是至少有兩個div具有相同的鏈接爲每個圖像:

enter image description here