2014-10-31 64 views
0

我有這樣的代碼:簡單的網頁刮板格式化,我該如何解決這個問題?

import requests 
from bs4 import BeautifulSoup 



def posts_spider(): 
    url = 'http://www.reddit.com/r/nosleep/new/' 
    source_code = requests.get(url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text) 
    for link in soup.findAll('a', {'class': 'title'}): 
     href = "http://www.reddit.com" + link.get('href') 
     title = link.string 
     print(title) 
     print(href) 
     print("\n") 

def get_single_item_data(): 
    item_url = 'http://www.reddit.com/r/nosleep/new/' 
    source_code = requests.get(item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text) 
    for rating in soup.findAll('div', {'class': 'score unvoted'}): 
     print(rating.string) 

posts_spider() 
get_single_item_data() 

輸出是:

My light.. I'm seeing and feeling things.. what's happening? 
http://www.reddit.com/r/nosleep/comments/2kw0nu/my_light_im_seeing_and_feeling_things_whats/ 


Why being the first to move in a new Subdivision is not the most brilliant idea... 
http://www.reddit.com/r/nosleep/comments/2kw010/why_being_the_first_to_move_in_a_new_subdivision/ 


I Am Falling. 
http://www.reddit.com/r/nosleep/comments/2kvxvt/i_am_falling/ 


Heidi 
http://www.reddit.com/r/nosleep/comments/2kvrnf/heidi/ 


I remember everything 
http://www.reddit.com/r/nosleep/comments/2kvrjs/i_remember_everything/ 


To Lieutenant Griffin Stone 
http://www.reddit.com/r/nosleep/comments/2kvm9p/to_lieutenant_griffin_stone/ 


The woman in my room 
http://www.reddit.com/r/nosleep/comments/2kvir0/the_woman_in_my_room/ 


Dr. Margin's Guide to New Monsters: The Guest, or, An Update 
http://www.reddit.com/r/nosleep/comments/2kvhe5/dr_margins_guide_to_new_monsters_the_guest_or_an/ 


The Evil Woman (part 5) 
http://www.reddit.com/r/nosleep/comments/2kva73/the_evil_woman_part_5/ 


Blood for the blood god, The first of many. 
http://www.reddit.com/r/nosleep/comments/2kv9gx/blood_for_the_blood_god_the_first_of_many/ 


An introduction to the beginning of my journey 
http://www.reddit.com/r/nosleep/comments/2kv8s0/an_introduction_to_the_beginning_of_my_journey/ 


A hunter..of sorts. 
http://www.reddit.com/r/nosleep/comments/2kv8oz/a_hunterof_sorts/ 


Void Trigger 
http://www.reddit.com/r/nosleep/comments/2kv84s/void_trigger/ 


What really happened to Amelia Earhart 
http://www.reddit.com/r/nosleep/comments/2kv80r/what_really_happened_to_amelia_earhart/ 


I Used To Be Fine Being Alone 
http://www.reddit.com/r/nosleep/comments/2kv2ks/i_used_to_be_fine_being_alone/ 


The Green One 
http://www.reddit.com/r/nosleep/comments/2kuzre/the_green_one/ 


Elevator 
http://www.reddit.com/r/nosleep/comments/2kuwxu/elevator/ 


Scary story told by my 4 year old niece- The Guy With Really Big Scary Claws 
http://www.reddit.com/r/nosleep/comments/2kuwjz/scary_story_told_by_my_4_year_old_niece_the_guy/ 


Cranial Nerve Zero 
http://www.reddit.com/r/nosleep/comments/2kuw7c/cranial_nerve_zero/ 


Mom's Story About a Ghost Uncle 
http://www.reddit.com/r/nosleep/comments/2kuvhs/moms_story_about_a_ghost_uncle/ 


It snowed. 
http://www.reddit.com/r/nosleep/comments/2kutp6/it_snowed/ 


The pocket watch I found at a store 
http://www.reddit.com/r/nosleep/comments/2kusru/the_pocket_watch_i_found_at_a_store/ 


You’re Going To Die When You Are 23 
http://www.reddit.com/r/nosleep/comments/2kur3m/youre_going_to_die_when_you_are_23/ 


The Customer: Part Two 
http://www.reddit.com/r/nosleep/comments/2kumac/the_customer_part_two/ 


Dimenhydrinate 
http://www.reddit.com/r/nosleep/comments/2kul8e/dimenhydrinate/ 


• 
• 
• 
• 
• 
12 
12 
76 
4 
2 
4 
6 
4 
18 
2 
6 
13 
5 
16 
2 
2 
14 
48 
1 
13 

我想要做的是,放置匹配評價每篇文章就在旁邊,所以我可以立刻告訴該帖子具有多少評級,而不是在1個「塊」中打印標題和鏈接,而是在另一個「塊」中打印評級號碼。 在此先感謝您的幫助!

+0

你有沒有試過這個:http://www.reddit.com/dev/api? – 2014-10-31 15:36:26

+0

具體來說:http://www.reddit.com/r/python/new.json – 2014-10-31 15:37:32

回答

1

您可以通過迭代div元素與class="thing"(考慮它作爲遍歷帖子)一次完成。對於每個div,得到該鏈接,等級:

from urlparse import urljoin 

from bs4 import BeautifulSoup 
import requests 

def posts_spider(): 
    url = 'http://www.reddit.com/r/nosleep/new/' 
    soup = BeautifulSoup(requests.get(url).content) 
    for thing in soup.select('div.thing'): 
     link = thing.find('a', {'class': 'title'}) 
     rating = thing.find('div', {'class': 'score'}) 
     href = urljoin("http://www.reddit.com", link.get('href')) 

     print(link.string, href, rating.string) 

posts_spider() 

僅供參考,div.thingCSS Selector所有div s的class="thing"匹配。

+0

在我做同樣的事情之前,你從字面上發佈了一分鐘。作爲一個方面說明,我相信評級應該是'find('span',{'class':'rank'})' – Anzel 2014-10-31 15:53:35

+0

@Anzel是的,我正在考慮它,然後我發現OP正在使用'評分「 - 我認爲這是OP的真正含義。我們拭目以待。謝謝。 – alecxe 2014-10-31 15:54:37

+0

你是對的! OP正在使用'score unvoted',多麼不尋常 – Anzel 2014-10-31 16:01:21