2016-11-11 81 views
1

得到網頁中的所有鏈接時,我測試了一下,一直打印出來(無,0),即使我使用url有幾個< A HREF =要使用urllib.request裏

import urllib.request as ur 
def getNextlink(url): 
    sourceFile = ur.urlopen(url) 
    sourceText = sourceFile.read() 
    page = str(sourceText) 

    startLink = page.find('<a href=') 
    if startLink == -1: 
     return None, 0 
    startQu = page.find('"', startLink) 
    endQu = page.find('"', startQu+1) 
    url = page[startQu +1:endQu] 
    return url, endQu 

回答

3

你應該使用美麗的湯,而不是要求您的要求順利運作。我會給下面的例子:

from bs4 import BeautifulSoup 
import requests 

def links(url): 
    html = requests.get(url).content 
    bsObj = BeautifulSoup(html, 'lxml') 

    links = bsObj.findAll('a') 
    finalLinks = set() 
    for link in links: 
     finalLinks.add(link.attrs['href']) 

如果有幫助請了投票答案

+0

忘了提,我不能使用任何第三方模塊。 – Anymee

0

這裏是另一種解決方案:

from urllib.request import urlopen 

url = '' 
html = str(urlopen(url).read()) 

for i in range(len(html) - 3): 
    if html[i] == '<' and html[i+1] == 'a' and html[i+2] == ' ': 
     pos = html[i:].find('</a>') 
     print(html[i: i+pos+4]) 

定義網址。 希望這會有所幫助,不要忘記投票並接受。

+0

我正在使用Python 3,所以我確實改變了一點,所以它可以運行,但它仍然不起作用。它返回ValueError:未知的url類型:'' – Anymee

+0

我已經修改它爲python3 –

0

怎麼樣呢?

import requests 
from bs4 import BeautifulSoup 

research_later = "giraffe" 
goog_search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + research_later 

r = requests.get(goog_search) 
print r 

soup = BeautifulSoup(r.text, "html.parser") 
print soup 

import requests 
from bs4 import BeautifulSoup 
r = requests.get("http://www.flashscore.com/soccer/netherlands/eredivisie/results/") 
soup = BeautifulSoup(r.content) 
htmltext = soup.prettify() 
print htmltext 

import sys,requests,csv,io 
from bs4 import BeautifulSoup 
from urllib.parse import urljoin 

url = "http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-rankings" 
r = requests.get(url) 
r.content 
soup = BeautifulSoup(r.content, "html.parser") 

maindiv = soup.find_all("div", {"class": "text-center"}) 
for div in maindiv: 
    print(div.text) 
相關問題