Python的抓鬥從一個HTML的所有鏈接，並只顯示鏈接

我試圖搶出標題使用網頁的聲明如下：Python的抓鬥從一個HTML的所有鏈接，並只顯示鏈接

titl1 = re.findall(r'<title>(.*?)</title>',the_webpage)

利用這一點，我得到['random webpage example1']。我如何刪除引號和括號？

使用該

我也想抓住一組每小時改變鏈接（這就是爲什麼我需要通配符）：links = re.findall(r'(file=(.*?).mp3)',the_webpage)。

我得到

[('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
    'http://media.kickstatic.com/kickapps/images/3380/audios/944521'), 
('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
    'http://media.kickstatic.com/kickapps/images/3380/audios/944521'), 
('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
    'http://media.kickstatic.com/kickapps/images/3380/audios/944521')]

我怎麼沒有file=的MP3鏈接？

我也想下載的MP3文件，並與該網站的標題追加他們，它會顯示

random webpage example1.mp3

我將如何做到這一點？我仍然在學習Python和正則表達式，這有點讓我感到困惑。

來源

2012-08-01 jokajinx

[正則表達式一般不用於解析XML一個很好的候選人/HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454）。您可能會發現[BeautifulSoup]（http://www.crummy.com/software/BeautifulSoup/）有用 - 抓取所有鏈接就像「soup.find_all（'a'）」一樣簡單。看看[文檔]（http://www.crummy.com/software/BeautifulSoup/bs4/doc/）。 – 2012-08-01 20:59:18

你應該看看更適合於URL解析的BeautifulSoup。 – xbb 2012-08-01 20:59:50

哦..你可能會發現這有助於格式化你的問題：http://stackoverflow.com/editing-help – 2012-08-01 21:02:09

至少對於部分1，你可以做

>>> mytitle = title1[0] 
>>> print mytitle 
random webpage example1

正則表達式將返回匹配的字符串列表，所以你只需要抓住列表中的第一項。

同樣，對於第二部分，正則表達式返回裏面有元組的列表。你可以這樣做：

>>> download_links = [href for (discard, href) in links] 
>>> print download_links 
['http://media.kickstatic.com/kickapps/images/3380/audios/944521', 'http://media.kickstatic.com/kickapps/images/3380/audios/944521', 'http://media.kickstatic.com/kickapps/images/3380/audios/944521']

至於下載文件，使用urlib2（至少對於蟒蛇2.x的，不是蟒蛇3.x的肯定）。詳情請參閱this question。

來源

2012-08-01 21:09:52 Michael0x2a

對於第一部分 titl1 = re.findall(r'<title>(.*?)</title>',the_webpage)將返回一個列表，當您打印一個列表時，它會打印括號和引號。因此，如果您確定始終只有一場比賽，請嘗試print title[0]。（您也可以嘗試re.search代替）

對於第二部分，如果你從"(file=(.*?)\.mp3)"改變你重新圖案"file=(.*?)\.mp3"你將只得到'http://linkInThisPart/path/etc/etc'部分，你將需要添加，雖然在.mp3擴展名。

i。Ë

audio_links = [x +'.mp3' for x in re.findall(r'file=(.*?)\.mp3',web_page)]

下載你可能要考慮的urllib文件，urllib2的

import urllib2 
url='http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3' 
req=urllib2.Request(url) 
temp_file=open('random webpage example1.mp3','wb') 
buffer=urllib2.urlopen(req).read() 
temp_file.write(buff) 
temp_file.close()

來源

2012-08-01 21:21:44 ffledgling

所以當我使用鏈接audio_links = [x +'。mp3'for x in re.findall（r 'file =（。*？）\。mp3'，web_page）]我得到的所有回報都是[''，''，''] – jokajinx 2012-08-03 13:15:49

標題很好，謝謝 – jokajinx 2012-08-03 13:17:15

試試只是'.'而不是'\ .'？ – ffledgling 2012-08-03 18:15:20

代碼：

#!/usr/bin/env python 

import re,urllib,urllib2 

Url = "http://www.ihiphopmusic.com/music/rick-ross-sixteen-feat-andre-3000" 
print Url 
print 'test .............' 
req = urllib2.Request(Url) 
print "1" 
response = urllib2.urlopen(req) 
print "2" 
the_webpage = response.read() 
print "3" 
titl1 = re.findall(r'<title>(.*?)</title>',the_webpage) 
print "4" 
a2 = [x +'.mp3' for x in re.findall(r'file=(.*?)\.mp3',the_webpage)] 
print "5" 
a2 = [x[0][5:] for x in a2] 
print "6" 
ti = titl1[0] 
print ti 
print "7" 
print a2 
print "8" 

print "9" 
#print the_page 
print "10" 

req=urllib2.Request(a2) 
print "11" 
temp_file=open(ti) 
print "12" 
buffer=urllib2.urlopen(req).read() 
print "13" 
temp_file.write(buff) 
print "14" 
temp_file.close() 
print "15" 
print "16"

結果

http://www.ihiphopmusic.com/music/rick-ross-sixteen-feat-andre-3000 
test ............. 
1 
2 
3 
4 
5 
6 
Rick Ross - Sixteen (feat. Andre 3000) 
7 
['', '', ''] 
8 
9 
10 
Traceback (most recent call last): 
    File "grub.py", line 29, in <module> 
    req=urllib2.Request(a2) 
    File "/usr/lib/python2.7/urllib2.py", line 198, in __init__ 
    self.__original = unwrap(url) 
    File "/usr/lib/python2.7/urllib.py", line 1056, in unwrap 
    url = url.strip() 
AttributeError: 'list' object has no attribute 'strip'

來源

2012-08-03 18:40:08 jokajinx

嘗試格式化您的代碼。 – ffledgling 2012-08-06 02:45:53

的Python 3：

import requests 
import re 
from urllib.request import urlretrieve

- 首先獲得HTML文本

html_text=requests.get('url')

- 正則表達式找到的網址

正則表達式，匹配（ '模式'，'文字'，flags）

在模式'（）'用於分組您想要的內容。在這種情況下，我們將「http：//*****.mp3」分組，並且可以使用.group（1）或groups（）引用它。

url_find=re.findall('file=(http://media.mp3*',html_text) 
for url_match in url_matches: 
    index += 1 
    print(url_match) 
    urlretrieve(url_match, './graber/mp3/user' + str(index) + '.mp3')

這就是我如何完成的，希望這會有所幫助。（下載東西有多種方法，在這種情況下，我使用urlretrieve）

來源

2017-06-01 05:14:12 tyrantqiao

Python的抓鬥從一個HTML的所有鏈接，並只顯示鏈接

回答

相關問題