Python Web Scraping問題

基本上我有一個大的html文檔，我想刮。類似文檔的一個非常簡化的例子如下：
Python Web Scraping問題

<a name = 'ID_0'></a> 
<span class='c2'>Date</span> 
<span class='c2'>December 12,2005</span> 
<span class='c2'>Source</span> 
<span class='c2'>NY Times</span> 
<span class='c2'>Author</span> 
<span class='c2'>John</span> 

<a name = 'ID_1'></a> 
<span class='c2'>Date</span> 
<span class='c2'>January 21,2008</span> 
<span class='c2'>Source</span> 
<span class='c2'>LA Times</span> 

<a name = 'ID_2'></a> 
<span class='c2'>Source</span> 
<span class='c2'>Wall Street Journal</span> 
<span class='c2'>Author</span> 
<span class='c2'>Jane</span>

該文件有大致3500「一個」標籤和在第一我想使每一個具有相同的佈局。所以，我寫的線沿線的東西：

a_list = soup.find_all('a') 
data2D = [] 
for i in range(0,len(a_list)): 
    data=[] 
    data.append(a_list[i]['name']) 
    data.append(a_list[i].find_next(text='Date').find_next().text) 
    data.append(a_list[i].find_next(text='Source').find_next().text) 
    data.append(a_list[i].find_next(text='Author').find_next().text) 
    data2D.append(data)

然而，由於一些ID缺失作者或日期，刮板取下一個可用的作者或日期這將是從下一個ID。 ID_1將擁有ID_2作者。 ID_2會有ID_3日期。我的第一個想法是以某種方式跟蹤每個標籤的索引，並且索引是否超過下一個'a'標籤索引，然後追加空值。有更好的解決方案嗎？

來源

2015-11-04 Jay

使用lxml和xpath .. – SIslam

相反的find_next()，我會用.find_next_siblings()（或.find_all_next()），並得到所有的標籤，直到下一個a鏈接或文件的末尾。沿着這些線：

links = soup.find_all('a', {"name": True}) 
data = [] 
columns = set(['Date', 'Source', 'Author']) 

for link in links: 
    item = [link["name"]] 
    for elm in link.find_next_siblings(): 
     if elm.name == "a": 
      break # hit the next "a" element - break 

     if elm.text in columns: 
      item.append(elm.find_next().text) 

    data.append(item)

來源

2015-11-04 14:45:04 alecxe

Python Web Scraping問題

回答

相關問題