*更新：如何用python/beautifulsoup解析html

首先，我對Python很陌生。我試圖從離線網站刮取聯繫信息並將信息輸出到csv。我想抓取頁面網址（不知道如何從html中完成），電子郵件，電話，位置數據（如果可能），任何名稱，任何電話號碼和html網站的標記行（如果存在）。*更新：如何用python/beautifulsoup解析html

更新＃2代碼：

import os, csv, re 
from bs4 import BeautifulSoup 

topdir = 'C:\\projects\\training\\html' 
output = csv.writer(open("scrape.csv", "wb+")) 
output.writerow(["headline", "name", "email", "phone", "location", "url"]) 
all_contacts = [] 

for root, dirs, files in os.walk(topdir): 
    for f in files: 
     if f.lower().endswith((".html", ".htm")): 
      soup = BeautifulSoup(f) 

      def mailto_link(soup):   
      if soup.name != 'a': 
       return None 
      for key, value in soup.attrs: 
       if key == 'href': 
        m = re.search('mailto:(.*)',value) 
       if m: 
        all_contacts.append(m) 
       return m.group(1) 
      return None 

      for ul in soup.findAll('ul'): 
      contact = [] 
      for li in soup.findAll('li'): 
       s = li.find('span') 
       if not (s and s.string): 
        continue 
       if s.string == 'Email:': 
        a = li.find(mailto_link) 
        if a: 
        contact['email'] = mailto_link(a) 
       elif s.string == 'Website:': 
        a = li.find('a') 
        if a: 
        contact['website'] = a['href'] 
       elif s.string == 'Phone:': 
        contact['phone'] = unicode(s.nextSibling).strip() 
      all_contacts.append(contact) 
      output.writerow([all_contacts]) 

print "Finished"

此輸出目前不會返回比行頭的任何其他。我在這裏錯過了什麼？這應該至少返回來自html文件的一些信息，這是這個頁面：http://bendoeslife.tumblr.com/about

來源

2013-05-13 user2338089

您通常不能從頁面HTML中獲取頁面URL;你需要在提取時間保存這個。至於其他...我們需要看一些示例數據來告訴你解析器出了什麼問題。 – abarnert 2013-05-13 18:14:11

這裏至少有兩個問題。

首先，f是一個文件名，而不是文件內容，或者是從這些內容製作的湯。因此，f.find('h2')將在文件名中找到'h2'，這不是非常有用。

其次，大多數find方法（包括str.find，這是你所調用的）返回一個索引，而不是一個子串。在該索引上調用str只會給你一個數字的字符串版本。例如：

>>> s = 'A string with an h2 in it' 
>>> i = s.find('h2') 
>>> str(i) 
'17'

所以，你的代碼做這樣的事情：

>>> f = 'C:\\python\\training\\offline\\somehtml.html' 
>>> headline = f.find('h2') 
>>> str(headline) 
'-1'

你可能想調用soup對象的方法，而不是f。 BeautifulSoup.find返回湯的「子樹」，這正是您想要在此串聯的東西。

但是，如果沒有您的示例輸入來測試它是不可能的，所以我不能承諾這是您代碼中的唯一問題。

同時，當你遇到這樣的事情時，你應該嘗試打印出中間值。打印出f,headline和headline2，爲什麼headline3是錯誤的將更加明顯。

與soup只需更換f在find電話，和修復您的壓痕錯誤，對您的樣本文件http://bendoeslife.tumblr.com/about現在工作運行。

但是，它並沒有做任何有用的事情。由於文件中的任何位置都沒有h2標記，因此headline最終爲None。大部分其他領域也是如此。 做的唯一的事情是找到的東西是url，因爲你要求它找到一個空字符串，它會發現東西任意。有三個不同的解析器，我得到<p>about</p>或<html><body><p>about</p></body></html>和<html><body></body></html> ...

你需要真正瞭解你試圖解析之前，你可以做任何有用的文件結構。在這種情況下，例如，有一個電子郵件地址，但它位於標題爲"Email"的<a>元素中，<li>元素的的"email"。所以，你需要編寫一個find來根據其中的一個標準來找到它，或者其他的東西。

來源

2013-05-13 18:16:52 abarnert

*更新：如何用python/beautifulsoup解析html

回答

相關問題