我幾乎完成了抓取表格的webcralwer。這僅輸出表中的第一行。任何人都可以幫助確定爲什麼這不會返回表中的所有行。請忽略while循環,因爲它最終會有一個循環部分。美麗的行循環只運行一次?
import urllib
from bs4 import BeautifulSoup
#file_name = "/user/joe/uspc-cpc.txt
#file = open(file_name,"w")
i=125
while i==125:
url = "http://www.uspto.gov/web/patents/classification/cpc/html/us" + str(i) + "tocpc.html"
print url + '\n'
i += 1
data = urllib.urlopen(url).read()
print data
#get the table data from dump
#append to csv file
soup = BeautifulSoup(data)
table = soup.find("table", width='80%')
for tr in table.findAll('tr')[2:]:
col = row.findAll('td')
uspc = col[0].get_text().encode('ascii','ignore')
cpc1 = col[1].get_text().encode('ascii','ignore')
cpc2 = col[2].get_text().encode('ascii','ignore')
cpc3 = col[3].get_text().encode('ascii','ignore')
print uspc + ',' + cpc1 + ',' + cpc2 + ',' + cpc3 + '\n'
#file.write(record)
#file.close()
CODE我運行:
import urllib
from bs4 import BeautifulSoup
#file_name = "https://stackoverflow.com/users/ripple/uspc-cpc.txt"
#file = open(file_name,"w")
i=125
while i==125:
url = "http://www.uspto.gov/web/patents/classification/cpc/html/us" + str(i) + "tocpc.html"
print 'Grabbing from: ' + url + '\n'
i += 1
#get the table data from the page
data = urllib.urlopen(url).read()
#send to beautiful soup
soup = BeautifulSoup(data)
table = soup.find("table", width='80%')
for tr in table.findAll('tr')[2:]:
col = tr.findAll('td')
uspc = col[0].get_text().encode('ascii','ignore').replace(" ","")
cpc1 = col[1].get_text().encode('ascii','ignore').replace(" ","")
cpc2 = col[2].get_text().encode('ascii','ignore').replace(" ","")
cpc3 = col[3].get_text().encode('ascii','ignore').replace(" ","").replace("more...", "")
record = uspc + ',' + cpc1 + ',' + cpc2 + ',' + cpc3 + '\n'
print record
#file.write(record)
#file.close()
是什麼打印? – 2013-04-09 17:34:47
您沒有定義「行」。 – 2013-04-09 17:36:17
@Marjin Pieters:如何定義行?輸出是一行:125/901,H 03H 3/02,B 28D 5/00,H 03H 3/04,B 23D 47/005,B 24B 37/08更多... – 2013-04-09 17:37:36