2016-12-05 39 views
0

我正在嘗試從使用BeautifulSoup的網站中拉取html信息,但由於某種原因,輸出結果爲破損格式,其中每行中的每個字符都被分成了自己的單元格。用熊貓或其他模塊將每一行合併爲一個值

我當前的代碼是:

from bs4 import BeautifulSoup 
import urllib 
import csv 
import pandas as pd 
url = 'http://www.hkexnews.hk/listedco/listconews/mainindex/SEHK_LISTEDCO_DATETIME_TODAY.HTM' 

html = urllib.urlopen(url) 
soup = BeautifulSoup(html,'html.parser') 

r0 = soup.find_all("tr", class_="row0") 
#removed r1 just to make sure everything works first 
#r1 = soup.find_all("tr", class_="row1") 


f = csv.writer(open('news.csv','w')) 


for a in r0: 
    f.writerow(a.encode('utf-8')) 

首先我不能確定如何每一行合併爲一個單元格,其次是有另一種方式爲我拉的信息,而無需合併。

回答

1
import requests 
from bs4 import BeautifulSoup 
url = 'http://www.hkexnews.hk/listedco/listconews/mainindex/SEHK_LISTEDCO_DATETIME_TODAY.HTM' 
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'lxml') 

rows = soup.find_all(class_=['row0','.row1']) 
for row in rows: 
    cell = [i.text for i in row.find_all('td')] 
    print(cell) 

出來:

['06/12/201608:41', '01159', 'JIMEI INT ENT', 'Announcements and Notices - [Resumption]EXCHANGE NOTICE - RESUMPTION OF TRADING\xa0(1KB, HTM)'] 
['06/12/201608:15', '03933', 'UNITED LAB', 'Announcements and Notices - [Issue of Convertible Securities]COMPLETION OF THE ISSUE OF U.S.$130,000,000 CONVERTIBLE BONDS DUE 2021\xa0(80KB, PDF)'] 
['06/12/201608:10', '00005', 'HSBC HOLDINGS', 'Announcements and Notices - [Overseas Regulatory Announcement - Other]Transaction in own shares\xa0(860KB, PDF)'] 
['06/12/201607:59', '00763', 'ZTE', 'Announcements and Notices - [Overseas Regulatory Announcement - Board/Supervisory Board Resolutions]Announcement Resolutions of the Eleventh Meeting of the Seventh Session of the Board of Directors\xa0(186KB, PDF)'] 
['06/12/201607:08', '01378', 'CHINAHONGQIAO', 'Announcements and Notices - [Major Transaction]MAJOR TRANSACTION-(1) SUBSCRIPTION OF SHARES OF LOFTEN; AND (2) ACQUISITION OF THE ENTIRE EQUITY INTEREST IN INNOVATIVE METAL\xa0(75KB, PDF)'] 
['06/12/201607:04', '01345', 'PIONEER PHARM', 'Circulars - [Connected Transaction](1) DISCLOSEABLE AND CONNECTED TRANSACTION DISPOSAL OF 100% INTEREST IN A WHOLLY-OWNED SUBSIDIARY AND (2) NOTICE OF EGM\xa0(220KB, PDF)'] 
['06/12/201606:11', '00993', 'HUARONG INT FIN', 'Announcements and Notices - [Discloseable Transaction]DISCLOSEABLE TRANSACTION IN RELATION TO\r\nSUBSCRIPTION FOR NOTES\xa0(144KB, PDF)'] 
['06/12/201606:08', '00300', 'KUNMING MACHINE', 'Announcements and Notices - [Overseas Regulatory Announcement - Other]Announcement on Receiving An Enquiry Letter on \r\nRelated Supplemental Announcement from Shanghai Stock Exchange\xa0(394KB, PDF)'] 

更新:

import requests 
from bs4 import BeautifulSoup 
url = 'http://www.hkexnews.hk/listedco/listconews/mainindex/SEHK_LISTEDCO_DATETIME_TODAY.HTM' 
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'lxml') 

rows = soup.find_all(class_=['row0','.row1']) 
for row in rows: 
    data = row.get_text(separator='\t').split('\t', 5) 
    print (data) 

出來:

['07/12/2016', '17:42', '00207', 'JOY CITY PPT', 'Announcements and Notices - [List of Directors and their Role and Function]', 'List of Directors and their Roles and Functions\t\xa0(62KB, PDF)'] 
['07/12/2016', '17:40', '02880', 'DALIAN PORT', 'Announcements and Notices - [Overseas Regulatory Announcement - Corporate Governance Related Matters]', 'An announcement has just been published by the issuer in the Chinese section of this website, a corresponding version of which may or may not be published in this section\t\xa0(1KB, HTM)'] 
['07/12/2016', '17:38', '00193', 'CAPITAL ESTATE', 'Announcements and Notices - [Results of AGM]', 'POLL RESULTS OF THE ANNUAL GENERAL\r\nMEETING HELD ON 7 DECEMBER, 2016\t\xa0(95KB, PDF)'] 
['07/12/2016', '17:35', '00207', 'JOY CITY PPT', 'Announcements and Notices - [Dividend or Distribution/Closure of Books or Change of Book Closure Period]', 'SPECIAL DIVIDEND AND CLOSURE OF REGISTER OF MEMBERS\t\xa0(133KB, PDF)'] 
['07/12/2016', '17:29', '00052', 'FAIRWOOD HOLD', 'Next Day Disclosure Returns - [Share Buyback]', 'Next Day Disclosure Return\t\xa0(125KB, PDF)'] 
['07/12/2016', '17:21', '00756', 'TIANYI SUMMI', 'Announcements and Notices - [Other - Miscellaneous]', 'VOLUNTARY ANNOUNCEMENT - INCREASE IN SHAREHOLDING OF A CONTROLLING SHAREHOLDER\t\xa0(120KB, PDF)'] 
['07/12/2016', '17:16', '00702', 'SINO OIL & GAS', 'Next Day Disclosure Returns - [Share Buyback]', 'NEXT DAY DISCLOSURE RETURN\t\xa0(294KB, PDF)'] 
+0

感謝您的答覆,我想知道如果有單獨的一種方式日期和時間值,並添加與每個標題相關的鏈接? – kimpster

+0

我不確定標題是什麼,但我更新代碼來分隔日期和時間值。 –