2017-07-03 59 views
1
結果

我試圖從網站[https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal%20asc%2C%20score%20desc%2C%20metadata_modified%20desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0]用美麗的湯和寫在CSV

選擇文本中提取選擇文本並使用美麗的湯已經寫代碼: `

wiki = "https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0" 
page= urllib.request.urlopen(wiki) 
from bs4 import BeautifulSoup 
import re 
soup = BeautifulSoup(page) 
data2 = soup.find_all('h3', class_="dataset-heading") 

data3 = [] 
getdata = [] 
for link in data2: 
    data3 = soup.find_all("a", href=re.compile('/dataset/', re.IGNORECASE)) 
for data in data3: 
     getdata = data.text 
     print(getdata) 

len(getdata) 
` 

我的HTML是像:

<a href = "/dataset/banks-assets, class = "label" data-format = "xls">XLS<\a>

當我在代碼上面運行時,我得到的文本是我想要的,但'XLS'單詞即將到來,我想刪除'XLS'並希望在一列中解析csv中剩餘的文本。我的輸出是:

  • 銀行 - 資產
  • XLS
  • 合併曝光 - 直接及最終 風險基礎
  • XLS
  • 外匯交易等 官方儲備增持資產
  • XLS
  • 財務公司和通用金融家 - 選定的資產和負債
  • XLS
  • 負債和資產 - 每月XLS合併曝光 - 直接風險基礎 - 國際索賠由國家
  • XLS 依此類推......

我檢查了上面的輸出是否列表。它被給出了列表,但它只有一個元素,但正如我上面顯示的,我的輸出是很多文本。 請幫我解決它。

回答

1

如果目的只是從結果列中刪除XLS行,那麼就可以達到,例如,部份方式:

from urllib.request import urlopen 
wiki = "https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0" 
page= urlopen(wiki) 
from bs4 import BeautifulSoup 
import re 
soup = BeautifulSoup(page) 
data2 = soup.find_all('h3', class_="dataset-heading") 

    data3 = [] 
    getdata = [] 
    for link in data2: 
     data3 = soup.find_all("a", href=re.compile('/dataset/', re.IGNORECASE)) 
    for data in data3: 
     if data.text.upper() != 'XLS': 
      getdata.append(data.text) 
    print(getdata) 

你將會得到一個文本,你需要一個列表。然後它可以很容易地轉換爲,例如,DataFrame,其中這些數據將顯示爲一列。

import pandas as pd 
df = pd.DataFrame(columns=['col1'], data=getdata) 

輸出:

            col1 
0          Banks – Assets 
1 Consolidated Exposures – Immediate and Ultimat... 
2 Foreign Exchange Transactions and Holdings of ... 
3 Finance Companies and General Financiers – Sel... 
4     Liabilities and Assets – Monthly 
5 Consolidated Exposures – Immediate Risk Basis ... 
6  Consolidated Exposures – Ultimate Risk Basis 
7 Banks – Consolidated Group off-balance Sheet B... 
8  Liabilities of Australian-located Operations 
9 Building Societies – Selected Assets and Liabi... 
10 Consolidated Exposures – Immediate Risk Basis ... 
11   Banks – Consolidated Group Impaired Assets 
12 Assets and Liabilities of Australian-Located O... 
13          Managed Funds 
14   Daily Net Foreign Exchange Transactions 
15  Consolidated Exposures-Immediate Risk Basis 
16         Public Unit Trust 
17       Securitisation Vehicles 
18   Assets of Australian-located Operations 
19     Banks – Consolidated Group Capital 

把到CSV:

df.to_csv('C:\Users\Username\output.csv') 
+0

@Arti,任何反饋?註釋?是不是對您有幫助? –

+0

@德米特里,這是有益的,因爲我學到新東西,但我事先轉換成字典的東西,你可以看到我的答案上面。但我也堅持把東西放入csv。請檢查並幫助擺脫它.- Arti123 – Arti123

+0

@Arti,將數據放入csv的最簡單方法是使用我添加到解決方案底部行的方法=) 稍後我會在有空時檢查您的代碼。 Aslo如果我的回答適合你,你可以通過把綠色複選標記接受它嗎? –