熊貓可以從http鏈接直接讀取csv
:
實施例;
df = pd.read_csv(
'https://vincentarelbundock.github.io/Rdatasets/'
'csv/datasets/OrchardSprays.csv')
print(df)
結果:
Unnamed: 0 decrease rowpos colpos treatment
0 1 57 1 1 D
1 2 95 2 1 E
.. ... ... ... ... ...
62 63 3 7 8 A
63 64 19 8 8 C
[64 rows x 5 columns]
通過拼搶獲得鏈接:
要從前頁面獲取鏈接本身,我們也可以使用pandas
做網頁刮數據。例如:
base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'
import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
header=0, flavor='html5lib')[0]
將返回頁面中的表中的數據。不幸的是,我們在這裏使用,這是行不通的,因爲pandas
抓取頁面上的文字,而不是鏈接。
猴子修補刮刀以獲得鏈接:
要獲得的URL,就可以猴子補丁庫,如:
def _text_getter(self, obj):
text = obj.text
if text.strip() in ('CSV', 'DOC'):
try:
text = base_url + obj.find('a')['href']
except (TypeError, KeyError):
pass
return text
from pandas.io.html import _BeautifulSoupHtml5LibFrameParser as bsp
bsp._text_getter = _text_getter
測試代碼:
base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'
import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
header=0, flavor='html5lib')[0]
for row in df.head().iterrows():
print('%-14s: %s' % (row[1].Item, row[1].csv))
結果:
AirPassengers: https://vincentarelbundock.github.io/Rdatasets/csv/datasets/AirPassengers.csv
BJsales : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BJsales.csv
BOD : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BOD.csv
CO2 : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/CO2.csv
Formaldehyde : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/Formaldehyde.csv
看來今天對我來說是糟糕的一天。如果考慮問題的規則,再次懷疑我的時間閱讀我眼中的問題......問題描述了這個問題,但卻錯過了這個部分:你已經做了什麼來解決你描述的問題? – Claudio
@Claudio,你能告訴我它在哪裏指示你必須展示你所嘗試過的東西嗎? –
您想閱讀本頁所有數據集? – Hackaholic