問題:可靠刮股價表
我的目標是自動從本網站stock prices抓取與貨幣的價格表。由於股票經紀人未提供API,我不得不尋找解決辦法。
爲了避免重複發明輪子和浪費時間/金錢,我已經爲此尋找申請,但不幸的是我沒有找到一個適用於本網站的申請。
我已經試過:
R
和rvest
R爲以其簡單和直接的使用。讓我們看看這個代碼,它基本上是一個從texbook複製粘貼的例子:
library("rvest")
url <- "https://iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=120&date=2016-12-19-19-0"
population <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="mCSB_3_container"]/table') %>%
html_table()
population
population <- population[[1]]
head(population)
獲取一個空表。
JavaScript
和casperJS
JavaScipt
和PhantomJS
Python
和BeautifulSoup
Pandas
和它的read_html()
- 請問你能解釋爲什麼我在嘗試不同的網頁抓取和HTML解析工具時得到空表嗎?
- 什麼是最可靠的方式來處理這個特定的股票價格網站的網絡抓取?
這個選項是迄今爲止最好的,我居然能提取數據,但它是非常緩慢的,並最終與崩潰「內存耗盡」 錯誤:
var casper = require('casper').create({
logLevel:'debug',
verbose:true,
loadImages: false,
loadPlugins: false,
webSecurityEnabled: false,
userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11"
});
var url = 'https://eu.iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=60&date=2016-12-19-21-0';
var length;
var fs = require('fs');
var sep = ';';
//var count = 0;
casper.start(url);
//date
var today = new Date();
var dd = today.getDate();
var mm = today.getMonth()+1; //January is 0!
var hh = today.getHours();
var fff = today.getMilliseconds();
var MM = today.getMinutes();
var yyyy = today.getFullYear();
if(dd<10){
dd='0'+dd;
}
if(mm<10){
mm='0'+mm;
}
var today = yyyy +'_'+mm + '_' +dd + '_'+ hh +'_'+ MM +'_'+ fff;
casper.echo(today);
function getCellContent(row, cell) {
cellText = casper.evaluate(function(row, cell) {
return document.querySelectorAll('table tbody tr')[row].childNodes[cell].innerText.trim();
}, row, cell);
return cellText;
}
function moveNext()
{
var rows = casper.evaluate(function() {
return document.querySelectorAll('table tbody tr');
});
length = rows.length;
this.echo("table length: " + length);
};
//get 3 tables
for (var mins = 0; mins < 3; mins++)
{
url = 'https://eu.iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=60&date=2016-12-19-21-' + mins;
casper.echo(url);
casper.thenOpen(url);
casper.then(function() {
this.waitForSelector('#mCSB_3_container table tbody tr');
});
casper.then(moveNext);
casper.then(function() {
for (var i = 0; i < length; i++)
{
//this.echo("Date: " + getCellContent(i, 0));
//this.echo("Bid: " + getCellContent(i, 1));
//this.echo("Ask: " + getCellContent(i, 2));
//this.echo("Quotes: " + getCellContent(i, 4));
fs.write('prices_'+today+'.csv', getCellContent(i, 0) + sep + getCellContent(i, 1) + sep + getCellContent(i, 2) + sep + getCellContent(i, 4) + "\n", "a");
}
});
}
casper.run();
this.echo("finished with processing");
使用此選項我只得到一個單一的表中:
var webPage = require('webpage');
var page = webPage.create();
page.open('https://iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=120&date=2016-12-19-19-0', function(status) {
var title = page.evaluate(function() {
return document.querySelectorAll('table tbody tr');
});
});
獲得一個空表的結果:
from bs4 import BeautifulSoup
from urllib2 import urlopen
url = "https://iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=120&date=2016-12-19-19-0"
soup = BeautifulSoup(urlopen(url), "lxml")
table = soup.findAll('table', attrs={ "class" : "quotes-table-result"})
print("table length is: "+ str(len(table)))
嘗試與「Scrapy殼牌」,但得到了一張空表。
隨着pandas
我有以下錯誤:
ValueError: No tables found matching pattern '.+'
的代碼:
import pandas as pd
import html5lib
f_states = pd.read_html("https://iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=120&date=2016-12-19-19-0")
print f_states
該問題:
注:這可能是該網站正試圖阻止網絡刮,我研究robots.txt
,但它看起來像有隻通過瀏覽器支持的具體和谷歌機器人的具體說明。
嘗試用Python'selenium'編號:http://selenium-python.readthedocs.io/installation.html – Prabhakar
嘗試用Scrapy +飛濺蟒蛇。 @Prabhakar硒很好,但速度太慢。 – parik
另外python + pandas''read_html'很好。 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html –