這是一個使用一些tidyverse
軟件包和您的數據集的解決方案。
library(rvest) # for scrapping the data
#> Le chargement a nécessité le package : xml2
library(dplyr, warn.conflicts = F)
library(tidyr)
library(purrr, warn.conflict = F)
最前一頁,我們直接從你的例子網址,以獲取你的數據。
tab <- read_html("http://finviz.com/quote.ashx?t=BA") %>%
html_node("table.snapshot-table2") %>%
html_table(header = F) %>%
as_data_frame()
tab
#> # A tibble: 12 x 12
#> X1 X2 X3 X4 X5 X6
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Index DJIA S&P500 P/E 20.77 EPS (ttm) 11.42
#> 2 Market Cap 141.89B Forward P/E 22.14 EPS next Y 10.71
#> 3 Income 7.12B PEG 1.13 EPS next Q 2.62
#> 4 Sales 90.90B P/S 1.56 EPS this Y 2.30%
#> 5 Book/sh -3.34 P/B - EPS next Y 7.28%
#> 6 Cash/sh 17.26 P/C 13.74 EPS next 5Y 18.36%
#> 7 Dividend 5.68 P/FCF 17.94 EPS past 5Y 7.40%
#> 8 Dividend % 2.39% Quick Ratio 0.40 Sales past 5Y 6.60%
#> 9 Employees 150500 Current Ratio 1.20 Sales Q/Q -8.10%
#> 10 Optionable Yes Debt/Eq - EPS Q/Q 885.50%
#> 11 Shortable Yes LT Debt/Eq - Earnings Jul 26 BMO
#> 12 Recom 2.20 SMA20 -0.16% SMA50 8.14%
#> # ... with 6 more variables: X7 <chr>, X8 <chr>, X9 <chr>, X10 <chr>,
#> # X11 <chr>, X12 <chr>
頁眉是在每一個偶數列每一個奇數列和數據,我們 創建按行結合的子集兩列整齊tibble。對於 ,我們生成奇數和偶數索引。然後, purrr::map_dfr
允許我們迭代這兩個列表,應用一個函數並將行綁定結果。該功能包括從表[ ]
中選擇2列,並用set_names
重命名這兩列。
col_num <- seq_len(ncol(tab))
even <- col_num[col_num %% 2 == 0]
odd <- setdiff(col_num, even)
tab2 <- map2_dfr(odd, even, ~ set_names(tab[, c(.x, .y)], c("header", "value")))
tab2
#> # A tibble: 72 x 2
#> header value
#> <chr> <chr>
#> 1 Index DJIA S&P500
#> 2 Market Cap 141.89B
#> 3 Income 7.12B
#> 4 Sales 90.90B
#> 5 Book/sh -3.34
#> 6 Cash/sh 17.26
#> 7 Dividend 5.68
#> 8 Dividend % 2.39%
#> 9 Employees 150500
#> 10 Optionable Yes
#> # ... with 62 more rows
你有一個很好的2列長的表與所有的數據。現在,如果您希望 表格採用寬格式而不是長格式,則必須進行轉置。 但首先,我們必須處理標題爲 列中的一些重複名稱。您不能有重複的列名稱。
tab2 %>%
filter(header == header[duplicated(header)])
#> # A tibble: 2 x 2
#> header value
#> <chr> <chr>
#> 1 EPS next Y 10.71
#> 2 EPS next Y 7.28%
我們只是重新命名第二occurence加入_2
tab3 <- tab2 %>%
mutate(header = case_when(
duplicated(header) ~ paste(header, 2, sep = "_"),
TRUE ~ header)
)
# No more duplicates
any(duplicated(tab3$header))
#> [1] FALSE
tab3 %>% filter(stringr::str_detect(header, "EPS next Y"))
#> # A tibble: 2 x 2
#> header value
#> <chr> <chr>
#> 1 EPS next Y 10.71
#> 2 EPS next Y_2 7.28%
可以在寬格式傳遞,並有72列,而不是72行。
tab3 %>%
spread(header, value)
#> # A tibble: 1 x 72
#> `52W High` `52W Low` `52W Range` ATR `Avg Volume` Beta `Book/sh`
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -3.78% 87.78% 126.31 - 246.49 3.77 3.46M 1.18 -3.34
#> # ... with 65 more variables: `Cash/sh` <chr>, Change <chr>, `Current
#> # Ratio` <chr>, `Debt/Eq` <chr>, Dividend <chr>, `Dividend %` <chr>,
#> # Earnings <chr>, Employees <chr>, `EPS (ttm)` <chr>, `EPS next
#> # 5Y` <chr>, `EPS next Q` <chr>, `EPS next Y` <chr>, `EPS next
#> # Y_2` <chr>, `EPS past 5Y` <chr>, `EPS Q/Q` <chr>, `EPS this Y` <chr>,
#> # `Forward P/E` <chr>, `Gross Margin` <chr>, Income <chr>, Index <chr>,
#> # `Insider Own` <chr>, `Insider Trans` <chr>, `Inst Own` <chr>, `Inst
#> # Trans` <chr>, `LT Debt/Eq` <chr>, `Market Cap` <chr>, `Oper.
#> # Margin` <chr>, Optionable <chr>, `P/B` <chr>, `P/C` <chr>,
#> # `P/E` <chr>, `P/FCF` <chr>, `P/S` <chr>, Payout <chr>, PEG <chr>,
#> # `Perf Half Y` <chr>, `Perf Month` <chr>, `Perf Quarter` <chr>, `Perf
#> # Week` <chr>, `Perf Year` <chr>, `Perf YTD` <chr>, `Prev Close` <chr>,
#> # Price <chr>, `Profit Margin` <chr>, `Quick Ratio` <chr>, Recom <chr>,
#> # `Rel Volume` <chr>, ROA <chr>, ROE <chr>, ROI <chr>, `RSI (14)` <chr>,
#> # Sales <chr>, `Sales past 5Y` <chr>, `Sales Q/Q` <chr>, `Short
#> # Float` <chr>, `Short Ratio` <chr>, Shortable <chr>, `Shs Float` <chr>,
#> # `Shs Outstand` <chr>, SMA20 <chr>, SMA200 <chr>, SMA50 <chr>, `Target
#> # Price` <chr>, Volatility <chr>, Volume <chr>
想法:您還可以替換標題列中的所有空格_
以使列名不含空格。通常更容易處理。
你可以試試't(df)[1:2,]'?這將'轉置'數據幀,然後抓住'1,2行' – CPak
乾杯隊友,我正在研究這種方法!只是爲了澄清,你抓住1,2排是什麼意思? – DavimusPrime