2014-10-29 109 views
1

我想使用SPLIT函數對各種文本條目進行文字分析,在本例中爲git commit comments。通常,單詞由空格分隔,但我還希望在分隔符列表中包含逗號,分號,冒號,句號,問號,感嘆號,製表符,新行。基本上使用REGEX模式來指定分隔符,如果找到它們中的任何一個,則將其視爲分隔符。SPLIT可以與多個分隔符一起使用嗎?

例如:

SELECT 
    split(commit_message, " ") as words, 
FROM [project:dataset.table] 
LIMIT 1000 

如果輸入的數據是這樣的:

"Commit message XYX: Hello. This is a test. This is a fun test! First, we'll run a test, then we'll check the results. A test is currently running." 

如果我們做一個GROUP BY的話,這個詞「測試」將有一個COUNT我希望4,但使用上面的查詢測試只計算一次。如果分隔符字段接受類似於下面的REGEXP,但是我認爲這不可用,或者語法未發佈,那將會很好。

SELECT 
    split(commit_message, "[\W]+") as words, 
FROM [project:dataset.table] 
LIMIT 1000 

在上面的例子中,如果檢測到一個或多個非單詞字符,這些會全部作爲分隔符處理。如果此功能不存在,是否可以考慮將來的改進?在這個時候,我需要將結果放在「單詞」列中,並去除所有非單詞字符以獲得我想要的內容。 (見下文)

SELECT 
    LOWER(REGEXP_EXTRACT(words, r'(\w+)')) as words 
FROM 
    (
    SELECT 
     split(commit_message, " ") as words, 
    FROM [project:dataset.table] 
    ) 
LIMIT 1000 

我很感激,如果你有建議,以避免這個額外的步驟提取非單詞字符。

+0

請參閱下面的回答,請分享成果! – 2014-10-29 22:30:23

回答

3

SPLIT函數只接受常量字符串作爲分隔符。正則表達式分隔符沒有隱藏的語法。

的替代方案,你可以嘗試使用REGEXP_REPLACE用空格或任何一個分隔符來替換所有您想要的分隔符,就像這樣:

SPLIT(REGEXP_REPLACE(message, ",|;|:|\\.|\\?|!|\t|\n", " "), " ") 
1

更新:查看完整的文章在http://www.reddit.com/r/bigquery/comments/2kqe4g/words_that_these_developers_say_that_others_dont/


@sprocket說什麼:首先使用REGEX_REPLACE,然後使用SPLIT()。

請參閱http://www.reddit.com/r/bigquery/comments/2ep8np/mining_the_top_news_words_for_each_day_with_gdelt進行類似分析。

一個工作查詢,什麼Python開發人員說,JavaScript開發人員不要說:

SELECT word, c 
FROM (
    SELECT word, COUNT(*) c 
    FROM (
    SELECT SPLIT(msg, ' ') word 
    FROM (
     SELECT REGEXP_REPLACE(LOWER(payload_commit_msg), r'[^a-z]', ' ') msg 
     FROM [githubarchive:github.timeline] 
     WHERE 
     repository_language == 'Python' 
     AND payload_commit_msg != '' 
     GROUP EACH BY msg 
    ) 
) 
    GROUP BY word 
    ORDER BY c DESC 
    LIMIT 500 
) 
WHERE word NOT IN (
    SELECT x FROM (SELECT word x, COUNT(*) c 
    FROM (
    SELECT SPLIT(msg, ' ') word 
    FROM (
     SELECT REGEXP_REPLACE(LOWER(payload_commit_msg), r'[^a-z]', ' ') msg 
     FROM [githubarchive:github.timeline] 
     WHERE 
     repository_language == 'JavaScript' 
     AND payload_commit_msg != '' 
     GROUP EACH BY msg 
    ) 
) 
    GROUP BY x 
    ORDER BY c DESC 
    LIMIT 1000) 
); 

查看完整的文章在http://www.reddit.com/r/bigquery/comments/2kqe4g/words_that_these_developers_say_that_others_dont/

相關問題