2016-10-22 70 views
0

我一直試圖提取Big Query中我的「文本」列中存在的任何URL。列包含文本和遍佈整個(一個細胞可能包含多個URL)我試圖用這個正則表達式的URL的混合物:在Big Query中使用正則表達式來提取URL

SELECT 

    REGEXP_EXTRACT (Text, r'(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*') 
FROM 
Data.Text_Files 

我目前得到「無法解析正則表達式」當我嘗試運行查詢。我試過修改它,但無濟於事。

正則表達式在一個在線構建器中工作,但我不確定如何將它合併到Big Query中。

任何幫助將不勝感激 - 或者至少指出如何將正則表達式合併到Big Query中!

回答

4

嘗試以下 - 這是BigQuery的標準SQL(見Enabling Standard SQLMigrating from legacy SQL

WITH YourTable AS (
    SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask). ' AS Text UNION ALL 
    SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL 
    SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text 
) 
SELECT 
id, 
REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL 
FROM YourTable 

這爲您提供了id字段的輸出,並多次實地與所有相應的網址

如果你需要扁平結果 - 你可以使用下面的變化

WITH YourTable AS (
    SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask). ' AS Text UNION ALL 
    SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL 
    SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text 
) 
SELECT 
    id, URL  
FROM (
    SELECT id, REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL 
    FROM YourTable 
), UNNEST(URL) as URL 

注意:你可以在這裏使用任何正則表達式,你可以在網上找到 - 但是什麼必須是 - 只有一個匹配組是允許的!所以所有內部匹配組都應該使用?:進行轉義,您可以在上面的示例中看到它。因此,您希望在輸出中看到的唯一組應該保持原樣 - 無?:

+0

嗨米哈伊爾 - 非常感謝您的意見!我一直在玩這個,當你的查詢運行良好時,我仍然試圖理解如何修改它,以便查詢我表中的現有列。 WITH YourTable AS( SELECT 1 AS id ...部分僅用於說明您的示例?即使用我自己的查詢,我是否只從SELECT id開始,REGEXP_EXTRACT ...? –

+0

這是正確的。首先選擇我想要的,REG ...確保你使用的字段名稱在你的表格中實際存在 –

+0

它的工作原理!這非常好 - 一直在努力解決這個問題!非常感謝 –

2

你的正則表達式有一個不完整的捕獲組,並且有2個非轉義字符。我不知道你使用的是哪種在線正則表達式構建器,但是也許你忘了將新的正則表達式放入其中?是

存在的問題如下:

(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]* 
POINTERS TO PROBLEMS ON THIS LINE --->        ^1     ^^2 
  1. 這是一個沒有結束捕獲組的開始。您可能需要在*之前)
  2. 所有斜槓都需要轉義。這應該是\/或甚至\/\\

這裏是我的兩個建議實施的示例:https://regex101.com/r/pt1hqS/1

好運修復它!

+0

感謝指針Addison,但是當我通過Big Query進行操作時,仍然「無法解析表達式」?我可以看到它在正則表達式生成器中工作,但是我認爲我在執行查詢時缺少一些東西:'SELECT REGEXP_EXTRACT(Text,r'http(s)?:\/\ /。)?(www \)[ - α-ZA-Z0-9:%._ \ +〜#=]。?。{2256} \ [AZ] {2,6-} \ b([ - α-ZA-Z0-9%_ :?\ +。〜#&\/\/=])*') FROM Data.Content' –

+0

您錯過了字符串開始處的括號(它在那裏)。請確保您仔細複製/粘貼代碼,因爲大多數瀏覽器並不容易。 – Addison

+0

或者,如果你想使用一些正則表達式我只是從[另一個答案](http://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string)修改,請嘗試: '的https:\/\/[\ W _-] +(:\ [\ W _-] +?)+([\ W,@^=%:?\ /〜+# - ] * [\瓦特@^=%&\ /〜+# - ])?' – Addison