2017-04-08 83 views
0

我有一個包含許多文件的數據集。每個文件包含一個空行分離式的許多評論:從文件中提取數據以使用bash腳本填充數據庫

<Author>bigBob 
<Content>definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES 
<Date>Jan 2, 2009 
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/> 
<No. Reader>-1 
<No. Helpful>-1 
<Overall>4 
<Value>4 
<Rooms>4 
<Location>4 
<Cleanliness>5 
<Check in/front desk>4 
<Service>3 
<Business service>4 

<Author>rickMN... next review goes on 

對於每一個檢討,我需要的標籤後,提取數據,並把它放在這樣的事情(我打算寫一個.sql文件所以當我做「.read」,它將填充我的數據庫):

INSERT INTO [HotelReviews] ([Author], [Content], [Date], [Image], [No_Reader], [No_Helpful], [Overall], [Value], [Rooms], [Location], [Cleanliness], [Check_In], [Service], [Business_Service]) VALUES ('bigBob', 'definitely above...', ...) 

我的問題是如何在每個標籤後提取數據並使用bash把它放在一個INSERT語句?

編輯 文本後<Content>標籤通常是多條線路的段落

回答

1

這是正確的做法對你想做什麼:

$ cat tst.awk 
NF { 
    if (match($0,/^<img\s+src="([^"]+)/,a)) { 
     name="Image" 
     value=a[1] 
    } 
    else if (match($0,/^<([^>"]+)>(.*)/,a)) { 
     name=a[1] 
     value=a[2] 
     sub(/ \/.*|\./,"",name) 
     gsub(/ /,"_",name) 
    } 

    names[++numNames] = name 
    values[numNames] = value 
    next 
} 

{ prt() } 
END { prt() } 

function prt() { 
    printf "INSERT INTO [HotelReviews] (" 

    for (nameNr=1; nameNr<=numNames; nameNr++) { 
     printf " [%s]", names[nameNr] 
    } 

    printf ") VALUES (" 

    for (nameNr=1; nameNr<=numNames; nameNr++) { 
     printf " \047%s\047", values[nameNr] 
    } 

    print "" 

    numNames = 0 
    delete names 
    delete values 
} 

$ awk -f tst.awk file 
INSERT INTO [HotelReviews] ([Author] [Content] [Date] [Image] [No_Reader] [No_Helpful] [Overall] [Value] [Rooms] [Location] [Cleanliness] [Check_in] [Service] [Business_service]) VALUES ('bigBob' 'definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES' 'Jan 2, 2009' 'http://cdn.tripadvisor.com/img2/new.gif' '-1' '-1' '4' '4' '4' '4' '5' '4' '3' '4' 
INSERT INTO [HotelReviews] ([Author]) VALUES ('rickMN... next review goes on' 

上面使用GNU awk來匹配第三個參數()。按摩以獲得您想要的精確格式/輸出。

+0

非常感謝!但是,代碼中存在一個錯誤。我認爲它與一行相關,並導致重複([Date] x 2)。如果你能幫我解決這個問題,我會非常感激(我對那些正則表達式並不熟悉)。再次感謝你! –

+0

我打算使用你提出的方法,這幾乎是我想要做的,但我不知道該怎麼做。請幫助我解決問題,並再次感謝您。 –

+0

好的,我調整了img正則表達式來解決這個問題,並稍微整理了代碼。 –

1

例子:

#!/bin/bash 

while IFS= read -r line; do 
    [[ $line =~ ^\<Author\>(.*) ]] && Author="${BASH_REMATCH[1]}" 
    [[ $line =~ ^\<Content\>(.*) ]] && Content="${BASH_REMATCH[1]}" 

    # capture lines not starting with < and append to variable Content 
    [[ $line =~ ^[^\<] ]] && Content+="$line" 

    # match an empty line 
    [[ $line =~ ^$ ]] && echo "${Author}, ${Content}" 
done < file 

輸出與您的文件:

 
bigBob, definitely above average! we had a really nice stay there last year when I and ... 

=~:匹配一個正則表達式(字符串左,正則表達式正確不帶引號)

^:行

\<\>的比賽開始:比賽<>

.*:線在這裏比賽休息

(.*):捕獲其餘行至第一個元素的陣列BASH_REMATCH

參見:The Stack Overflow Regular Expressions FAQ