從文件中提取數據以使用bash腳本填充數據庫

我有一個包含許多文件的數據集。每個文件包含一個空行分離式的許多評論：從文件中提取數據以使用bash腳本填充數據庫

<Author>bigBob 
<Content>definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES 
<Date>Jan 2, 2009 
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/> 
<No. Reader>-1 
<No. Helpful>-1 
<Overall>4 
<Value>4 
<Rooms>4 
<Location>4 
<Cleanliness>5 
<Check in/front desk>4 
<Service>3 
<Business service>4 

<Author>rickMN... next review goes on

對於每一個檢討，我需要的標籤後，提取數據，並把它放在這樣的事情（我打算寫一個.sql文件所以當我做「.read」，它將填充我的數據庫）：

INSERT INTO [HotelReviews] ([Author], [Content], [Date], [Image], [No_Reader], [No_Helpful], [Overall], [Value], [Rooms], [Location], [Cleanliness], [Check_In], [Service], [Business_Service]) VALUES ('bigBob', 'definitely above...', ...)

我的問題是如何在每個標籤後提取數據並使用bash把它放在一個INSERT語句？

編輯文本後<Content>標籤通常是多條線路的段落

來源

2017-04-08 Mr Wondeful

這是正確的做法對你想做什麼：

$ cat tst.awk 
NF { 
    if (match($0,/^<img\s+src="([^"]+)/,a)) { 
     name="Image" 
     value=a[1] 
    } 
    else if (match($0,/^<([^>"]+)>(.*)/,a)) { 
     name=a[1] 
     value=a[2] 
     sub(/ \/.*|\./,"",name) 
     gsub(/ /,"_",name) 
    } 

    names[++numNames] = name 
    values[numNames] = value 
    next 
} 

{ prt() } 
END { prt() } 

function prt() { 
    printf "INSERT INTO [HotelReviews] (" 

    for (nameNr=1; nameNr<=numNames; nameNr++) { 
     printf " [%s]", names[nameNr] 
    } 

    printf ") VALUES (" 

    for (nameNr=1; nameNr<=numNames; nameNr++) { 
     printf " \047%s\047", values[nameNr] 
    } 

    print "" 

    numNames = 0 
    delete names 
    delete values 
}

。

$ awk -f tst.awk file 
INSERT INTO [HotelReviews] ([Author] [Content] [Date] [Image] [No_Reader] [No_Helpful] [Overall] [Value] [Rooms] [Location] [Cleanliness] [Check_in] [Service] [Business_service]) VALUES ('bigBob' 'definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES' 'Jan 2, 2009' 'http://cdn.tripadvisor.com/img2/new.gif' '-1' '-1' '4' '4' '4' '4' '5' '4' '3' '4' 
INSERT INTO [HotelReviews] ([Author]) VALUES ('rickMN... next review goes on'

上面使用GNU awk來匹配第三個參數（）。按摩以獲得您想要的精確格式/輸出。

來源

2017-04-08 15:28:38

非常感謝！但是，代碼中存在一個錯誤。我認爲它與一行相關，並導致重複（[Date] x 2）。如果你能幫我解決這個問題，我會非常感激（我對那些正則表達式並不熟悉）。再次感謝你！ –

我打算使用你提出的方法，這幾乎是我想要做的，但我不知道該怎麼做。請幫助我解決問題，並再次感謝您。 –

好的，我調整了img正則表達式來解決這個問題，並稍微整理了代碼。 –

例子：

#!/bin/bash 

while IFS= read -r line; do 
    [[ $line =~ ^\<Author\>(.*) ]] && Author="${BASH_REMATCH[1]}" 
    [[ $line =~ ^\<Content\>(.*) ]] && Content="${BASH_REMATCH[1]}" 

    # capture lines not starting with < and append to variable Content 
    [[ $line =~ ^[^\<] ]] && Content+="$line" 

    # match an empty line 
    [[ $line =~ ^$ ]] && echo "${Author}, ${Content}" 
done < file

輸出與您的文件：

 
bigBob, definitely above average! we had a really nice stay there last year when I and ...

=~：匹配一個正則表達式（字符串左，正則表達式正確不帶引號）

^：行

\<或\>的比賽開始：比賽<或>

.*：線在這裏比賽休息

(.*)：捕獲其餘行至第一個元素的陣列BASH_REMATCH

參見：The Stack Overflow Regular Expressions FAQ

來源

2017-04-08 09:41:51 Cyrus

從文件中提取數據以使用bash腳本填充數據庫

回答

相關問題