2017-10-07 58 views
0

我試圖以JSON格式將另一個文件中的產品與JSON行格式的產品列表匹配。這有時稱爲記錄鏈接,實體解析,引用調整或僅匹配。將來自列表的JSONlines匹配到新的JSON列表中

目標是匹配來自第三方零售商的產品列表,例如, 「Nikon D90 12.3MP數碼單反相機(僅限於機身)」與一組已知產品相比較,例如「尼康D90」。

詳細

數據對象

產品

{ 
"product_name": String // A unique id for the product 
"manufacturer": String 
"family": String // optional grouping of products 
"model": String 
"announced-date": String // ISO-8601 formatted date string, e.g. 2011-04-28T19:00:00.000-05:00 
} 

上市

{ 
"title": String // description of product for sale 
"manufacturer": String // who manufactures the product for sale 
"currency": String // currency code, e.g. USD, CAD, GBP, etc. 
"price": String // price, e.g. 19.99, 100.00 
} 

結果

{ 
"product_name": String 
"listings": Array[Listing] 
} 

數據 包含兩個文件: products.txt - 包含大約700個產品 listings.txt - 包含約20,000個產品上市

當前代碼(使用python):

import jsonlines 
import json 
import re 
import logging, sys 

logging.basicConfig(stream=sys.stderr, level=logging.DEBUG) 

with jsonlines.open('products.jsonl') as products: 
    for prod in products: 
    jdump = json.dumps(prod) 
    jload = json.loads(jdump) 
    regpat = re.compile("^\s+|\s*-| |_\s*|\s+$") 
    prodmatch = [x for x in regpat.split(jload["product_name"].lower()) if x] 
    manumatch = [x for x in regpat.split(jload["manufacturer"].lower()) if x] 
    modelmatch = [x for x in regpat.split(jload["model"].lower()) if x] 
    wordmatch = prodmatch + manumatch + modelmatch 
    #print (wordmatch) 
    #logging.debug('product first output') 
    with jsonlines.open('listings.jsonl') as listings: 
     for entry in listings: 
     jdump2 = json.dumps(entry) 
     jload2 = json.loads(jdump2) 
     wordmatch2 = [x for x in regpat.split(jload2["title"].lower()) if x] 
     #print (wordmatch2) 
     #logging.debug('listing first output') 
     contained = [x for x in wordmatch2 if x in wordmatch] 
     if contained: 
      print(contained) 
     #logging.debug('contained first match') 

以上代碼分裂了產品文件中的產品名稱,型號和製造商中的字詞,並嘗試匹配清單文件中的字符串,但我覺得這太慢了,並且必須有更好的方法來執行此操作。任何幫助表示讚賞

+0

什麼工作,什麼不是?如果你想要一個答案,你必須提出一個問題。 –

+0

嵌套for循環遍歷所有數據,但我的匹配不是很準確或不精確。通過 –

+0

解析也需要很長時間您可能希望找到一個全文搜索的數據庫並使用它。還有關於文本標準化的在線資源,可以改進此代碼或使用全文搜索數據庫。我知道這是開放式的,但這是一個很大的領域,選擇一個角落並開始閱讀。 :) – ldrg

回答

0

首先,我不確定dumps()後面跟着loads()是怎麼回事。如果你能找到一種方法避免在每次迭代中序列化和反序列化所有東西,這將是一個巨大的勝利,因爲從你在這裏發佈的代碼看起來完全是多餘的。第二,列表的東西:因爲它不會改變,爲什麼不在循環前解析一次數據結構(可能是將wordmap2的內容映射到源自它的列表的dict),並重用結構,而解析products.json?

下一步:如果有一種方法來調整這個使用multiprocessing我強烈建議你這樣做。你完全在CPU上綁定,你可以很容易地在你的所有內核上並行運行。

最後,我給了它一些花哨的正則表達式shenanigans。這裏的目標是儘可能多地將邏輯推入正則表達式,因爲我認爲re是在C中實現的,因此比在Python中執行所有這些字符串工作更具性能。

import json 
import re 

PRODUCTS = """ 
[ 
{ 
"product_name": "Puppersoft Doggulator 5000", 
"manufacturer": "Puppersoft", 
"family": "Doggulator", 
"model": "5000", 
"announced-date": "ymd" 
}, 
{ 
"product_name": "Puppersoft Doggulator 5001", 
"manufacturer": "Puppersoft", 
"family": "Doggulator", 
"model": "5001", 
"announced-date": "ymd" 
}, 
{ 
"product_name": "Puppersoft Doggulator 5002", 
"manufacturer": "Puppersoft", 
"family": "Doggulator", 
"model": "5002", 
"announced-date": "ymd" 
} 
] 
""" 


LISTINGS = """ 
[ 
{ 
"title": "Doggulator 5002", 
"manufacturer": "Puppersoft", 
"currency": "Pupper Bux", 
"price": "420" 
}, 
{ 
"title": "Doggulator 5005", 
"manufacturer": "Puppersoft", 
"currency": "Pupper Bux", 
"price": "420" 
}, 
{ 
"title": "Woofer", 
"manufacturer": "Shibasoft", 
"currency": "Pupper Bux", 
"price": "420" 
} 
] 
""" 

SPLITTER_REGEX = re.compile("^\s+|\s*-| |_\s*|\s+$") 
product_re_map = {} 
product_re_parts = [] 

# get our matching keywords from products.json 
for idx, product in enumerate(json.loads(PRODUCTS)): 
    matching_parts = [x for x in SPLITTER_REGEX.split(product["product_name"]) if x] 
    matching_parts += [x for x in SPLITTER_REGEX.split(product["manufacturer"]) if x] 
    matching_parts += [x for x in SPLITTER_REGEX.split(product["model"]) if x] 

    # store the product object for outputting later if we get a match 
    group_name = 'i{idx}'.format(idx=idx) 
    product_re_map[group_name] = product 
    # create a giganto-regex that matches anything from a given product. 
    # the group name is a reference back to the matching product. 
    # I use set() here to deduplicate repeated words in matching_parts. 
    product_re_parts.append("(?P<{group_name}>{words})".format(group_name=group_name, words="|".join(set(matching_parts)))) 
# Do the case-insensitive matching in C code 
product_re = re.compile("|".join(product_re_parts), re.I) 

for listing in json.loads(LISTINGS): 
    # we match against split words in the regex created above so we need to 
    # split our source input in the same way 
    matching_listings = [] 
    for word in SPLITTER_REGEX.split(listing['title']): 
     if word: 
      product_match = product_re.match(word) 
      if product_match: 
       for k in product_match.groupdict(): 
        matching_listing = product_re_map[k] 
        if matching_listing not in matching_listings: 
         matching_listings.append(matching_listing) 
    print listing['title'], matching_listings