查找與n元語法匹配單詞

數據集：包含的屬性/土地特徵的無監督分類查找與n元語法匹配單詞

df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2))) 
df[:,0:1] 

Id  bigram 
1952043 [(Swimming,Pool),(Pool,in),(in,the),(the,roof),(roof,top), 
1918916 [(Luxury,Apartments),(Apartments,consisting),(consisting,11), 
1645751 [(Flat,available),(available,sale),(sale,Medavakkam), 
1270503 [(Toddler,Pool),(Pool,with),(with,Jogging),(Jogging,Tracks), 
1495638 [(near,medavakkam),(medavakkam,junction),(junction,calm),

我有一個Python文件（Categories.py）。

category = [('Luxury Apartments', 'IN', 'Recreation_Ammenities'), 
     ('Swimming Pool', 'IN','Recreation_Ammenities'), 
     ('Toddler Pool', 'IN', 'Recreation_Ammenities'), 
     ('Jogging Tracks', 'IN', 'Recreation_Ammenities')] 
Recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']

要找到兩字列第二類別列表中匹配的單詞：

tokens=pd.Series(df["bigram"]) 
Lid=pd.Series(df["Id"]) 
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.Recreation])))

在運行上面的代碼，我收到此錯誤：

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

需要幫助的這。

我所需的輸出是：

Id  bigram         Recreation_Amenities 
1952043 [(Swimming,Pool),(Pool,in),(in,the),.. Swimming Pool 
1918916 [(Luxury,Apartments),(Apartments,..  Luxury Apartments 
1645751 [(Flat,available),(available,sale)..  
1270503 [(Toddler,Pool),(Jogging,Tracks)..  Toddler Pool,Jogging Tracks 
1495638 [(near,medavakkam),..

來源

2017-08-27 Rajitha Naik

沿着這些線路的東西應該爲你工作：

def match_bigrams(row): 
    categories = [] 

    for bigram in row.bigram: 
     joined = ' '.join(list(bigram)) 
     if joined in Recreation: 
      categories.append(joined) 

    return categories 

df['Recreation_Amenities'] = df.apply(match_bigrams, axis=1) 
print(df) 


Id bigram Recreation_Amenities 
0 1952043 [(Swimming, Pool), (Pool, in), (in, the), (the... [Swimming Pool] 
1 1918916 [(Luxury, Apartments), (Apartments, consisting... [Luxury Apartments] 
2 1645751 [(Flat, available), (available, sale), (sale, ... [] 
3 1270503 [(Toddler, Pool), (Pool, with), (with, Jogging... [Toddler Pool, Jogging Tracks] 
4 1495638 [(near, medavakkam), (medavakkam, junction), (... []

每個兩字是由空間接合，從而可以測試的二元是否包含在您的類別列表中（即if joined in Recreation）。

來源

2017-08-27 08:05:33

，你能解釋一下在高清功能通過了 '行' 參數。而且我還希望多次爲每個類別使用此功能，如娛樂，醫療保健，安全等，以便我可以爲n個類別調用相同的功能。我怎麼能這樣做？ –

函數'match_bigrams'被逐行應用（因爲數據框中的每一行都被傳入此函數）。關於你的第二個問題，取決於：該功能與「Recreation」列表中的類別匹配。因此，當您使用其他類別擴展此列表時，它應該適用於n個類別。 –

是的，但目前在功能，條件是 - '如果加入休閒：'就像明智我有多個類別，我想避免寫每個類別的整個功能。所以我可以通過在調用函數中傳遞類別名稱來調用相同的函數，在這裏 - df.apply（match_bigrams，axis = 1） –

您可以參加由空間的元組，然後使用雙列表解析找到存在於娛樂的話，並應用即

df['Recreation_Amenities'] = df['bigram'].apply(lambda x : [j for j in Recreation if j in [' '.join(i) for i in x]])

讓我們考慮你有一個數據幀

 
    Id  bigram 
0 1270503 [(Toddler, Pool), (Pool, with), (with, Jogging), (Jogging, Tracks)] 
1 1952043 [(Swimming, Pool), (Pool, in), (in, the), (the, roof), (roof, top)] 
2 1918916 [(Luxury, Apartments), (Apartments, consisting), (consisting, 11)] 
3 1495638 [(near, medavakkam), (medavakkam, junction), (junction, calm)] 
4 1645751 [(Flat, available), (available, sale), (sale, Medavakkam)]

而且你必須列表康樂即

Recreation = ['Luxury Apartments', 'Swimming Pool', 'Toddler Pool', 'Jogging Tracks']

然後

df['Recreation_Amenities'] = df['bigram'].apply(lambda x : [j for j in Recreation if j in [' '.join(i) for i in x]])

輸出：df['Recreation_Amenities']

 

0 [Toddler Pool, Jogging Tracks] 
1 [Swimming Pool]    
2 [Luxury Apartments]   
3 []        
4 []        
Name: Recreation_Amenities, dtype: object

來源

2017-08-27 08:44:03 Dark

查找與n元語法匹配單詞

回答

相關問題