2017-08-27 88 views
2

數據集:包含的屬性/土地特徵的無監督分類查找與n元語法匹配單詞

df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2))) 
df[:,0:1] 

Id  bigram 
1952043 [(Swimming,Pool),(Pool,in),(in,the),(the,roof),(roof,top), 
1918916 [(Luxury,Apartments),(Apartments,consisting),(consisting,11), 
1645751 [(Flat,available),(available,sale),(sale,Medavakkam), 
1270503 [(Toddler,Pool),(Pool,with),(with,Jogging),(Jogging,Tracks), 
1495638 [(near,medavakkam),(medavakkam,junction),(junction,calm), 

我有一個Python文件(Categories.py)。

category = [('Luxury Apartments', 'IN', 'Recreation_Ammenities'), 
     ('Swimming Pool', 'IN','Recreation_Ammenities'), 
     ('Toddler Pool', 'IN', 'Recreation_Ammenities'), 
     ('Jogging Tracks', 'IN', 'Recreation_Ammenities')] 
Recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities'] 

要找到兩字列第二類別列表中匹配的單詞:

tokens=pd.Series(df["bigram"]) 
Lid=pd.Series(df["Id"]) 
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.Recreation]))) 

在運行上面的代碼,我收到此錯誤:

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas 

需要幫助的這。

我所需的輸出是:

Id  bigram         Recreation_Amenities 
1952043 [(Swimming,Pool),(Pool,in),(in,the),.. Swimming Pool 
1918916 [(Luxury,Apartments),(Apartments,..  Luxury Apartments 
1645751 [(Flat,available),(available,sale)..  
1270503 [(Toddler,Pool),(Jogging,Tracks)..  Toddler Pool,Jogging Tracks 
1495638 [(near,medavakkam),.. 

回答

1

沿着這些線路的東西應該爲你工作:

def match_bigrams(row): 
    categories = [] 

    for bigram in row.bigram: 
     joined = ' '.join(list(bigram)) 
     if joined in Recreation: 
      categories.append(joined) 

    return categories 

df['Recreation_Amenities'] = df.apply(match_bigrams, axis=1) 
print(df) 


Id bigram Recreation_Amenities 
0 1952043 [(Swimming, Pool), (Pool, in), (in, the), (the... [Swimming Pool] 
1 1918916 [(Luxury, Apartments), (Apartments, consisting... [Luxury Apartments] 
2 1645751 [(Flat, available), (available, sale), (sale, ... [] 
3 1270503 [(Toddler, Pool), (Pool, with), (with, Jogging... [Toddler Pool, Jogging Tracks] 
4 1495638 [(near, medavakkam), (medavakkam, junction), (... [] 

每個兩字是由空間接合,從而可以測試的二元是否包含在您的類別列表中(即if joined in Recreation)。

+0

,你能解釋一下在高清功能通過了 '行' 參數。而且我還希望多次爲每個類別使用此功能,如娛樂,醫療保健,安全等,以便我可以爲n個類別調用相同的功能。我怎麼能這樣做? –

+1

函數'match_bigrams'被逐行應用(因爲數據框中的每一行都被傳入此函數)。關於你的第二個問題,取決於:該功能與「Recreation」列表中的類別匹配。因此,當您使用其他類別擴展此列表時,它應該適用於n個類別。 –

+0

是的,但目前在功能,條件是 - '如果加入休閒:'就像明智我有多個類別,我想避免寫每個類別的整個功能。所以我可以通過在調用函數中傳遞類別名稱來調用相同的函數,在這裏 - df.apply(match_bigrams,axis = 1) –

1

您可以參加由空間的元組,然後使用雙列表解析找到存在於娛樂的話,並應用即

df['Recreation_Amenities'] = df['bigram'].apply(lambda x : [j for j in Recreation if j in [' '.join(i) for i in x]]) 

讓我們考慮你有一個數據幀

 
    Id  bigram 
0 1270503 [(Toddler, Pool), (Pool, with), (with, Jogging), (Jogging, Tracks)] 
1 1952043 [(Swimming, Pool), (Pool, in), (in, the), (the, roof), (roof, top)] 
2 1918916 [(Luxury, Apartments), (Apartments, consisting), (consisting, 11)] 
3 1495638 [(near, medavakkam), (medavakkam, junction), (junction, calm)] 
4 1645751 [(Flat, available), (available, sale), (sale, Medavakkam)] 

而且你必須列表康樂即

Recreation = ['Luxury Apartments', 'Swimming Pool', 'Toddler Pool', 'Jogging Tracks'] 

然後

df['Recreation_Amenities'] = df['bigram'].apply(lambda x : [j for j in Recreation if j in [' '.join(i) for i in x]]) 

輸出:df['Recreation_Amenities']

 

0 [Toddler Pool, Jogging Tracks] 
1 [Swimming Pool]    
2 [Luxury Apartments]   
3 []        
4 []        
Name: Recreation_Amenities, dtype: object