2017-06-20 53 views
2

我想使用Python從thesaurus.com刮同義詞,並使用無序列表列出同義詞。如何在Python中從html中刪除無序列表?

from lxml import html 
import requests 
term = (input("Enter in a term to find the synonyms of: ")) 
page = requests.get('http://www.thesaurus.com/browse/' + term.lower(),allow_redirects=True) 
if page.status_code == 200: 
    tree = html.fromstring(page.content) 
    synonyms = tree.xpath('//div[@class="relevancy-list"]/text()') 
    print(synonyms) 
else: 
    print("No synonyms found!") 

我的代碼只輸出空格而不是同義詞。我如何刮取實際的同義詞而不是空格。

回答

1

/text()僅打印當前標籤下的文本。因此,您當前的代碼將不會打印同義詞,因爲它位於div標記內的另一個標記下。

您應該使用//text()來打印當前標籤下的所有文本。但是這會打印所有文本,包括不必要的文本。

爲您的使用情況下,由於同義詞是<span class="text">標籤中,你可以使用這個XPath:

//div[@class="relevancy-list"]//span[@class="text"]/text() 

其選擇與類「文本」的跨度內發現的所有文本發現類DIV中「相關列表」。

對於輸入項set,使用XPath的輸出是:

 
['firm', 'bent', 'stated', 'specified', 'rooted', 'established', 'confirmed', 'pat', 'immovable', 'obstinate', 'ironclad', 'predetermined', 'intent', 'entrenched', 'appointed', 'regular', 'prescribed', 'determined', 'scheduled', 'fixed', 'settled', 'certain', 'customary', 'decisive', 'definite', 'inveterate', 'pigheaded', 'resolute', 'rigid', 'steadfast', 'stubborn', 'unflappable', 'usual', 'concluded', 'agreed', 'resolved', 'stipulated', 'arranged', 'prearranged', 'dead set on', 'hanging tough', 'locked in', 'set in stone', 'solid as a rock', 'stiff-necked', 'well-set', 'immovable', 'entrenched', 'located', 'solid', 'situate', 'stiff', 'placed', 'stable', 'fixed', 'settled', 'situated', 'rigid', 'strict', 'stubborn', 'unyielding', 'hidebound', 'positioned', 'sited', 'jelled', 'hard and fast', 'deportment', 'comportment', 'fit', 'presence', 'mien', 'hang', 'carriage', 'air', 'turn', 'attitude', 'address', 'demeanor', 'position', 'inclination', 'port', 'posture', 'setting', 'scene', 'scenery', 'flats', 'stage set', u'mise en sc\xe8ne', 'series', 'array', 'lot', 'collection', 'batch', 'crowd', 'cluster', 'gang', 'bunch', 'crew', 'circle', 'body', 'coterie', 'faction', 'company', 'bundle', 'outfit', 'band', 'clique', 'mob', 'kit', 'class', 'clan', 'compendium', 'clutch', 'camp', 'sect', 'push', 'organization', 'clump', 'assemblage', 'pack', 'gaggle', 'rat pack', 'locate', 'head', 'prepare', 'fix', 'introduce', 'turn', 'settle', 'lay', 'install', 'put', 'apply', 'post', 'establish', 'wedge', 'point', 'lock', 'affix', 'direct', 'rest', 'seat', 'station', 'plop', 'spread', 'lodge', 'situate', 'plant', 'park', 'bestow', 'train', 'stick', 'plank', 'arrange', 'insert', 'level', 'plunk', 'mount', 'aim', 'cast', 'deposit', 'ensconce', 'fasten', 'embed', 'anchor', 'make fast', 'make ready', 'zero in', 'appoint', 'name', 'schedule', 'make', 'impose', 'stipulate', 'settle', 'determine', 'establish', 'fix', 'specify', 'designate', 'decree', 'resolve', 'rate', 'conclude', 'price', 'prescribe', 'direct', 'value', 'ordain', 'allocate', 'instruct', 'allot', 'dictate', 'estimate', 'regulate', 'assign', 'arrange', 'lay down', 'agree upon', 'fix price', 'fix', 'stiffen', 'thicken', 'condense', 'jelly', 'clot', 'congeal', 'solidify', 'cake', 'coagulate', 'jell', 'gelatinize', 'crystallize', 'jellify', 'gel', 'become firm', 'gelate', 'drop', 'subside', 'sink', 'vanish', 'dip', 'disappear', 'descend', 'go down', 'initiate', 'begin', 'raise', 'abet', 'provoke', 'instigate', 'commence', 'foment', 'whip up', 'put in motion', 'set on', 'stir up'] 

注意,您將獲得同義詞詞的所有感官。

您可能需要手動循環使用//div[@class="relevancy-list"]的結果,併爲每個發現的每個div提取//span[@class="text"]/text()

0
import requests 
from bs4 import BeautifulSoup 

term = input("Enter in a term to find the synonyms of: ") 
page = requests.get('http://www.thesaurus.com/browse/' + term.lower(), allow_redirects=True) 

if page.status_code == 200: 
    soup = BeautifulSoup(page.content, 'html.parser') 
    get_syn_tag = soup.find('div', {'class': 'relevancy-list'}) 
    list_items = get_syn_tag.findAll('li') 
    synonyms = [] # to fetch synonym anytime used list to append all synonyms 
    for i in list_items: 
     synonym = i.find('span', {'class':'text'}).text 
     print(synonym) # prints single synonym on each iteration 
     synonyms.append(synonym) # appends synonym to list 
else: 
    print("No synonyms found!") 

找到所有li標籤更精確,但是在這種情況下,線下還將致力於:

synonym_list = [i.text for i in get_syn_tag.findAll('span', {'class':'text'})] # this will create a list of all available synonyms if there is no other `span` tag with same class `text` in the specified `div`