使用Python從網站中提取網頁元素

我想從本網站的表格和段落文本中提取各種元素。使用Python從網站中提取網頁元素

https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655

這是我使用的代碼：

import lxml 
from lxml import html 
from lxml import etree 
import urllib2 
source = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30656&SSO=1').read() 
x = etree.HTML(source) 
growth = x.xpath("//*[@id="home_feature_container"]/div/div[2]/div/table[2]/tbody/tr[3]/td[2]/p)") 
growth

什麼是提取從一個網站，我想的元素，而無需每次都改變的XPath代碼的最佳方式是什麼？他們每個月都在同一個網站上發佈新數據，但XPath有時會發生一些變化。

來源

2017-02-26 prashanth manohar

什麼是你想要的元素一個例子嗎？您的XPath無效，無法在此頁面上進行測試。 –

我改變了xpath。我需要「製造一瞥」表中的元素。還有段落文字。 –

如果你經常要修改的項目的位置，嘗試通過名稱檢索它們。例如，以下是如何從「新訂單」行中的表格中提取元素的方法。

import requests #better than urllib 
from lxml import html, etree 

url = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1' 
page = requests.get(url) 
tree = html.fromstring(page.content) 

neworders = tree.xpath('//strong[text()="New Orders"]/../../following-sibling::td/p/text()') 

print(neworders)

或者，如果你想整個HTML表格：

data = tree.xpath('//th[text()="MANUFACTURING AT A GLANCE"]/../..') 

for elements in data: 
    print(etree.tostring(elements, pretty_print=True))

使用BeautifulSoup

from bs4 import BeautifulSoup 
import requests 

url = "https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1" 

content = requests.get(url).content 

soup = BeautifulSoup(content, "lxml") 

table = soup.find_all('table')[1] 

table_body = table.find('tbody') 

data= [] 
rows = table_body.find_all('tr') 
for row in rows: 
    cols = row.find_all('td') 
    cols = [ele.text.strip() for ele in cols] 
    data.append([ele for ele in cols if ele]) 

print(data)

來源

2017-02-26 01:50:39

嘿Ettore，有一個小問題。我在這裏描述：http://stackoverflow.com/q/42592948/4399016 謝謝！ –

BeautifulSoup救援：

from bs4 import BeautifulSoup 
import urllib2 

r = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655') 
soup = BeautifulSoup(r) 
soup.find('div', {'id': 'home_feature_container'}, 'h4')

此代碼是在它的方式來實現所描述的規範。如果您使用soup.find().contents，它會創建元素中包含的每個項目的列表。

至於說明頁面上的變化，它真的取決於。如果變化很大，則必須更改soup.find()。否則，您可能能夠編寫足夠通用的代碼，以便始終適用。（就像如果div稱爲home_feature_container總是功能，你永遠也不會改變這一點。）

來源

2017-02-26 01:20:15 celestialroad

嗨，你可以展示一個返回一些值的代碼示例。有一張表「製造一覽」。你能否展示一些正在被你的技術提取和顯示的元素。萬分感謝！！ –

使用Python從網站中提取網頁元素

回答

相關問題