2016-08-19 85 views
4

我在美麗的湯中選擇這個'div'對象然後解析其中的數據時遇到了麻煩。用美麗的湯解析數據綁定HTML中的標記

首先,我必須解碼HTML網頁上的實體(如https://mothereff.in/html-entities)。

我將採取哪些步驟來,例如,以編程方式選擇

(海: '/ S3/fhphotos/CIRD-72K6-H9_SID_1.jpg,寬度= 1000 &高度= 1000 &模式= MAX')

從下面

<div data-bind="component: { name: &#39;product-detail&#39;, params: {hasVariants:true,name:&#39;BROOKS LOUNGE CHAIR&#39;,hasCategory:true,superCategoryName:&#39;Furniture&#39;,categoryDisplayName:&#39;Living Room&#39;,categorySlug:&#39;living-room&#39;,subcategoryDisplayName:&#39;Chairs&#39;,subcategorySlug:&#39;chairs&#39;,collection:{id:1529,name:&#39;Irondale&#39;,description:&#39;Each piece is a striking conversation-starter. Tables are made from reclaimed doors paired with salvaged architecture or old machine parts. Storage solutions are inspired by libraries of the 1940’s. Cast iron beds with linen panels as well as seating in linen, lush velvet and top-grain leather offer a distinctive found feel.&#39;,isFeatured:true,isNew:false,image:&#39;/FourHandsMarketplace/media/General/Featured%20Collections/IRONDALE.jpg?width=500&#39;,shortDescription:&#39;Moving from Parisian flea market to modern to industrial, understated elegance is a common theme. Waxed leathers and distressed irons mix with fabrics for an intriguing style blend.\r\n&#39;,uri:&#39;/collections/irondale&#39;},attributes:[{id:384,name:&#39;COVER&#39;,displayOrder:30,swatches:true,values:[{id:12710,name:&#39;EBONY&#39;,displayOrder:1,swatchUrl:&#39;/s3/fhphotos/Y C11458-G6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;},{id:12711,name:&#39;STONEWASH DARK GREEN&#39;,displayOrder:2,swatchUrl:&#39;/s3/fhphotos/Y C11458-H9_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;}]},{id:385,name:&#39;FINISH&#39;,displayOrder:40,swatches:true,values:[{id:12712,name:&#39;BLACK WASH WEATHERED&#39;,displayOrder:1,swatchUrl:&#39;/s3/fhphotos/Y C11458-K5_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;},{id:12713,name:&#39;DISTRESSED WASHED OLD OAK&#39;,displayOrder:2,swatchUrl:&#39;/s3/fhphotos/Y C11458-K6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;}]}],products:[{attributeValueIds:[12710,12712],description:&#39;Our take on the classic Adirondack emphasizes comfort with thick, top-grain leather cushioning. Wire-brushed oak is finished in black and hand-distressed for a naturally weathered patina.&#39;,dimensions:&#39;W: 27.75&quot; H: 29&quot; D: 34.75&quot;&#39;,availabilityDescription:&#39;&lt;strong>Quantity in Stock: &lt;/strong>&lt;span >88&lt;/span>&lt;br />&lt;strong>More on the Way: &lt;/strong>&lt;span >Yes&lt;/span>&lt;br />&lt;strong>Estimated Arrival Date: &lt;/strong>&lt;span >1 to 2 weeks&lt;/span>&#39;,colors:[&#39;Black Washed Weathered&#39;,&#39;Ebony&#39;],weightPounds:45.0,volumeCubicFeet:18.72,images:[{order:1,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_PRM_1.jpg&#39;},{order:2,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_1.jpg&#39;},{order:3,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_2.jpg&#39;},{order:4,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_1.jpg&#39;},{order:5,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_2.jpg&#39;},{order:6,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_BCK_1.jpg&#39;},{order:7,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_FRT_1.jpg&#39;},{order:8,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_SID_1.jpg&#39;},{order:9,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_3.jpg&#39;},{order:10,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_3.jpg&#39;},{order:11,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_4.jpg&#39;}],priceHtml:&#39;$520.00&#39;,itemNumber:&#39;CIRD-72K5-G6H6&#39;,name:&#39;Brooks Lounge Chair-Ebony, Blk Wsh Weath&#39;,availableForImmediateShipment:true,isNew:false,isCloseout:false},{attributeValueIds:[12711,12713],description:&#39;Our take on the classic Adirondack emphasizes comfort with green, stonewashed cotton canvas cushioning. Wire-brushed oak is hand-distressed for a naturally weathered patina.&#39;,dimensions:&#39;W: 27.75&quot; H: 29&quot; D: 34.5&quot;&#39;,availabilityDescription:&#39;&lt;strong>Quantity in Stock: &lt;/strong>&lt;span >147&lt;/span>&lt;br />&lt;strong>More on the Way: &lt;/strong>&lt;span >Yes&lt;/span>&lt;br />&lt;strong>Estimated Arrival Date: &lt;/strong>&lt;span >1 to 2 weeks&lt;/span>&#39;,colors:[&#39;Distressed Washed Old Oak&#39;,&#39;Stonewash Dark Green&#39;],weightPounds:45.0,volumeCubicFeet:18.72,images:[{order:1,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_PRM_1.jpg&#39;},{order:2,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_1.jpg&#39;},{order:3,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_2.jpg&#39;},{order:4,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_1.jpg&#39;},{order:5,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_2.jpg&#39;},{order:6,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_BCK_1.jpg&#39;},{order:7,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_FRT_1.jpg&#39;},{order:8,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_SID_1.jpg&#39;},{order:9,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_3.jpg&#39;},{order:10,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_3.jpg&#39;}],priceHtml:&#39;$290.00&#39;,itemNumber:&#39;CIRD-72K6-H9&#39;,name:&#39;Brooks Lounge Chair-Stonewsh Drk Green&#39;,availableForImmediateShipment:true,isNew:false,isCloseout:false}],activeItemNumber:&#39;CIRD-72K5-G6H6&#39;,priceDescription:&#39;Wholesale Price&#39;} }"></div> 

的代碼?

回答

0

哪裏這個網站字符串的來源和究竟你有興趣的提取,但對於美麗的湯的一部分,你只需要它並不完全清楚:

soup = BeautifulSoup(s) 
text = soup.div['data-bind'] 

s是串在你的問題。在獲得'數據綁定'attribute之前,我們首先得到'div'tag

該格式讓我感到困惑,因爲它與json類似,類似於python字典,但沒有一個解析器喜歡輸入。我猜它的JavaScript?我寫這個question激發了快速和骯髒的括號計數循環:

nest_lvl = 0 
lvl_string = list() 
for char in text: 
    if char == '{': 
     nest_lvl += 1 
    elif char == '}': 
     nest_lvl -= 1 

    try: 
     lvl_string[nest_lvl] += char 
    except IndexError:   # first iter 
     lvl_string.append(char) 

    if char == '}': 
     print nest_lvl, lvl_string[nest_lvl] 
     lvl_string[nest_lvl] = '' 

將有望讓你開始它。同樣,解析部分實際上取決於解析器的普遍程度以及您想要提取的內容。