2017-05-24 70 views
0

如果我有一個如下所示的HTML片段,如何在python中獲得如下所需的輸出。在Python中自定義HTML片段

樣本HTML片段:

<td width="10" class="data1"><a class="datalink" href="m01_detail.asp?key=002396653&amp;itemNumber=0">&gt;</a></td> 

      <td class="data1"><a class="datalink" href="m01_detail.asp?key=002396653&amp;itemNumber=0">002396653</a></td> 

      <td class="data1">IMPORT EXPRESS RECYCLE</td> 

      <td class="data1">961879066</td> 

     <td class="data1">11/23/2016</td> 

      <td class="data1"></td>  <!--SARA--> 

      <td class="data1" align="center">CN</td> 

      <td class="data1" align="center">PVG</td> 

輸出:

961879066 | CN

到目前爲止我的代碼:

def reading(): 
    with open("C:\\Users\\John\\Desktop\\test.txt") as f: 
     for lines in f.readlines(): 
      line = lines.replace("\t","").strip() 
      print (line) 

    f.close() 

    reading() 

感謝,

+1

您應該使用beautifulsoup解析HTML內容...順便說一句,這將有助於如果您發佈你想要的網站鏈接颳去。例如,您可以使用'soup.find_all'('td',{'class':'data1'})'來獲取'class'屬性等於'data1'的所有'td'標籤。 –

+1

我同意@ dot.Py,我們需要鏈接(或完整的HTML頁面)。我認爲你想要第四個和第六個'td'標籤的文本是否正確? –

回答

0

您可以在下面嘗試代碼到g等所需的輸出:

import lxml.html 

html = lxml.html.fromstring("""<td width="10" class="data1"><a class="datalink" href="m01_detail.asp?key=002396653&amp;itemNumber=0">&gt;</a></td> 
<td class="data1"><a class="datalink" href="m01_detail.asp?key=002396653&amp;itemNumber=0">002396653</a></td> 
<td class="data1">IMPORT EXPRESS RECYCLE</td> 
<td class="data1">961879066</td> 
<td class="data1">11/23/2016</td> 
<td class="data1"></td>  <!--SARA--> 
<td class="data1" align="center">CN</td> 
<td class="data1" align="center">PVG</td>""") 

output = html.xpath('concat(//td[4], "|", //td[7])') 
print(output) # '961879066|CN' 

通原HTML代碼html可變