如何刪除蟒蛇

文本的一部分，我很新的蟒蛇因此陷入了這個問題：如何刪除蟒蛇

有像

blahh 
blah 
blah 
... 
<start> 
some stuff 
</start> 
even more blah blah blah

我想刪除所有的嗒嗒txt文件零件在<start>之前和</start>之後。（主要是來自這個link。我想用bs4製作頁面中的html文件，所以我認爲我必須先刪除所有的非html部分。

有人可以告訴我什麼是最好的辦法做到這一點感謝任何幫助

來源

2015-02-06 novice_007

@AJ：請不要建議使用正則表達式解析HTML。請閱讀http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags（和l給一個標籤上墨只是沒用。） – geoffspear 2015-02-06 17:10:52

不，你並不需要刪除的文件的非相關部分讓BeautifulSoup解析完整的文件是，找到你所需要的標籤：？！

from urllib2 import urlopen 
from bs4 import BeautifulSoup 

url = 'http://www.sec.gov/Archives/edgar/data/70858/000119312507058027/0001193125-07-058027.txt' 
soup = BeautifulSoup(urlopen(url)) 
print(soup.document)

來源

2015-02-06 17:09:48 alecxe

非常感謝，alecxe。這真的幫助我！ – 2015-02-06 18:54:30

如何刪除蟒蛇

回答

相關問題