將網頁下載到文本文件

我有下面的代碼，它的工作原理。將網頁下載到文本文件

Imports System.IO 
Imports System.Net 

Module Module1 

    Sub Main() 

     Dim webClient1 As New WebClient() 
     webClient1.Encoding = System.Text.Encoding.ASCII 
     webClient1.DownloadFile("http://www.bmreports.com/servlet/com.logica.neta.bwp_MarketIndexServlet?displayCsv=true", "C:\temp\stream.txt") 
    End Sub 

End Module

這創建了文本文件，但它也下載了所有的html。我怎樣才能省略這個，只是得到頁面上顯示的文字？

來源

2013-10-16 Silentbob

然後，你需要解析整個html文本，提取需要的文本（使用正則表達式/ manaually）並插入到文本文件中。 – mit

使用'HtmlAgilityPack'解析html。 html文件中沒有「純文本模式」。 –

可以使用正則表達式從文檔中刪除所有的HTML標籤：

Dim source as string = File.ReadAllText("C:\temp\stream.txt") 

    'Clean html tags 
    source = StripTagsRegex(source) 

    'Strip function 

    Private Function StripTagsRegex(source As String) As String 
    Return Regex.Replace(source, "<.*?>", String.Empty) 
    End Function

這裏有錫爾正則表達式的一個例子，它提取純文本：

http://regexr.com?36ori

來源

2013-10-16 09:21:55

卡洛斯，你可以添加多個替換正則表達式，因爲我想用回車替換\。 – Silentbob

你可以這樣做：Regex.Replace（來源，「<.*?>」，String.Empty）.Replace（「\」，「\ n」） –

歡呼聲我會嘗試 – Silentbob

將網頁下載到文本文件

回答

相關問題