2016-07-28 73 views
0

我想從頁面源文件中提取一些html數據。這是參考。鏈接有一個HTML鏈接查看源:http://www.4icu.org/reviews/index2.htm。我想問一下,我怎樣才能用JAVA提取大學的名稱和國家名稱。我知道如何提取大學名稱的方法,但是如何通過在class =「i」時掃描表格來提高程序的速度,並提取國家(即美國)與< ...... 「alt =」United States「/>JAVA解析表格數據

<tr> 
<td><a name="UNIVERSITIES-BY-NAME"></a><h2>A-Z list of world Universities and Colleges</h2> 
</tr> 

<tr> 
<td class="i"><a href="/reviews/9107.htm"> A.T. Still University</a></td> 
<td width="50" align="right" nowrap>us <img src="/i/bg.gif" class="fl flag-us" alt="United States" /></td> 
</tr> 

在此先感謝。

編輯 按照什麼@ 11thdimension說,這裏是我的java文件

public class University { 
    public static void main(String[] args) throws Exception { 
     System.out.println("Started"); 

     URL url = new URL ("http://www.4icu.org/reviews/index2.htm"); 

     URLConnection spoof = url.openConnection();   
     // Spoof the connection so we look like a web browser 
     spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)"); 

     String connect = url.toString(); 
     Document doc = Jsoup.connect(connect).get(); 

     Elements cells = doc.select("td.i"); 

     Iterator<Element> iterator = cells.iterator(); 

     while (iterator.hasNext()) { 
      Element cell = iterator.next(); 
      String university = cell.select("a").text(); 
      String country = cell.nextElementSibling().select("img").attr("alt"); 

      System.out.printf("country : %s, university : %s %n", country, university); 
     } 
    } 
} 

然而,當我運行它,它給了我下面的錯誤。

Started 
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://www.4icu.org/reviews/index2.htm 

EDIT2 我已經創建了下面的程序,以獲取HTML網站的標題。

public class Get_Header { 
    public static void main(String[] args) throws Exception { 
    URL url = new URL("http://www.4icu.org/reviews/index2.htm"); 
    URLConnection connection = url.openConnection(); 

    Map responseMap = connection.getHeaderFields(); 
    for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();) { 
     String key = (String) iterator.next(); 
     System.out.println(key + " = "); 

     List values = (List) responseMap.get(key); 
     for (int i = 0; i < values.size(); i++) { 
     Object o = values.get(i); 
     System.out.println(o + ", "); 
     } 
    } 
    } 
} 

它重新調整以下結果。

X-Frame-Options = 
SAMEORIGIN, 
Transfer-Encoding = 
chunked, 
null = 
HTTP/1.1 403 Forbidden, 
CF-RAY = 
2ca61c7a769b1980-HKG, 
Server = 
cloudflare-nginx, 
Cache-Control = 
max-age=10, 
Connection = 
keep-alive, 
Set-Cookie = 
__cfduid=d4f8d740e0ae0dd551be15e031359844d1469853403; expires=Sun, 30-Jul-17 04:36:43 GMT; path=/; domain=.4icu.org; HttpOnly, 
Expires = 
Sat, 30 Jul 2016 04:36:53 GMT, 
Date = 
Sat, 30 Jul 2016 04:36:43 GMT, 
Content-Type = 
text/html; charset=UTF-8, 

雖然我可以得到頭,但我應該如何在編輯和EDIT2代碼結合起來,形成一個完整的?謝謝。

+0

你需要做一次或那將是一個重複性測試k? – 11thdimension

+0

解決方案需要多長時間才能證明暫停問題的正確性? – 11thdimension

+0

我編輯了這個問題,以縮小我的問題。謝謝 –

回答

1

如果這將是一個單一的時間任務,那麼你應該使用JavaScript fot它。

以下代碼將在控制檯中記錄所需的名稱。您必須在瀏覽器控制檯中運行它。

(function() { 
    var a = []; 
    document.querySelectorAll("td.i a").forEach(function (anchor) { a.push(anchor.textContent.trim());}); 

    console.log(a.join("\n")); 
})(); 

以下爲Jsoup selectors

Maven的依賴

<dependencies> 
    <dependency> 
     <groupId>org.jsoup</groupId> 
     <artifactId>jsoup</artifactId> 
     <version>1.8.3</version> 
    </dependency> 
</dependencies> 

的Java代碼的Java例子

import java.io.File; 
import java.util.Iterator; 

import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 

public class TestJsoup { 
    public static void main(String[] args) throws Exception { 
     System.out.println("Starteed"); 

     File file = new File("A-Z list of 11930 World Colleges & Universities.html"); 
     Document doc = Jsoup.parse(file, "UTF-8"); 

     Elements cells = doc.select("td.i"); 

     Iterator<Element> iterator = cells.iterator(); 

     while (iterator.hasNext()) { 
      Element cell = iterator.next(); 
      String university = cell.select("a").text(); 
      String country = cell.nextElementSibling().select("img").attr("alt"); 

      System.out.printf("country : %s, university : %s %n", country, university); 
     } 
    } 
} 
+0

謝謝。該程序將運行多次,因爲在http鏈接中需要更改各種索引號。只是好奇我怎麼能用java來抓住「alt = united states」中的國家級數據。謝謝 –

+0

添加代碼來提取國家。 – 11thdimension

+0

感謝您的幫助。但是,當我將鏈接http://www.4icu.org/reviews/index2.htm插入到11930 World Colleges&Universities.html的AZ列表替換位置時,​​它給我在線程「main」java中的異常。 io.FileNotFoundException:www.4icu.org \ reviews \ index2.htm 我修改了我的問題,使其更清晰。 –