2013-02-22 56 views
1

我編寫了代碼來抓取並保存網頁中的圖像。由於某種原因,我得到 一個錯誤,我不知道如何解決。java.lang.IllegalArgumentException當在Java中使用Jsoup時

我正在使用一種方法來確保每個圖像,我索引實際上存在,所以我不知道爲什麼會發生這種情況。

這裏是我的代碼:

import org.jsoup.Jsoup; 
import org.jsoup.helper.Validate; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 
import java.net.*; 
import java.awt.Image; 
import java.awt.image.RenderedImage; 
import java.io.*; 

import java.io.IOException; 

import javax.imageio.ImageIO; 
import javax.imageio.ImageReader; 
import javax.imageio.stream.ImageInputStream; 

public class jsoup { 
    public static void main(String[] args) throws IOException { 
    crawl("http://www.istockphoto.com/photo"); 
} 

public static void crawl(String crawlurl) throws IOException{ 
    Document doc = Jsoup.connect(crawlurl).get(); 
    getImgFromLinks(doc); 
} 

public static void getImgFromLinks(Document doc) throws IOException{ 
    Elements links = doc.select("a[href]"); 
    //System.out.println(links); 

    for(int i=0;i<links.size();i++){ 
     if(exists(links.get(i).attr("href"))){ 
      System.out.println("crawled: " + links.get(i).attr("href")); 
      getImages(doc, links.get(i).attr("href")); 
     }else{ 
      System.out.println("I couldnt crawl: " + links.get(i).attr("href")); 
     } 
    } 
} 

public static String smartUrl(String url, String src) { 
    if(exists(src)){ 
     return(src); 
    }else{ 
     return(url + src); 
    } 
} 


public static void getImages(Document doc, String url) throws IOException{ 



     for(int i=0; i<doc.getElementsByTag("img").size();i++){ 
      Element image = doc.select("img").get(i); 
      String imgsrc = image.attr("src"); 
      if(imgsrc.toLowerCase().contains("png") || imgsrc.toLowerCase().contains("jpg") || imgsrc.toLowerCase().contains("jpeg") || imgsrc.toLowerCase().contains("gif")){ 

      int slashIndex = smartUrl(url, imgsrc).lastIndexOf('/'); 
      String finalUrl = smartUrl(url, imgsrc).substring(slashIndex); 

      URL imgurl = new URL(smartUrl(url, imgsrc)); 

      if(exists(imgurl.toString())){ 
      Image crawledimg = ImageIO.read(imgurl); 


      ImageIO.write((RenderedImage) crawledimg, "gif",new File("/Users/Jonathan/Desktop/crawledimages" + finalUrl)); 


      System.out.println("I got an image from:" + url + " Image Name: " + finalUrl); 
      } 

     } 
     } 


} 


public static boolean exists(String URLName) { 
    try { 
     HttpURLConnection.setFollowRedirects(false); 

    //HttpURLConnection.setInstanceFollowRedirects(false); 
     HttpURLConnection con = 
     (HttpURLConnection) new URL(URLName).openConnection(); 
     con.setRequestMethod("HEAD"); 
     return (con.getResponseCode() == HttpURLConnection.HTTP_OK); 
    } 
    catch (Exception e) { 
     return false; 
    } 
    } 
} 

這裏是輸出:

crawled: http://www.istockphoto.com/ 
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/ Image Name: /facebook.png 
I got an image from:http://www.istockphoto.com/ Image Name: /twitter.png 
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/ Image Name: /cartWhite.png 
I couldnt crawl: # 
I couldnt crawl: http://www.istockphoto.com/sign-in/aHR0cCUzQSUyRiUyRnd3dy5pc3RvY2twaG90by5jb20lMkZwaG90bw== 
I couldnt crawl: http://www.istockphoto.com/join/aHR0cCUzQSUyRiUyRnd3dy5pc3RvY2twaG90by5jb20lMkZwaG90bw== 
crawled: http://www.istockphoto.com/photo 
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif 
I got an image from:http://www.istockphoto.com/photo Image Name: /facebook.png 
I got an image from:http://www.istockphoto.com/photo Image Name: /twitter.png 
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif 
Exception in thread "main" java.lang.IllegalArgumentException: im == null! 
at javax.imageio.ImageIO.write(ImageIO.java:1457) 
at javax.imageio.ImageIO.write(ImageIO.java:1527) 
at jsoup.getImages(jsoup.java:68) 
at jsoup.getImgFromLinks(jsoup.java:34) 
at jsoup.crawl(jsoup.java:24) 
at jsoup.main(jsoup.java:19) 

的圖像被保存,直到發生錯誤。

有誰知道如何解決這個問題?

此外,出於某種原因,頁面上的相同圖像正在多次保存。

謝謝你的時間,

喬納森奧倫。

+2

您是否嘗試在調試器中運行代碼以確定如何獲取空值? – jtahlborn 2013-02-22 19:51:52

回答

1

貌似null正在裏面ImageIO.write()

smartURL函數傳遞有一個缺陷,你將需要解決。它不會從網頁中獲取的圖片網址中創建預期的網址。

例如: /static/images/cartWhite.png將由smartURL轉換爲http://www.istockphoto.com/photo/static/images/cartWhite.png,它不是圖像,但同時它也不是錯誤頁面。所以crawledimg指的是null生成IllegalArgumentException

對此的快速解決方法是創建URLgetImages()內只有http://www.istockphoto.com

由於每個頁面都有它們,所以圖像會被多次保存。您可以保留一張圖片列表以避免發生這種情況。

我在代碼中發現另一個顯示停止符,您將無法從您抓取的網頁中檢索任何其他圖像。網站上的圖片不會以*.jpg*.png等結尾。因此,您需要在開始之前研究網站上圖片網址的格式。

+0

謝謝你的回答!我能解決這個問題。 – yonatano 2013-02-23 22:50:14