2014-09-19 69 views
-1

我試圖解析多個文件並將它們分成一組HashMap中的字段。這是一個樣本文件。解析Java中的文本文件以獲取字段的HashMap

COCONUT OIL CONTRACT TO CHANGE - DUTCH TRADERS 

    ROTTERDAM, March 18 - Contract terms for trade in coconut 
oil are to be changed from long tons to tonnes with effect from 
the Aug/Sep contract onwards, Dutch vegetable oil traders said. 
    Operators have already started to take account of the 
expected change and reported at least one trade in tonnes for 
Aug/Sept shipment yesterday. 

我需要的程序,這個文檔解析爲一個自定義文檔類具有鍵,文件名,文件名稱,地點,日期,作者,內容,類別字段中。

這是我嘗試過的。

public static Document parse(String filename) { 

     File f = new File(filename); 

     if (f.isFile()){ 



      String fileId; 
      if (filename.indexOf(".") > 0) { 
       fileId = filename.substring(0, filename.lastIndexOf(".")); 
      } 
      String category = f.getParent(); 

      InputStream in = new FileInputStream(f); 

      byte buf[] = new byte[1024]; 
      int len = in.read(buf); 
      while(len > 0){ 
       .......... 
      } 
      in.close(); 
     } 


     return null; 
    } 
+0

我很抱歉你試圖在這裏完成? :O – 2014-09-19 19:18:44

+0

那麼,這是一個開始,但很難以相同的方式繼續。如果我是你,我現在不再編寫代碼,首先找出需要採取的高級步驟。把這些步驟寫在一張紙上。 '1。將文件完全讀入字符串。 2.提取文件標題...等等。然後你可以開始一步一步編碼,在每一步之後測試結果。 – biziclop 2014-09-19 19:20:17

回答

0

下面的代碼可以幫助你:

try { 
     FileInputStream fstream = new FileInputStream("myFile.txt"); 
     DataInputStream in = new DataInputStream(fstream); 
     BufferedReader br = new BufferedReader(new InputStreamReader(in)); 
     StringBuffer contentBuffer = new StringBuffer(); 
     String line = null; 
     boolean foundTitle = false; 
     boolean foundPlaceAndDate = false; 
     String date = ""; 
     while ((line = br.readLine()) != null) { 
      if (line.matches("^[a-z-A-Z0-9].*") && !foundTitle) { 
       // If line starts with a letter or number and has no title yet, that's the title 
       System.out.println("Title: " + line); 
       foundTitle = true; 
      } else if (line.matches("^[\\ \t].*") && !foundPlaceAndDate) { 
       // If line starts with a space or tab and it's out first paragraph, then this paragraph has place and date 
       System.out.println("Place: " + line.trim().substring(0, line.trim().indexOf(","))); 
       date = line.trim().substring(line.trim().indexOf(",") + 1, line.trim().indexOf("-")).trim(); 
       System.out.println("Date: " + date); 
       foundPlaceAndDate = true; 
      } 
      contentBuffer.append(line); 
     } 

     String content = contentBuffer.toString().substring(contentBuffer.toString().indexOf(date) + date.length() + 2).trim(); 
     System.out.println("Content: " + content); 

     br.close(); 
     fstream.close(); 
    } catch (Exception e) { 
     System.err.println("Oh no! I got the following error: " + e.getMessage()); 
    } 

輸出將是:

標題:椰子油合同變更 - 荷蘭商人

地點: ROTTERDAM

日期:3月18日

內容:貿易在椰子油合同條款將被從長噸改爲噸,起fromthe八月/九月合同的效力,荷蘭植物油貿易商稱。運營商已經開始考慮預期的變化,並且昨天至少報告了一次交易的噸數。

+0

這確實讓我開始了,但我需要將該文件解析爲文檔類,它看起來像this.public類文檔{0} {0} {0} {0} \t \t \t公共文獻(){ \t \t地圖=新的HashMap (); \t} \t \t \t \t 公共無效setField(FIELDNAMES FN,字符串... O){ \t \t map.put(FN,O); \t} \t \t \t \t \t公共字符串[] getfield命令(FIELDNAMES FN){ \t \t返回map.get(FN); \t} } – 2014-09-19 19:52:27

+0

現在您只需填寫Document類的字段即可。例如:'Document document = new Document(); document.setField(「title」,title);' – shimatai 2014-09-22 18:10:59