2017-02-10 69 views
0

我的代碼從HTML表中解析數據,然後將其導出到我的Oracle數據庫。對於在表上運行代碼時某些原因,我有時收到錯誤消息:TyperError:期望字符串或字節對象

Traceback (most recent call last): 
    File "Z:\Code\successfullest_html_code.py", line 122, in <module> 
    cursor.executemany(sql_query, exported_data) 
TypeError: expecting string or bytes object 

在大多數表,我的代碼完美的作品,併爲產生這種錯誤我剛剛進入由那些手..但現在這些錯誤更頻繁地發生。我只想知道爲什麼這隻發生在一些桌子上,而不是發生在其他人看起來完全相同時。

我讀過這個錯誤,當您嘗試通過命令運行除字符串(或字節對象)以外的其他東西時會生成此錯誤。但是這些表格幾乎完全相同,所以它會讓我困惑,爲什麼有時會產生這個錯誤。

這是我的代碼;

from bs4 import BeautifulSoup, NavigableString, Tag 
import pandas as pd 
import numpy as np 
import os 
import re 
import email 
import cx_Oracle 

dsnStr = cx_Oracle.makedsn("sole.nefsc.noaa.gov", "1526", "sole") 
con = cx_Oracle.connect(user="username", password="password$", dsn=dsnStr) 

def celltext(cell): 
    '''  
     textlist=[] 
     for br in cell.findAll('br'): 
      next = br.nextSibling 
      if not (next and isinstance(next,NavigableString)): 
       continue 
      next2 = next.nextSibling 
      if next2 and isinstance(next2,Tag) and next2.name == 'br': 
       text = str(next).strip() 
       if text: 
        textlist.append(next) 
     return (textlist) 
    ''' 
    textlist=[] 
    y = cell.find('span') 
    for a in y.childGenerator(): 
     if isinstance(a, NavigableString): 
      textlist.append(str(a)) 
    return (textlist) 

path = 'Z:\\bins_html_yes' 

for filename in os.listdir(path): 
    file_path = os.path.join(path, filename) 
    if os.path.isfile(file_path): 
     with open(file_path,'r') as w: 
      html=w.read() 
     #html=open(file_path,'r').read() 
      soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string 
      table = soup.find_all('table')[1] # Grab the second table 

df_Quota = pd.DataFrame() 

for filename in os.listdir(path): 
    file_path = os.path.join(path, filename) 
    if os.path.isfile(file_path): 
     with open(file_path, 'r') as f: 
      pattern = re.compile(r'Sent:.*?\b(\d{4})\b') 
      email = f.read() 
      dates = pattern.findall(email) 
      if dates: 
       export_year = (''.join(dates)) 
       print("export_year:", export_year) 

for row in table.find_all('tr'):  
    columns = row.find_all('td') 
    try: 
     if columns[0].get_text().strip()!='ID':# skip header 
      #print("First Column:", columns[0].get_text().strip()) 
      Quota = celltext(columns[1]) 
      Weight = celltext(columns[2]) 
      price = celltext(columns[3]) 

      Nrows= max([len(Quota),len(Weight),len(price)]) #get the max number of rows 

      IDList = [columns[0].get_text()] * Nrows 
      DateList = [columns[4].get_text()] * Nrows 

      if price[0].strip()=='Package': 
       price = [columns[3].get_text()] * Nrows 

      if len(Quota)<len(Weight):#if Quota has less itmes extend with NaN 
       lstnans= [np.nan]*(len(Weight)-len(Quota)) 
       Quota.extend(lstnans) 

      if len(price) < len(Quota): #if price column has less items than quota column, 
       val = [columns[3].get_text()] * (len(Quota)-len(price)) #extend with 
       price.extend(val)          #whatever is in 
                     #price column 

      #if len(DateList) > len(Quota): #if DateList is longer than Quota, 
       #print("it's longer than") 
       #value = [columns[4].get_text()] * (len(DateList)-len(Quota)) 
       #DateList = value * Nrows 

      if len(Quota) < len(DateList): #if Quota is less than DateList (due to gap), 
       stu = [columns[1].get_text()] * (len(DateList)-len(Quota)) #extend with what exists 
       #stu = [np.nan]*(len(DateList)-len(Quota)) #extend with NaN 
       Quota.extend(stu) 

      if len(Weight) < len(DateList): 
       dru = [columns[2].get_text()] * (len(DateList)-len(Weight)) #extend with what exists 
       #dru = [np.nan]*(len(DateList)-len(Weight)) #extend with Nan 
       Weight.extend(dru) 

      FinalDataframe = pd.DataFrame(
      { 
      'ID':IDList,  
      'AvailableQuota': Quota, 
      'LiveWeightPounds': Weight, 
      'price':price, 
      'DatePosted':DateList 
      }) 
      #print("ID:", IDList) 
      #print("Price:", price) 

      df_Quota = df_Quota.append(FinalDataframe, ignore_index=True) 
      #df_Q = df_Quota['DatePosted'].iloc[0] #capture only most recent 
      #df_Quota = df_Quota[df_Quota['DatePosted'] == df_Q] #date's data 
    except IndexError: 
     continue 

df_Quota['year'] = export_year 

print ("Dataframe is:", df_Quota) 

cursor = con.cursor() 
exported_data = [tuple(x) for x in df_Quota.values] 
sql_query = ("INSERT INTO FISHTABLE(species, date_posted, stock_id, pounds, advertised_price, year_posted, sector_name, ask)" "VALUES(:1, :2, :3, :4, :5, :6, 'NEFS 2', '1')") 
cursor.executemany(sql_query, exported_data) 
con.commit() #commit to database 

cursor.close() 
con.close() 

這裏有一個表,它成功地出口:儘管你可以忽略大部分,上線cursor.executemany(sql_query, exported_data)發生錯誤

enter image description here

而且這裏有一個表,它失敗的:

enter image description here

這裏是數據幀的打印輸出(該\n,說自己其實亂了出口的話):

Dataframe is:  AvailableQuota DatePosted  ID LiveWeightPounds price year 
0   White Hake \n4/15\n \n002\n   50,000 $0.10 2015 
1   GOM COD \n3/23\n \n1493\n   3,600 $0.60 2015 
2   \nGreysole \n3/23\n \n1493\n   \n350 \n$1.25 2015 
3   GBE COD \n3/20\n \n1878\n   1,113 $0.60 2015 
4    Dabs \n3/18\n \n1043\n   3,000 $0.50 2015 
5   \nGreysole \n3/18\n \n1043\n   \n700 \n$.85 2015 
6   GOM HADD \n3/13\n \n011\n    790 $0.50 2015 
7    Dabs \n3/13\n \n370\n   2,100  $.60 2015 
8   \nGreySole \n3/13\n \n370\n   \n4,700 \n$.85 2015 
9   GOM COD \n3/13\n \n1734\n   1,900 $0.90 2015 
10  \nGOM HADD \n3/13\n \n1734\n   \n1,000 \n$1.00 2015 
11  \nGreysole \n3/13\n \n1734\n   \n3,000 \n$1.50 2015 
12   \nDabs \n3/13\n \n1734\n   \n2,700 \n$1.00 2015 
13   GBW Cod \n3/13\n \n816\n   12,000 $0.40 2015 
14   \nDabs \n3/13\n \n816\n   \n2,000 \n$0.60 2015 
15  \nGreysole \n3/13\n \n816\n   \n2,000 \n$0.90 2015 
16   GOM COD \n3/13\n \n373\n    300 $0.90 2015 
17 \nGOM YellowTail \n3/13\n \n373\n   \n3,300 \n$0.20 2015 
18  \nGOM Hadd \n3/13\n \n373\n   \n1,000 \n$0.50 2015 
19   GOM Hadd \n3/11\n \n001\n    2500 $0.40 2015 
20   GOM HADD \n3/9\n \n187\n   1,100 $0.50 2015 
21  \nGreysole  \n3/9\n \n187\n   \n900 \n$0.85 2015 
22   \nDabs \n3/9\n \n187\n   \n450 \n$0.50 2015 
23   GOM COD \n3/5\n \n255\n    500 $0.40 2015 
24  \nGOM Hadd \n3/5\n \n255\n   \n1,000 \n$0.40 2015 
25 \nGOM Yellowtail \n3/5\n \n255\n   \n3,000 \n$0.20 2015 
26   Gom Hadd \n2/12\n \n485\n   5,800 $0.40 2015 
27 \nGom Yellowtail \n2/12\n \n485\n   \n1100 \n$0.20 2015 
28   GOM HADD \n1/26\n \n314\n    439 $1.50 2015 
29 \nGOM Yellowtail \n1/26\n \n314\n   \n2,274 \n$0.25 2015 
30   GOM HADD \n1/26\n \n1610\n   2,950 $0.70 2015 
31    NaN \n1/26\n \n1610\n   \n500  \n 2015 
32    NaN \n1/26\n \n1610\n   \n2,550 \n$0.25 2015 
33 GOM Yellowtail \n1/23\n \n347\n   4,780 $0.25 2015 
34 GOM Yellowtail \n1/23\n \n802\n   2,141 $0.25 2015 
35    POLL \n12/8\n \n310B\n   65234 $0.01 2015 
36    \nRED \n12/8\n \n310B\n   \n76610 \n$0.01 2015 
37   \nSNE BB \n12/8\n \n310B\n   \n2121 \n$0.30 2015 
38   \nGOM BB \n12/8\n \n310B\n   \n7285 \n$0.05 2015 
39   GOM BB \n5/29\n \n588\n    9989 $0.10 2015 
40   \nGOM YT \n5/29\n \n588\n   \n6172 \n$0.25 2015 
41   \nPOLL \n5/29\n \n588\n   \n10314 \n$0.01 2015 
42   \nREDFISH \n5/29\n \n588\n   \n2705 \n$0.01 2015 

這裏是(exported_data)打印輸出:

[('White Hake', '\n4/15\n', '\n002\n', '50,000', '$0.10', '2015'), ('GOM COD', '\n3/23\n', '\n1493\n', '3,600', '$0.60', '2015'), ('\nGreysole', '\n3/23\n', '\n1493\n', '\n350', '\n$1.25', '2015'), ('GBE COD', '\n3/20\n', '\n1878\n', '1,113', '$0.60', '2015'), ('Dabs', '\n3/18\n', '\n1043\n', '3,000', '$0.50', '2015'), ('\nGreysole', '\n3/18\n', '\n1043\n', '\n700', '\n$.85', '2015'), ('GOM HADD', '\n3/13\n', '\n011\n', '790', '$0.50', '2015'), ('Dabs', '\n3/13\n', '\n370\n', '2,100', '$.60', '2015'), ('\nGreySole', '\n3/13\n', '\n370\n', '\n4,700', '\n$.85', '2015'), ('GOM COD', '\n3/13\n', '\n1734\n', '1,900', '$0.90', '2015'), ('\nGOM HADD', '\n3/13\n', '\n1734\n', '\n1,000', '\n$1.00', '2015'), ('\nGreysole', '\n3/13\n', '\n1734\n', '\n3,000', '\n$1.50', '2015'), ('\nDabs', '\n3/13\n', '\n1734\n', '\n2,700', '\n$1.00', '2015'), ('GBW Cod', '\n3/13\n', '\n816\n', '12,000', '$0.40', '2015'), ('\nDabs', '\n3/13\n', '\n816\n', '\n2,000', '\n$0.60', '2015'), ('\nGreysole', '\n3/13\n', '\n816\n', '\n2,000', '\n$0.90', '2015'), ('GOM COD', '\n3/13\n', '\n373\n', '300', '$0.90', '2015'), ('\nGOM YellowTail', '\n3/13\n', '\n373\n', '\n3,300', '\n$0.20', '2015'), ('\nGOM Hadd', '\n3/13\n', '\n373\n', '\n1,000', '\n$0.50', '2015'), ('GOM Hadd', '\n3/11\n', '\n001\n', '2500', '$0.40', '2015'), ('GOM HADD', '\n3/9\n', '\n187\n', '1,100', '$0.50', '2015'), ('\nGreysole ', '\n3/9\n', '\n187\n', '\n900', '\n$0.85', '2015'), ('\nDabs', '\n3/9\n', '\n187\n', '\n450', '\n$0.50', '2015'), ('GOM COD', '\n3/5\n', '\n255\n', '500', '$0.40', '2015'), ('\nGOM Hadd', '\n3/5\n', '\n255\n', '\n1,000', '\n$0.40', '2015'), ('\nGOM Yellowtail', '\n3/5\n', '\n255\n', '\n3,000', '\n$0.20', '2015'), ('Gom Hadd', '\n2/12\n', '\n485\n', '5,800', '$0.40', '2015'), ('\nGom Yellowtail', '\n2/12\n', '\n485\n', '\n1100', '\n$0.20', '2015'), ('GOM HADD', '\n1/26\n', '\n314\n', '439', '$1.50', '2015'), ('\nGOM Yellowtail', '\n1/26\n', '\n314\n', '\n2,274', '\n$0.25', '2015'), ('GOM HADD', '\n1/26\n', '\n1610\n', '2,950', '$0.70', '2015'), (nan, '\n1/26\n', '\n1610\n', '\n500', '\n', '2015'), (nan, '\n1/26\n', '\n1610\n', '\n2,550', '\n$0.25', '2015'), ('GOM Yellowtail', '\n1/23\n', '\n347\n', '4,780', '$0.25', '2015'), ('GOM Yellowtail', '\n1/23\n', '\n802\n', '2,141', '$0.25', '2015'), ('POLL', '\n12/8\n', '\n310B\n', '65234', '$0.01', '2015'), ('\nRED', '\n12/8\n', '\n310B\n', '\n76610', '\n$0.01', '2015'), ('\nSNE BB', '\n12/8\n', '\n310B\n', '\n2121', '\n$0.30', '2015'), ('\nGOM BB', '\n12/8\n', '\n310B\n', '\n7285', '\n$0.05', '2015'), ('GOM BB', '\n5/29\n', '\n588\n', '9989', '$0.10', '2015'), ('\nGOM YT', '\n5/29\n', '\n588\n', '\n6172', '\n$0.25', '2015'), ('\nPOLL', '\n5/29\n', '\n588\n', '\n10314', '\n$0.01', '2015'), ('\nREDFISH', '\n5/29\n', '\n588\n', '\n2705', '\n$0.01', '2015')] 

別的不說,爲什麼出現的錯誤真的讓我困惑所有地方的線...... cursor.executemany()只是應該從上面的行執行SQL查詢,對吧?它適用於某些表格,但在其他表格上失敗,我真的不知道爲什麼。任何幫助解釋和解決這個表示讚賞,謝謝。

+0

您可以發佈完整的堆棧跟蹤?另外,我會建議在調用之前將一個調試輸出放在'exported_data'上。我懷疑你列表中的一個元組格式不正確。 – TemporalWolf

+0

對不起,請原諒我是一個noob,但什麼是堆棧跟蹤?而我從來沒有真正使用過調試器......當它說'TypeError:期望字符串或字節對象'時,它會伴隨着一堆其他文本。「 – theprowler

+1

發佈整個集合(儘管我會用「用戶」和「公司」等佔位符值替換任何個人數據) – TemporalWolf

回答

1
if len(Quota)<len(Weight): #if Quota has less itmes extend with NaN 
    lstnans= [np.nan]*(len(Weight)-len(Quota)) 
    Quota.extend(lstnans) 

您有意在您的列表中添加nan s以掩蓋某些解析錯誤。根本原因在於建立Quota

並回答你的問題:

why should NaNs cause the export fail? My Oracle is set to allow Null in a cell so shouldn't it accept a NaN result?

>>> float('NaN') == None 
False 

nanNone/Null

+0

好的,因爲'NaN'不等於'Null',所以'NaN'導致它失敗。有沒有辦法解決這個問題?我可以用適合Oracle表格單元格的東西來替換那些'NaN'嗎? – theprowler

+1

@theprowler如果您正在解析引起該問題的表,那麼您的問題就出現了,正如您所提到的那樣,這些字段在原始數據中具有值。你需要弄清楚爲什麼他們沒有解析。您可以將其更改爲「無」,但這不會解決解析問題。 – TemporalWolf

+0

右對對。這就說得通了。我不知道爲什麼它決定失敗,我認爲每週的表格創建者將隨機\ n添加到我的代碼捕獲的單元格中,而不是我希望它捕獲的字符串。它只是有時發生,所以我想我必須忍受它。雖然 – theprowler

相關問題