2016-09-07 94 views
-1

我有這樣的代碼被寫入由其他人的Python 2和I它轉換到Python 3:BeautifulSoup:不是JSON序列化

url = self.lodestone_url + '/topics/' 
    r = self.make_request(url) 

    news = [] 
    soup = bs4.BeautifulSoup(r.content) 
    for tag in soup.select('.news__content__list__topics li'): 
     entry = {} 
     title_tag = tag.select('.ic_topics a')[0] 
     script = str(tag.select('script')[0]) 
     entry['timestamp'] = int(re.findall(r"1[0-9]{9},", script)[0].rstrip(',')) 
     entry['link'] = '//' + self.lodestone_domain + title_tag['href'] 
     entry['id'] = entry['link'].split('/')[-1] 
     entry['title'] = title_tag.string.strip() 
     body = tag.select('.news__content__list__topics--body')[0] 
     for a in body.findAll('a'): 
      if a['href'].startswith('/'): 
       a['href'] = '//' + self.lodestone_domain + a['href'] 
     print(type(body)) 
     entry['body'] = body.encode('utf-8').strip() 
     #entry['body'] = "" 
     entry['lang'] = 'en' 
     news.append(entry) 

最後一塊我不能弄清楚是從上方這一行:

 entry['body'] = body.encode('utf-8').strip() 

因爲它給這個錯誤:

Traceback (most recent call last): 
    File "lodestoner", line 48, in <module> 
    print(json.dumps(ret, indent=4)) 
    File "/usr/local/lib/python3.5/json/__init__.py", line 237, in dumps 
    **kw).encode(obj) 
    File "/usr/local/lib/python3.5/json/encoder.py", line 201, in encode 
    chunks = list(chunks) 
    File "/usr/local/lib/python3.5/json/encoder.py", line 427, in _iterencode 
    yield from _iterencode_list(o, _current_indent_level) 
    File "/usr/local/lib/python3.5/json/encoder.py", line 324, in _iterencode_list 
    yield from chunks 
    File "/usr/local/lib/python3.5/json/encoder.py", line 403, in _iterencode_dict 
    yield from chunks 
    File "/usr/local/lib/python3.5/json/encoder.py", line 436, in _iterencode 
    o = _default(o) 
    File "/usr/local/lib/python3.5/json/encoder.py", line 180, in default 
    raise TypeError(repr(o) + " is not JSON serializable") 
TypeError: b'<div class="news__content__list__topics--body"><a class="news__content__list__topics__link_banner" href="//na.finalfantasyxiv.com/lodestone/topics/detail/f05649918007c827f44000ef5462461cec1e8b38"><img alt="" height="149" src="http://img.finalfantasyxiv.com/t/f05649918007c827f44000ef5462461cec1e8b38.png?1473152734" width="570"/></a>FINAL FANTASY XIV will be attending Tokyo Game Show 2016 at Makuhari Messe in Chiba in full force, and we\xe2\x80\x99ll be a larger than Hydaelyn presence as we\xe2\x80\x99ll be occupying space at our own Square Enix booth as well as the Intel booth! Additionally, we\xe2\x80\x99ll be broadcasting the next Letter from the Producer LIVE straight from the show floor, so be sure to mark your calendars as this is the second part of the Patch 3.4 special which you won\xe2\x80\x99t want to miss!<br><br><a href="//na.finalfantasyxiv.com/lodestone/topics/detail/f05649918007c827f44000ef5462461cec1e8b38" rel="f05649918007c827f44000ef5462461cec1e8b38">Read on</a> for more details.</br></br></div>' 
is not JSON serializable 

以上,body變量是鍵入<class 'bs4.element.Tag'>

所以,當我刪除encode的一部分,它看起來像這樣:

 entry['body'] = body.strip() 

然後我得到這個錯誤:

TypeError: 'NoneType' object is not callable 

我缺少什麼?對於這樣的大多數情況,刪除encode已經工作。

+0

難道你只是想'進入[「身體」]'來保存新聞條目的文本內容?即'「FINAL FANTASY XIV將參加東京電玩展......」' – SuperShoot

+0

@SuperShoot是的,我認爲這是原作者的意圖。 。 – Zeno

+0

作爲腳本代表,您呼叫'.encode(「UTF-8」),帶()'在BS4'tag'對象的實例 - 但它們是字符串操作。嘗試'unicode(body.string)' - 根據將返回標籤中任何文本的unicode表示的文檔。 – SuperShoot

回答

0

原作者不提取文本,他們正在傾倒的HTML內容,你需要傳遞一個STR做使用python3相同:

In [10]: soup = BeautifulSoup("<div>foo</div>","html.parser") 

In [11]: print(json.dumps(soup.div.encode("utf-8"))) 
..................................... 

/usr/lib/python3.5/json/encoder.py in default(self, o) 
    177 
    178   """ 
--> 179   raise TypeError(repr(o) + " is not JSON serializable") 
    180 
    181  def encode(self, o): 

TypeError: b'<div>foo</div>' is not JSON serializable 

In [12]: print(json.dumps(str(soup.div.encode("utf-8"),"utf-8"))) 
"<div>foo</div>" 

這正是你使用python2:

In [4]: soup = BeautifulSoup("<div>foo</div>","html.parser") 

In [5]: print(json.dumps(soup.div.encode("utf-8"))) 
"<div>foo</div>"