,如果你寫你的文件時利用Python的CSV庫這將有助於。這可以將項目列表轉換爲正確的逗號分隔值。
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
myUrl = 'https://www.youtube.com/user/HolaSoyGerman/about'
uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("h1",{"class":"branded-page-header-title"})
filename = "Products2.csv"
with open(filename, "w", newline='') as f:
csv_output = csv.writer(f)
headers = ["Channel Name", "Channel Description", "Channel Social Media Links"]
csv_output.writerow(headers)
channel_name = containers[0].a.text
print("Channel Name :" + channel_name)
# For About Section Info
aboutUrl = 'https://www.youtube.com/user/HolaSoyGerman/about'
uClient1 = uReq(aboutUrl)
page_html1 = uClient1.read()
uClient1.close()
page_soup1 = soup(page_html1, "html.parser")
description_div = page_soup.findAll("div",{"class":"about-description branded-page-box-padding"})
channel_description = description_div[0].pre.text
print("Channel Description :" + channel_description)
links = [link.a.get('href') for link in page_soup.findAll("li",{"class":"channel-links-item"})]
csv_output.writerow([channel_name, channel_description, links[0]])
for link in links[1:]:
csv_output.writerow(['', '', link])
這會給你一個單行每個在最後一列的HREFs,例如:
Channel Name,Channel Description,Channel Social Media Links
HolaSoyGerman.,Los Hombres De Verdad Usan Pantuflas De Perrito,http://www.twitter.com/germangarmendia
,,http://instagram.com/germanchelo
,,http://www.youtube.com/juegagerman
,,http://www.youtube.com/juegagerman
,,http://www.twitter.com/germangarmendia
,,http://instagram.com/germanchelo
,,https://plus.google.com/108460714456031131326
每個writerow()
通話將寫值的列表,以該文件爲逗號分隔值並在最後自動爲你添加換行符。所需要的就是構建每行的值列表。首先將您的鏈接的第一個,並在您的頻道描述後,使其成爲列表中的最後一個值。其次,爲前兩列有空值的剩餘鏈接寫一行。
爲了回答您的評論,下面應該讓你開始:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
def get_data(url, csv_output):
if not url.endswith('/about'):
url += '/about'
print("URL: {}".format(url))
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("h1", {"class":"branded-page-header-title"})
channel_name = containers[0].a.text
print("Channel Name :" + channel_name)
description_div = page_soup.findAll("div", {"class":"about-description branded-page-box-padding"})
channel_description = description_div[0].pre.text
print("Channel Description :" + channel_description)
links = [link.a.get('href') for link in page_soup.findAll("li", {"class":"channel-links-item"})]
csv_output.writerow([channel_name, channel_description, links[0]])
for link in links[1:]:
csv_output.writerow(['', '', link])
#TODO - get list of links for the related channels
return related_links
my_url = 'https://www.youtube.com/user/HolaSoyGerman'
filename = "Products2.csv"
with open(filename, "w", newline='') as f:
csv_output = csv.writer(f)
headers = ["Channel Name", "Channel Description", "Channel Social Media Links"]
csv_output.writerow(headers)
for _ in range(5):
next_links = get_data(my_url, csv_output)
my_url = next_links[0] # e.g. follow the first of the related links
你不包括換行符f.write後'(CHANNEL_NAME + 「」 + CHANNEL_DESCRIPTION)',所以當然第一行將會進一步結束。另外請注意'',「+」,「==」,,「',並且CSV模塊支持從序列中寫入,而不是自己添加逗號。 – jonrsharpe
那麼我怎樣才能實現這個請給我例子。我希望社交媒體鏈接不以逗號分隔。我希望在社交媒體鏈接中的所有鏈接應寫入社交媒體鏈接列的新下一行單元 –