2016-10-22 38 views
0

我正在抓取從網站的數據到我的數據庫在rails.I獲取32000記錄與這個腳本沒有任何問題,但我想更快地獲取數據所以我在我的rake任務中應用線程,但是在運行rake任務時出現問題,一些數據正在抓取,然後rake任務被中止。使用線程在rails中擦除數據

我不知道要做什麼任務,如果有任何幫助可以完成我真的很感激。這是我的刮取任務代碼。

task scratch_to_database: :environment do 
    time2 = Time.now 
    puts "Current Time : " + time2.inspect 
    client = Mechanize.new 
    giftcard_types=Giftcard.card_types 
    find_all_merchant=Merchant.all.pluck(:id, :name).to_h 

    #first index page of the merchant 
    index_page = client.get('https://www.twitter.com//') 
    document_page_index = Nokogiri::HTML::Document.parse(index_page.body) 
    #set all merchant is deteled true 
    # set_merchant_as_deleted = Merchant.update_all(is_deleted: true) if Merchant.exists? 
    # set_giftcard_as_deleted = Giftcard.update_all(is_deleted: true) if Giftcard.exists? 
    update_all_merchant_record = [] 
    update_all_giftcard_record = [] 
    threads = [] 
    #Merchant inner page pagination loop 
    page_no_merchant = document_page_index.css('.pagination.pagination-centered ul li:nth-last-child(2) a').text.to_i 
    1.upto(page_no_merchant) do |page_number| 
     threads << Thread.new do 
     client.get("https://www.twitter.com/buy-gift-cards?page=#{page_number}") do |page| 
      document = Nokogiri::HTML::Document.parse(page.body) 

      #Generate the name of the merchant and image of the merchant loop 
      document.css('.product-source').each do |item| 
       merchant_name= item.children.css('.name').text.gsub("Gift Cards", "") 
       href = item.css('a').first.attr('href') 
       image_url=item.children.css('.img img').attr('data-src').text.strip 
        #image url to parse the url of the image 
       image_url=URI.parse(image_url) 
       #saving the record of the merchant 
       # @merchant=Merchant.create(name: merchant_name , image_url:image_url) 
        if find_all_merchant.has_value?(merchant_name) 
        puts "this if" 
        merchant_id=find_all_merchant.key(merchant_name) 
        puts merchant_id 
        else 
        @merchant= Merchant.create(name: merchant_name , image_url:image_url) 
        update_all_merchant_record << @merchant.id 
        [email protected] 
        end 
       # @merchant.update_attribute(:is_deleted, false) 
       #set all giftcard is deteled true 
       # set_giftcard_as_deleted = Giftcard.where(merchant_id: @merchant.id).update_all(is_deleted: true) if Giftcard.where(merchant_id: @merchant.id).exists? 
       #first page of the giftcard details page 
       first_page = client.get("https://www.twitter.com#{href}") 
       document_page = Nokogiri::HTML::Document.parse(first_page.body) 
       page_no = document_page.css('.pagination.pagination-centered ul li:nth-last-child(2) a').text.to_i 
       hrefextra =document_page.css('.dropdown-menu li a').last.attr('href') 

       #generate the giftcard details loop with the pagination 
       # update_all_record = [] 
       find_all_giftcard=Giftcard.where(merchant_id:merchant_id).pluck(:row_id) 
       puts merchant_name 
       # puts find_all_giftcard.inspect 


        card_page = client.get("https://www.twitter.com#{hrefextra}") 
        document_page = Nokogiri::HTML::Document.parse(card_page.body) 

        #table details to generate the details of the giftcard with price ,per_off and final value of the giftcard 

        document_page.xpath('//table/tbody/tr[@class="toggle-details"]').collect do |row| 
         type1=[] 
         row_id = row.attr("id").to_i 

         row.at("td[2] ul").children.each do |typeli| 
         type = typeli.text.strip if typeli.text.strip.length != 0 
         type1 << type if typeli.text.strip.length != 0 
         end 

         value = row.at('td[3]').text.strip 
         value = value.to_s.tr('$', '').to_f 

         per_discount = row.at('td[4]').text.strip 
         per_discount = per_discount.to_s.tr('%', '').to_f 

         final_price = row.at('td[5] strong').text.strip 
         final_price = final_price.to_s.tr('$', '').to_f 

         type1.each do |type| 
          if find_all_giftcard.include?(row_id) 
           update_all_giftcard_record<<row_id 
           puts "exists" 
          else 
           puts "new" 
          @giftcard= Giftcard.create(card_type: giftcard_types.values_at(type.to_sym)[0], card_value:value, per_off:per_discount, card_price: final_price, merchant_id: merchant_id , row_id: row_id) 
          update_all_giftcard_record << @giftcard.row_id 
          end 
         end 
         #saving the record of the giftcard 
          # @giftcard=Giftcard.create(card_type:1, card_value:value, per_off:per_discount, card_price: final_price, merchant_id: @merchant.id , gift_card_type: type1) 
        end 
        # Giftcard.where(:id =>update_all_record).update_all(:is_deleted => false) 

       #delete all giftcard which is not present 
       # giftcard_deleted = Giftcard.where(:is_deleted => true,:merchant_id => @merchant.id).destroy_all if Giftcard.where(merchant_id: @merchant.id).exists? 
      time2 = Time.now 
      puts "Current Time : " + time2.inspect 
      end 
     end 
     end 
    end 
    threads.each(&:join) 
     puts "-------" 
     puts threads 
    # merchant_deleted = Merchant.where(:is_deleted => true).destroy_all if Merchant.exists? 
    merchant_deleted = Merchant.where('id NOT IN (?)',update_all_merchant_record).destroy_all if Merchant.exists? 
    giftcard_deleted = Giftcard.where('row_id NOT IN (?)',update_all_giftcard_record).destroy_all if Giftcard.exists? 
end 

錯誤我收到: 的ActiveRecord :: ConnectionTimeoutError:無法獲得從5.000秒內池中的連接(等待5.001秒);所有池中的連接都在使用中

回答

0

每個線程都需要單獨連接到數據庫。您需要增加應用程序可以在您的database.yml文件中使用的連接池大小。

但是你的數據庫也應該能夠處理傳入的連接。如果您正在使用mysql,您可以在控制檯上運行select @@MAX_CONNECTIONS來檢查。

+0

這個命令不能在控制檯中工作abhishek有沒有其他方法可以更快地實現抓取或更快地提高抓取速度,以便可以在最短時間內獲取數據 –

+0

否則我可以限制每次連接的線程意味着執行10個線程然後睡眠並創建新的線程。有什麼辦法可以做到這一點 –

+0

你應該做的是創建一個固定數量的線程(比如說10)。然後將這個任務(page_no_merchant)平均分配給這個,你只需要改變你的外部循環。嘗試一次,讓我知道你是否需要任何幫助。 –