2017-08-22 139 views
1

我們有三個服務必須位於羣集中。因此,我們使用Infinispan爲集羣節點和共享這些服務之間的數據。成功重新啓動後,有時候我收到異常,並在其他節點中收到了「已更改」事件。實際上所有節點都在運行。我無法弄清楚這個原因。org.infinispan.util.concurrent.TimeoutException:「節點名稱」的複製超時

我使用的Infinispan 8.1.3分佈式緩存,JGroups的-3.4

org.infinispan.util.concurrent.TimeoutException: Replication timeout for sipproxy-16964 
      at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:765) 
      at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$80(JGroupsTransport.java:599) 
      at org.infinispan.remoting.transport.jgroups.JGroupsTransport$$Lambda$9/1547262581.apply(Unknown Source) 
      at java.util.concurrent.CompletableFuture$ThenApply.run(CompletableFuture.java:717) 
      at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:193) 
      at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2345) 
      at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:46) 
      at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:17) 
      at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) 
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) 
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
      at java.lang.Thread.run(Thread.java:745) 
    2017-08-22 04:44:52,902 INFO [JGroupsTransport] (ViewHandler,ISPN,transport_manager-48870) ISPN000094: Received new cluster view for channel ISPN: [transport_manager-48870|3] (2) [transport_manager-48870, mediaproxy-47178] 
    2017-08-22 04:44:52,949 WARN [PreferAvailabilityStrategy] (transport-thread-transport_manager-p4-t24) ISPN000313: Cache mediaProxyResponseCache lost data because of abrupt leavers [sipproxy-16964] 
    2017-08-22 04:44:52,951 WARN [ClusterTopologyManagerImpl] (transport-thread-transport_manager-p4-t24) ISPN000197: Error updating cluster member list 
    java.lang.IllegalArgumentException: There must be at least one node with a non-zero capacity factor 
      at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.checkCapacityFactors(DefaultConsistentHashFactory.java:57) 
      at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.updateMembers(DefaultConsistentHashFactory.java:74) 
      at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.updateMembers(DefaultConsistentHashFactory.java:26) 
      at org.infinispan.topology.ClusterCacheStatus.updateCurrentTopology(ClusterCacheStatus.java:431) 
      at org.infinispan.partitionhandling.impl.PreferAvailabilityStrategy.onClusterViewChange(PreferAvailabilityStrategy.java:56) 
      at org.infinispan.topology.ClusterCacheStatus.doHandleClusterView(ClusterCacheStatus.java:337) 
      at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:397) 
      at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:314) 
      at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener$1.run(ClusterTopologyManagerImpl.java:571) 
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
      at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
      at java.lang.Thread.run(Thread.java:745) 

jgroups.xml:

<config xmlns="urn:org:jgroups" 
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.4.xsd"> 
    <TCP bind_addr="131.10.20.16" 
     bind_port="8010" port_range="10" 
     recv_buf_size="20000000" 
     send_buf_size="640000" 
     loopback="false" 
     max_bundle_size="64k" 
     bundler_type="old" 
     enable_diagnostics="true" 
     thread_naming_pattern="cl" 
     timer_type="new" 
     timer.min_threads="4" 
     timer.max_threads="30" 
     timer.keep_alive_time="3000" 
     timer.queue_max_size="100" 
     timer.wheel_size="200" 
     timer.tick_time="50" 
     thread_pool.enabled="true" 
     thread_pool.min_threads="2" 
     thread_pool.max_threads="30" 
     thread_pool.keep_alive_time="5000" 
     thread_pool.queue_enabled="true" 
     thread_pool.queue_max_size="100" 
     thread_pool.rejection_policy="discard" 

     oob_thread_pool.enabled="true" 
     oob_thread_pool.min_threads="2" 
     oob_thread_pool.max_threads="30" 
     oob_thread_pool.keep_alive_time="5000" 
     oob_thread_pool.queue_enabled="false" 
     oob_thread_pool.queue_max_size="100" 
     oob_thread_pool.rejection_policy="discard"/> 
     <TCPPING initial_hosts="131.10.20.16[8010],131.10.20.17[8010],131.10.20.182[8010]" port_range="2" 
     timeout="3000" num_initial_members="3" /> 

    <MERGE3 max_interval="30000" 
      min_interval="10000"/> 

    <FD_SOCK/> 
    <FD_ALL interval="3000" timeout="10000" /> 
    <VERIFY_SUSPECT timeout="500" /> 
    <BARRIER /> 
    <pbcast.NAKACK use_mcast_xmit="false" 
        retransmit_timeout="100,300,600,1200" 
        discard_delivered_msgs="true" /> 
    <UNICAST3 conn_expiry_timeout="0"/> 

    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" 
        max_bytes="10m"/> 
    <pbcast.GMS print_local_addr="true" join_timeout="5000" 
       max_bundling_time="30" 
       view_bundling="true"/> 
    <UFC max_credits="2M" 
     min_threshold="0.4"/> 
    <MFC max_credits="2M" 
     min_threshold="0.4"/> 
    <FRAG2 frag_size="60000" /> 
    <pbcast.STATE_TRANSFER/> 
</config> 

回答

1

TimeoutException異常只是一個RPC響應還未內收到說暫停,沒有更多。當服務器處於壓力下時可能會發生這種情況,但這可能不是這種情況 - 以下日誌說節點是'懷疑'的 - 該節點可能無響應超過10秒(這是配置的限制,見FD_ALL)。

首先檢查該服務器中的日誌是否有錯誤,以及GC日誌是否有任何停止世界暫停。

+0

好的,謝謝.i會檢查當時是否有gc –

+0

你是對的!完整的GC導致這個:) –

1

作爲@flavius建議的主要原因是您的某個節點由於某種原因停止並且未能回覆RPC。

我建議改變的JGroups的日誌記錄級別,這樣你可以看到爲什麼一個節點被懷疑(它可以由FD_SOCKFD_ALL協議發生),爲什麼它是從視圖中消除(這是很可能是這個發生因爲VERIFY_SUSPECT協議)。

你也可以檢查爲什麼發生。在大多數情況下,這是由於長時間的GC暫停造成的。但是由於其他原因,您的虛擬機也可能被主機暫停。我建議在這兩個VM中使用JHiccup,並將其作爲Java代理附加到您的進程。這樣你應該注意到它是否是JVM停止世界造成這個或是操作系統。

+0

好的,謝謝。我會試試這個。 –

+0

你是對的!完整的GC導致:) –

+0

我很高興你找到了這個! – altanis