我們有一個系統在3.5版本中使用Hazelcast IExecutor Service和IMap。我們最近遇到了Hazelcast集羣成員一個接一個地出現內存不足,最後所有節點都與OOM崩潰。
在進行因果分析時,我們發現有數千個以下的日誌條目和日誌文件大小呈指數級增長。另外存在日誌的存儲空間也已經空間不足。Hazelcast羣集成員因「IsStillRunningService」對象數量龐大而導致內存不足
WARNING: [10.7.90.189]:30103 [FB] [3.5] Asking if operation execution has been started: com.hazelcast.spi.impl.operationservice.i[email protected]48b3ac3b
Mar 30, 2016 11:09:29 AM com.hazelcast.spi.impl.operationservice.impl.Invocation
WARNING: [10.7.90.189]:30103 [FB] [3.5] While asking 'is-executing': Invocation{ serviceName='hz:core:partitionService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{serviceName='hz:core:partition
Service', partitionId=-1, callId=59834, invocationTime=1459349279980, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[1.2.3.4]:30102, b
ackupsExpected=0, backupsCompleted=0}
com.hazelcast.core.OperationTimeoutException: No response for 10000 ms. Aborting invocation! Invocation{ serviceName='hz:core:partitionService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{servic
eName='hz:core:partitionService', partitionId=-1, callId=268177, invocationTime=1459349295209, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Addr
ess[10.7.90.190]:30102, backupsExpected=0, backupsCompleted=0} No response has been received! backups-expected:0 backups-completed: 0
at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)
at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService$IsOperationStillRunningCallback.setOperationTimeout(IsStillRunningService.java:224)
at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService$IsOperationStillRunningCallback.onFailure(IsStillRunningService.java:219)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture$1.run(InvocationFuture.java:137)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
我的理解是,集羣成員將不斷的心跳,以確保所有成員都活着,我相信默認爲10秒。現在的問題是,如果任何成員無反應或休息狀態,其餘成員將繼續進行正在執行的呼叫。在查看堆轉儲後,知道堆中有73%的堆滿了「IsStillRunningService」對象。
問題:
- 如何去了解究竟出在哪裏?
- 用完存儲空間只是一個共同發生或可能有任何相關性?我們懷疑這可能導致其他人,因爲它在一週內發生過兩次。
Hazelcast XML配置:
<hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config http://www.hazelcast.com/schema/config/hazelcast-config-3.5.xsd"
xmlns="http://www.hazelcast.com/schema/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<map name="myMap">
<backup-count>0</backup-count>
<time-to-live-seconds>43200</time-to-live-seconds>
<eviction-policy>LRU</eviction-policy>
<max-size policy="USED_HEAP_PERCENTAGE">75</max-size>
<eviction-percentage>10</eviction-percentage>
<in-memory-format>OBJECT</in-memory-format>
</map>
<executor-service name="calculation">
<pool-size>10</pool-size>
<queue-capacity>400</queue-capacity>
</executor-service>
<executor-service name="loader">
<pool-size>5</pool-size>
<queue-capacity>400</queue-capacity>
</executor-service>
<properties>
<property name="hazelcast.icmp.timeout">5000</property>
<property name="hazelcast.initial.wait.seconds">10</property>
<property name="hazelcast.connection.monitor.interval">5000</property>
</properties>
<network>
<port auto-increment="true" port-count="100">30101</port>
<join>
<multicast enabled="false">
<multicast-group>224.2.2.3</multicast-group>
<multicast-port>54327</multicast-port>
</multicast>
<tcp-ip enabled="true">
<interface>1.2.3.4</interface>
<interface>1.2.3.5</interface>
<interface>1.2.3.6</interface>
</tcp-ip>
<aws enabled="false"/>
</join>
<interfaces enabled="false">
<interface>127.0.0.1</interface>
</interfaces>
</network>
</hazelcast>
StackTrace
LinkedBlockingQueue which holds IsStillRunningService Objects
謝謝。我們繼續使用3.6.2版本,並且在過去的幾周內沒有看到任何此類錯誤。 –