2017-10-18 77 views
4

我們有一個將消息廣播到Service Fabric無狀態服務的類。這種無狀態服務有一個分區,但有很多副本。 該消息應發送給系統中的所有副本。因此,我們查詢單個分區的FabricClient以及該分區的所有副本。 我們使用標準的HTTP通信(無狀態服務具有帶有自託管OWIN偵聽器的通信偵聽器,使用WebListener/HttpSys)和共享HttpClient實例。 在負載測試期間,我們在發送消息期間收到許多錯誤。請注意,我們在同一個應用程序中還有其他服務,還可以進行通信(WebListener/HttpSys,ServiceProxy和ActorProxy)。負載測試期間的FabricTransientException「無法ping任何提供的Service Fabric網關端點。」

我們看到異常的代碼是(堆棧跟蹤是代碼示例如下):

private async Task SendMessageToReplicas(string actionName, string message) 
{ 
    var fabricClient = new FabricClient(); 
    var eventNotificationHandlerServiceUri = new Uri(ServiceFabricSettings.EventNotificationHandlerServiceName); 

    var promises = new List<Task>(); 
    // There is only one partition of this service, but there are many replica's 
    Partition partition = (await fabricClient.QueryManager.GetPartitionListAsync(eventNotificationHandlerServiceUri).ConfigureAwait(false)).First(); 

    string continuationToken = null; 
    do 
    { 
    var replicas = await fabricClient.QueryManager.GetReplicaListAsync(partition.PartitionInformation.Id, continuationToken).ConfigureAwait(false); 
    foreach(Replica replica in replicas) 
    { 
     promises.Add(SendMessageToReplica(replica, actionName, message)); 
    } 

    continuationToken = replicas.ContinuationToken; 
    } while(continuationToken != null); 

    await Task.WhenAll(promises).ConfigureAwait(false); 
} 


private async Task SendMessageToReplica(Replica replica, string actionName, string message) 
{ 
    if(replica.TryGetEndpoint(out Uri replicaUrl)) 
    { 
    Uri requestUri = UriUtility.Combine(replicaUrl, actionName); 
    using(var response = await _httpClient.PostAsync(requestUri, message == null ? null : new JsonContent(message)).ConfigureAwait(false)) 
    { 
     string responseContent = await response.Content.ReadAsStringAsync().ConfigureAwait(false); 
     if(!response.IsSuccessStatusCode) 
     { 
     throw new Exception(); 
     } 
    } 
    } 
    else 
    { 
    throw new Exception(); 
    } 
} 

下拋出異常:

System.Fabric.FabricTransientException: Could not ping any of the provided Service Fabric gateway endpoints. ---> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071C49 
at System.Fabric.Interop.NativeClient.IFabricQueryClient9.EndGetPartitionList2(IFabricAsyncOperationContext context) 
at System.Fabric.FabricClient.QueryClient.GetPartitionListAsyncEndWrapper(IFabricAsyncOperationContext context) 
at System.Fabric.Interop.AsyncCallOutAdapter2`1.Finish(IFabricAsyncOperationContext context, Boolean expectedCompletedSynchronously) 
--- End of inner exception stack trace --- 
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() 
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) 
at Company.ServiceFabric.ServiceFabricEventNotifier.<SendMessageToReplicas>d__7.MoveNext() in c:\work\ServiceFabricEventNotifier.cs:line 138 

在同一期間我們也看到這個例外是拋出:

System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.) ---> System.ComponentModel.Win32Exception (0x80004005): An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full 
at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection) 
at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection) 
at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection) 
at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions) 
at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry) 
at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry) 
at System.Data.SqlClient.SqlConnection.OpenAsync(CancellationToken cancellationToken) 

羣集中的計算機上的事件日誌顯示Ë警告:

Event ID: 4231 
Source: Tcpip 
Level: Warning 
A request to allocate an ephemeral port number from the global TCP port space has failed due to all such ports being in use. 

Event ID: 4227 
Source: Tcpip 
Level: Warning 
TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint. 

最後的微軟服務織物管理日誌顯示數百警告類似

Event 4121 
Source Microsoft-Service-Fabric 
Level: Warning 
client-02VM4.company.nl:19000/192.168.10.36:19000: error = 2147942452, failureCount=160522. Filter by (type~Transport.St && ~"(?i)02VM4.company.nl:19000") to get listener lifecycle. Connect failure is expected if listener was never started, or listener/its process was stopped before/during connecting. 

Event 4097 
Source Microsoft-Service-Fabric 
Level: Warning 
client-02VM4.company.nl:19000 : connect failed, having tried all addresses 

一段時間後,警告變成錯誤:

Event 4096 
Source Microsoft-Service-Fabric 
Level: Error 
client-02VM4.company.nl:19000 failed to bind to local port for connecting: 0x80072747 

人告訴我們爲什麼發生這種情況,以及我們可以做些什麼來解決這個問題?我們做錯了什麼?

回答

1

我們(我和OP工作)一直在測試這一點,它竟然是由FabricClient巴赫Esben的建議。

FabricClient的文件也指出:

強烈建議您分享FabricClients儘可能。這是因爲FabricClient有多個優化,如緩存和批處理,否則您將無法充分利用。

看起來FabricClient的行爲就像HttpClient類,你應該共享這個實例,當你不這樣做時,你會得到同樣的問題,端口耗盡。

與FabricClient documentation工作常見異常但還提到,當FabricObjectClosedException發生時,你應該:

的FabricClient的

處置對象,你正在使用和實例化一個新的FabricClient對象。

共享FabricClient可修復端口耗盡問題。

1

看起來你有一個端口耗盡問題。假設情況如此, 要麼你必須弄清楚如何重用你的連接,否則你將不得不實現某種節流機制,所以你不用完所有可用的端口。

不知道結構客戶端如何行爲,它可能是它負責耗盡,或者它可能是我們無法看到代碼的SQL Server部分(但是因爲您將其發佈到日誌中,我認爲它可能很可能與你的ping測試無關)。

查看httpwebresponse的參考資源(https://github.com/Microsoft/referencesource/blob/master/System/net/System/Net/HttpWebResponse.cs),也可能是配置響應(即您的postasync使用語句)正在關閉HttpClients連接。這意味着你不是在重複使用連接,而是始終打開新連接。

我猜測測試一個不配置你的httpwebresponse的變體是一件相當容易的事情。

+0

確實我們認爲這是一個端口耗盡問題。不處理HttpWebResponses似乎很奇怪。我們將嘗試其他一些變體,以確定問題是否如您所暗示的那樣存在於Service Fabric客戶端或HttpClient用法中。 –

+0

同意它似乎很奇怪,但看看參考源它似乎是在關閉期間關閉連接組: ConnectStream connectStream = m_ConnectStream ConnectStream;如果(connectStream!= null && connectStream.Connection!= null) { connectStream.Connection.ServicePoint.CloseConnectionGroup(ConnectionGroupName); } 所以也許 –

1

調用每個現有服務實例的原因是什麼?

通常,您應該只調用SF運行時提供的一個服務實例(如果此節點過載,它將嘗試從同一節點/進程或另一個節點中選擇一個)。

如果您需要在所有服務實例中發出某種狀態更改/事件的信號,可能應該在服務實現內部完成此操作,以便檢查此狀態更改(可能是有狀態服務)或發佈 - 子事件隊列每次需要此信息時(請參閱https://github.com/loekd/ServiceFabric.PubSubActors)。

另一個想法是在另一個支持批量數據的操作中同時向服務實例發送很多消息。

如果您必須以較高的頻率從單一來源發送單個消息,那麼保持連接處於前面的答案狀態是一個很好的解決方案。

而且,主叫方應該做的連接彈性,例如參見https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-communication#communicating-with-a-service

+0

我們正在研究如何使用Actor Events來實現這段通信。但是,儘管也許這不是最好的解決方案,但我們預計它應該在技術上有效。 –