查找字符串中的重複序列

我想找到在VB.Net字符串重複序列，是這樣的：查找字符串中的重複序列

點心測試的String =「EDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGB」

我想要的程序來檢測在EDCRFVTGB的情況下重複序列，並計算重複次數。我的問題是要找到字符串中的重複序列，我搜索了幾種方法來做到這一點，但我沒有得到解決方案，我嘗試了快速排序算法，重複算法，但其中有幾個不能使用字符串。

我雖然關於創建子字符串並檢查它們在字符串中的存在，但我不知道如何獲得子字符串，因爲沒有字符串模式，也有可能沒有重複序列的串。

來源

2015-06-28 GoldSpy98

你的問題是要找到最長的重複子？ – 1010

你需要做這樣的代碼可以和猜測密碼的「暴力」程序相媲美，並且它將非常緩慢地運行所有可能的子字符串並檢查所有其他子字符串。 – maksymiuk

與此格式完全相同還是在重複字符串之前/之後有/未使用過的文本片段？ – maraca

你知道字符串的起始位置嗎？你知道它有多長時間嗎？

的簡單算法是：

for each character index i 
    for each character index after that j 
     compare substring(i, j-i) to substring(j, j-i) 
     if equal, record as a found repeating substring

有優化，如明知字符串不能超越字符串的結尾（中J上限），並只想找子比你長只有找到了。

這不是超高效的（N-squared），但是一個相關的廣義問題（「編輯距離」）也被認爲不比N平方好，所以你去了。

來源

2015-06-28 00:55:15

當你達到字符串長度的一半時，你也停止搜索子字符串;） – maksymiuk

我不知道字符串有多長，字符串可能是隨機字符，沒有特定的長度，可能包含空格或不包含，並且在這些隨機字符中，我需要查找是否有重複的字符序列，字符串不會超過400-500個字符。根據Andrew的說法，代碼只檢測重複序列直到字符串的一半，但是我需要找到重複序列，即使它在字符串的末尾，它也可能是最長的一個。 – GoldSpy98

安德魯建議優化;他沒有建議僞代碼描述不正確的算法。該算法是正確的，並且無論您在哪裏找到它們，都會找到長度爲1或更長的子字符串。主要優化是什麼使你從運行時O（N^3）-ish到保證O（N^2），但是按照書面的方式，它可以工作。 –

首先檢查一半目標字符串是否重複兩次。如果不是，檢查三分之一的字符串是否重複三次。如果不是，檢查四分之一的字符串是否重複四次。直到找到匹配的序列爲止。如果商數不是整數，則跳過任何除數，以使其表現更好。此代碼應達到目的並且在本說明書中沒有任何澄清填充間隙：

Public Function DetermineSequence(ByVal strTarget As String) As String 

    Dim strSequence As String = String.Empty 

    Dim intLengthOfTarget As Integer = strTarget.Length 

    'Check for a valid Target string. 
    If intLengthOfTarget > 2 Then 

     'Try 1/2 of Target, 1/3 of Target, 1/4 of Target, etc until sequence is found. 
     Dim intCursor As Integer = 2 

     Do Until strSequence.Length > 0 OrElse intCursor = intLengthOfTarget 

      'Don't even test the string if its length is not a divisor (to an Integer) of the length of the target String. 
      If IsDividendDivisibleByDivisor(strTarget.Length, intCursor) Then 

       'Get the possible sequence. 
       Dim strPossibleSequence As String = strTarget.Substring(0, (intLengthOfTarget/intCursor)) 

       'See if this possible sequence actually is the repeated String. 
       If IsPossibleSequenceRepeatedThroughoutTarget(strPossibleSequence, strTarget) Then 

        'The repeated sequence has been found. 
        strSequence = strPossibleSequence 

       End If 

      End If 

      intCursor += 1 

     Loop 

    End If 

    Return strSequence 

End Function 

Private Function IsDividendDivisibleByDivisor(ByVal intDividend As Integer, ByVal intDivisor As Integer) As Boolean 

    Dim bolDividendIsDivisbleByDivisor As Boolean = False 

    Dim intOutput As Integer 

    If Integer.TryParse((intDividend/intDivisor), intOutput) Then 

     bolDividendIsDivisbleByDivisor = True 

    End If 

    Return bolDividendIsDivisbleByDivisor 

End Function 

Private Function IsPossibleSequenceRepeatedThroughoutTarget(ByVal strPossibleSequence As String, ByVal strTarget As String) As Boolean 

    Dim bolPossibleSequenceIsRepeatedThroughoutTarget As Boolean = False 

    Dim intLengthOfTarget As Integer = strTarget.Length 
    Dim intLengthOfPossibleSequence As Integer = strPossibleSequence.Length 

    Dim bolIndicatorThatPossibleSequenceIsCertainlyNotRepeated As Boolean = False 

    Dim intCursor As Integer = 1 

    Do Until (intCursor * intLengthOfPossibleSequence) = strTarget.Length OrElse bolIndicatorThatPossibleSequenceIsCertainlyNotRepeated 

     If strTarget.Substring((intCursor * intLengthOfPossibleSequence), intLengthOfPossibleSequence) <> strPossibleSequence Then 

      bolIndicatorThatPossibleSequenceIsCertainlyNotRepeated = True 

     End If 

     intCursor += 1 

    Loop 

    If Not bolIndicatorThatPossibleSequenceIsCertainlyNotRepeated Then 

     bolPossibleSequenceIsRepeatedThroughoutTarget = True 

    End If 

    Return bolPossibleSequenceIsRepeatedThroughoutTarget 

End Function

來源

2015-06-28 02:26:13 N0Alias

這不是他想要的，看到他的問題中的評論，我現在提出了評論，以獲得更好的知名度。 – maraca

我喜歡這個，因爲它會測試一個字符串是否是一個序列，但是我不得不同意，這樣的輸出不會產生請求的結果。但我想補充一點，我認爲這非常有創意，我喜歡它的功能。 –

-1

這是產生以這樣的方式使得它們被長度和第一次出現有序增量的全部重複序列的算法。它基於一個簡單的想法：在一個句子中兩次找到一個單詞，相同的起始字母必須出現兩次。有一些解釋（算法保持不變）的Java代碼，它將輸出交織的重複，例如， BANANA => A，N，AN，NA，ANA（1,3），如果與前一個的距離小於字符串長度，則可以消除索引，以便在此算法中進行更正（下面的代碼是樣本運行，這應該更好地解釋它）：

public List<String> getRepetitions(String string) { 
    List<String> repetitions = new ArrayList<String>(); 
    Map<String, List<Integer>> rep = new HashMap<String, List<Integer>>(), repOld; 
    // init rep, add start position of all single character length strings 
    for (int i = 0; i < string.length(); i++) { 
     String s = string.substring(i, i + 1); // startIndex inclusive, endIndex exclusive 
     if (rep.containsKey(s)) { 
     rep.get(s).add(new Integer(i)); 
     } else { 
     List<Integer> l = new ArrayList<Integer>(); 
     l.add(new Integer(i)); 
     rep.put(l); 
     } 
    } 
    // eliminate those with no repetitions and add the others to the solution 
    for (Map.Entry<String, Integer> e : rep.entrySet()) { 
     if (e.getValue().size() < 2) { 
     rep.remove(e.getKey()); 
     } else { 
     repetitions.add(e.getKey()); 
     } 
    } 
    for (int len = 1; rep.size() > 0; len++) { 
     repOld = rep; 
     rep = new HashMap<String, List<Integer>>(); 
     for (Map.EntrySet<String, List<Integer>> e : repOld.entrySet()) { 
     for (Integer i : e.getValue()) { // for all start indices 
      if (i.intValue() + len + 1 >= string.length()) 
       break; 
      String s = e.getKey() + string.charAt(i.intValue() + len + 1); 
      if (rep.containsKey(s)) { 
       rep.get(s).add(i); 
      } else { 
       List<Integer> l = new ArrayList<Integer>(); 
       l.add(i); 
       rep.put(l); 
      } 
     } 
     } 
     // eliminate repetitions and add to solution 
     for (Map.Entry<String, Integer> e : rep.entrySet()) { 
     if (e.getValue().size() < 2) { 
      rep.remove(e.getKey()); 
     } else { 
      repetitions.add(e.getKey()); 
     } 
     } 
    } 
    return repetitions; // ordered by length, so last = longest 
}

樣品運行香蕉：

單個字母添加到rep =>乙 - > [0]，A - > [1，3，5] ，N - > [2,4]
消除少於2次出現的那些（B），將其他添加到溶液（A，N）
add下列信剩餘occurances（創建新rep）：AN - > [1，3]，NA - > [2,4]
消除（ - ），並添加（AN，NA）
重複步驟3和4。：ANA - > [1,3]
在週期rep將成爲空的，並且該算法結束

來源

2015-06-28 03:58:57 maraca

由於這個問題是要求VB.Net解決方案 –

@PaulIshak在這裏算法重要...不是語言，我敢肯定這是最高性能的方法。我低估了你的意思，因爲你沒有解釋任何事情......把它做成你想要的......好極了，他會從你身上學到很多東西。 – maraca

如果他需要更多的信息，從經驗，我知道他會問。我的命名約定解釋了這一切，如果他們不這樣做，我會按他的要求澄清。 –

這裏是一個例子，這將允許用戶指定的序列和返回的最小和最大長度一個名爲sequence的自定義類的列表，其中有多個事件。序列類將包含找到的模式和模式出現的索引列表。

enter image description here

Option Strict On 
Option Explicit On 
Option Infer Off 
Public Class Form1 
    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click 

     ListView1.Items.Clear() 
     ListView1.Columns.Clear() 
     ListView1.Columns.Add("Sequence") 
     ListView1.Columns.Add("Indexes of occurrence") 
     Dim sequences As List(Of Sequence) = DetectSequences("EDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGB") 
     For Each s As Sequence In sequences 
      Dim item As New ListViewItem(s.Sequence) 
      item.Tag = s 
      item.SubItems.Add(s.IndexesToString) 
      ListView1.Items.Add(item) 
     Next 
     ListView1.AutoResizeColumns(ColumnHeaderAutoResizeStyle.HeaderSize) 
    End Sub 
    Function DetectSequences(s As String, Optional minLength As Integer = 5, Optional MaxLength As Integer = 8) As List(Of Sequence) 
     Dim foundPatterns As New List(Of String) 
     Dim foundSequences As New List(Of Sequence) 
     Dim potentialPattern As String = String.Empty, potentialMatch As String = String.Empty 
     For start As Integer = 0 To s.Length - 1 
      For length As Integer = 1 To s.Length - start 
       potentialPattern = s.Substring(start, length) 
       If potentialPattern.Length < minLength Then Continue For 
       If potentialPattern.Length > MaxLength Then Continue For 
       If foundPatterns.IndexOf(potentialPattern) = -1 Then 
        foundPatterns.Add(potentialPattern) 
       End If 
      Next 
     Next 
     For Each pattern As String In foundPatterns 
      Dim sequence As New Sequence With {.Sequence = pattern} 
      For start As Integer = 0 To s.Length - pattern.Length 
       Dim length As Integer = pattern.Length 
       potentialMatch = s.Substring(start, length) 
       If potentialMatch = pattern Then 
        sequence.Indexes.Add(start) 
       End If 
      Next 
      If sequence.Indexes.Count > 1 Then foundSequences.Add(sequence) 
     Next 
     Return foundSequences 
    End Function 
    Public Class Sequence 
     Public Sequence As String = "" 
     Public Indexes As New List(Of Integer) 
     Public Function IndexesToString() As String 
      Dim sb As New System.Text.StringBuilder 
      For i As Integer = 0 To Indexes.Count - 1 
       If i = Indexes.Count - 1 Then 
        sb.Append(Indexes(i).ToString) 
       Else 
        sb.Append(Indexes(i).ToString & ", ") 
       End If 
      Next 
      Return sb.ToString 
     End Function 
    End Class 
    Private Sub ListView1_SelectedIndexChanged(sender As Object, e As EventArgs) Handles ListView1.SelectedIndexChanged 
     If ListView1.SelectedItems.Count = 0 Then Exit Sub 
     RichTextBox1.Clear() 
     RichTextBox1.Text = "EDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGB" 
     Dim selectedSequence As Sequence = DirectCast(ListView1.SelectedItems(0).Tag, Sequence) 
     For Each i As Integer In selectedSequence.Indexes 
      RichTextBox1.SelectionStart = i 
      RichTextBox1.SelectionLength = selectedSequence.Sequence.Length 
      RichTextBox1.SelectionBackColor = Color.Red 
     Next 
    End Sub 
End Class

來源

2015-06-28 17:53:45

查找字符串中的重複序列

回答

相關問題