2009-06-16 138 views

回答

34

對於PowerShell來說,這是一項比較簡單的任務,因爲標準的Get-Content cmdlet不能很好地處理非常大的文件,這很複雜。我建議的做法是使用.NET StreamReader class在PowerShell腳本中逐行讀取文件,並使用Add-Content cmdlet將每行寫入文件名不斷增加的索引文件。像這樣:

$upperBound = 50MB # calculated by Powershell 
$ext = "log" 
$rootName = "log_" 

$reader = new-object System.IO.StreamReader("C:\Exceptions.log") 
$count = 1 
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext) 
while(($line = $reader.ReadLine()) -ne $null) 
{ 
    Add-Content -path $fileName -value $line 
    if((Get-ChildItem -path $fileName).Length -ge $upperBound) 
    { 
     ++$count 
     $fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext) 
    } 
} 

$reader.Close() 
+1

這正是我一直在尋找的,感謝您確認我的預感,get-content對於大文件來說並不是很好。 – 2009-06-16 19:53:25

+3

有用的提示:您可以表達這樣的... $ UPPERBOUND = 5MB – Lee 2009-06-16 20:02:01

+3

數字對於那些懶得看接下來的回答,您可以通過$讀卡器=新對象就是System.IO.StreamReader($ INPUTFILE設置$讀者對象) – lmsurprenant 2011-07-14 12:27:03

15

我經常需要做同樣的事情。訣竅是將標題重複到每個拆分塊中。我寫了下面的cmdlet(PowerShell v2 CTP 3),它的確有用。

############################################################################## 
#.SYNOPSIS 
# Breaks a text file into multiple text files in a destination, where each 
# file contains a maximum number of lines. 
# 
#.DESCRIPTION 
# When working with files that have a header, it is often desirable to have 
# the header information repeated in all of the split files. Split-File 
# supports this functionality with the -rc (RepeatCount) parameter. 
# 
#.PARAMETER Path 
# Specifies the path to an item. Wildcards are permitted. 
# 
#.PARAMETER LiteralPath 
# Specifies the path to an item. Unlike Path, the value of LiteralPath is 
# used exactly as it is typed. No characters are interpreted as wildcards. 
# If the path includes escape characters, enclose it in single quotation marks. 
# Single quotation marks tell Windows PowerShell not to interpret any 
# characters as escape sequences. 
# 
#.PARAMETER Destination 
# (Or -d) The location in which to place the chunked output files. 
# 
#.PARAMETER Count 
# (Or -c) The maximum number of lines in each file. 
# 
#.PARAMETER RepeatCount 
# (Or -rc) Specifies the number of "header" lines from the input file that will 
# be repeated in each output file. Typically this is 0 or 1 but it can be any 
# number of lines. 
# 
#.EXAMPLE 
# Split-File bigfile.csv 3000 -rc 1 
# 
#.LINK 
# Out-TempFile 
############################################################################## 
function Split-File { 

    [CmdletBinding(DefaultParameterSetName='Path')] 
    param(

     [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)] 
     [String[]]$Path, 

     [Alias("PSPath")] 
     [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)] 
     [String[]]$LiteralPath, 

     [Alias('c')] 
     [Parameter(Position=2,Mandatory=$true)] 
     [Int32]$Count, 

     [Alias('d')] 
     [Parameter(Position=3)] 
     [String]$Destination='.', 

     [Alias('rc')] 
     [Parameter()] 
     [Int32]$RepeatCount 

    ) 

    process { 

     # yeah! the cmdlet supports wildcards 
     if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} } 
     elseif ($Path) { $ResolveArgs = @{Path=$Path} } 

     Resolve-Path @ResolveArgs | %{ 

      $InputName = [IO.Path]::GetFileNameWithoutExtension($_) 
      $InputExt = [IO.Path]::GetExtension($_) 

      if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount } 

      # get the input file in manageable chunks 

      $Part = 1 
      Get-Content $_ -ReadCount:$Count | %{ 

       # make an output filename with a suffix 
       $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt)) 

       # In the first iteration the header will be 
       # copied to the output file as usual 
       # on subsequent iterations we have to do it 
       if ($RepeatCount -and $Part -gt 1) { 
        Set-Content $OutputFile $Header 
       } 

       # write this chunk to the output file 
       Write-Host "Writing $OutputFile" 
       Add-Content $OutputFile $_ 

       $Part += 1 

      } 

     } 

    } 

} 
2

我做了一些修改,根據每個部分的大小來拆分文件。

############################################################################## 
#.SYNOPSIS 
# Breaks a text file into multiple text files in a destination, where each 
# file contains a maximum number of lines. 
# 
#.DESCRIPTION 
# When working with files that have a header, it is often desirable to have 
# the header information repeated in all of the split files. Split-File 
# supports this functionality with the -rc (RepeatCount) parameter. 
# 
#.PARAMETER Path 
# Specifies the path to an item. Wildcards are permitted. 
# 
#.PARAMETER LiteralPath 
# Specifies the path to an item. Unlike Path, the value of LiteralPath is 
# used exactly as it is typed. No characters are interpreted as wildcards. 
# If the path includes escape characters, enclose it in single quotation marks. 
# Single quotation marks tell Windows PowerShell not to interpret any 
# characters as escape sequences. 
# 
#.PARAMETER Destination 
# (Or -d) The location in which to place the chunked output files. 
# 
#.PARAMETER Size 
# (Or -s) The maximum size of each file. Size must be expressed in MB. 
# 
#.PARAMETER RepeatCount 
# (Or -rc) Specifies the number of "header" lines from the input file that will 
# be repeated in each output file. Typically this is 0 or 1 but it can be any 
# number of lines. 
# 
#.EXAMPLE 
# Split-File bigfile.csv -s 20 -rc 1 
# 
#.LINK 
# Out-TempFile 
############################################################################## 
function Split-File { 

    [CmdletBinding(DefaultParameterSetName='Path')] 
    param(

     [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)] 
     [String[]]$Path, 

     [Alias("PSPath")] 
     [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)] 
     [String[]]$LiteralPath, 

     [Alias('s')] 
     [Parameter(Position=2,Mandatory=$true)] 
     [Int32]$Size, 

     [Alias('d')] 
     [Parameter(Position=3)] 
     [String]$Destination='.', 

     [Alias('rc')] 
     [Parameter()] 
     [Int32]$RepeatCount 

    ) 

    process { 

    # yeah! the cmdlet supports wildcards 
     if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} } 
     elseif ($Path) { $ResolveArgs = @{Path=$Path} } 

     Resolve-Path @ResolveArgs | %{ 

      $InputName = [IO.Path]::GetFileNameWithoutExtension($_) 
      $InputExt = [IO.Path]::GetExtension($_) 

      if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount } 

    Resolve-Path @ResolveArgs | %{ 

    $InputName = [IO.Path]::GetFileNameWithoutExtension($_) 
    $InputExt = [IO.Path]::GetExtension($_) 

    if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount } 

    # get the input file in manageable chunks 

    $Part = 1 
    $buffer = "" 
    Get-Content $_ -ReadCount:1 | %{ 

    # make an output filename with a suffix 
    $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt)) 

    # In the first iteration the header will be 
    # copied to the output file as usual 
    # on subsequent iterations we have to do it 
    if ($RepeatCount -and $Part -gt 1) { 
     Set-Content $OutputFile $Header 
    } 

    # test buffer size and dump data only if buffer is greater than size 
    if ($buffer.length -gt ($Size * 1MB)) { 
     # write this chunk to the output file 
     Write-Host "Writing $OutputFile" 
     Add-Content $OutputFile $buffer 
     $Part += 1 
     $buffer = "" 
    } else { 
     $buffer += $_ + "`r" 
    } 
    } 
    } 
     } 
    } 
} 
14

我試圖在單個vCard VCF文件中拆分多個聯繫人以分隔文件時發現此問題。這是我根據Lee的代碼所做的。我不得不查找如何創建一個新的StreamReader對象,並將null更改爲$ null。

$reader = new-object System.IO.StreamReader("C:\Contacts.vcf") 
$count = 1 
$filename = "C:\Contacts\{0}.vcf" -f ($count) 

while(($line = $reader.ReadLine()) -ne $null) 
{ 
    Add-Content -path $fileName -value $line 

    if($line -eq "END:VCARD") 
    { 
     ++$count 
     $filename = "C:\Contacts\{0}.vcf" -f ($count) 
    } 
} 

$reader.Close() 
38

關於一些現有答案的警告詞 - 它們將對非常大的文件運行非常緩慢。對於一個1.6   GB日誌文件,我在幾個小時後放棄了,意識到它不會在我第二天返回工作之前完成。

兩個問題:調用Add-Content打開,尋找並關閉源文件中每一行的當前目標文件。每次讀取源文件的一小部分,並尋找新的線條也會減慢速度,但我的猜測是添加內容是主要的罪魁禍首。

以下變異產生稍微不太愉快的輸出:它會在線路中間分割文件,但分割我的1.6   GB日誌中不到一分鐘:

$from = "C:\temp\large_log.txt" 
$rootName = "C:\temp\large_log_chunk" 
$ext = "txt" 
$upperBound = 100MB 


$fromFile = [io.file]::OpenRead($from) 
$buff = new-object byte[] $upperBound 
$count = $idx = 0 
try { 
    do { 
     "Reading $upperBound" 
     $count = $fromFile.Read($buff, 0, $buff.Length) 
     if ($count -gt 0) { 
      $to = "{0}.{1}.{2}" -f ($rootName, $idx, $ext) 
      $toFile = [io.file]::OpenWrite($to) 
      try { 
       "Writing $count to $to" 
       $tofile.Write($buff, 0, $count) 
      } finally { 
       $tofile.Close() 
      } 
     } 
     $idx ++ 
    } while ($count -gt 0) 
} 
finally { 
    $fromFile.Close() 
} 
3

還有這種快速(和有點髒)單行:

$linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | %{ Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } } 

您可以通過更改硬編碼的3000值來調整每批第一行的數量。

2

這樣做:

FILE 1

還有這種快速(和有點髒)的一行:

$linecount=0; $i=0; 
    Get-Content .\BIG_LOG_FILE.txt | % 
    { 
     Add-Content OUT$i.log "$_"; 
     $linecount++; 
     if ($linecount -eq 3000) {$I++; $linecount=0 } 
    } 

您可以通過改變調整的每一批第一線的數量硬編碼的3000值。

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -First 5000 | out-File C:\temp\file1.txt -Encoding ASCII 

FILE 2

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 5000 | Select -First 5000 | out-File C:\temp\file2.txt -Encoding ASCII 

FILE 3

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 10000 | Select -First 5000 | out-File C:\temp\file3.txt -Encoding ASCII 

等...

22

簡單的一行分裂基於(在這種情況下100)的行數:

$i=0; Get-Content .....log -ReadCount 100 | %{$i++; $_ | Out-File out_$i.txt} 
5

其中很多答案對我的源文件來說太慢了。我的源文件是10   MB和800   MB之間的SQL文件,它們需要分割成大致相同行數的文件。

我發現一些以前使用Add-Content的回答很慢。等待很長時間才能完成分組並不罕見。

我沒有嘗試Typhlosaurus's answer,但它看起來只按文件大小進行分割,而不是按行數進行分割。

以下內容適合我的目的。

$sw = new-object System.Diagnostics.Stopwatch 
$sw.Start() 
Write-Host "Reading source file..." 
$lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql") 
$totalLines = $lines.Length 

Write-Host "Total Lines :" $totalLines 

$skip = 0 
$count = 100000; # Number of lines per file 

# File counter, with sort friendly name 
$fileNumber = 1 
$fileNumberString = $filenumber.ToString("000") 

while ($skip -le $totalLines) { 
    $upper = $skip + $count - 1 
    if ($upper -gt ($lines.Length - 1)) { 
     $upper = $lines.Length - 1 
    } 

    # Write the lines 
    [System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)]) 

    # Increment counters 
    $skip += $count 
    $fileNumber++ 
    $fileNumberString = $filenumber.ToString("000") 
} 

$sw.Stop() 

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds" 

對於54   MB的文件,我得到的輸出...

Reading source file... 
Total Lines : 910030 
Split complete in 1.7056578 seconds 

我希望別人找一個簡單的,基於行的分裂腳本符合我的要求,會發現這很有用。

20

與此處的所有答案相同,但使用StreamReader/StreamWriter在新行上分行(逐行,而不是嘗試將整個文件一次讀入內存)。這種方法可以以我知道的最快方式分割大文件。

注意:我做了很少的錯誤檢查,所以我不能保證它能順利地處理您的情況。它爲我做了(1.7   GB TXT文件 400萬行,每個文件在100,000行中分割,在95秒內)。

#split test 
$sw = new-object System.Diagnostics.Stopwatch 
$sw.Start() 
$filename = "C:\Users\Vincent\Desktop\test.txt" 
$rootName = "C:\Users\Vincent\Desktop\result" 
$ext = ".txt" 

$linesperFile = 100000#100k 
$filecount = 1 
$reader = $null 
try{ 
    $reader = [io.file]::OpenText($filename) 
    try{ 
     "Creating file number $filecount" 
     $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext)) 
     $filecount++ 
     $linecount = 0 

     while($reader.EndOfStream -ne $true) { 
      "Reading $linesperFile" 
      while(($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){ 
       $writer.WriteLine($reader.ReadLine()); 
       $linecount++ 
      } 

      if($reader.EndOfStream -ne $true) { 
       "Closing file" 
       $writer.Dispose(); 

       "Creating file number $filecount" 
       $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext)) 
       $filecount++ 
       $linecount = 0 
      } 
     } 
    } finally { 
     $writer.Dispose(); 
    } 
} finally { 
    $reader.Dispose(); 
} 
$sw.Stop() 

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds" 

輸出分裂1.7   GB的文件:

... 
Creating file number 45 
Reading 100000 
Closing file 
Creating file number 46 
Reading 100000 
Closing file 
Creating file number 47 
Reading 100000 
Closing file 
Creating file number 48 
Reading 100000 
Split complete in 95.6308289 seconds 
0

我的要求是有點不同。我經常使用逗號分隔和製表符分隔ASCII文件,其中單行是單個數據記錄。而且它們非常大,所以我需要將它們分成可管理的部分(同時保留標題行)。

因此,我回到了我的經典VBScript方法,並將可在任何Windows計算機上運行的小型.vbs腳本(它由Window上的WScript.exe腳本主機引擎自動執行)一起執行。

該方法的好處是它使用文本流,因此底層數據不會被加載到內存中(或者至少並非全部都是)。結果是它非常快速,並且不需要太多內存來運行。在我的i7上使用此腳本分割的測試文件大小約爲1 GB,擁有大約1200萬行文本,並被分割成25個部分文件(每個文件約500k行) - 處理耗時約2分鐘,它沒有超過任何時候使用的3 MB內存。

這裏需要注意的是,它依賴於具有「行」(意味着每個記錄由CRLF分隔)的文本文件,因爲文本流對象使用「ReadLine」函數一次處理一行。但是,嘿,如果你正在使用TSV或CSV文件,它是完美的。

Option Explicit 

Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt" 
Private Const REPEAT_HEADER_ROW = True     
Private Const LINES_PER_PART = 500000     

Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart 

sStart = Now() 

sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1) 
iLineCounter = 0 
iOutputFile = 1 

Set oFileSystem = CreateObject("Scripting.FileSystemObject") 
Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False) 
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True) 

If REPEAT_HEADER_ROW Then 
    iLineCounter = 1 
    sHeaderLine = oInputFile.ReadLine() 
    Call oOutputFile.WriteLine(sHeaderLine) 
End If 

Do While Not oInputFile.AtEndOfStream 
    sLine = oInputFile.ReadLine() 
    Call oOutputFile.WriteLine(sLine) 
    iLineCounter = iLineCounter + 1 
    If iLineCounter Mod LINES_PER_PART = 0 Then 
     iOutputFile = iOutputFile + 1 
     Call oOutputFile.Close() 
     Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True) 
     If REPEAT_HEADER_ROW Then 
      Call oOutputFile.WriteLine(sHeaderLine) 
     End If 
    End If 
Loop 

Call oInputFile.Close() 
Call oOutputFile.Close() 
Set oFileSystem = Nothing 

Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now()) 
0

聽起來像是UNIX命令拆分作業:

split MyBigFile.csv 

在21K塊就拆我的55 GB的csv文件,在不到10分鐘。

這是不是當地的PowerShell的,雖然,但自帶的,例如,for Windows軟件包https://git-scm.com/download/win

0

混帳由於線可以在日誌中我認爲最好採取一些每個文件的辦法行變量。下面的代碼段處理在400萬行的日誌文件未滿19秒(18.83 ..秒)拆分爲500000個組塊:

$sourceFile = "c:\myfolder\mylargeTextyFile.csv" 
$partNumber = 1 
$batchSize = 500000 
$pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv" 

[System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001) # utf8 this one 

$fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None") 
$streamIn=New-Object System.IO.StreamReader($fs, $enc) 
$streamout = new-object System.IO.StreamWriter $pathAndFilename 

$line = $streamIn.readline() 
$counter = 0 
while ($line -ne $null) 
{ 
    $streamout.writeline($line) 
    $counter +=1 
    if ($counter -eq $batchsize) 
    { 
     $partNumber+=1 
     $counter =0 
     $streamOut.close() 
     $pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv" 
     $streamout = new-object System.IO.StreamWriter $pathAndFilename 

    } 
    $line = $streamIn.readline() 
} 
$streamin.close() 
$streamout.close() 

這可以很容易地轉變成一個功能或腳本文件與參數,使它更通用。它使用StreamReaderStreamWriter來實現其速度和微小的內存佔用

0

這是我的解決方案將一個名爲patch6.txt(大約32,000行)的文件拆分爲每個1000行的單獨文件。它並不快,但它能完成這項工作。

$infile = "D:\Malcolm\Test\patch6.txt" 
$path = "D:\Malcolm\Test\" 
$lineCount = 1 
$fileCount = 1 

foreach ($computername in get-content $infile) 
{ 
    write $computername | out-file -Append $path_$fileCount".txt" 
    $lineCount++ 

    if ($lineCount -eq 1000) 
    { 
     $fileCount++ 
     $lineCount = 1 
    } 
}