按字節大小預測分割XML文件

我有XML消息xmlStr，必須將其分割成更小的XML消息，這些消息小於或等於maxSizeBytes。這是通過將文檔的根和第一個孩子作爲較小XML的基礎，並將一些數量的元素放入新形成的（較小的）XML消息中來完成的。按字節大小預測分割XML文件

<?xml version="1.0"?> 
<Bas> 
    <Hdr> 
    <Smt>...</Smt> 
    <Smt>...</Smt> 
    <Smt>...</Smt> 
    </Hdr> 
</Bas>

目前，我測量整個郵件大小int smtNodesPerMessage = (int)Math.Ceiling((double)ASCIIEncoding.ASCII.GetByteCount(xmlStr)/(double)maxSizeBytes);，其次是考慮將smtNodesPerMessage節點分成更小的XML：

//doc is original XDocument message 
XDocument splitXML = new XDocument(new XElement(doc.Root.Name,            
            doc.Root.Descendants("Hdr"))); 
splitXML.Root.Add(batchOfSmt);

我很快就發現，是較小的XML文件的字節大小大比maxSizeBytes，由於XDocument爲每個消息添加額外的字符，增加字節大小。

來源

2017-04-10 newprint

有趣。讓我們知道你是如何去的 – MickyD

代碼可能是爲每條消息添加xml標識：<？xml version =「1.0」？> – jdweng

@jdweng，我確實，'splitXML.Declaration = doc.Declaration;'但不在上面的代碼。 – newprint

基本算法是：其中一個具有空Hdr元素文件

獲取大小。請注意，默認編碼是UTF-8。所以我用Encoding.Default.GetByteCount來計算文檔的大小和它的元素。
克隆將檢查是否子文件大小之前對每個子文檔
對於eash Smt元素這個空HDR文件將超過最大值

代碼註釋

var doc = XDocument.Load("data.xml"); 
var hdr = xdoc.Root.Element("Hdr"); 
var elements = hdr.Elements().ToList(); 
hdr.RemoveAll(); // we can remove child elements, because they are stored in a list 
hdr.Value = ""; // otherwise xdoc will compact empty element to <Hdr/> 

// calculating size of sub-document 'template' 
var sb = new StringBuilder(); 
using (XmlWriter writer = XmlWriter.Create(sb)) 
    doc.Save(writer); 
var outerSizeInBytes = Encoding.Default.GetByteCount(sb.ToString()); 

var maxSizeInBytes = 100; 
var subDocumentIndex = 0; // used just for naming sub-document files 
var subDocumentSizeBytes = outerSizeInBytes; // initial size of any sub-document 
var subDocument = new XDocument(doc); // clone 'template' 

foreach (var smt in elements) 
{ 
    var currentElementSizeBytes = Encoding.Default.GetByteCount(smt.ToString()); 

    if (maxSizeInBytes < subDocumentSizeBytes + currentElementSizeBytes 
     && subDocumentSizeBytes != outerSizeInBytes) // case when first element is too big 
    { 
     subDocument.Save($"doc{++subDocumentIndex}.xml"); 
     subDocument = new XDocument(doc); 
     subDocumentSizeBytes = outerSizeInBytes; 
    } 

    subDocument.Root.Element("Hdr").Add(smt); 
    subDocumentSizeBytes += currentElementSizeBytes; 
} 

// if current sub-document has elements added, save it too 
if (outerSizeInBytes < subDocumentSizeBytes) 
    subDocument.Save($"doc{++subDocumentIndex}.xml");

當來源是，最大大小爲250字節時，您將獲得三個文檔

<?xml version="1.0"?> 
<Bas> 
    <Hdr> 
    <Smt>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</Smt> 
    <Smt>Contrary to popular belief, Lorem Ipsum is not simply random text.</Smt> 
    <Smt>It has survived not only five centuries, 
but also the leap into electronic typesetting, remaining essentially unchanged.</Smt> 
    <Smt>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</Smt> 
    </Hdr> 
</Bas>

DOC1（223個字節）：

<?xml version="1.0" encoding="utf-8"?> 
<Bas> 
    <Hdr> 
    <Smt>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</Smt> 
    <Smt>Contrary to popular belief, Lorem Ipsum is not simply random text.</Smt> 
    </Hdr> 
</Bas>

DOC2（259個字節，單元素）：

<?xml version="1.0" encoding="utf-8"?> 
<Bas> 
    <Hdr> 
    <Smt>It has survived not only five centuries, 
but also the leap into electronic typesetting, remaining essentially unchanged.</Smt> 
    </Hdr> 
</Bas>

doc3的（128個字節，最後一個）

<?xml version="1.0" encoding="utf-8"?> 
<Bas> 
    <Hdr> 
    <Smt>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</Smt> 
    </Hdr> 
</Bas>

來源

2017-04-10 16:27:22

如果您使用ascii.GetBytesCount - 最好是將xml編碼聲明爲ascii（在xml聲明中）。 – Evk

@Evk同意，我只是從問題中複製字節計算方法。其實我相信Unicode應該用在那裏 –

是的，我認爲應該使用UTF-8（如果沒有指定其他編碼，則默認使用xml）。 – Evk

按字節大小預測分割XML文件

回答

相關問題