问题描述
我正在使用编码UTF-8"将对象编组为 XML 文件.它成功生成文件.但是当我尝试将其解组时,出现错误:
I am marshalling objects to XML file using encoding "UTF-8". It generates file successfully. But when I try to unmarshal it back, there is an error:
无效的 XML 字符(Unicode:0x{2}) 的值中发现属性{1}"且元素为0"
An invalid XML character (Unicode: 0x{2}) was found in the value of attribute "{1}" and element is "0"
字符为 0x1A 或 u001a,在 UTF-8 中有效但在 XML 中非法.JAXB 中的 Marshaller 允许将此字符写入 XML 文件,但 Unmarshaller 无法将其解析回来.我尝试使用另一种编码(UTF-16、ASCII 等)但仍然出错.
The character is 0x1A or u001a, which is valid in UTF-8 but illegal in XML. Marshaller in JAXB allows writing this character into XML file, but Unmarshaller cannot parse it back. I tried to use another encoding (UTF-16, ASCII, etc) but still error.
常见的解决方案是在 XML 解析之前删除/替换这个无效字符.但是如果我们需要这个字符,解组后如何得到原来的字符呢?
The common solution is to remove/replace this invalid character before XML parsing. But if we need this character back, how to get the original character after unmarshalling?
在寻找此解决方案时,我想在解组之前用替代字符(例如点 =.")替换无效字符.
While looking for this solution, I want to replace the invalid characters with a substitute character (for example dot = ".") before unmarshalling.
我已经创建了这个类:
public class InvalidXMLCharacterFilterReader extends FilterReader { public static final char substitute = '.'; public InvalidXMLCharacterFilterReader(Reader in) { super(in); } @Override public int read(char[] cbuf, int off, int len) throws IOException { int read = super.read(cbuf, off, len); if (read == -1) return -1; for (int readPos = off; readPos < off + read; readPos++) { if(!isValid(cbuf[readPos])) { cbuf[readPos] = substitute; } } return readPos - off + 1; } public boolean isValid(char c) { if((c == 0x9) || (c == 0xA) || (c == 0xD) || ((c >= 0x20) && (c <= 0xD7FF)) || ((c >= 0xE000) && (c <= 0xFFFD)) || ((c >= 0x10000) && (c <= 0x10FFFF))) { return true; } else return false; } }
这就是我读取和解组文件的方式:
Then this is how I read and unmarshall the file:
FileReader fileReader = new FileReader(this.getFile()); Reader reader = new InvalidXMLCharacterFilterReader(fileReader); Object o = (Object)um.unmarshal(reader);
不知何故,读者不会用我想要的字符替换无效字符.它会导致无法解组的错误 XML 数据.我的 InvalidXMLCharacterFilterReader 类有问题吗?
Somehow the reader does not replace invalid characters with the character I want. It results a wrong XML data which can't be unmarshalled. Is there something wrong with my InvalidXMLCharacterFilterReader class?
推荐答案
Unicode 字符 U+001A 是 在 XML 1.0 中是非法的:
The Unicode character U+001A is illegal in XML 1.0:
在这种情况下,用于表示它的编码无关紧要,在 XML 内容中根本不允许.
The encoding used to represent it does not matter in this case, it's simply not allowed in XML content.
XML 1.1 允许某些受限字符(包括 U+001A)被包括在内,但它们必须以数字字符引用的形式出现 ()
XML 1.1 allows some of the restricted characters (including U+001A) to be included, but they must be present as numeric character references ()
维基百科有一个很好的情况总结.