VBA学堂—无Bom文本编码判断

上一篇文章,介绍了文本编码判断,但这都是基于文本有BOM,如果是遇到没有BOM的文本,显然结果是不正确。所以在上一篇文章的基础上,增加对无BOM文本的判断。要百分百准确判断一个文件的编码是很难的,但是判断文本是否UTF-8编码就相对简单。用正则表达式遍历数据就可以判断,网上也有参考代码。


整理代码如下:

Sub Gc()
 Dim myFileName$
 myFileName = ThisWorkbook.Path & "\UTF8NoBom.txt"
 MsgBox GetCode(myFileName)
End Sub
Function GetCode(ByVal myFileName As String)
 Dim i As Long
 Dim n As Long
 Open myFileName For Binary Access Read As #1
 n = LOF(1) - 1
 
 ReDim Tmp(n) As Byte
 ReDim tp(n)
 Get #1, , Tmp
 Close #1
 
 For i = 0 To n
 tp(i) = ChrW(Tmp(i)) '返回与ANSI 字符代码相对应的字符
 Next
 
 str1 = Tmp(0) & Tmp(1) '前二个
 str2 = str1 & Tmp(2) '前三个
 str3 = Join(tp, "")
 
 If str1 = "255254" Then
 GetCode = "Unicode"
 ElseIf str1 = "254255" Then
 GetCode = "Unicode Big Endian"
 ElseIf str2 = "239187191" Then
 GetCode = "UTF-8"
 ElseIf is_valid_utf8(str3) Then '判断是否UTF8
 GetCode = "UTF8_NOBOM"
 Else
 GetCode = "ANSI"
 End If
End Function

下面是判断是否为UTF8

Function is_valid_utf8(ByRef str) 'ByRef以提高效率
 Dim s, mRegExp
 Set mRegExp = CreateObject("VbScript.regexp")
 
 s = "[\xC0-\xDF]([^\x80-\xBF]|$)"
 s = s & "|[\xE0-\xEF].{0,1}([^\x80-\xBF]|$)"
 s = s & "|[\xF0-\xF7].{0,2}([^\x80-\xBF]|$)"
 s = s & "|[\xF8-\xFB].{0,3}([^\x80-\xBF]|$)"
 s = s & "|[\xFC-\xFD].{0,4}([^\x80-\xBF]|$)"
 s = s & "|[\xFE-\xFE].{0,5}([^\x80-\xBF]|$)"
 s = s & "|[\x00-\x7F][\x80-\xBF]"
 s = s & "|[\xC0-\xDF].[\x80-\xBF]"
 s = s & "|[\xE0-\xEF]..[\x80-\xBF]"
 s = s & "|[\xF0-\xF7]...[\x80-\xBF]"
 s = s & "|[\xF8-\xFB]....[\x80-\xBF]"
 s = s & "|[\xFC-\xFD].....[\x80-\xBF]"
 s = s & "|[\xFE-\xFE]......[\x80-\xBF]"
 s = s & "|^[\x80-\xBF]"
 mRegExp.Pattern = s
 is_valid_utf8 = (Not mRegExp.test(str))
End Function

代码依然存在小问题:如果文本是纯英文数字,ASCII会判断为UTF8NoBom,不过纯英文数字在ASCII范围内和UTF-8是兼容的,不会出现乱码,可以忽略。


举报
评论 0