protobuffer中string和bytes类型

6月 07, 2017 in 序列化与反序列化

protobuffer中string和bytes类型

从上一节protobuffer的介绍中我们知道字符串类型在protobuffer中有string和bytes两种类型，那这两种类型有什么区别呢,什么时候用string,什么时候用bytes。在C++中两种类型分别对应的是什么类型.下面将揭开迷雾

string与bytes区别

按照经验我们知道bytes一般适用于存储二进制数据的，但在C++中,string既可以存储ASCII文本字符串，也能存储任意多个\0的二进制序列，那两者的区别在哪里呢？

string类型（protobuffer中的string，与C++区别开）不能存储非法的UTF-8字符，如果遇到该字符，序列化的时候将会出错。

[libprotobuf ERROR google/protobuf/wire_format.cc:1091] String field ‘str’ contains invalid UTF-8 data when serializing a protocol buffer. Use the ‘bytes’ type if you intend to send raw bytes.

###出现上述错误的原因
这里从ProtoBuf的源码进行分析。protoBuf在序列化的过程中，都会调用SerializeFieldWithCachedSizes这个函数。我们看一下序列化string和bytes在序列化过程中的区别。

对于string类型：

case FieldDescriptor::TYPE_STRING: {
  string scratch;
  const string& value = field->is_repeated() ?
    message_reflection->GetRepeatedStringReference(
      message, field, j, &scratch) :
    message_reflection->GetStringReference(message, field, &scratch);
  VerifyUTF8StringNamedField(value.data(), value.length(), SERIALIZE,
                             field->name().c_str());
  WireFormatLite::WriteString(field->number(), value, output);
  break;
}

对于bytes类型：

case FieldDescriptor::TYPE_BYTES: {
        string scratch;
        const string& value = field->is_repeated() ?
          message_reflection->GetRepeatedStringReference(
            message, field, j, &scratch) :
          message_reflection->GetStringReference(message, field, &scratch);
        WireFormatLite::WriteBytes(field->number(), value, output);
        break;
}

从上面可以看到，序列化string和bytes的区别主要在于:string类型序列化调用了VerifyUTF8StringNamedField函数检验string中是否有非法的UTF-8字符。其中VerifyUTF8StringNamedField实现如下：

void WireFormat::VerifyUTF8StringFallback(const char* data,
                                          int size,
                                          Operation op,
                                          const char* field_name) {
  if (!IsStructurallyValidUTF8(data, size)) {
    const char* operation_str = NULL;
    switch (op) {
      case PARSE:
        operation_str = "parsing";
        break;
      case SERIALIZE:
        operation_str = "serializing";
        break;
      // no default case: have the compiler warn if a case is not covered.
    }
    string quoted_field_name = "";
    if (field_name != NULL) {
      quoted_field_name = StringPrintf(" '%s'", field_name);
    }
    // no space below to avoid double space when the field name is missing.
    GOOGLE_LOG(ERROR) << "String field" << quoted_field_name << " contains invalid "
               << "UTF-8 data when " << operation_str << " a protocol "
               << "buffer. Use the 'bytes' type if you intend to send raw "
               << "bytes. ";
  }
}

string和bytes类型在C++和Java中的区别

protobuf类型在C++和java中的类型对应如下：

在C++中，string和bytes的实现都是std::string类型。
在Java中string和bytes类型的实现分别是String和ByteString。

为什么bytes类型可以描述string类型，还需要string呢？

根据论坛上说的，string类型在Java中有较多的API可供使用，而bytes较少，所以能定义为string的尽量定义为string，如果字段值确定或者可能含有非法的utf-8编码，则使用bytes类型。

Comment and share

魏传柳

protobuffer中string和bytes类型

protobuffer中string和bytes类型

string与bytes区别

string和bytes类型在C++和Java中的区别

魏传柳(2824759538@qq.com)

author.bio

Tencent

ShenZhen,China