在 Avro 檔案中儲存空值

一、簡介

在本教學中，我們將探討在 Java 中使用 Apache Avro 時處理和寫入檔案null值的兩種方法。這些處理null值的方法也將使我們能夠討論處理可為空字段的最佳實踐。

2. Avro 中空值的問題

Apache Avro 是一個資料序列化框架，提供豐富的資料結構和緊湊、快速的二進位資料格式。不過，在 Avro 中使用null值需要特別注意。

讓我們回顧一下我們可能會遇到問題的常見場景：

GenericRecord record = new GenericData.Record(schema);

 record.put("email", null);

 // This might throw NullPointerException when writing to file

預設情況下，Avro 欄位不可為空。嘗試儲存 null 值會在序列化期間導致NullPointerException 。

在我們查看第一個解決方案之前，讓我們使用正確的依賴項來設定我們的專案：

<dependency>

 <groupId>org.apache.avro</groupId>

 <artifactId>avro</artifactId>

 <version>1.12.0</version>

 </dependency>

3.處理空值的解決方案

在本節中，我們將探討在 Avro 中處理null值的兩種主要方法：模式定義和基於註解。

3.1.以三種可能的方式定義模式

我們可以透過三種方式定義具有可接受的null值的 Avro 模式。首先，讓我們來看看 JSON 字串方法：

private static final String SCHEMA_JSON = """

 {

 "type": "record",

 "name": "User",

 "namespace": "com.baeldung.apache.avro.storingnullvaluesinavrofile",

 "fields": [

 {"name": "id", "type": "long"},

 {"name": "name", "type": "string"},

 {"name": "active", "type": "boolean"},

 {"name": "lastUpdatedBy", "type": ["null", "string"], "default": null},

 {"name": "email", "type": "string"}

 ]

 }""";

public static Schema createSchemaFromJson() {

 return new Schema.Parser().parse(SCHEMA_JSON);

 }

這裡我們使用union類型語法定義可為空的欄位： [“null”, “string”].

接下來，我們將使用SchemaBuilder方法以更具程式化的方式定義我們的模式：

public static Schema createSchemaWithOptionalFields1() {

 return SchemaBuilder

 .record("User")

 .namespace("com.baeldung.apache.avro.storingnullvaluesinavrofile")

 .fields()

 .requiredLong("id")

 .requiredString("name")

 .requiredBoolean("active")

 .name("lastUpdatedBy")

 .type() // Start of configuration

 .unionOf()

 .nullType()

 .and()

 .stringType()

 .endUnion()

 .nullDefault() // End of configuration

 .requiredString("email")

 .endRecord();

 }

在此範例中，我們使用SchemaBuilder建立一個架構，其中lastUpdatedBy欄位可以為null或boolean值。

最後，讓我們建立另一個模式，與上面的模式類似，但採用不同的方法：

public static Schema createSchemaWithOptionalFields2() {

 return SchemaBuilder

 .record("User")

 .namespace("com.baeldung.apache.avro.storingnullvaluesinavrofile")

 .fields()

 .requiredLong("id")

 .requiredString("name")

 .requiredBoolean("active")

 .requiredString("lastUpdatedBy")

 .optionalString("email") // Using optional field

 .endRecord();

 }

我們沒有使用type().unionOf().nullType().andStringType().endUnion().nullDefault()鏈，而是使用了optionalString() 。

讓我們快速比較一下最後兩種定義模式的方法，因為它們非常相似。

較長的版本在配置null值時提供更多控制選項。較短的版本是SchemaBuilder.本質上，他們做同樣的事情。

3.2.使用`@Nullable`註解

下一個方法使用 Avro 的內建@Nullable註釋：

public class AvroUser {

 private long id;

 private String name;

 @Nullable

 private Boolean active;

 private String lastUpdatedBy;

 private String email;



 // rest of code

 }

該註釋告訴 Avro 基於反射的程式碼產生能力，該欄位可以接受null值。

4. 寫入文件的實現

現在，讓我們看看如何序列化包含null值的Record ：

public static void writeToAvroFile(Schema schema, GenericRecord record, String filePath) throws IOException {

 DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);

 try (DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter)) {

 dataFileWriter.create(schema, new File(filePath));

 dataFileWriter.append(record);

 }

 }

我們初始化一個Generic DatumWriter來處理GenericRecord物件。這是與GenericRecord.接下來，我們將模式作為建構函數參數傳遞，以使其知道如何序列化資料。

然後，我們初始化一個DataFileWriter,該類別處理 Avro record.它還處理檔案的元資料和壓縮。

然後，使用create()方法，我們建立具有指定架構的 Avro 檔案。在這裡，我們添加更多數據（標頭）和元數據。

最後，我們將實際record寫入文件中。如果記錄中標示Nullable或union類型的欄位包含null值，這些欄位將會正確序列化。

5. 測試我們的解決方案

現在，讓我們檢查我們的實作是否正常運作：

@Test

 void whenSerializingUserWithNullPropFromStringSchema_thenSuccess(@TempDir Path tempDir) {

 user.setLastUpdatedBy(null);

 schema = AvroUser.createSchemaWithOptionalFields1();



 String filePath = tempDir.resolve("test.avro").toString();

 GenericRecord record = AvroUser.createRecord(AvroUser.createSchemaFromJson(), user);



 assertDoesNotThrow(() -> AvroUser.writeToAvroFile(schema, record, filePath));



 File avroFile = new File(filePath);

 assertTrue(avroFile.exists());

 assertTrue(avroFile.length() > 0);

 }

在此測試中，我們最初將lastUpdatedBy欄位設為null.然後，我們從開頭聲明的String模式建立了一個schema 。

從測試中我們可以看到，該記錄已成功序列化為null值：

@Test

 void givenSchemaBuilderWithOptionalFields1_whenCreatingSchema_thenSupportsNull(@TempDir Path tempDir) {

 user.setLastUpdatedBy(null);

 String filePath = tempDir.resolve("test.avro").toString();



 schema = AvroUser.createSchemaWithOptionalFields1();

 GenericRecord record = AvroUser.createRecord(schema, user);



 assertTrue(schema.getField("lastUpdatedBy").schema().isNullable(),

 "Union type field should be nullable");

 assertDoesNotThrow(() -> AvroUser.writeToAvroFile(schema, record, filePath));



 File avroFile = new File(filePath);

 assertTrue(avroFile.exists());

 assertTrue(avroFile.length() > 0);

 }

上面的第二個測試中也有類似的情況，我們使用了SchemaBuilder ，對null欄位進行了更長的配置。

最後， SchemaBuilder的第二個版本具有更短的null字段配置：

@Test

 void givenSchemaBuilderWithOptionalFields2_whenCreatingSchema_thenSupportsNull(@TempDir Path tempDir) {

 user.setEmail(null);

 String filePath = tempDir.resolve("test.avro").toString();



 schema = AvroUser.createSchemaWithOptionalFields2();

 GenericRecord record = AvroUser.createRecord(schema, user);



 assertTrue(schema.getField("email").schema().isNullable(),

 "Union type field should be nullable");

 assertDoesNotThrow(() -> AvroUser.writeToAvroFile(schema, record, filePath));



 File avroFile = new File(filePath);

 assertTrue(avroFile.exists());

 assertTrue(avroFile.length() > 0);

 }

六、結論

在本文中，我們探討了在 Apache Avro 中處理null值的兩種主要方法。首先，我們了解如何透過三種方式定義模式。然後，我們直接在類別屬性上實作了Nullable註解。

兩種方法都是有效的。然而，模式方法提供了更多的粒度，並且通常是生產系統的首選。

與往常一樣，程式碼可以在 GitHub 上取得。

本作品係原創或者翻譯，採用《署名-非商業性使用-禁止演繹4.0國際》許可協議