踩坑：Druid + S3 批量摄取任务中的各种报错

背景信息

Apache Druid: 26.0.0
Batch ingestion task informations:
- SQL-based ingestion
- S3 input source

Duplicate column entries found

详细报错

1
2
3
4
{
  "errorCode": "CannotParseExternalData",
  "errorMessage": "Duplicate column entries found : [0, Facebook]"
}

解决方案

Druid 属于列式存储，出现此问题的根本原因是，存在名称相同的两列。因此需要定位到名称相同的两列，并进行手动调整。

我遇到这个问题，是因为 MMP 方写入到 S3 的一手原始数据本身就是有问题的，具体表现为原始数据表头丢失，导致 Druid 自动识别到存在三列名称都为空的列。详见下方：

以下是正常的表头：
以下是有问题的表头：

InsertTimeOutOfBounds

详细报错

1
2
3
4
5
{
  "errorCode": "InsertTimeOutOfBoundsFault",
  "interval": "2023-06-09T00:00:00.000Z/2023-06-10T00:00:00.000Z",
  "errorMessage": "Query generated time chunk [2023-06-09T00:00:00.000Z/2023-06-10T00:00:00.000Z] out of bounds specified by replaceExistingTimeChunks"
}

解决方案

此问题一般发生在 REPLACE specific time ranges，即类似下列的任务中：

1
2
3
4
5
REPLACE INTO <target table>
OVERWRITE WHERE __time >= TIMESTAMP '<lower bound>' AND __time < TIMESTAMP '<upper bound>'
< SELECT query >
PARTITIONED BY <time granularity>
[ CLUSTERED BY <column list> ]

出现此问题的原因是，查询生成的时间段超出了由 replaceExistingTimeChunks 指定的边界，因此需要检查并修改日期字段。

我遇到这个问题，是因为在上述任务中的 WHERE 语句中，MILLIS_TO_TIMESTAMP("{created_at}" * 1000) 的格式转换有问题（具体是没有*1000就直接转时间戳），导致最终的时间戳对应的是-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z

Worker did not start in timeout

详细报错

以下已省略其他敏感信息：

1
2
3
4
{
  "type": "query_controller",
  "errorMsg": "The worker that this task is assigned did not start it in timeout[PT5M]. See overlord and middleMana..."
}

解决方案

我遇到这个问题，是直接在 Druid 控制后台运行批量摄取任务时发生的。一般情况下是因为服务器存储空间不足。（🙊 来自小公司的小声bb）

以下清理内存的一些常用方法。

👉 定期清除日志文件，指路我的另一篇文章使用 Crontab 添加定时任务

1
2
3
4
5
6
7
# 查看日志内存占用大小
df -h
du -sh /var/log/* | sort -hr | head -n 10
du -sh /opt/druid/apache-druid-26.0.0/log/* | sort -hr | head -n 10

# 移除所有的 Druid 的日志文件
sudo rm /opt/druid/apache-druid-26.0.0/log/*.log

未完待续 …

踩坑：Druid + S3 批量摄取任务中的各种报错

背景信息

Duplicate column entries found

详细报错

解决方案

InsertTimeOutOfBounds

详细报错

解决方案

Worker did not start in timeout

详细报错

解决方案

相关内容