LongVALE 论文复现
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos——论文复现
LongVALE 论文复现#
复现环境#
硬件配置#
- 系统:
Ubuntu 18.04 - CPU:
Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz - GPU:
RTX 3090×8 - 内存:
791224272 kB ~ 754.6 GB
环境配置#
本项目基于 VTimeLLM 项目代码 ↗
在按照官方开源项目的 README 进行复现时出现问题, 同样的问题在 VTimeLLM 中也存在, 调整如下:
git clone https://github.com/ttgeng233/LongVALE.git
cd LongVALE
conda create --name=longvale python=3.10
conda activate longvale
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
# 在此处显式降级 flash-attn 版本, 否则编译时会出错
pip install flash-attn==1.0.9 --no-build-isolation
pip install "numpy<2.0"
pip install moviepy==1.0.3zsh数据集下载#
这篇论文有自己的一个数据集, 名为 LongVALE, 可以从百度网盘 ↗或者 Hugging Face ↗ 下载 (不过 Hugging Face 上的数据集是多个压缩包, 百度网盘下着方便点, 买个闲时下载卡, 实验室主机挂一晚上就下好了),
原始视频数据压缩后, 训练集 190.2G, 测试集 40.55G
考虑到校园网好像不太够, 现在只下载了测试集进行推理,
运行评估代码#
nohup python longvalellm/eval/eval.py --video_feat_folder features_eval/visual_features_1171 --audio_feat_folder features_eval/audio_features_1171 --asr_feat_folder features_eval/speech_features_1171 --task all --log_path log > output.log0 2>&1 &zsh时序视频定位#
nohup python longvalellm/eval/eval.py --video_feat_folder features_eval/visual_features_1171 --audio_feat_folder features_eval/audio_features_1171 --asr_feat_folder features_eval/speech_features_1171 --task grounding --log_path log > output.log1 2>&1 &
python longvalellm/eval/metric.py --task grounding --log_path logzsh-
评估结果
zsh====================== Grounding ====================== Found 13867 logs mIoU: 10.88 R1@0.3: 15.68 R1@0.5: 8.62 R1@0.7: 3.87 -
和论文结果的对比
mIoU R@0.3 R@0.5 R@0.7 LongVALE 11.0 15.7 8.6 3.9 Test 10.88 15.68 8.62 3.87
密集视频字幕生成#
nohup python longvalellm/eval/eval.py --video_feat_folder features_eval/visual_features_1171 --audio_feat_folder features_eval/audio_features_1171 --asr_feat_folder features_eval/speech_features_1171 --task captioning --log_path log > output2.log 2>&1 &
python longvalellm/eval/metric.py --task captioning --log_path logzsh-
评估结果
zsh====================== Captioning ===================== Found 1171 logs soda_c: 2.80 METEOR: 4.68 CIDEr: 7.84 -
和论文结果的对比
S C M LongVALE 2.8 7.9 4.7 Test 2.80 7.84 4.68
片段字幕生成#
nohup python longvalellm/eval/eval.py --video_feat_folder features_eval/visual_features_1171 --audio_feat_folder features_eval/audio_features_1171 --asr_feat_folder features_eval/speech_features_1171 --task seg_captioning --log_path log > output3.log 2>&1 &
python longvalellm/eval/metric.py --task seg_captioning --log_path logzsh-
评估结果
zsh======================Segemnt Captioning ===================== BLEU4: 5.58% METEOR: 10.94% Rouge: 22.40% CIDEr: 20.05% -
和论文结果的对比
B R C M LongVALE 5.6 22.4 20.3 10.9 Test 5.58 22.40 20.05 10.94