1. Audio Set: An ontology and human-labeled dataset for audio events
2. Language models are few-shot learners;Brown;Advances in neural information processing systems,2020
3. Stanford alpaca: An instruction-following llama model;Taori,2023
4. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models;Li,2023
5. Exploring the limits of transfer learning with a unified text-to-text transformer;Raffel;The Journal of Machine Learning Research,2020