This study investigated how enhanced video captioning types affected comprehension and vocabulary acquisition regarding form recognition, meaning recall and meaning recognition. 158 low-intermediate Chinese EFL undergraduates were randomly assigned to English captions (EC), English captions with highlighted target words and L1 gloss (ECL1), Chinese and English captions (CEC), Chinese and English captions with highlighted target words (CECGW), and no captions (NC). For listening comprehension, results revealed the CECGW scored higher than CEC, EC and NC while the NC performed lower than other groups with statistical significance. Captioned videos and videos bilingually captioned with glossed target words aided listening comprehension. For form recognition in vocabulary tests, no statistically significant differences were detected across the caption types. ECL1 was the most effective in meaning recall and recognition. Pedagogical implications are proposed for teachers’ adoption of L1 in captioned videos for learners’ optimal learning.