As we all know, listening makes learning easier and interesting than reading. An audiobook is a software that converts text to speech. Though this sounds good, the audiobooks available in the market are not free and feasible for everyone. Added to this, we find that these audiobooks are only meant for fictional stories, novels or comics. A comprehensive review of the available literature shows that very little intensive work was done for image to speech conversion. In this paper, we employ various strategies for the entire process. As an initial step, deep learning techniques are constructed to denoise the images that are fed to the system. This is followed by text extraction with the help of OCR engines. Additional improvements are made to improve the quality of text extraction and post processing spell check mechanism are incorporated for this purpose. Our result analysis demonstrates that with denoising and spell checking, our model has achieved an accuracy of 98.11% when compared to 84.02% without any denoising or spell check mechanism.