Producing content material for Huge Open On-line Course (MOOC) platforms like Coursera and EdX could be academically rewarding (and doubtlessly profitable), however it’s time-consuming — notably the place movies are concerned. Producing professional-level lecture clips requires not solely working a veritable studio’s value of apparatus, however devoting important assets to transferring, modifying, and importing footage of every lesson.
That’s why analysis scientists at Udacity, an internet studying platform with over 100,000 programs, are investigating a machine studying framework that robotically generates lecture movies from audio narration alone. They declare in a preprint paper (“LumièreNet: Lecture Video Synthesis from Audio“) on Arxiv.org that their AI system — LumièreNet — can synthesize footage of any size by straight mapping between audio and corresponding visuals.
“In present video manufacturing pipeline, an AI equipment which semi (or absolutely) automates lecture video manufacturing at scale could be extremely beneficial to allow agile video content material improvement (fairly than re-shooting every new video),” wrote the paper’s coauthors. “To [this] finish, we suggest a brand new methodology to synthesize lecture movies from any size of audio narration: … A easy, modular, and absolutely neural network-based [AI] which produces an teacher’s full pose lecture video given the audio narration enter, which has not been addressed earlier than from deep studying perspective so far as we all know.”
The researchers’ mannequin has a pose estimation element that synthesizes physique determine photographs from video frames extracted from a coaching knowledge set, mainly by detecting and localizing main physique factors to create detailed surface-based human physique representations. A second module within the mannequin — a bidirectional recurrent long-short time period reminiscence (BLSTM) community, which processes knowledge so as (ahead and backward_ so that every output displays the inputs and outputs that precede it — takes as enter audio options and makes an attempt to suss out the connection between them and visible parts.
To check LumièreNet, the researchers filmed an teacher’s lecture video for round eight hours at Udacity’s in-house studio, which yielded roughly 4 hours of video and two narrations for coaching and validation. They report that the educated AI system produces “convincing” clips with clean physique gestures and sensible hair, however that its creations (two of that are right here and right here) seemingly gained’t idiot most observers. As a result of the pose estimator can’t seize high quality particulars like eye movement, lips, hair, and clothes, synthesized lecturers hardly ever blink and transfer their mouths unnaturally. Worse, their eyes generally look in several instructions and their fingers alway seem oddly blurry.
The workforce posits that the addition of “face keypoints” (i.e., high quality particulars) may result in higher synthesis, although, they usually word that thankfully, their system’s modular design permits every element to be educated and improved independently,
“[There are] many future instructions are possible to discover,” wrote the researchers. “Though our strategy is developed with major intents to help agile video content material improvement which is essential in present on-line MOOC programs, we acknowledge there may very well be potential misuse of the applied sciences … We hope that our outcomes will catalyze new developments of deep studying applied sciences for industrial video content material manufacturing.”