FWIW, here's a basic skeleton outlining how to do this:
<block myblock>
/ trials = [1-4=sequence(initiatecoinflip, coinflipresult)]
</block>
<trial initiatecoinflip>
/ stimulusframes = [1=clickoncoin]
/ validresponse = (clickoncoin)
/ inputdevice = mouse
</trial>
<trial coinflipresult>
/ stimulustimes = [0=coininmotion; 1500=coinresult]
/ validresponse = (noresponse)
/ trialduration = 3000
</trial>
<text clickoncoin>
/ items = ("Imagine I'm a picture of a coin. Click me.")
/ size = (25%, 25%)
</text>
<text coininmotion>
/ items = ("Imagine I'm a video of a coin being flipped.")
/ size = (25%, 25%)
</text>
<text coinresult>
/ items = ("Result: Heads", "Result: Tails")
/ size = (25%, 25%)
/ select = replace
</text>
As you can see, it's fairly straightforward and the exact same logic applies to using <picture> and/or <video> elements instead of <text>.