The beginning until "well" works pretty ... well... There's some stuff we could tone down but I would concentrate on other areas right now.
For instance the pause between "well" and "he came..." There's a lip smacking and swallowing sound right before the "he came..." part and it sounds like the guy is nervous or having trouble saying the part (trouble not physically but mentally). So I would adjust the acting during that pause accordingly. Right now it looks like he just had a sip of hot chocolate and is really savoring the taste. He looks pretty relaxed too.
The "two men" gesture is a bit too on the nose for me, too acted out. I think you could do something more original and subtle. Or at least a different, more original gesture. Same with the last acting bit (lifting the eye brow).
It all works as it is now, but I feel like I have seen clips with a guy acting all cool before. You could use the sound and introduce more character and emotion. Think about the context of the shot. Who is the guy talking? And who is he talking to? Where are they? How is the content of the audio clip affecting the character's behaviour?
"He came in this morning... with two men. Big guys." Is that "he" in the same room as them, but off screen? Does he need to keep that information secret? Should he not tell the other guy? So maybe they are in a public place and he's more cautious as he reveals that information? Maybe he looks around? What if a waiter comes by and serves them their beverages and that's why there is that long pause. The guy could start talking but then stops because of the waiter, then looks at him nervously/angrily/anxiously until he leaves (maybe waves him off impatiently), etc. etc.
That's just a suggestion but hopefully gets you going idea wise.