Two sample frames of our dataset. In frame a individuals are fighting, while in frame b peoples are greeting. Considering only visual low-level features, a and b are similar, however, they are completely different considering also crowd emotion as high-level semantic representation. Also, in spite of having the “congestion” behavior class in both frames, in frame a individuals are “angry”, while in frame b individuals are “happy” emotionally

