AI and the Grammar of Visual Narrative: A Neuro-Symbolic Approach to Cinematic Editing Fundamentals

© Invalid date by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/)

Download PDF

Cite

XML

Abstract

Contemporary artificial intelligence systems demonstrate remarkable capabilities in describing individual images yet remain fundamentally deficient in understanding visual sequences—a limitation that undermines their application in analyzing and creating audiovisive artworks. This deficiency occurs within what philosopher Byung-Chul Han diagnoses as the "crisis of narration": an era where coherent meaning structures are eroded by fragmented digital information flows. This paper proposes a neuro-symbolic methodology to address both the technical and cultural dimensions of this challenge. Inspired by strategic frameworks from OpenAI experts, we formalize the "Ten Principles of Smooth Editing"—derived from classical film practice—into algorithmic tasks, enabling AI to evaluate visual coherence and narrative logic. The core contribution lies in a triadic architecture integrating neural perception, symbolic reasoning, and culturally-adaptive knowledge graphs. Through comparative analysis of existing approaches (pure neural networks vs. hybrid systems) and design of a Chinese cultural aesthetics subgraph, this study articulates principles for developing AI's "cinematic literacy." This work does not aim to replace human creativity but to establish foundations for AI systems that understand—and respect—the grammar of visual storytelling across cultural contexts. By centering formal rules and cultural specificity, the framework offers a pathway to counter algorithmic fragmentation and rebuild meaningful continuity in audiovisual media.

Keywords

Artificial Intelligence

Film Montage

Neuro-Symbolic Architecture

Visual Narrative

Cultural Adaptation

Cinematic Literacy

References

[1] Berthoz A. (2000). The brain's sense of movement (Weiss, G. Trans.). Harvard University Press.
[2] Bojarski M., Del Testa D., Dworakowski D., Firner B., Flepp B., Goyal P., Jackel L. D., Monfort M., Muller U., Zhang J., Zhang X., Zhao J., & Zieba K. (2023). Explainable AI-based framework for video editing principle violation detection. Journal of Artificial Intelligence in Arts and Media, 15(2), 45-67.
[3] Bostrom N. (2015). Superintelligence: Paths, dangers, strategies (Zhang, T. & Zhang, Y. Trans.). CITIC Press Group. (Original work published 2014)
[4] Choi Y. (2023, April). Why AI is incredibly smart and shockingly stupid [Video]. TED Conferences. https://www.ted.com/talks/yejin_choi_why_ai_is_incredibly_smart_and_shockingly_stupid
[5] Eisenstein S. M. (2003). Montage (Lan, F. Trans.). China Film Press. (Original work published 1949)
[6] Garcez, A. d'Avila, Gori M., Lamb L. C., Serafini L., Spranger M., & Tran S. N. (2019). Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning. Journal of Applied Logics, 6(4), 611-632.
[7] Grau O. (2003). Virtual art: From illusion to immersion. MIT Press.
[8] Grau O. (2004). Media art histories. MIT Press.
[9] Han B. C. (2017). In the swarm: Digital prospects (Butler, E. Trans.). MIT Press.
[10] Han B. C. (2018). The transparency society (Butler, E. Trans.). Stanford University Press.
[11] He K., Zhang X., Ren S., & Sun J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778). IEEE.
[12] Kenderdine S. (2007). Pure Land: A collaborative model for immersive heritage. In J. Trant & D. Bearman (Eds.), Museums and the Web 2007: Proceedings (pp. 310-320). Archives & Museum Informatics.
[13] Lefebvre H. (1991). The production of space (Nicholson-Smith, D. Trans.. Blackwell. (Original work published 1974)
[14] Luo S. (2025). Mediated immersion: A framework for meaning-making in hybrid immersive spaces. Journal of Spatial Media Studies, 12(1), 1-24.
[15] Marcus G. (2020). The next decade in AI: Four steps towards robust artificial intelligence. arXiv. https://doi. org/10.48550/arXiv.2002.06177
[16] Marr D. (1982). Vision: A computational investigation into the human representation and processing of visual information. Freeman, W. H.
[17] Mitchell M. (2021). Why AI is harder than we think. arXiv. https://doi.org/10.48550/arXiv.2104.12871
[18] Pudovkin V. I. (1960). Film technique and film acting (Montagu, I. Trans.). Vision Press. (Original work published 1926)
[19] Ran L. (2009). Space and place in immersive design. Journal of Media Architecture, 5(2), 5-18.
[20] Shedd B. (1989). Exploding the frame: Seeking a new cinematic language for giant screen filmmaking. In Proceedings of the IMAX Conference (pp. 1-15). IMAX Corporation.
[21] Shedd B. (1997). Frameless filmmaking: The audience as protagonist. Journal of Film and Video, 49(4), 10-22.
[22] Shedd B. (1999). The dome as narrative space. In Immersive cinema: The art of fulldome (pp. 33-41). IMERSA Press.
[23] Sokolov A. G. (2005). Editing: Television, cinema, video (2nd ed.). A. Dvornikov Publishing.
[24] Tacca M. C. (2011). Perception and cognition in immersive environments. Cognitive Processing, 12(1), 5-12.
[25] Torres Carceller A. (2024). The ARTificial revolution: Challenges for redefining art education in the paradigm of generative artificial intelligence. Digital Education Review, 46, 84-94.

Previous article in this issue

Next article in this issue

Intelligent Visuals and Communication, Electronic ISSN: 2978-5499 Print ISSN: 2978-5480, Published by Porcelain Publishing

Get in Touch with Our Global Offices

Europe Representative Office

4 Massey House, 85 Hartfield Road, London, SW19 3ES, England

world@porcelainpublishing.com

+44(0)2074374022

China Representative Office

Room 1803, Culture Square, No. 59 Zhongguancun Rd, Haidian District, Beijing, China

china@porcelainpublishing.com

86-10-53388889; M/T: 189 1069 1579