Open Access Open Access  Restricted Access Subscription or Fee Access

Application of Embedding Method for both the Study of Binary Codes

Sudhanshu Pandey

Abstract


Binary code analysis is essential for analyzing programmers despite access to the actual source code specific to valuable place. The study of binaries may be difficult due to their wide variations: due to the proliferation of tech developers, the source code is now often optimized for several instruction set architectures (ISAs); although there is no systematic vocabulary that distinguishes among their assembly languages. The complexity of detection has been further exacerbated by a number of developer enhancements and obscured application identifiers. Such minutiae ensures that certain bugs may only be found at a fine-grained stage. Recent steps in deep learning in Natural Language Processing (NLP)—may provide a solution: deep learning methods may process large texts and encode the semantics of individual words into vectors called word embedding that are useful for data visualization and interpretation. By considering assemblies as a phrase and directions as phrases, we use NLP ideas to produce individual instruction embedding’s. Explicitly, we choose to build on existing models that are single-architecture, or that suffer from performance problems while managing several architectures. This research proposes a cross-architectural instruction embedding model that co-encodes instruction semantics from different ISAs where identical instructions are tightly combined within and across architectures. Results demonstrate that our model is effective in removing semantics from binaries on its own and our embedding catch semantic equivalences through multiple architectures. When combined, these instruction embedding can reflect the sense of functions or simple blocks; thus this model may prove useful for cross-architectural bug, ransomware, and plagiarism detection.

 


Full Text:

PDF

References


Sarath Chandar AP et al. “An autoencoder approach to learning bilingual word representations”. In: Advances in Neural Information Processing Systems. 2014, pp. 1853–1861.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate”. In: arXiv preprint arXiv: 1409.0473 (2014).

Brenda S Baker. “On finding duplication and near-duplication in large software systems”. In: Proceedings of 2nd Working Conference on Reverse Engineering. IEEE. 1995, pp. 86–95.

Mikhail Bilenko and Raymond J Mooney. “Adaptive duplicate detection using learnable string similarity measures”. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2003, pp. 39–48.

David Brumley et al. “BAP: A binary analysis platform”. In: International Conference on Computer Aided Verification. Springer. 2011, pp. 463–469.

Jianpeng Cheng and Mirella Lapata. “Neural summarization by extracting sentences and words”. In: arXiv preprint arXiv:1603.07252 (2016).

Kyunghyun Cho et al. “Learning phrase representations using RNN encoderdecoder for statistical machine translation”. In: arXiv preprint arXiv:1406.1078 (2014).

Gobinda G Chowdhury. “Natural language processing”. In: Annual review of information science and technology 37.1 (2003), pp. 51–89.

Ronan Collobert and JasonWeston. “A unified architecture for natural language processing: Deep neural networks with multitask learning”. In: Proceedings of the 25th international conference on Machine learning. ACM. 2008, pp. 160–167.

Jonathan Crussell, Clint Gibler, and Hao Chen. “Attack of the clones: Detecting cloned applications on android markets”. In: European Symposium on Research in Computer Security. Springer. 2012, pp. 37–54.

Mark A DePristo et al. “A framework for variation discovery and genotyping using next-generation DNA sequencing data”. In: Nature genetics 43.5 (2011), p. 491.

Steven HH Ding, Benjamin CM Fung, and Philippe Charland. “Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization”. In: Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. IEEE.

Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. “discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code.” In: NDSS. 2016.

Qian Feng et al. “Scalable graph-based bug search for firmware images”. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM. 2016, pp. 480–491.

John R Firth. “A synopsis of linguistic theory, 1930-1955”. In: Studies in linguistic analysis (1957).

Mark Gabel, Lingxiao Jiang, and Zhendong Su. “Scalable detection of semantic clones”. In: Proceedings of the 30th international conference on Software engineering. ACM. 2008, pp.

–330.

Debin Gao, Michael K Reiter, and Dawn Song. “Binhunt: Automatically finding semantic differences in binary programs”. In: International Conference on Information and Communications Security. Springer. 2008, pp. 238–255.

Jianfeng Gao et al. “Learning continuous phrase representations for translation modeling”. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2014, pp. 699–709. 21]

Stephan Gouws, Yoshua Bengio, and Greg Corrado. “Bilbowa: Fast bilingual distributed representations without word alignments”. In: International Conference on Machine Learning. 2015, pp. 748–756.

Zhuobing Han et al. “Learning to predict severity of software vulnerability using only vulnerability description”. In: 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE. 2017, pp. 125–136.

Karl Moritz Hermann and Phil Blunsom. “Multilingual distributed representations without word alignment”. In: arXiv preprint arXiv:1312.6173 (2013).

Yikun Hu et al. “Cross-architecture binary semantics understanding via similar code comparison”. In: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). Vol. 1. IEEE. 2016, pp. 57–67.

Xuan Huo, Ming Li, and Zhi-Hua Zhou. “Learning Unified Features from Natural and Programming Languages for Locating Buggy Source Code.” In: IJCAI. 2016, pp. 1606–1612.

Yoon-Chan Jhi et al. “Program characterization using runtime values and its application to software plagiarism detection”. In: IEEE Transactions on Software Engineering 41.9 (2015), pp. 925–943.

Yoon-Chan Jhi et al. “Value-based program characterization and its application to software plagiarism detection”. In: Proceedings of the 33rd International Conference on Software Engineering. ACM. 2011, pp. 756–765.

Lingxiao Jiang et al. “Deckard: Scalable and accurate tree-based detection of code clones”. In: Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society. 2007, pp. 96–105.

Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. “CCFinder: a multilinguistic token-based code clone detection system for large scale source code”. In: IEEE Transactions on Software Engineering 28.7 (2002), pp. 654–670.

Boojoong Kang et al. “Malware classification method via binary content comparison”. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium. ACM. 2012, pp. 316–321.

Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. “Inducing crosslingual distributed representations of words”. In: Proceedings of COLING 2012 (2012), pp. 1459–1474. [43] Tomas Mikolov et al. “Distributed representations of words and phrases and their compositionality”. In: Advances in neural information processing systems. 2013, pp. 3111–3119.

Tomas Mikolov et al. “Efficient estimation of word representations in vector space”. In: arXiv preprint arXiv:1301.3781 (2013).

Jiang Ming, Dongpeng Xu, and Dinghao Wu. “MalwareHunt: semantics-based malware diffing speedup by normalized basic block memoization”. In: Journal of Computer Virology and Hacking Techniques 13.3 (2017), pp. 167–178.

Jiang Ming et al. “Deviation-based obfuscation-resilient program equivalence checking with application to software plagiarism detection”. In: IEEE Transactions on Reliability 65.4 (2016), pp. 1647–1664.

Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (2015), p. 529.

Serguei A Mokhov, Joey Paquet, and Mourad Debbabi. “The use of NLP techniques in static code analysis to detect weaknesses and vulnerabilities”. In: Canadian Conference on Artificial Intelligence. Springer. 2014, pp. 326–332.

Lili Mou et al. “Convolutional neural networks over tree structures for programming language processing”. In: Thirtieth AAAI Conference on Artificial Intelligence. 2016.

Ginger Myles and Christian Collberg. “Detecting software theft via whole program path birthmarks”. In: International Conference on Information Security. Springer. 2004, pp. 404–415.

Heewan Park et al. “Detecting code theft via a static instruction trace birthmark for Java methods”. In: 2008 6th IEEE International Conference on Industrial Informatics. IEEE. 2008, pp. 551–556.

Jigar Patel et al. “Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques”. In: Expert Systems with Applications 42.1 (2015), pp. 259–268.

Alexander Pechenkin and Roman Demidov. “Applying Deep Learning and Vector Representation for Software Vulnerabilities Detection”. In: Proceedings of the 11th International Conference on Security of Information and Networks. ACM. 2018, p. 13.


Refbacks

  • There are currently no refbacks.