Software forensics


Software forensics is the science of analyzing software source code or binary code to determine whether intellectual property infringement or theft occurred. It is the centerpiece of lawsuits, trials, and settlements when companies are in dispute over issues involving software patents, copyrights, and trade secrets. Software forensics tools can compare code to determine correlation, a measure that can be used to guide a software forensics expert.

Past methods of software forensics

Past methods of code comparison included hashing, statistical analysis, text matching, and tokenization. These methods compared software code and produced a single measure indicating whether copying had occurred. However, these measures were not accurate enough to be admissible in court because the results were not accurate, the algorithms could be easily fooled by simple substitutions in the code, and the methods did not take into account the fact that code could be similar for reasons other than copying.

Robert Zeidman

In 2003, Robert Zeidman developed algorithms, which he incorporated in the CodeSuite tool, for multidimensional software correlation that divides software code into basic elements and determines which elements are similar, or “correlated.” He also created a procedure for filtering and interpreting the correlations to eliminate reasons for correlation that are not due to copying: common algorithms, third-party code, common identifier names, common author, and automatic code generation. The combination of these algorithms and procedures resulted in a more reliable and quantitative analysis than was available previously. Zeidman’s book, "The Software IP Detective's Handbook," is considered the standard textbook for software forensics and the CodeSuite tools and methodology has been used in many software IP litigations including ConnectU v. Facebook, Symantec v. IRS Baker & McKenzie, and SCO Group, Inc. v. International Business Machines Corp.

Copyright infringement

Following the use of software tools to compare code to determine the amount of correlation, an expert can use an iterative filtering process to determine that the correlated code is due to third-party code, code generation tools, commonly used names, common algorithms, common programmers, or copying. If the correlation is due to copying, and the copier did not have the authority from the rights holder, then copyright infringement occurred.

Trade secret protection and infringement

Software can contain trade secrets, which provide a competitive advantage to a business. To determine trade secret theft, the same tools and processes can be used to detect copyright infringement. If code was copied without authority, and that code has the characteristics of a trade secret—it is not generally known, the business keeps it secret, and its secrecy maintains its value to the business—then the copied code constitutes trade secret theft.
Trade secret theft can also involve the taking of code functionality without literally copying the code. Comparing code functionality is a very difficult problem that has yet to be accomplished by any algorithm in reasonable time. For this reason, finding the theft of code functionality is still mostly a manual process.

Patent infringement

As with trade secret functionality, it is not currently possible to scientifically detect software patent infringement, as software patents cover general implementation rather than specific source code. For example, a program that implements a patented invention can be written in many available programming languages, using different function names and variable names and performing operations in different sequences. There are so many combinations of ways to implement inventions in software that even the most powerful modern computers cannot consider all combinations of code that might infringe a patent. This work is still left to human experts using their knowledge and experience, but it is a problem that many in software forensics are trying to automate by finding an algorithm or simplifying process.

Objective facts before subjective evidence

One important rule of any forensic analysis is that the objective facts must be considered first. Reviewing comments in the code or searching the Internet to find information about the companies that distribute the code and the programmers who wrote the code are useful only after the objective facts regarding correlation have been considered. Once an analysis has been performed using forensic tools and procedures, analysts can then begin looking at subjective evidence like comments in the code. If the information in that subjective evidence conflicts with the objective analysis, analysts need to doubt the subjective evidence. Fake copyright notices, open source notifications, or programmer names that were added to source code after copying took place, in order to disguise the copying, are not uncommon in real-world cases of code theft.