To build our first antivirus, we need to know about the virus first. Every antivirus program needs to constantly update their database to defend against new virus thread. Day by day not only our security systems are getting smarter but the viruses as well.

Let's see some example of viruses -

Polymorphic virus - This type of virus is hard to detect because this virus changes its signature every time it creates a replica. Antivirus software takes more than days to detect this virus type.

Worm – Worm is a type of virus that self-replicate to computers that are connected using the bandwidth and computing of every host machine.

There are more types of viruses out there.

Wannacrypt or wannacry is one of the most popular virus attack, This type of virus is called  Ransomware.

import pandas as pd
import numpy as np
import pickle
import sklearn.ensemble as ske
from sklearn import cross_validation, tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.externals import joblib
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

These viruses are created by malicious hackers to gain access to your computer and use your resources to do their task.

For malware analysis, we need follow mainly two approach

1. Static Approach

2. Dynamic Approach

Static Approach are code based means, in this approach antivirus software look into the code of the program to determine if it’s malicious or safe.

Dynamic approach looks for on going tasks running by software to determine the status.

So, Let’s start building our antivirus software using machine learning.

We are gonna use these libraries to build our antivirus.

Pandas is for data analysis.

Numpy is used for mapping the data. Sklearn ensemble used to save our learn feature as a byte string.

The first thing we need is to load the dataset where we stored the features and labels of different programs and saved as CSV format on our local machine. This CSV file contains two possible label, legitimate or malicious. Then we print the total number of features per row.

And after that, we are gonna set multiple Classifier to determine which is works best for the perticular machine. Below I’ve given the whole python program. Please install these dependencies to run the script.

1. Pandas

2. Numpy

3. Pickle

4. Scipy

5. Scikit-learn

You can use pip or anaconda to install those dependencies.

So let's jump right into the program.

This program gonna save the best classifier for us and save as pickle file into the classifier directory. Now we will look at our main progrm where we will use our trained classifier to detect a program is malicious or not.

Here is main program