Latest Release | |
License |
- Allow the user to extract VERIS JSON objects into a Pandas DataFrame structure. The most likely source of VERIS data is the VERIS Community Database (VCDB).
- Provide some basic data analysis functionality for the DataFrame. This includes calculating the count and frequency of given enumerations, and plotting a simple horizontal bar chart.
To install this package, either git clone
this repository, and
python -m pip install <path>/verispy/
Or, you may simply use:
pip install verispy
You will also need to download the VCDB data:
git clone https://github.com/vz-risk/VCDB.git
After installing, creating a VERIS object is simple. We just need the path to the VCDB json directory:
In [1]: from verispy import VERIS
In [2]: data_dir = '../VCDB/data/json/validated/'
In [3]: v = VERIS(json_dir=data_dir)
We may wish to verify that the VERIS schema URL is correct. Note that the schema comes from the VERIS repo on GitHub. If you cannot connect to the internet every time you run verispy
, you can also download the schema and change the path locally with the schema_path
parameter in the json_to_df
function.
In [4]: v.schema_url
Out[4]: 'https://raw.githubusercontent.com/vz-risk/veris/master/verisc-merged.json'
Then, we can load the VERIS data from the JSON and assign to a DataFrame:
In [5]: veris_df = v.json_to_df(verbose=True)
Loading schema
Loading JSON files to DataFrame.
Finished loading JSON files to dataframe.
Building DataFrame with enumerations.
Done building DataFrame with enumerations.
Post-Processing DataFrame (A4 Names, Victim Industries, Patterns)
Finished building VERIS DataFrame
Then, we might want to inspect our DataFrame:
In [6]: veris_df.shape
Out[6]: (7839, 2315)
In [7]: veris_df.head()
Out[7]:
action.Environmental action.Error ... victim.state victim.victim_id
0 False False ... NJ C.R. Bard, Inc.
1 False True ... NaN British Columbia Ministry of Finance
2 False False ... NaN NaN
3 False False ... FL Camberwell High School
4 False False ... NaN Loyalist Certification Services Exams
[5 rows x 2315 columns]
Do a quick value count on one of the enumerations:
In [8]: veris_df['action.malware.variety.Ransomware'].value_counts()
Out[8]:
False 7716
True 123
Name: action.malware.variety.Ransomware, dtype: int64
Most of the enumerations are True/False values.
To see a dictionary of the enumerations, look at the enumerations
attribute in the VERIS object:
In [9]: len(v.enumerations)
Out[9]: 68
In [10]: import pprint
In [11]: pprint.pprint(v.enumerations)
{'action.environmental.variety': ['Deterioration',
'Earthquake',
'EMI',
'ESD',
'Fire',
'Flood',
'Hazmat',
'Humidity',
'Hurricane',
'Ice',
'Landslide',
'Leak',
'Lightning',
'Meteorite',
'Particulates',
'Pathogen',
'Power failure',
'Temperature',
'Tornado',
'Tsunami',
'Vermin',
'Volcano',
'Wind',
'Other',
'Unknown'],
'action.error.variety': ['Capacity shortage',
'Classification error',
'Data entry error',
'Disposal error',
'Gaffe',
'Loss',
...
# many more lines
The enum_summary
(get enumerations with confidence intervals) function is the main analysis function within verispy
.
We can look at top-level enumerations:
In [12]: v.enum_summary(veris_df, 'action')
Out[12]:
enum x n freq
0 Error 2268 7629.0 0.29729
1 Hacking 2079 7629.0 0.27251
2 Misuse 1604 7629.0 0.21025
3 Physical 1517 7629.0 0.19885
4 Malware 635 7629.0 0.08324
5 Social 517 7629.0 0.06777
6 Environmental 8 7629.0 0.00105
7 Unknown 210 NaN NaN
Or lower-level enumerations:
In [13]: v.enum_summary(veris_df, 'action.social.variety')
Out[13]:
enum x n freq
0 Phishing 350 501.0 0.69860
1 Bribery 51 501.0 0.10180
2 Pretexting 41 501.0 0.08184
3 Extortion 33 501.0 0.06587
4 Forgery 16 501.0 0.03194
5 Influence 13 501.0 0.02595
6 Other 10 501.0 0.01996
7 Baiting 2 501.0 0.00399
8 Elicitation 2 501.0 0.00399
9 Propaganda 2 501.0 0.00399
10 Scam 2 501.0 0.00399
11 Spam 1 501.0 0.00200
12 Unknown 16 NaN NaN
We can add a second variable as the by
parameter, and thus get enumerations subsetted by the "by":
In [14]: v.enum_summary(veris_df, 'action', by='attribute')
Out[14]:
by enum x n freq
0 attribute.Availability Physical 1153 2350.0 0.49064
1 attribute.Availability Hacking 664 2350.0 0.28255
2 attribute.Availability Error 446 2350.0 0.18979
3 attribute.Availability Malware 138 2350.0 0.05872
4 attribute.Availability Misuse 67 2350.0 0.02851
5 attribute.Availability Social 59 2350.0 0.02511
6 attribute.Availability Environmental 8 2350.0 0.00340
7 attribute.Availability Unknown 5 NaN NaN
8 attribute.Confidentiality Error 2231 7057.0 0.31614
9 attribute.Confidentiality Hacking 1684 7057.0 0.23863
10 attribute.Confidentiality Misuse 1552 7057.0 0.21992
11 attribute.Confidentiality Physical 1492 7057.0 0.21142
12 attribute.Confidentiality Malware 555 7057.0 0.07865
13 attribute.Confidentiality Social 459 7057.0 0.06504
14 attribute.Confidentiality Environmental 2 7057.0 0.00028
15 attribute.Confidentiality Unknown 198 NaN NaN
16 attribute.Integrity Hacking 916 1833.0 0.49973
17 attribute.Integrity Malware 635 1833.0 0.34643
18 attribute.Integrity Social 517 1833.0 0.28205
19 attribute.Integrity Physical 321 1833.0 0.17512
20 attribute.Integrity Misuse 257 1833.0 0.14021
21 attribute.Integrity Error 35 1833.0 0.01909
22 attribute.Integrity Environmental 0 1833.0 0.00000
23 attribute.Integrity Unknown 15 NaN NaN
We can add in a confidence interval by specifying the ci_method
(currently supported methods: wilson
, normal
, or agresti_coull
, see https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportion_confint.html for more information):
In [15]: v.enum_summary(veris_df, 'action.social.variety', ci_method='wilson')
Out[15]:
enum x n freq method lower upper
0 Phishing 350 501.0 0.69860 wilson 0.65704 0.73715
1 Bribery 51 501.0 0.10180 wilson 0.07828 0.13138
2 Pretexting 41 501.0 0.08184 wilson 0.06090 0.10914
3 Extortion 33 501.0 0.06587 wilson 0.04728 0.09106
4 Forgery 16 501.0 0.03194 wilson 0.01975 0.05124
5 Influence 13 501.0 0.02595 wilson 0.01523 0.04388
6 Other 10 501.0 0.01996 wilson 0.01088 0.03635
7 Baiting 2 501.0 0.00399 wilson 0.00110 0.01444
8 Elicitation 2 501.0 0.00399 wilson 0.00110 0.01444
9 Propaganda 2 501.0 0.00399 wilson 0.00110 0.01444
10 Scam 2 501.0 0.00399 wilson 0.00110 0.01444
11 Spam 1 501.0 0.00200 wilson 0.00035 0.01122
12 Unknown 16 NaN NaN wilson NaN NaN
And we can change the confidence interval width with ci_level
(default is 0.95):
In [16]: v.enum_summary(veris_df, 'action.social.variety', ci_method='wilson', ci_level=0.5)
Out[16]:
enum x n freq method lower upper
0 Phishing 350 501.0 0.69860 wilson 0.68460 0.71224
1 Bribery 51 501.0 0.10180 wilson 0.09304 0.11127
2 Pretexting 41 501.0 0.08184 wilson 0.07395 0.09048
3 Extortion 33 501.0 0.06587 wilson 0.05878 0.07374
4 Forgery 16 501.0 0.03194 wilson 0.02705 0.03767
5 Influence 13 501.0 0.02595 wilson 0.02157 0.03119
6 Other 10 501.0 0.01996 wilson 0.01616 0.02463
7 Baiting 2 501.0 0.00399 wilson 0.00249 0.00639
8 Elicitation 2 501.0 0.00399 wilson 0.00249 0.00639
9 Propaganda 2 501.0 0.00399 wilson 0.00249 0.00639
10 Scam 2 501.0 0.00399 wilson 0.00249 0.00639
11 Spam 1 501.0 0.00200 wilson 0.00103 0.00387
12 Unknown 16 NaN NaN wilson NaN NaN
The enum_summary
function returns a DataFrame. With this enumeration DataFrame, we can then draw a simple horizontal bar chart with the plot_barchart
function:
In [17]: actionci_df = v.enum_summary(veris_df, 'action')
In [18]: action_fig = v.plot_barchart(actionci_df, 'Actions')
In [19]: action_fig.show()
Another useful feature of the verispy
package is the df_to_matrix
function, which converts the VERIS DataFrame into a matrix of boolean values for selected enumerations. This feature is inspired by the blog post DBIR Data-Driven Cover by Jay Jacobs. This blog post talks about the DBIR "Patterns," which were originally described in the 2014 DBIR.
In Jay's blog post, he points the reader to a GitHub gist with a function he wrote, getpatternlist.R. We have taken that R function and converted it to Python; it may be found at this gist: dbir_patterns.py.
Using this function, we can create a data frame with the DBIR patterns:
In [23]: import pandas as pd
def get_pattern(df):
""" Generates the DBIR "patterns," with liberal inspiration from the getpatternlist.R:
https://gist.github.com/jayjacobs/a145cb87551f551fc719
Parameters
----------
df: pd DataFrame with most VERIS encodings already built (from verispy package).
Returns
-------
pd DataFrame with the patterns. Does not return as part of original VERIS DF.
"""
skimmer = df['action.physical.variety.Skimmer'] | \
(df['action.physical.variety.Tampering'] & df['attribute.confidentiality.data.variety.Payment'])
espionage = df['actor.external.motive.Espionage'] | df['actor.external.variety.State-affiliated']
.... (more lines, see gist)
In [23]: patterns = get_pattern(veris_df)
In [24]: patterns['pattern'].value_counts()
Out[24]:
Miscellaneous Errors 1814
Privilege Misuse 1597
Lost and Stolen Assets 1460
Everything Else 1028
Web Applications 896
Payment Card Skimmers 278
Crimeware 268
Cyber-Espionage 248
Denial of Service 162
Point of Sale 88
Name: pattern, dtype: int64
From here, we can go back to our veris_df
DataFrame and make a boolean VERIS matrix:
In [25]: vmat = v.df_to_matrix(veris_df)
In [26]: vmat
Out[26]:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 1, 1, ..., 0, 1, 0],
...,
[0, 0, 1, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 1, ..., 0, 0, 0]])
In [27]: vmat.shape
Out[27]: (7839, 569)
Then, we can do a dimensionality-reducing technique called TSNE. The following operation may take several minutes:
In [28]: from sklearn.manifold import TSNE
In [29]: tsne = TSNE(n_components=2, random_state=42)
In [30]: v_tsne = tsne.fit_transform(vmat)
Finally, we can create the following plot, which we have colored by DBIR "Pattern", using Seaborn:
In [31]: import seaborn as sns
In [32]: import pandas as pd
In [33]: import matplotlib.pyplot as plt
In [34]: tsne_df = pd.DataFrame({'x':v_tsne[:, 0], 'y':v_tsne[:, 1], 'pattern':patterns['pattern']})
In [35]: tsne_df.head()
Out[35]:
x y pattern
0 0.411892 -34.907738 Privilege Misuse
1 29.374905 12.816430 Miscellaneous Errors
2 -63.858070 -47.406250 Cyber-Espionage
3 -58.987106 7.611073 Web Applications
4 -75.674927 7.452817 Web Applications
In [36]: tsne_centers = tsne_df.groupby(by='pattern').mean()
...: tsne_centers['pattern'] = tsne_centers.index
In [37]: p1 = sns.lmplot(x='x', y='y', data=tsne_df, fit_reg=False, hue='pattern',
...: scatter_kws={'alpha':0.25}, size=6)
...:
In [38]: def label_point(df, ax):
...: for i, point in df.iterrows():
...: ax.text(point['x'] - 30, point['y'], point['pattern'])
...:
In [39]: label_point(tsne_centers, plt.gca())
In [40]: plt.show()
Unit tested with pytest
(veris) verispy $ pytest
======================================================= test session starts ========================================================
platform darwin -- Python 3.6.5, pytest-3.5.1, py-1.5.3, pluggy-0.6.0
rootdir: /Users/tylerbyers/src/verispy, inifile:
plugins: remotedata-0.2.1, openfiles-0.3.0, doctestplus-0.1.3, arraydiff-0.2
collected 8 items
verispy/tests/test_veris.py ........ [100%]
==================================================== 8 passed in 11.50 seconds =====================================================