Preventing the Financing of Terrorism with Machine Learning and Blockchain Data

September 3, 2021

Graph technology is the best way to maintain the context for explainability. It offers a human-friendly way to evaluate connected data, enabling human overseers to better map and visualize AI decision paths. In this article, we explain graph databases and how to use them in building a terrorist knowledge graph to identify terrorist organizations and the funding behind them, that was the goal of this challenge “Preventing the Financing of Terrorism by Engineering an ML Model with Financial Crime Data“

Terrorism financing originates from both illegal and legal activities where persons involved are not necessarily concealing the source of funds but disguising the nature of the financed activity. Although detection of suspicious activities (e.g. money laundering, terrorism financing, fraud detection..) is typically investigated with machine learning, thanks to the focus on relationships, graph analytics can be seen as an efficient process with the results showing effective predictors in determining and flagging fraudulent records, inspecting meaningful relationships, curating and preparing data before it can actually be used.

“Graphs are changing the rules for machine learning. Traditionally, data must be structured as a 2D matrix or 3D tensor for analysis, and feature dimensionality reduction is often used.

Graphs, on the other hand, are an intuitive, adaptable, and efficient way to represent knowledge and relationships in an unlimited number of dimensions, where each type of relationship represents an additional dimension leading to a potentially richer source of knowledge. ”

The goal is to build a knowledge graph by integrating knowledge and data from different sources, analyze its structure and extract key insights from data. The approach to achieve this can be done using the pipeline below

Terrorist knowledge graph NEOj4 - Source: Omdena

Graph visualization task pipeline – Source: Omdena

By identifying the future necessity for highly connected data, the importance of data retrieval rather than just storing data, inconsistency of data model as well as potential demands for frequent changes, we decided that using a graph database would be a powerful tool for graph analysis and visualization.

Why Graph Databases?

A graph database (GDB) is a NoSQL database that stores graph-related data items as a collection of nodes and edges, where edges are representing the relationships across the nodes.

Graph databases require no schema or ontology, and can readily accept data in any structure and therefore are enabling quick re-thinking of the analytics space. By extracting entities/nodes and relationships and facts from different sources, creating a cognitive model based on a knowledge graph, powerful novel answers and intuitions can be achieved. Graph databases are the optimal systems for taking investigative journeys in knowledge graphs, especially interconnected data under common semantics.

To store nodes and edges, we selected the Neo4j database. Several factors have influenced this decision:

Neo4j is fast to deploy and manage and friendly for newcomers.
Features that are particularly helpful considering the short project time and the heterogeneous experience of the team members.
Neo4j has an open-source free community edition. The community edition provides a graph analytics workspace “Neo4j Browser” that provides graph visualization and exploration together with the Cypher query language – which is very easy to learn and can operate across Neo4j.

Node and Relationship are the Neo4j terms for entity/vertex and edge. The Basic elements in a Neo4j Graph are Nodes, Relationships, Labels, Properties.

Terrorist knowledge graph - Source: Omdena

Example of graph data structure – Source: Omdena

Labels provide categorical information. In the above eg., Label Perp is just a “set membership marker” for grouping nodes (e.g. “Hamas”) into a set where all nodes that have a label Perp belong to the same set.

Properties are used to associate specific information with individual nodes.

Relationships must have both a type and a direction. Here the relationship HAS_SIMILAR_NAME connects nodes from Org set of nodes and Perp set of nodes. Just like nodes, relationships can have properties that express specific attributes.

Neo4j setup guidelines

Neo4j Graph Data Platform offers different tools for graph database development and deployment. In this project, we explored Neo4j Desktop and Neo4j Server. Since Neo4j is based on Java, it is necessary to install JVM before installing Neo4j. All installation packages are in Neo4j official website download.

To set up a neo4j database in Azure, Ubuntu server, we followed the official guide

Locally to test and to connect we installed Neo4j Desktop (free with registration) includes a free development license for Enterprise Edition allowing the use of Neo4j Enterprise on our local desktop for developing applications. Basically, it’s for local development only, Neo4j Desktop isn’t intended or licensed for deployment or usage as a server version (we have the server versions of Neo4j Community and Enterprise for that instead).

Each Neo4j server currently (in the community edition) can host a single Neo4j database

For our purposes, we used two Neo4j add-ons:

Neo4j APOC Library – “Awesome Procedures On Cypher”
we installed the library by copying the apoc*-all.jar file in the neo4j plugins/ directory. Then we habilitated the functions in the neo4j config file.
Neo4j Graph Data Science library installed following this guide.

Data transformation

The Data collections from various open-source datasets, pre-processing and data import into the Neo4j graph database were detailed below:

Datasets used

For the graph analysis and visualization task, several datasets were taken into account and were imported into a graph database (Neo4j):

The global terrorism database (GTD)

This database contains information on multiple dimensions about more than 200,000 international and domestic terrorist attacks that have occurred worldwide since 1970. More than 100 structured variables characterize each attack’s location, tactics and weapons, targets, perpetrators, casualties and consequences, and general information such as definitional criteria and links between coordinated attacks. Unstructured variables include summary descriptions of the attacks and more detailed information on the weapons used, specific motives of the attackers, property damage, and ransom demands (where applicable).

Although the GTD is an open-source database, the National Consortium for the Study of Terrorism and Responses to Terrorism is strict about their ownership of the data, therefore users can only download the dataset by following the links on the “Access the GTD” pages.

We explored a subset of GTD data (data from 2010-2019) whose columns of interest were ingested into Neo4j instances either as nodes, properties or relationships. Several ingestion strategies have been tested. We considered three strategies:

1) Use of LOAD CSV to get the data into our query

2) Use of py2neo and Python

3) Using the officially supported python driver by neo4j

Since Py2neo is not optimized for large imports, for the import of data and creating nodes and relationships, we decided to combine strategies 1. and 3.

For example, we applied strategy 3 to create relationship IS_RESPONSIBLE_FOR between node Event and Perpretator and property for this relationship claimed:

import pandas as pd
from neo4j import GraphDatabase
from loguru import logger
import os
from pathlib import Path

project_dir = str(Path(__file__).resolve().parents[0])
data_dir = os.path.join(project_dir, "data")
logger.add(data_dir+'Perp_Event_Relation.log')
logger.info('Dataset loading')
GTD_df = pd.read_csv(data_dir+("/globalterrorismdb_0221dist.csv"))
GTD_df_small = GTD_df[GTD_df["iyear"] >= 2010]
logger.info('The dataset is loaded')

uri="bolt://server_address:7687"
user, password=("username", "password")
database_name="databaseName"

driver = GraphDatabase.driver(uri, auth=(user, password))
logger.info('Connected to database')
query_create_rel_att="""
MATCH (e:Event {eventid:$eid}),(p:Perpetrator {name:$name_perpetrator})
MERGE (e)<-[r:IS_RESPONSIBLE_FOR] - (p)
ON CREATE SET r.Claimed=$claimed
"""

loaded = 0
nn=0
with driver.session(database=database_name) as session:       
    for ix,row in GTD_df_small.iterrows():
        nn+=1
        if(nn%1000 ==0):
           logger.info(str(nn)+'Loaded')
           session.run(query_create_rel_att, eid=row["eventid"], 
                  claimed=row["claimed"], name_perpetrator=row["gname"])   
        non_nan_row_entries = row.isna()
        if not non_nan_row_entries["gname2"] :
            session.run(query_create_rel_att, eid=row["eventid"],
                  claimed=row["claim2"], name_perpetrator=row["gname2"])
        non_nan_row_entries = row.isna()
        if not non_nan_row_entries["gname3"] :
            session.run(query_create_rel_att, eid=row["eventid"],
                  claimed=row["claim3"], name_perpetrator=row["gname3"])

logger.info('Total loaded'+ str(nn))

Sanction lists

Economic sanctions are an important part of the fight against financial crime for AML regulators. Governments and international authorities publish sanctions lists to combat persons engaged in illegal activities.Sanction lists include people/individuals, organizations,vessels,more.

The sanction list from various Authorities/countries considered here – UN, EU, US [OFAC non-SDN], and Canada were in XML/PDF/CSV formats. To get one consolidated dataset as a sanction list, we processed all the XML files using python scripts to populate the data in the XML elements into the CSV columns.

All the CSV files generated were consolidated to a single CSV file – Sanction_List.csv, which can be used to import in relational/Graph database. The individuals and organizations were imported separately following the Strategy 1 – Load CSV import statements. For example one can use this cipher statement to create labels with individual details::

LOAD CSV WITH HEADERS FROM 'file:///Sanction_List.csv' AS row WITH row.NAME AS name, 
row.NATIONALITY AS nationality, row.INDIVIDUAL_ALIAS AS Alias, 
row.INDIVIDUAL_ADDRESS AS address, row.INDIVIDUAL_DATE_OF_BIRTH AS DateOfBirth, 
row.COMMENTS AS comments
WHERE row.TYPE ='Individual' 
MERGE (i:individual {name: name})
SET i.nationality = nationality,i.Alias = Alias,i.address = address,
i.DateOfBirth = DateOfBirth ,i.comments = comments
RETURN count(i)

Furthermore, we created sanction authority nodes and established relationships between individual/organization
to the authority using Strategy 1 and running this cipher Load CSV import statement:

LOAD CSV WITH HEADERS FROM 'file:///Sanction_List.csv' AS row WITH row.NAME AS name, row.LIST_TYPE AS listType
MATCH (i:individual {name: name}) 
MATCH (n:sancAuth {type: listType})
MERGE (i)-[rel:SanctionBy]->(n)
RETURN count(rel)

Panama papers

The Panama Papers are over 11 million documents leaked from Mossack Fonseca, one of the largest law firms in the world specializing in offshore accounts and incorporation of shell companies by ICIJ investigations. The common link among these individuals is that they used shell companies and offshore accounts to shield their wealth from their home governments. ICIJ recommends caution before drawing any conclusions, as many people and entities have the same or similar names. ICIJ suggests confirming the identities of any individuals or entities found in the database using addresses or other identifiable information. Hence the data referenced in this case study can be to indicate a risk that needs to be further investigated. The zip file containing the node-edge CSV files was downloaded from ICIJ website. To import the data to the Neo4j database, these CSV files need to be pre-processed. Two CSV files(panama_papers.nodes.address.csv and panama_papers.nodes.officer.csv) had unwanted characters that need to remove using the command(tr -d) in the terminal as mentioned in the reference.

The final CSV files were placed in the import folder of the Neo4j server instance for CSV import. The cipher statements run one by one to load the nodes and edges of the Panama dataset in the Neo4j Graph database by following strategy 1, using LOAD CSV import.

For example, the cipher statement used to import all the address nodes is given below,

:auto USING PERIODIC COMMIT 10000  
LOAD CSV WITH HEADERS FROM "file:///panama_papers.nodes.addressm.csv" AS row  
MERGE (n:Node {node_id:row.node_id})  
ON CREATE SET n = row, n:Address;

Twitter data

Twitter data is the result of the combination of the input of two different datasets, namely: (1) General Terrorism Dataset and (2) Sanction Lists. The combination of these two datasets (where sanctions lists acted as a filter on the General Terrorism Dataset) was used as an input to the Twitter API. The result of this was a Twitter dataset, which contains all relevant tweets for each event that had taken place in the filtered General Terrorism Dataset. Based on this twitter dataset we created a sentiment analysis script.

This script extracted a sentiment score for each of the tweets that were created for a specific event. This was done by extracting scores such as hate, non-hate, anger, non-anger, neutral, positive, negative, etc. Once these scores were given a calculation was made to eventually end up with a sentiment score for each of the tweets. Having a score for each of the tweets made it easier (sort score in descending order) to sort out those people who are either involved in a specific event or whose tweets were in favor of a specific attack. We imported Twitter data into Neo4j by following strategy 1. where LOAD CSV cipher statement was created to create Tweet nodes with sentiment analysis scores above a certain value.

Firstly, we created tweet nodes (only for tweets with sentiment an analysis score above 0.4) by using this cipher statement:

LOAD CSV WITH HEADERS FROM "file:///TwitterDataForaEvent.csv" 
AS row WITH row WHERE toFloat(row.sentiment_analysis) > 0.4 
MERGE (n:Tweet {name : row.name}) 
SET n.tweet=row.tweet,n.username = row.username,n.sentiment_score=row.sentiment_analysi,
n.tweet_date=row.date,n.link=row.link,n.attack_ref=row.attack 
RETURN count(*)

In order to create a relationship with the event, we used this cipher statement:

MATCH (t:Tweet {attack_ref:'2019-01-09-West Bank-Jerusalem'}) 
MATCH (n:Event)-[:EVENT_LOCATED_IN]->(r)-[]->(c) 
WHERE n.iyear=2019 AND n.imonth=1 AND n.iday=9 AND c.provstate='West Bank' 
MERGE (n)<-[rel:Tweet_is_about]-(t) 
RETURN count(*)

Crypto data

Wallets that received funds – Source: Omdena

To address cyber-enhanced funding campaigns and activities, we populate the graph with wallets associated with sanctioned individuals and with wallets appearing on social media in crowdfunding campaigns for financing terrorist and sanctioned entities. Wallet addresses are unique identifiers and are stored in nodes. It is possible to get from the blockchains all the transactions that from the wallets moved funds to other wallets.

So, the graph is further populated with all the wallets that received funds, and a relationship with the amount of the transaction/s connects the wallets. A third iteration is also considered. Wallets with a very large number of transactions (>10k) are included but transactions are ignored.

This is because it is very likely that such nodes are associated with institutional entities and exchanges.

Mapping datasets of interest into terrorist knowledge graph

Connecting entities and relationships from different data sources (structured and unstructured) could reveal insights from the intersection of node-link and geospatial analysis so we decided to consolidate them into a knowledge graph Data consolidation implies the integration of data from multiple sources into a single destination with the aim of deriving important information easily and quickly and increasing the efficiency and productivity of certain business processes. We set a strategy for integration that is not optimized for storage space, but it allows an iterative approach: To ingest the data into the graph, we chose that information from different sources is stored as many times as it appears in the sources (e.g., different nodes). Then, if the information is recognized to coincide, a relationship between the different nodes is introduced.

Even if this approach is not storage-efficient, it allows proceeding iteratively, which can be important in the absence of unique identifiers. It allows, for example, to convey in the relationship the level of confidence of the identification. Moreover, it allows using different algorithms for the identification of the entities that can take into account data-quality problems (e.g. spell problems, mistype, or bad data encodings).

For example, an individual could appear both in Panama Papers and in several watch lists. In this case, we would create a node for each source and then check if the nominative or the individual can be considered with certain confidence the same individual. In that case, we will introduce a relationship between the nodes.

In the process of Name-Entity recognition (NER) because of time restriction, we explored single-field identification by name similarity and direct identification:

Name similarity with Jaro–Winkler distance (ignoring case)
Name similarity with Levenshtein distance (ignoring case)
Name equivalence (ignoring case)

Once two names are identified as sufficiently similar, a relationship between the two corresponding nodes is drawn with properties reporting the algorithm that identified the relationship, the threshold chosen, and the actual similarity degree between the nodes.

To reduce the presence of false positives, one can explore multi-field similarities and leverage uniquely identifiers like telephone numbers, bank accounts, and passport ids.

This can be easily achieved e.g. by including these data when present in watch lists and running Cypher queries.

Here is the cypher statement for creating similarity relationships Levenshtein Similarity:

MATCH (o:org), (p:Perpetrator)
with o,p, apoc.text.levenshteinSimilarity (toLower(p.name),toLower(o.name)) as distance where distance > 0.7
MERGE (o)-[r:HAS_SIMILAR_NAME {algorithm:“levenshteinSimilarity”}]->(p)
ON CREATE SET r.distance=distance, r.threshold=0.7

By connecting different sources into the knowledge graph, we faced few challenges:

The time needed for query execution. Similarity relationships can be computationally expensive and need to traverse all the nodes of the graph ~O(N²).
Dealing with different types of data (unstructured data need preprocessing)

To establish the relationship between Panama papers and sanction list,

The name of individuals; The name of the entity/organizations, common in both the datasets were identified.

Relationships were established for the same name of individuals/organizations identified using this cipher statement:

MATCH (n:individual),(o:Officer) WHERE toLower (n.name) = toLower (o.name) 
CREATE (n)-[r:sancIndividual]->(o) 
RETURN r
MATCH (e:Entity) MATCH (o:org) WHERE toLower(e.name) = toLower(o.name) 
CREATE (e)-[r:sancOrg]->(o) 
RETURN r

Graph analysis of a complex system reveals many interesting and non-intuitive discoveries about semantic relations between different entities.

Combining the iterative approach with the graph structure allows identifying and investigating relationships between entities or entities and financial transactions that with traditional relational databases would require tailored SQL queries involving several Join operations. Relationships are instead apparent when looking at connected components of a knowledge graph.

We performed a graph analysis by executing a couple of Cypher queries to learn more about hidden patterns and analyzed connections between entities in a network (Section: Results and insights)

Results and insights

Financing is a constant in terrorism, required to ensure the sustainability of the organizational operations and individual activities. Social media and new payment methods have become attractive tools for terrorist organizations and according to the U.S. Attorney General Cyber Framework, “terrorist groups have solicited cryptocurrency donations running into the millions of dollars via online social media campaigns.” [U.S. Dep’t of Just., Report of the Attorney General’s Cyber Digital Task Force 51 (2020). Available at: www.justice.gov/archives/ag/page/file/1326061/download.]

Sources of terrorist funding include, but are not limited to, low-level fraud, kidnapping for ransom, the misuse of non-profit organizations, the illicit trade in commodities (such as oil, charcoal, diamonds, gold, and the narcotic “captagon”), and digital currencies. [https://www.interpol.int/Crimes/Terrorism/Tracing-terrorist-finances]

Understanding financing flow is essential in developing effective measures of counter-terrorist financing regardless of platform or method employed. Insights about financial management strategies of terrorist organizations can be derived by visualizing the funding of previous attacks from the source of funds through their intermediaries to other distribution points, i.e. funding streams to organizations, operations, and individuals, and thus prevent attacks in the future.

Therefore, we investigated semantically interconnected data information from one/multiple datasets and derived several intuitions about:

Collaborator network among Perpetrators

Collaborator network. Source: Omdena

We search for all collaborator networks between terrorist groups with this Cypher query:

MATCH p=()-[r:IS_COLLABORATOR]->() RETURN p LIMIT 25

Identification of organizations (Sanction list dataset) present (have a similar name) in Perpetrators (GTD dataset) can be retrieved with this Cypher query

MATCH p=()-[r:HAS_SIMILAR_NAME]->() RETURN p LIMIT 25

Presenting identification of organizations in Saction list – Source: Omdena

The appearance of tweets with respect to a certain event (Event: 08.11.2019.: Perpretrator: Hamas) is shown after running this Cypher query:

MATCH (n:Tweet)-[]->(e:Event)-->(p:Perpetrator) where p.name <> 'Unknown' 
RETURN n,e,p

Appearance of events with respect to a certain event – Source: Omdena

The shortest path between any officer in Panama papers with any individual mentioned in the sanction list, to determine if any person had any direct/indirect relationship with a sanctioned individual, can be retrieved with the below query:

MATCH (a:Officer),(b:individual) WHERE a.name CONTAINS 'NIGEL RICHARD JAMES COWIE' 
MATCH p=allShortestPaths((a)-[:officer_of|intermediary_of|registered_address|sancIndividual*..10]-(b)) 
RETURN p

The shortest path between any officer – Source: Omdena

Financial transactions between sanctioned entities and terrorist organizations

Match (p:Perpetrator), (e:entity) WITH p,e 
MATCH r=allShortestPaths( (p) -[rel:IS_OWNED_BY|GSTransaction*..9]- (e)) 
return r

We found a transaction between a wallet provided in a crowdsourcing campaign and a wallet owned by an entity sanctioned by the OFAC:

Relationship between wallets – Source: Omdena

Neo4j Streamlit Integration

The knowledge graph can be displayed in Streamlit dashboard app using the Streamlit agraph component. The data for the graph can be retrieved from the graph database using the Neo4j Python driver.

In this case, we took a sample CSV file having the columns – transactionnumber, transaction_description,t_from, and t_to representing a suspicious transaction report(STR). A field on Streamlit page will display all the transaction numbers in a dropdown.

Depending on the transaction number selected, its corresponding t_from name and t_to name will be analyzed for any relationship/connections within the datasets, and the graph output if any will be displayed in separate columns.

Import the following libraries:

import streamlit as st
from neo4j import GraphDatabase 
import pandas as pd 
from streamlit_agraph import agraph, TripleStore, Config

Using neo4j driver to connect to graph database instance, driver = GraphDatabase.driver(“bolt://IP:Host”, auth=(“name”, “pwd”))

The key class of agraph used here is the TripleStore – it serves as the central data store. The TripleStore consists of three sets, each belonging to the Nodes, Edges, and Triples classes respectively.

New triples can be added to the TripleStore with this method:

store = TripleStore() 
store.add_triple(node1, link, node2)

We need to pass the source node, the edge, and the target node to the triple store.

The cipher query statement is designed in such a way as to retrieve the above 3 details. The sample query taken here is as follows:

result2 = session.run("""MATCH (n:Officer)-[rel1]->(r:Entity)<-[rel2]- 
(n1:Officer)<-[rel3:sancIndividual]-(n2)-[rel4:SanctionBy]->(n3) 
WHERE n.name = '""" + name + """' 
RETURN n,r,n1,n2,n3,rel1,rel2,rel3,rel4""") 
rList = result2.data()

The List(rList) retrieved from the database is iterated to retrieve the starting_node-[:edge]->ending_Node details and then assign it to triplestore.

Configurations according to the display requirements are added and then assigned in agraph component to visualize the graph in a streamlit module.

config = Config(height=500, width=700, nodeHighlightBehavior=True, 
highlightColor="#F7A7A6", directed=True, collapsible=True, 
node={'labelProperty': 'label'},link={'labelProperty': 'label', 'renderLabel': True}) 
agraph(list(store.getNodes()), list(store.getEdges()), config)

Streamlit dashboard to visualize financing terrorism using knowledge graph – Source: Omdena

Possible improvements

The relationships created in the above section can be extended for the alias name of individuals/organizations in the sanction list.

For now, alias names are a property of the particular individuals/organizations – a) using APOC procedures or b)creating an alias as a separate node for the main individual/organization can be employed to extend this relationship.

To create an alias as a separate node, using CSV import, can be done using the below statement:

LOAD CSV WITH HEADERS FROM 'file:///UN_List_Consolidated.csv' AS row 
WITH row  UNWIND split(row.INDIVIDUAL_ALIAS,';') AS anameMATCH (i:individual{uid:row.DATAID})
MERGE (a:individual {name:aname,uid:row.DATAID}) 
MERGE (i)-[r:has_Alias]->(a) 
RETURN -aname,i.uid as id

Explore Multi-field similarity (to reduce false positives) e.g. include in similarity measure DOB, LOB

Geographical proximity- identify proximity of addresses and geolocalization. We have institutions that have been targeted or companies that reside in the same building as sanctioned ones (e.g. highlighting possible intermediaries)

Social-media data for enhanced ENR (e.g. Name<->usernames<-> Aliases)

Conclusion

A decision knowledge graph does not drive actions directly but surfaces trends in the data, which can be used in several ways such as to extract a view or subgraph for specific analysis.

(Recommendation) Once we have the results from queries, graph-based visualization tools are helpful for exploring connections in graph data, but we can also use these outputs as training data for ML models.

For example – a binary classification using a decision tree model can be used to predict if an individual/entity in an STR report had any background in terrorism/maintaining shell companies/illicit crypto connections