Suspicious customer detection with LDA + Auto-Encoder part1

Suspicious customer detection is one of the applications of Fraud Detection or Anomaly Detection. There are many data-driven approaches in this area. This blog will introduce a very novel approach as a good alternative solution for suspicious customer detection: LDA + Auto-Encoder.

Different business units or business lines have different perspectives to detect anomaly customers. In this blog, we only focus on the scope of wire transfer service in a commercial bank. The general idea of a wire transfer is very easy to understand. In a wire transfer chain, a bank can be either an originator bank who initiates a transfer or a beneficiary bank who receives a transfer or an intermediary bank who passes the transfer as an intermediary. You can think of an intermediary bank scenario like the following chart: A customer wants to send money from Bank of America to his friend whose bank is a small bank in China. However, Bank of America does not set up an account in that small bank, the only way to complete this transfer is looking for help from Bank of China who has an account in that small bank. So Bank of America will send message/money to Bank of China and Bank of China will send message/money to that small bank. In this case, the Bank of China is an intermediary bank. For further information about the message/money flow between banks, please check this link.

Whatever your position is, originator bank or beneficiary bank or intermediary bank, as a member in that chain, you need to monitor the behavior of customers. Actually, the beneficiary that the friend of the originator is not our customer, but we will still detect his behavior, usually those wire entity will be called Pseudo Customers. Commonly, the originator and the beneficiary are not easy to monitor since they are not our real customers, but with this fancy model, it is easy to detect their suspicious behaviors. So far, we have known the background of the wire transfer, now let’s talk about the workflow of our model.

The final inputs for Auto-Encoder are the topic probability vectors generated by LDA and the predictive features generated by both structured and unstructured data. This blog is Part1 where we mainly introduce LDA. In the next blog Part2, we will introduce predictive features and Auto-Encoder. As we know, LDA is a topic model in NLP, which is able to generate the topic probability vector (soft clustering) for each document. In this case, every transaction can be converted to a single word and every customer(entity) can be converted to a single document. Suppose we have collected one-year transactions as a population, all transactions are words, all distinct entities are documents, the transactions under a specific entity are the words in that document. We can conduct the conversion as the following chart:

So far we have known the big view of this model, let’s check the code to know more details. In the production environment, we will extract the tokens from raw messages which are unstructured data. To simplify the problem, suppose the process of tokenization has been completed and tokens have been stored in a database. Now we read tokens from a SQL or NoSQL database and perform some simple filter to extract wire transfer.

from tqdm import tqdm_notebook as tqdm
import sklearn
import pandas as pd
import numpy as np
import pyodbc 
import re

#Load Data
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=;DATABASE=;UID=;PWD=')
df=pd.read_sql_query('select * from txn where bookdate between '+ str(startdate)+' and '+str(enddate))
#Only select following features
df=df[['TranNo','Bene', 'ByOrder', 'BookDate', 'BaseAmt', 'Cust','Account','PaymtMethod','Type','RecvPay']]
#only select wire transfer type which is the type in 1001,1002,1003,1004
#remove MT202 txn which is bank to bank txn
Out[1]:['TranNo' 'Bene' 'ByOrder' 'BookDate' 'BaseAmt' 'Cust' 'Account'
 'PaymtMethod' 'Type' 'RecvPay']

Notice Book date is transfer date, BaseAmt is transfer amount, PaymtMethod is payment method includes 103, 202,202cov, CHIPS, FED. Type is transfer type includes all wire transfer type code, RecvPay is a receive or pay indicator. For converting to word, BaseAmt has to be an integer, it is not allowed to include any continuous features in any word. We conduct transformation as follows:

#transform amt into bucket
def Amt_bin(amt,bin_num,rankAmt):
        for idx, th in enumerate(rankAmt):
            if amt<=th:
                return idx
        return bin_num

#set total bin_num as 20
rankAmt=[df_tp['BaseAmt'].quantile(i/bin_num) for i in range(bin_num)]

#generate new feature

Notice the originator and beneficiary name is free format, we need to apply fuzzy logic on the originator and beneficiary to normalize the name, the code is as follows:

#pseudo_high_frequency_words contains all Prefix suffix, etc of a company
def Normalization_Pseudo_Name(name):
    if name is not None:
        name=re.sub('[^0-9a-zA-Z]+',' ',name)
        for word in pseudo_high_frequency_words:
            name=re.sub(' '+word+' ',' ',name)
            name=re.sub(' '+word,' ',name)
        name=re.sub(' ','',name)
        return name
        return None

#genearte normalized name

Now we can group originator and beneficiary by their normalized names.

#group byorder
Byorder_group=[i for i in g_byorder.__iter__()]

#group bene
Bene_group=[i for i in g_bene.__iter__()]

#features used in byorder should exclude byorder name
       'PaymtMethod', 'Type', 'RecvPay', 'Norm_Bene','BaseAmt_Bin']

def words_generator(df,feature_used_in_word):
        df['words']=df[feature_used_in_word].apply(lambda x: '_'.join(x),axis=1)
        return df['words'].values
#generate all words for each originator, return [(originator1,[word1,word2.....]),...]
                feature_used_in_byorder)) for i in tqdm(range(Byorder_num))]

#features used in benefiiary should exclude byorder name
       'PaymtMethod', 'Type', 'RecvPay', 'Norm_ByOrder','BaseAmt_Bin']

#generate all words for each beneficiary, return [(beneficiary1,[word1,word2.....]),...]
                feature_used_in_bene)) for i in tqdm(range(Bene_num))]

Merge all originator and beneficiary on the name, combine the words with the same name, generate distinct entity name and the words under each distinct entity.

#create dataframe for originator and beneficiary

#merge two df
#combine words with same name
merge_word_df['words_merge']=merge_word_df.apply(lambda x: np.append(x['words_byorder'],
#delete nan from words_merge
#Do not use np.nan(), it will raise error when input is string, only work for numeric!
merge_word_df['words_merge']=merge_word_df['words_merge'].apply(lambda x: x[~pd.isnull(x)]) 
#keep two columns only

In order to train LDA, all words should be converted to numeric value as word2index.

#generate vocabulary
def vocab_generator(Series):
    for i in Series:
    return np.unique(vocab)

#convert word to index
def word2idx(vocab,df,idx_col,col_type='array'):
        word2idx_dic={word:idx for idx,word in enumerate(vocab)}
        if col_type=='array':
            df[idx_col+'_idx']=df[idx_col].apply(lambda doc:np.array([word2idx_dic[word] for word in doc]))
        elif col_type=='value':
            df[idx_col+'_idx']=df[idx_col].apply(lambda value: word2idx_dic[value])
            raise ValueError('Input col_type is not valid!')
        return word2idx_dic,df


#For business security problem, let's also index the entity's name

#Generate LDA Input
 name_idx    words_merge_idx
0	0	[5082, 29357]
1	1	[76626, 30295, 30294]
2	2	[36226]
3	5	[105063]
4	8	[32751, 32759, 32747, 32748, 32757, 32749, 32759]
...	...	...
28632	28623	[85244, 67091]
28633	28625	[45452, 45451]
28634	28632	[79454, 64567, 64566, 23890]
28635	28633	[110589, 110589]
28636	28636	[103360]
28637 rows × 2 columns

So far we have already generated the documents for each customer which will be the inputs for the LDA model. Now, let’s see how to perform LDA to generate topic probability vector! First of all, we need to delete some customers with too few words and too many words that belong to outlier in order to have a good result from LDA. The following chart displays the word count and log of word count distribution of all customers.

#Generate word count for customers
df['words_count']=df['words_merge_idx'].apply(lambda doc: len(doc))
df['words_count_log']=df['words_count'].apply(lambda x: np.log(x))

#Chart for word count distribution

From word count boxplot, we know most of the customers’ word count is between 1 to 15. From word count log boxplot, most of the customers’ word count is between 1 to exp(4) which is around 50. On the other hand, generate a topic vector for customers with too few words is meaningless, So we collect the customers with a word count between 5 to 50. Also, notice that if we perform a hard segmentation for those originators and beneficiaries such as bank, companies, natural-person, etc, we can set up different word count threshold for each segment. That will be much helpful for the LDA model. The number of selected customers after filtering is 8577.

#select customers with words from 5 to 50
Out[3]:(8577, 4)

An essential thing for LDA is choosing the best topic number, there are usually two metrics to determine the best topic number: Topic Coherence and Perplexity. In our production environment, topic coherence is a good metric to consider, but other things will also be taken into account. We can also take the topic number as a parameter in the entire model LDA+Auto-Encoder but not only a parameter of LDA when we perform the parameter tuning. Basically, the topic coherence is the most popular metric for the topic number choosing. So in this blog, we will go through the entire process of choosing the best model based on the topic coherence value.

# preprocess data for LDA
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
data=[[str(word) for word in doc] for doc in df['words_merge_idx']]
corpus = [id2word.doc2bow(text) for text in data]

#We will run 20 models with topic number from 1 to 20
#Let's see which model has the biggest coherence value!
for num_topics in tqdm(range(1,K+1)):
            coherencemodel = CoherenceModel(model=model, texts=data, dictionary=id2word, coherence='c_v')

#generate a chart to display all values 
import matplotlib.pyplot as plt
%matplotlib inline
num_topic_range = range(1,K+1)
for x,y in zip(num_topic_range, coherence_values):
    print("Num Topics =", x, " has Coherence Value of", y)
plt.plot(num_topic_range, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
print('the best topic number is %d'%(1+np.argmax(np.array(coherence_values))))
Out[3]:the best topic number is 3

The best topic number is 3 and the biggest coherence value is 0.6876. Notice that the best coherence value is the biggest one. So far we can train the best LDA model with topic number equals 3. Notice that in a real production environment, the selected topic number is much bigger than 3.

#train the model with num_topics=3
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,

#Let's check the most important words/txns for each topic
from pprint import pprint
  '0.003*"99058" + 0.003*"99059" + 0.002*"14641" + 0.002*"14957" + '
  '0.002*"99060" + 0.002*"99070" + 0.002*"99055" + 0.002*"14640" + '
  '0.002*"99068" + 0.002*"99057"'),
  '0.006*"14986" + 0.002*"97793" + 0.001*"102618" + 0.001*"14984" + '
  '0.001*"36741" + 0.001*"42136" + 0.001*"4678" + 0.001*"7726" + 0.001*"7312" '
  '+ 0.001*"7392"'),
  '0.002*"58731" + 0.002*"78120" + 0.001*"58733" + 0.001*"58734" + '
  '0.001*"110020" + 0.001*"58721" + 0.001*"32869" + 0.001*"52824" + '
  '0.001*"54048" + 0.001*"62471"')]

Now let’s take a look at the dominant topic and keywords for each customer.

def topic_keyword_display(ldamodel, corpus, texts):
        # Init output
        sent_topics_df = pd.DataFrame()
        # Get main topic in each document
        for i, row_list in enumerate(ldamodel[corpus]):
            row = row_list[0] if ldamodel.per_word_topics else row_list            
            # print(row)
            row = sorted(row, key=lambda x: (x[1]), reverse=True)
            # Get the Dominant topic, Perc Contribution and Keywords for each document
            for j, (topic_num, prop_topic) in enumerate(row):
                if j == 0:  #dominant topic
                    wp = ldamodel.show_topic(topic_num)
                    topic_keywords = ", ".join([word for word, prop in wp])
                    sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
        sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

#generate df with dominant topic and keywords

In the next blog, we will dive deep into the Auto-Encoder and briefly introduce some fancy features which are very predictive of Customer behavior detection. Do you want to enhance your Data Science Tools? Please subscribe to my blog right now!


Published by frank victor xu

I am a data science practitioner. I love math, artificial intelligence and big data. I am looking forward to sharing experience with all data science enthusiasts.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: