Data Exploratory Analysis: Loan Repayment Challenge

Teng
4 min readOct 31, 2020

--

Loan

There are 577,682 application records from 459,393 unique applicants.

Figure below shows the unique loan applicants across the United State, OH, IL and TX has the highest unique applicant, while least applicants came from HI, AK, ND.

Figure below shows the number of loan application received in each month from 2014 to 2017. The time-series data shows a upward trend with a seasonal effect. Like most of commercial data, loan applications increased at Q4 of each year. December received the highest application in Q4 with 400%-600% increase compared to October of that quarter.

Figure below shows the number of approved and rejected loan. The loan approval rate is 6.93%.

Figure below shows the approved loan, categorized into the loan status. While loan application is a two-way process, Approximately 97% of the customers accept the loan offer.

Unfortunately, out of the 40,000 approved loan, we observed approximately 17,000 of loan is under external and internal collection. This indicates that many of the borrower has difficulty to repay the loan.

Approximately 8,000 of the loan are new loan while around 11,000 of the loan had been paid off.

Which lead method brought the most customer to apply for loan?

The figure below show the distribution of the lead type. Approximately 94% of the loan borrowers were led by ‘byMandatory’ and ‘lead’— by definition is leads that are bought from the ping tree.

I did a little research about the ping tree lead. Ping tree is a lead method where the lead-industry companies sell the the information to third party that uses information for target-marketing. Ping trees are especially popular in the finance space amongst loan, mortgage and credit card providers.

Among the lead-type, lead by ‘California’ has the highest lead-cost (Average 129.2) followed by ‘lead’ type (Average 31.1).

Next thing that we wonder, is there a higher APR or higher loan amount associated with higher lead-cost?

Looks like a higher loan amount comes with lower APR to attract the borrower. Although California lead-type has a higher cost, it successfully attracts borrowers with higher loan amount.

California lead-type customers has the highest average loan amount ($ 2,777) with a relatively low average APR (129%).

What if the customer wants to apply a lower amount of loan while enjoying a lower APR? My advise is try your luck through Lionpay, they could offer you an average APR of 30% with an average loan amount of $ 862.

Does a good track record matter?

Yes. In fact, if we consider those loan with ‘loan status’ = [‘Paid Off Loan’, ‘New Loan’, ‘Returned Item’, ‘Pending Paid Off’] as a Good Loan.

We can see that a customer that has a higher nPaidoff will eventually paid off the loan.

Did the company start making money??

We assume that the loan.csv and payment.csv captured all the loan and received payment.

loan_out = loan_cuv_df['loanAmount'].sum()
payment_received = pay_df.groupby('paymentStatus')['paymentAmount'].sum().Checked
net = loan_out-payment_received
print(f'Loan out is {loan_out:.2f}, received payment is {payment_received:.2f}')
print(f'Nett is {net:.2f}')

Total of -$ 24.9 M loan was given out, while the company already received $ 24.2M. The net cash flow is -$ 0.7M. Good news, it’s close to breakeven.

For the purpose of predicting whether a loan will end be a delinquent , we will be only focusing on data exploratory analysis ends at loan.csv.

The github link of exploratory analysis:

https://github.com/yuanxy33/Loan-challenge/blob/main/Data%20Exploratory%20Analysis.ipynb

The github link of the Prediction model:

https://github.com/yuanxy33/Loan-challenge/blob/main/Loan%20challenge%20Prediction.ipynb

--

--

Teng
Teng

Written by Teng

Data Scientist | Msc Data Science @ University of Leeds

No responses yet