Discovering whether data are of acceptable quality is a measurement task and not a very easy one. In this data quality project, I used excel and python to deal with Consumer Complaint Database, complaints about financial products and services.Data quality is important because, without high-quality data, you cannot understand or stay in contact with your customers.

Metadata Updated: Sep 26, 2015  Publisher: consumer financial protection bureau and I am not sure how to measure the size of the dataset.

Before we analysis 608678 Consumer Complaints, we have to make sure the quality of the data. Otherwise, the research will end up in ” Garbage in and garbage out”. First of all, I used python to look the overview of this dataset. <Click Here>Data Cleaning Original Python Code

screen-shot-2016-11-07-at-2-23-51-pm

comsumer-compaint

50 companies 50companies

Seven kinds of product and 21 types of Sub-product.

product_kind

The first part of data preparation is separating data into the fields that will be most useful to you. If we used python  .groupby(), function could print out each of the product have how many records.

b= a.groupby(['Product']).size()
print b
Product
Bank account or service     68655
Consumer Loan               23608
Credit card                 72027
Credit reporting           105557
Debt collection            111680
Money transfers              4250
Mortgage                   198170
Other financial service       662
Payday loan                  4303
Prepaid card                 2826
Student loan                16940
dtype: int64

used IsNull().Sum() function and to find null by all available column when the answer is “0” the column is not null. There is 370198 missing value in sub-issue.

picture1

If we are interested in Consumer Loan, we can use .loc() to get the data location. 

2

Now that we understand what data cleaning is for and what methods and approaches there are to shape up our dataset, there is still the question of what cleaning can and can’t catch. A general rule for cleaning a dataset where each column is a variable and the rows represent the records is:

  • if the number of incorrect or missing values in a row is greater than the number of correct values, it is recommended to exclude that row.
  • if the number of incorrect or missing values in a column is greater than the number of correct values in that column, it is recommended to exclude that column.

Some useful tutorial : https://g0v.gitbooks.io/data-design/content/book/ch10-what-data-cleaning-can-and-cant-catch.html 

Leave a Reply

Your email address will not be published. Required fields are marked *