![]() You do the usual splitting of train/test sets, fit, etc. """ total_num = len ( data1 ) + len ( data2 ) prob_data1 = len ( data1 ) / total_num prob_data2 = len ( data2 ) / total_num overall_mce = prob_data1 * calculate_mce ( data1 ) + prob_data2 * calculate_mce ( data2 ) return overall_mce def calculate_overall_impurity ( data1, data2, method ): """Calculate the overall impurity.Sci-kit learn documentation is pretty good: - scikit-learn 1.3.0 documentation max ( probs ) return mce def calculate_overall_mce ( data1, data2 ): """Calculate the overall misclassification error of the two input datasets. unique ( labels, return_counts = True ) probs = counts / counts. If the data is an empty array, ValueError will be raised. """ total_num = len ( data1 ) + len ( data2 ) prob_data1 = len ( data1 ) / total_num prob_data2 = len ( data2 ) / total_num overall_gini = prob_data1 * calculate_gini ( data1 ) + prob_data2 * calculate_gini ( data2 ) return overall_gini def calculate_mce ( data ): """Calculate the misclassification error of the input data. square ( probs )) return gini def calculate_overall_gini ( data1, data2 ): """Calculate the overall Gini index of the two input datasets. If the data is an empty array, gini will be 1. """ total_num = len ( data1 ) + len ( data2 ) prob_data1 = len ( data1 ) / total_num prob_data2 = len ( data2 ) / total_num overall_entropy = prob_data1 * calculate_entropy ( data1 ) + prob_data2 * calculate_entropy ( data2 ) return overall_entropy def calculate_gini ( data ): """Calculate the Gini index of the input data. If the data is an empty array, ZeroDivisionError will be raised. Should be the datasets whose last column contains the class labels. log2 ( probs )) return entropy def calculate_overall_entropy ( data1, data2 ): """Calculate the overall entropy of the two input datasets. If the data is an empty array, entropy will be 0. Should be the data whose last column contains the class labels. The reason I implement this method is to reproduce post-pruning given in the course learning materials.ĭef calculate_entropy ( data ): """Calculate the entropy of the input data. This leads to a similar results of computing misclassification error. ![]() cmap ( 0.7 ), fmt = "P_iĪlso, a function to calculate the Laplace-based misclassification probability is also provided. set_title ( 'Logspace size encoding records' ) # Legend of sizes kw = dict ( prop = "sizes", num = 5, color = scatter. scatter ( x, y, s = log_area, c = colors, alpha = 0.8, cmap = 'Paired' ) axes. legend ( handles, unique_classes, loc = "best", title = "Classes" ) # Scatter plot in logspace scatter = axes. legend_elements ( prop = 'colors', alpha = 0.6 ) lgd2 = axes. set_title ( 'Real size encoding records' ) # Legend of classes handles, _ = scatter. scatter ( x, y, s = area, c = colors, alpha = 0.8, cmap = 'Paired' ) axes. flat, xlabel = 'sourceIP Clusters', ylabel = 'destIP Clusters' ) # Scatter plot: use alpha to increase transparency scatter = axes. suptitle ( 'Cluster Connections with Classifications', fontsize = 20 ) plt. log ( relation )) ** 2 * 15 # Constrained size in logspace colors = relation # Colours defined by classes # Create new subplots figure fig, axes = plt. reshape ( - 1, 1 )), axis = 1 ) # Save the dataset with counts # pd.DataFrame(relation, columns=).to_csv('relation.csv') # Generate data for size-encoding scatter plot x = relation # Source IP cluster indices y = relation # Destination IP cluster indices area = ( relation ) ** 2 / 10000 # Marker size with real number of records log_area = ( np. This step may cost about 10 seconds relation = np. index ] # Use Counter method counter_relation = Counter ( cluster_triples ) # Generate the numpy array in shape (n,4) where n denotes all types of triples and the four column contains the number of records of the corresponding triples. head () ) # Generate triples with indices of sourceIP cluster, destIP cluster and class cluster_triples =, cluster_data_digit_cls. replace ( label, i ) print ( 'Cluster dataset with indices as class names generated: \n ', cluster_data_digit_cls. copy ( deep = True ) for i, label in enumerate ( unique_classes ): cluster_data_digit_cls = cluster_data_digit_cls. ![]() unique ( classes ) # Unique classes # Replace the string names with the indices of them in unique classes array cluster_data_digit_cls = cluster_data. Classes = cluster_data # Extract the class column unique_classes = np. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |