ࡱ> GIF%` (bjbjٕ 7>  <\ g| | | | | W W W $h tW W tt | | !tJ| | t| p  ,70gdpW K  W W W   W W W gtttt\\ Project Description CS 7301 Spring 2008 Data Mining Project#1: Clustering Evolving Data Streams and Applying It to Detect Peer-to-Peer Botnet Traffic 1. Clustering evolving data streams We face two major challenges while mining data streams: infinite length and concept drift. Most clustering algorithms are multi-pass, i.e., they require multiple scans of a data set. This could be too expensive for data streams, because of their huge volume. Thus, these algorithms cannot be applied directly to clustering data streams. Although there are some recent works in clustering data streams (O'Callaghan, 2002; Guha, 2000), they are not sufficient for handling data streams because of two reasons. First, they compute the clusters over the entire data stream. But this approach is not applicable to a data stream that evolves with time. If the data stream evolves, the underlying clusters may also undergo significant change with time. Second, they perform incremental clustering by maintaining only a small number of existing clusters (such as k in the k-median or k-means method). But it is possible that with the nature of data evolving, some objects originally belongings to one cluster may now belong to another cluster, or even form a new cluster. Based on these observations, two major ideas have been proposed: microclustering (zhang, 1996) and tilted time frame. The microclusters have an additive property which makes them a natural choice for data streams. When new data arrive, they could either be put into an existing microcluster, or form a new one, depending on how close they are from the nearest microcluster center. If a new microcluster is created, some existing microclusters may be merged, or some small and outdated microclusters may be removed. Thus, we may incrementally maintain and dynamically evolve microclusters in data streams. The summary information in the microclusters can be used for creation of macroclusters (i.e., the high-level k clusters for an input parameter k) dynamically, using k-means clustering algorithm. Besides, by adopting tilted time frame, microclusters are stored as time-stamped snapshots, and the snapshots are maintained at multi-granularity depending on their age. A comprehensive description of efficient clustering of data streams using this approach can be found in Aggarwal (2003). 2. Application to P2P botnet traffic detection P2P botnets are the new evolving technology of botnets. These botnets are distributed, and much smaller than IRC-based botnets. Thus, they are also more difficult to locate and take down. We have proposed an efficient stream data classification algorithm for P2P botnet traffic detection. We would like to integrate this technique with the data stream clustering, for P2P botnet traffic detection. This would be done as follows: we will apply the data stream clustering technique (Aggarwal, 2003) to cluster the stream data into two clusters. We would label one of the clusters as benign and the other as malicious, based on the label of a few sample points in each cluster. Then we would apply classification on the labeled data. The system will be evaluated by the accuracy of the learned classifier. 3. References: C. Aggarwal, J. Han, J. Wang, and P. S. Yu (2003). A framework for clustering evolving data streams. In Proc. 2003 Int. Conf. Very Large Data Bases (VLDB'03), pages 81-92, Berlin, Germany, Sept. 2003. S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan(2000). Clustering data streams. In Proc. 2000 Symp. Foundations of Computer Science (FOCS'00), pages 359-366, Redondo Beach, CA, 2000. L. O'Callaghan, A. Meyerson, R. Motwani, N. Mishra, and S. Guha (2002). Streaming-data algorithms for high-quality clustering. In Proc. 2002 Int. Conf. Data Engineering (ICDE'02), pages 685-696, San Fransisco, CA, Apr. 2002. T. Zhang, R. Ramakrishnan, and M. Livny. Birch(1996): an efficient data clustering method for very large databases. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'96), pages 103-114, Montreal, Canada, June 1996. POC: Mohammad M Masud Project#2: Ontology Alignment Given 2 data sources, S1 and S2, each of which is represented by ontologies O1 and O2, the goal is to find similar concepts between O1 and O2 by (1: examining their names (2: examining their respective instances (3: examining their structural properties (a nodes parents, children, siblings, etc.). For the purposes of this problem, O1 and O2 may belong to different knowledge domains drawn from any existing knowledge domain, although typically, the ontologies are modeled after the same knowledge domain (ie: the domain of university computer science faculties in the United States). Additionally, these ontologies may vary in breadth, depth, and through the types of relationships between their constituent concepts. The challenge involved in the alignment of these ontologies, assuming that they have already been constructed, is based on the derivation of procedures that will maximize the conceptual similarity between any 2 concepts between the ontologies. This is accomplished through determining the weighed similarity between the concept names and their associated instances and also by structural properties from each ontology close to the concepts being compared in an attempt to maximize the conceptual similarity. Name similarity is determined by comparing the full names of each concept, where the full name is defined as the concatenation of the concept names that lead to the currently compared concept, starting from the root node of the ontology. Content similarity relies on using 2 classifiers, CA and CB, which are trained to reliably determine whether instances belong to concept A and concept B, respectively. CA then must classify those instances belonging to concept B, while CB must classify instances belonging to concept A. The number of instances belonging to both concepts is determined, and from this, 3 separate probability values are calculated: P(A,B), which is the probability that an instance belongs to concepts A and B, P(~A, B), which is the probability that an instance belongs to concept B but does not belong to concept A, and P(A, ~B), which is the probability that a concept belongs to concept A but does not belong to concept B. Once these probabilistic measures are tallied, a user-defined function expresses a similarity matrix that describes pairwise similarity between between all of the concepts of O1 and O2. The final pairwise probability is derived by using Jaccard similarity, which is defined in the context of the joint probability distribution over all of member instances as P(A,B)/P(A,B) + P(A,~B) + P(~A,B). In order to maximize the similarity between concepts, local structural properties of the ontologies are evaluated and compared in a process known as relaxation labeling. This module takes the matrix computed by the similarity estimator and uses domain constraints and heuristic knowledge to determine the most likely overall similarity configuration. The constraints may be domain-dependent, such as There can be at most one department chair for a university faculty ontology, or the constraints may be domain-independent, such as Two concepts match if their children also match. The process of relaxation works as follows: concept nodes are assigned initial labels based solely on their intrinsic properties. Next, iterative local optimization is performed. Within each iteration, the label of the node is re-estimated based on the local features of its neighborhood, included domain constraints and heuristics. The process continues until labels do not change from one iteration to the next, or until some other version of convergence criterion is reached. Here are 4 relevant to GIS, with some emphasis on road networks: http://fcgov.com/gis/downloadable-data.php (Fort Collins GIS data, roads, habitats, zoning, etc., in dbf and shapefile format) http://data.geocomm.com/catalog/index.html (free GIS data, GIS data depot) http://www.geobase.ca/geobase/en/data/nrn/index.html (National Road Network of Canada, includes data in the form of shapefile and KML data) http://www.capcog.org/Information_Clearinghouse/Geospatial_main.asp (Texas GIS data, in mdb and shapefile format) POC: Jeffrey Lee Partyka 8BOPXY]^ijmqrsx 5 P W X j u     R ˧~vvnnfnfhKCJaJhq.CJaJhjE,CJaJhxNCJaJhPhPCJaJhCJaJhhP5CJ\aJh55CJ\aJh}hA5CJ\aJh}5CJ\aJh}hP5CJ\aJh55CJ\aJh5h55CJH\aJh5h55CJH\aJH&*+8  /0_ $7$8$H$a$gd\ $7$8$H$a$gd. $7$8$H$a$gd ! $7$8$H$a$gdrk $7$8$H$a$gd/ ~ $7$8$H$a$gdK $7$8$H$a$gdC s(R ` d q s w 0 5 : < H g r s   ! " - . > v  + < J _ a { ׹׹׹ױױשמז׎׎h\sCJaJh*CJaJhPhCJaJh.CJaJhf?CJaJhPhP6CJ]aJh{a,CJaJhC sCJaJhPhPCJaJhKhKCJaJh/ ~CJaJhKhKCJaJ8 /Sh!&(GH  OST[efjk,./0ļhPhbCJaJhPCJaJhrkCJaJhVz~CJaJhPhVz~CJaJh.CJaJhPhP6CJ]aJh*CJaJhPhPCJaJh\sCJaJ:03BLMPS^_@HJN~:lֽ|tttld\dTh7CJaJh#7CJaJh'"hCJaJhCCJaJh.CJaJh{(CJaJh@uCJaJhBzCJaJhmCJaJh@uh@uCJaJh@uh,CJaJh@uh7@CJaJhP5CJ\aJhoJhb5CJ\aJhoJhP5CJ\aJhoJh !5CJ\aJhoJhoJ5CJ\aJ/9:EFZ[  T[߾tllldhQpCJaJhH6CJaJhfhf6CJ]aJhfCJaJhfhfCJaJh2gCJaJh2g6CJ]aJh2gh2g6CJ]aJhgWCJaJh2gh2gCJaJhp|hp|6CJ]aJhj{CJaJhp|CJaJhp|hp|CJaJhfC\h\5CJ\aJ&[% r&&2'}'' (M({((( 7$8$H$gd5$a$gd5L7$8$H$^`LgdIWL7$8$H$^`LgdQpL7$8$H$^`Lgd9$*03OP^_m$%?@FGuv|}wy~"$)*ĸ|| h5CJhTh5CJH*hTh5CJhTh55>*CJaJhTh55CJaJh55CJaJh5h55CJaJh5CJaJhIWCJaJhQp6CJ]aJhQphQp6CJ]aJhQpCJaJhQphQpCJaJ0 l!m!s!t!q&r&|&}&&&&&u'v'{(|(((hIWh\CJaJh5h55 hTh5h5hTh5CJhTh5CJH*50P:pQp/ =!"#$% @@@ NormalCJ_HaJmH sH tH DA@D Default Paragraph FontRi@R  Table Normal4 l4a (k@(No List >*+8/ 0 _ [ %r2} M { 000000000000000000000000000@0@000@0@0@0@0@00[ r M { J00 J00J00 0 J00J0 _H0H0J00H00H000@0J00 R 0(((8@0(  B S  ?/7407`1727<37647 F57d 67LF77|87_97:7 ;7<7=7>7 y?72> > F cc      D M M  pp  :*urn:schemas-microsoft-com:office:smarttagsStreet;*urn:schemas-microsoft-com:office:smarttagsaddress9*urn:schemas-microsoft-com:office:smarttagsplace8*urn:schemas-microsoft-com:office:smarttagsCityB*urn:schemas-microsoft-com:office:smarttagscountry-region9 *urn:schemas-microsoft-com:office:smarttagsState 8r  fj,;=Ber S_q})6gt3@ % c j @ H ^ b g m r y (05<AGPTis$&0:08LS )a d i r  Z ]%$&F[# 3333333333333*  Z ]({  jw^_[Q O Hw!F.;O[\w 7s  ) B"M":$D&z:'d*}*jE,U,{a,I-.p/122H6#7":f?7@}Z@`BKE*FtbJoJXK\LiMxN Om8O6S@WIWgW{WG@X\fC\J\:]A^uN_h`c.bb2b'"hrk-k?]mhnp p&IpQp=q!DrC s\s@uty*yPycRyBzw{p|h}~/ ~Vz~XP=<$]FB.bDh,*=A3}jT\~w5q.[O4Q !ql$,E\hhX^_wn!j{grv!NBci0.`m5=WC2gpAT;g+L<,e <$agP3*[8qs{(\f9K  @ =  @UnknownGz Times New Roman5Symbol3& z Arial"qh&&;;!24d 2Q HX)?~2XTask: Clustering evolving data streams and integrate it to classification of botnet data Mehedy MasudlkhanOh+'0 ,8H Xd   \Task: Clustering evolving data streams and integrate it to classification of botnet dataMehedy MasudNormallkhan2Microsoft Office Word@G@^߿@^߿՜.+,0@ hp|  !UTD; ' YTask: Clustering evolving data streams and integrate it to classification of botnet data Title !"#$%&')*+,-./012345789:;<=?@ABCDEHRoot Entry FJData  1Table(WordDocument7>SummaryInformation(6DocumentSummaryInformation8>CompObjq  FMicrosoft Office Word Document MSWordDocWord.Document.89q