• 1.78 MB
  • 2022-04-22 11:29:13 发布

数据挖掘导论 (斯坦巴赫 著) 人民邮电出版社 课后答案

  • 170页
  • 当前文档由用户上传发布,收益归属用户
  1. 1、本文档共5页,可阅读全部内容。
  2. 2、本文档内容版权归属内容提供方,所产生的收益全部归内容提供方所有。如果您对本文有版权争议,可选择认领,认领后既往收益都归您。
  3. 3、本文档由用户上传,本站不保证质量和数量令人满意,可能有诸多瑕疵,付费之前,请仔细先通过免费阅读内容等途径辨别内容交易风险。如存在严重挂羊头卖狗肉之情形,可联系本站下载客服投诉处理。
  4. 文档侵权举报电话:19940600175。
'课后答案网:www.hackshp.cn课后答案网您最真诚的朋友www.hackshp.cn网团队竭诚为学生服务,免费提供各门课后答案,不用积分,甚至不用注册,旨在为广大学生提供自主学习的平台!课后答案网:www.hackshp.cn视频教程网:www.efanjy.comPPT课件网:www.ppthouse.com课后答案网:www.hackshp.cn若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cnIntroductiontoDataMiningInstructor’sSolutionManualPang-NingTanMichaelSteinbachVipinKumar课后答案网:www.hackshp.cnCopyrightc2006PearsonAddison-Wesley.Allrightsreserved.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn课后答案网:www.hackshp.cn若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cnContents1Introduction12Data53ExploringData194Classification:BasicConcepts,DecisionTrees,andModelEvaluation255Classification:AlternativeTechniques456AssociationAnalysis:BasicConceptsandAlgorithms717AssociationAnalysis:AdvancedConcepts958ClusterAnalysis:BasicConceptsandAlgorithms1259ClusterAnalysis:AdditionalIssuesandAlgorithms14710AnomalyDetection课后答案网:www.hackshp.cn157iii若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn课后答案网:www.hackshp.cn若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn1Introduction1.Discusswhetherornoteachofthefollowingactivitiesisadataminingtask.(a)Dividingthecustomersofacompanyaccordingtotheirgender.No.Thisisasimpledatabasequery.(b)Dividingthecustomersofacompanyaccordingtotheirprof-itability.No.Thisisanaccountingcalculation,followedbytheapplica-tionofathreshold.However,predictingtheprofitabilityofanewcustomerwouldbedatamining.(c)Computingthetotalsalesofacompany.No.Again,thisissimpleaccounting.课后答案网:www.hackshp.cn(d)Sortingastudentdatabasebasedonstudentidentificationnum-bers.No.Again,thisisasimpledatabasequery.(e)Predictingtheoutcomesoftossinga(fair)pairofdice.No.Sincethedieisfair,thisisaprobabilitycalculation.Ifthediewerenotfair,andweneededtoestimatetheprobabilitiesofeachoutcomefromthedata,thenthisismoreliketheproblemsconsideredbydatamining.However,inthisspecificcase,solu-tionstothisproblemweredevelopedbymathematiciansalongtimeago,andthus,wewouldn’tconsiderittobedatamining.(f)Predictingthefuturestockpriceofacompanyusinghistoricalrecords.Yes.Wewouldattempttocreateamodelthatcanpredictthecontinuousvalueofthestockprice.Thisisanexampleofthe若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn2Chapter1Introductionareaofdataminingknownaspredictivemodelling.Wecoulduseregressionforthismodelling,althoughresearchersinmanyfieldshavedevelopedawidevarietyoftechniquesforpredictingtimeseries.(g)Monitoringtheheartrateofapatientforabnormalities.Yes.Wewouldbuildamodelofthenormalbehaviorofheartrateandraiseanalarmwhenanunusualheartbehavioroccurred.Thiswouldinvolvetheareaofdataminingknownasanomalyde-tection.Thiscouldalsobeconsideredasaclassificationproblemifwehadexamplesofbothnormalandabnormalheartbehavior.(h)Monitoringseismicwavesforearthquakeactivities.Yes.Inthiscase,wewouldbuildamodelofdifferenttypesofseismicwavebehaviorassociatedwithearthquakeactivitiesandraiseanalarmwhenoneofthesedifferenttypesofseismicactivitywasobserved.Thisisanexampleoftheareaofdataminingknownasclassification.(i)Extractingthefrequenciesofasoundwave.No.Thisissignalprocessing.2.SupposethatyouareemployedasadataminingconsultantforanIn-ternetsearchenginecompany.Describehowdataminingcanhelpthecompanybygivingspecificexamplesofhowtechniques,suchasclus-tering,classification,associationrulemining,andanomalydetection课后答案网:www.hackshp.cncanbeapplied.Thefollowingareexamplesofpossibleanswers.•Clusteringcangroupresultswithasimilarthemeandpresentthemtotheuserinamoreconciseform,e.g.,byreportingthe10mostfrequentwordsinthecluster.•Classificationcanassignresultstopre-definedcategoriessuchas“Sports,”“Politics,”etc.•Sequentialassociationanalysiscandetectthatthatcertainqueriesfollowcertainotherquerieswithahighprobability,allowingformoreefficientcaching.•Anomalydetectiontechniquescandiscoverunusualpatternsofusertraffic,e.g.,thatonesubjecthassuddenlybecomemuchmorepopular.Advertisingstrategiescouldbeadjustedtotakeadvantageofsuchdevelopments.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn33.Foreachofthefollowingdatasets,explainwhetherornotdataprivacy课后答案网:www.hackshp.cnisanimportantissue.(a)Censusdatacollectedfrom1900–1950.No(b)IPaddressesandvisittimesofWebuserswhovisityourWebsite.Yes(c)ImagesfromEarth-orbitingsatellites.No(d)Namesandaddressesofpeoplefromthetelephonebook.No(e)NamesandemailaddressescollectedfromtheWeb.No若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn课后答案网:www.hackshp.cn若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn2Data1.IntheinitialexampleofChapter2,thestatisticiansays,“Yes,fields2and3arebasicallythesame.”Canyoutellfromthethreelinesofsampledatathatareshownwhyshesaysthat?Field2≈7forthevaluesdisplayed.Whileitcanbedangeroustodrawcon-Field3clusionsfromsuchasmallsample,thetwofieldsseemtocontainessentiallythesameinformation.2.Classifythefollowingattributesasbinary,discrete,orcontinuous.Alsoclassifythemasqualitative(nominalorordinal)orquantitative(intervalorratio).Somecasesmayhavemorethanoneinterpretation,sobrieflyindicateyourreasoningifyouthinktheremaybesomeambiguity.Example:Ageinyears.Answer:Discrete,quantitative,ratio课后答案网:www.hackshp.cn(a)TimeintermsofAMorPM.Binary,qualitative,ordinal(b)Brightnessasmeasuredbyalightmeter.Continuous,quantitative,ratio(c)Brightnessasmeasuredbypeople’sjudgments.Discrete,qualitative,ordinal(d)Anglesasmeasuredindegreesbetween0◦and360◦.Continuous,quan-titative,ratio(e)Bronze,Silver,andGoldmedalsasawardedattheOlympics.Discrete,qualitative,ordinal(f)Heightabovesealevel.Continuous,quantitative,interval/ratio(de-pendsonwhethersealevelisregardedasanarbitraryorigin)(g)Numberofpatientsinahospital.Discrete,quantitative,ratio(h)ISBNnumbersforbooks.(LookuptheformatontheWeb.)Discrete,qualitative,nominal(ISBNnumbersdohaveorderinformation,though)若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn6Chapter2Data(i)Abilitytopasslightintermsofthefollowingvalues:opaque,translu-cent,transparent.Discrete,qualitative,ordinal(j)Militaryrank.Discrete,qualitative,ordinal(k)Distancefromthecenterofcampus.Continuous,quantitative,inter-val/ratio(depends)(l)Densityofasubstanceingramspercubiccentimeter.Discrete,quan-titative,ratio(m)Coatchecknumber.(Whenyouattendanevent,youcanoftengiveyourcoattosomeonewho,inturn,givesyouanumberthatyoucanusetoclaimyourcoatwhenyouleave.)Discrete,qualitative,nominal3.Youareapproachedbythemarketingdirectorofalocalcompany,whobe-lievesthathehasdevisedafoolproofwaytomeasurecustomersatisfaction.Heexplainshisschemeasfollows:“It’ssosimplethatIcan’tbelievethatnoonehasthoughtofitbefore.Ijustkeeptrackofthenumberofcustomercomplaintsforeachproduct.Ireadinadataminingbookthatcountsareratioattributes,andso,mymeasureofproductsatisfactionmustbearatioattribute.ButwhenIratedtheproductsbasedonmynewcustomersatisfac-tionmeasureandshowedthemtomyboss,hetoldmethatIhadoverlookedtheobvious,andthatmymeasurewasworthless.Ithinkthathewasjustmadbecauseourbest-sellingproducthadtheworstsatisfactionsinceithadthemostcomplaints.Couldyouhelpmesethimstraight?”(a)Whoisright,themarketingdirectororhisboss?Ifyouanswered,hisboss,whatwouldyoudotofixthemeasureofsatisfaction?课后答案网:www.hackshp.cnThebossisright.AbettermeasureisgivenbynumberofcomplaintsfortheproductSatisfaction(product)=.totalnumberofsalesfortheproduct(b)Whatcanyousayabouttheattributetypeoftheoriginalproductsatisfactionattribute?Nothingcanbesaidabouttheattributetypeoftheoriginalmeasure.Forexample,twoproductsthathavethesamelevelofcustomersatis-factionmayhavedifferentnumbersofcomplaintsandvice-versa.4.Afewmonthslater,youareagainapproachedbythesamemarketingdirectorasinExercise3.Thistime,hehasdevisedabetterapproachtomeasuretheextenttowhichacustomerprefersoneproductoverother,similarproducts.Heexplains,“Whenwedevelopnewproducts,wetypicallycreateseveralvariationsandevaluatewhichonecustomersprefer.Ourstandardprocedureistogiveourtestsubjectsalloftheproductvariationsatonetimeandthen若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn7askthemtoranktheproductvariationsinorderofpreference.However,ourtestsubjectsareveryindecisive,especiallywhentherearemorethantwoproducts.Asaresult,testingtakesforever.Isuggestedthatweperformthecomparisonsinpairsandthenusethesecomparisonstogettherankings.Thus,ifwehavethreeproductvariations,wehavethecustomerscomparevariations1and2,then2and3,andfinally3and1.Ourtestingtimewithmynewprocedureisathirdofwhatitwasfortheoldprocedure,buttheemployeesconductingthetestscomplainthattheycannotcomeupwithaconsistentrankingfromtheresults.Andmybosswantsthelatestproductevaluations,yesterday.Ishouldalsomentionthathewasthepersonwhocameupwiththeoldproductevaluationapproach.Canyouhelpme?”(a)Isthemarketingdirectorintrouble?Willhisapproachworkforgener-atinganordinalrankingoftheproductvariationsintermsofcustomerpreference?Explain.Yes,themarketingdirectorisintrouble.Acustomermaygiveincon-sistentrankings.Forexample,acustomermayprefer1to2,2to3,but3to1.(b)Isthereawaytofixthemarketingdirector’sapproach?Moregenerally,whatcanyousayabouttryingtocreateanordinalmeasurementscalebasedonpairwisecomparisons?Onesolution:Forthreeitems,doonlythefirsttwocomparisons.Amoregeneralsolution:Putthechoicetothecustomerasoneoforder-ingtheproduct,butstillonlyallowpairwisecomparisons.Ingeneral,creatinganordinalmeasurementscalebasedonpairwisecomparisonis课后答案网:www.hackshp.cndifficultbecauseofpossibleinconsistencies.(c)Fortheoriginalproductevaluationscheme,theoverallrankingsofeachproductvariationarefoundbycomputingitsaverageoveralltestsub-jects.Commentonwhetheryouthinkthatthisisareasonableap-proach.Whatotherapproachesmightyoutake?First,thereistheissuethatthescaleislikelynotanintervalorratioscale.Nonetheless,forpracticalpurposes,anaveragemaybegoodenough.Amoreimportantconcernisthatafewextremeratingsmightresultinanoverallratingthatismisleading.Thus,themedianoratrimmedmean(seeChapter3)mightbeabetterchoice.5.Canyouthinkofasituationinwhichidentificationnumberswouldbeusefulforprediction?Oneexample:StudentIDsareagoodpredictorofgraduationdate.6.Aneducationalpsychologistwantstouseassociationanalysistoanalyzetestresults.Thetestconsistsof100questionswithfourpossibleanswerseach.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn8Chapter2Data(a)Howwouldyouconvertthisdataintoaformsuitableforassociationanalysis?Associationruleanalysisworkswithbinaryattributes,soyouhavetoconvertoriginaldataintobinaryformasfollows:Q1=AQ1=BQ1=CQ1=D...Q100=AQ100=BQ100=CQ100=D1000...10000010...0100(b)Inparticular,whattypeofattributeswouldyouhaveandhowmanyofthemarethere?400asymmetricbinaryattributes.7.Whichofthefollowingquantitiesislikelytoshowmoretemporalautocorre-lation:dailyrainfallordailytemperature?Why?Afeatureshowsspatialauto-correlationiflocationsthatareclosertoeachotheraremoresimilarwithrespecttothevaluesofthatfeaturethanloca-tionsthatarefartheraway.Itismorecommonforphysicallycloselocationstohavesimilartemperaturesthansimilaramountsofrainfallsincerainfallcanbeverylocalized;,i.e.,theamountofrainfallcanchangeabruptlyfromonelocationtoanother.Therefore,dailytemperatureshowsmorespatialautocorrelationthendailyrainfall.8.Discusswhyadocument-termmatrixisanexampleofadatasetthathas课后答案网:www.hackshp.cnasymmetricdiscreteorasymmetriccontinuousfeatures.Theijthentryofadocument-termmatrixisthenumberoftimesthattermjoccursindocumenti.Mostdocumentscontainonlyasmallfractionofallthepossibleterms,andthus,zeroentriesarenotverymeaningful,eitherindescribingorcomparingdocuments.Thus,adocument-termmatrixhasasymmetricdiscretefeatures.IfweapplyaTFIDFnormalizationtotermsandnormalizethedocumentstohaveanL2normof1,thenthiscreatesaterm-documentmatrixwithcontinuousfeatures.However,thefeaturesarestillasymmetricbecausethesetransformationsdonotcreatenon-zeroentriesforanyentriesthatwerepreviously0,andthus,zeroentriesarestillnotverymeaningful.9.Manysciencesrelyonobservationinsteadof(orinadditionto)designedex-periments.Comparethedataqualityissuesinvolvedinobservationalsciencewiththoseofexperimentalscienceanddatamining.Observationalscienceshavetheissueofnotbeingabletocompletelycontrolthequalityofthedatathattheyobtain.Forexample,untilEarthorbit-若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn9ingsatellitesbecameavailable,measurementsofseasurfacetemperaturere-liedonmeasurementsfromships.Likewise,weathermeasurementsareoftentakenfromstationslocatedintownsorcities.Thus,itisnecessarytoworkwiththedataavailable,ratherthandatafromacarefullydesignedexperi-ment.Inthatsense,dataanalysisforobservationalscienceresemblesdatamining.10.Discussthedifferencebetweentheprecisionofameasurementandthetermssingleanddoubleprecision,astheyareusedincomputerscience,typicallytorepresentfloating-pointnumbersthatrequire32and64bits,respectively.Theprecisionoffloatingpointnumbersisamaximumprecision.Moreex-plicity,precisionisoftenexpressedintermsofthenumberofsignificantdigitsusedtorepresentavalue.Thus,asingleprecisionnumbercanonlyrepresentvalueswithupto32bits,≈9decimaldigitsofprecision.However,oftentheprecisionofavaluerepresentedusing32bits(64bits)isfarlessthan32bits(64bits).11.Giveatleasttwoadvantagestoworkingwithdatastoredintextfilesinsteadofinabinaryformat.(1)Textfilescanbeeasilyinspectedbytypingthefileorviewingitwithatexteditor.(2)Textfilesaremoreportablethanbinaryfiles,bothacrosssystemsandprograms.(3)Textfilescanbemoreeasilymodified,forexample,usingatexteditor课后答案网:www.hackshp.cnorperl.12.Distinguishbetweennoiseandoutliers.Besuretoconsiderthefollowingquestions.(a)Isnoiseeverinterestingordesirable?Outliers?No,bydefinition.Yes.(SeeChapter10.)(b)Cannoiseobjectsbeoutliers?Yes.Randomdistortionofthedataisoftenresponsibleforoutliers.(c)Arenoiseobjectsalwaysoutliers?No.Randomdistortioncanresultinanobjectorvaluemuchlikeanormalone.(d)Areoutliersalwaysnoiseobjects?No.Oftenoutliersmerelyrepresentaclassofobjectsthataredifferentfromnormalobjects.(e)Cannoisemakeatypicalvalueintoanunusualone,orviceversa?Yes.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn10Chapter2Data13.ConsidertheproblemoffindingtheKnearestneighborsofadataobject.AprogrammerdesignsAlgorithm2.1forthistask.Algorithm2.1AlgorithmforfindingKnearestneighbors.1:fori=1tonumberofdataobjectsdo2:Findthedistancesoftheithobjecttoallotherobjects.3:Sortthesedistancesindecreasingorder.(Keeptrackofwhichobjectisassociatedwitheachdistance.)4:returntheobjectsassociatedwiththefirstKdistancesofthesortedlist5:endfor(a)Describethepotentialproblemswiththisalgorithmiftherearedupli-cateobjectsinthedataset.Assumethedistancefunctionwillonlyreturnadistanceof0forobjectsthatarethesame.Thereareseveralproblems.First,theorderofduplicateobjectsonanearestneighborlistwilldependondetailsofthealgorithmandtheorderofobjectsinthedataset.Second,ifthereareenoughduplicates,thenearestneighborlistmayconsistonlyofduplicates.Third,anobjectmaynotbeitsownnearestneighbor.(b)Howwouldyoufixthisproblem?Therearevariousapproachesdependingonthesituation.Oneapproachistotokeeponlyoneobjectforeachgroupofduplicateobjects.In课后答案网:www.hackshp.cnthiscase,eachneighborcanrepresenteitherasingleobjectoragroupofduplicateobjects.14.ThefollowingattributesaremeasuredformembersofaherdofAsianele-phants:weight,height,tusklength,trunklength,andeararea.Basedonthesemeasurements,whatsortofsimilaritymeasurefromSection2.4wouldyouusetocompareorgrouptheseelephants?Justifyyouranswerandex-plainanyspecialcircumstances.Theseattributesareallnumerical,butcanhavewidelyvaryingrangesofvalues,dependingonthescaleusedtomeasurethem.Furthermore,theattributesarenotasymmetricandthemagnitudeofanattributematters.Theselattertwofactseliminatethecosineandcorrelationmeasure.Eu-clideandistance,appliedafterstandardizingtheattributestohaveameanof0andastandarddeviationof1,wouldbeappropriate.15.YouaregivenasetofmobjectsthatisdividedintoKgroups,wheretheithgroupisofsizemi.Ifthegoalistoobtainasampleofsizen99.9%ofthesamegenes.)Twohumanbeingsshare>99.9%ofthesamegenes.Ifwewanttocomparethegeneticmakeupoftwohumanbeings,weshouldfocusontheirdifferences.Thus,theHammingdistanceismoreappropriateinthissituation.19.Forthefollowingvectors,xandy,calculatetheindicatedsimilarityordis-tancemeasures.(a)x=(1,1,1,1),y=(2,2,2,2)cosine,correlation,Euclideancos(x,y)=1,corr(x,y)=0/0(undefined),Euclidean(x,y)=2(b)x=(0,1,0,1),y=(1,0,1,0)cosine,correlation,Euclidean,Jaccardcos(x,y)=0,corr(x,y)=−1,Euclidean(x,y)=2,Jaccard(x,y)=0若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn13(c)x=(0,−1,0,1),y=(1,0,−1,0)cosine,correlation,Euclideancos(x,y)=0,corr(x,y)=0,Euclidean(x,y)=2(d)x=(1,1,0,1,0,1),y=(1,1,1,0,0,1)cosine,correlation,Jaccardcos(x,y)=0.75,corr(x,y)=0.25,Jaccard(x,y)=0.6(e)x=(2,−1,0,2,0,−3),y=(−1,1,−1,0,0,−1)cosine,correlationcos(x,y)=0,corr(x,y)=020.Here,wefurtherexplorethecosineandcorrelationmeasures.(a)Whatistherangeofvaluesthatarepossibleforthecosinemeasure?[−1,1].Manytimesthedatahasonlypositiveentriesandinthatcasetherangeis[0,1].(b)Iftwoobjectshaveacosinemeasureof1,aretheyidentical?Explain.Notnecessarily.Allweknowisthatthevaluesoftheirattributesdifferbyaconstantfactor.(c)Whatistherelationshipofthecosinemeasuretocorrelation,ifany?(Hint:Lookatstatisticalmeasuressuchasmeanandstandarddevia-tionincaseswherecosineandcorrelationarethesameanddifferent.)Fortwovectors,xandythathaveameanof0,corr(x,y)=cos(x,y).(d)Figure2.1(a)showstherelationshipofthecosinemeasuretoEuclideandistancefor100,000randomlygeneratedpointsthathavebeennormal-izedtohaveanL2lengthof1.Whatgeneralobservationcanyoumake课后答案网:www.hackshp.cnabouttherelationshipbetweenEuclideandistanceandcosinesimilaritywhenvectorshaveanL2normof1?Sinceallthe100,000pointsfallonthecurve,thereisafunctionalrela-tionshipbetweenEuclideandistanceandcosinesimilarityfornormal-izeddata.Morespecifically,thereisaninverserelationshipbetweencosinesimilarityandEuclideandistance.Forexample,iftwodatapointsareidentical,theircosinesimilarityisoneandtheirEuclideandistanceiszero,butiftwodatapointshaveahighEuclideandistance,theircosinevalueisclosetozero.Notethatallthesampledatapointswerefromthepositivequadrant,i.e.,hadonlypositivevalues.Thismeansthatallcosine(andcorrelation)valueswillbepositive.(e)Figure2.1(b)showstherelationshipofcorrelationtoEuclideandistancefor100,000randomlygeneratedpointsthathavebeenstandardizedtohaveameanof0andastandarddeviationof1.WhatgeneralobservationcanyoumakeabouttherelationshipbetweenEuclideandistanceandcorrelationwhenthevectorshavebeenstandardizedtohaveameanof0andastandarddeviationof1?若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn14Chapter2DataSameaspreviousanswer,butwithcorrelationsubstitutedforcosine.(f)DerivethemathematicalrelationshipbetweencosinesimilarityandEu-clideandistancewheneachdataobjecthasanL2lengthof1.LetxandybetwovectorswhereeachvectorhasanL2lengthof1.Forsuchvectors,thevarianceisjustntimesthesumofitssquaredattributevaluesandthecorrelationbetweenthetwovectorsistheirdotproductdividedbyn.nd(x,y)=(xk−yk)2k=1n=x2−2xkyk+y2kkk=1=1−2cos(x,y)+1=2(1−cos(x,y))(g)DerivethemathematicalrelationshipbetweencorrelationandEuclideandistancewheneachdatapointhasbeenbeenstandardizedbysubtract-ingitsmeananddividingbyitsstandarddeviation.Letxandybetwovectorswhereeachvectorhasanameanof0andastandarddeviationof1.Forsuchvectors,thevariance(standarddeviationsquared)isjustntimesthesumofitssquaredattributevalues课后答案网:www.hackshp.cnandthecorrelationbetweenthetwovectorsistheirdotproductdividedbyn.nd(x,y)=(xk−yk)2k=1n=x2−2xkyk+y2kkk=1=n−2ncorr(x,y)+n=2n(1−corr(x,y))21.Showthatthesetdifferencemetricgivenbyd(A,B)=size(A−B)+size(B−A)satisfiesthemetricaxiomsgivenonpage70.AandBaresetsandA−Bisthesetdifference.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn151.41.41.21.2110.80.80.60.60.40.4EuclideanDistanceEuclideanDistance0.20.20000.20.40.60.8100.20.40.60.81CosineSimilarityCorrelation(a)RelationshipbetweenEuclidean(b)RelationshipbetweenEuclideandistanceandthecosinemeasure.distanceandcorrelation.Figure2.1.Figuresforexercise20.1(a).Becausethesizeofasetisgreaterthanorequalto0,d(x,y)≥0.1(b).ifA=B,thenA−B=B−A=emptysetandthusd(x,y)=02.d(A,B)=size(A−B)+size(B−A)=size(B−A)+size(A−B)=d(B,A)3.First,notethatd(A,B)=size(A)+size(B)−2size(A∩B).∴d(A,B)+d(B,C)=size(A)+size(C)+2size(B)−2size(A∩B)−2size(B∩C)Sincesize(A∩B)≤size(B)andsize(B∩C)≤size(B),d(A,B课后答案网:www.hackshp.cn)+d(B,C)≥size(A)+size(C)+2size(B)−2size(B)=size(A)+size(C)≥size(A)+size(C)−2size(A∩C)=d(A,C)∴d(A,C)≤d(A,B)+d(B,C)22.Discusshowyoumightmapcorrelationvaluesfromtheinterval[−1,1]totheinterval[0,1].Notethatthetypeoftransformationthatyouusemightdependontheapplicationthatyouhaveinmind.Thus,considertwoapplications:clusteringtimeseriesandpredictingthebehaviorofonetimeseriesgivenanother.Fortimeseriesclustering,timeserieswithrelativelyhighpositivecorrelationshouldbeputtogether.Forthispurpose,thefollowingtransformationwouldbeappropriate:corrifcorr≥0sim=0ifcorr<0Forpredictingthebehaviorofonetimeseriesfromanother,itisnecessarytoconsiderstrongnegative,aswellasstrongpositive,correlation.Inthiscase,thefollowingtransformation,sim=|corr|mightbeappropriate.Notethatthisassumesthatyouonlywanttopredictmagnitude,notdirection.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn16Chapter2Data23.Givenasimilaritymeasurewithvaluesintheinterval[0,1]describetwowaystotransformthissimilarityvalueintoadissimilarityvalueintheinterval[0,∞].d=1−sandd=−logs.s24.Proximityistypicallydefinedbetweenapairofobjects.(a)Definetwowaysinwhichyoumightdefinetheproximityamongagroupofobjects.Twoexamplesarethefollowing:(i)basedonpairwiseproximity,i.e.,minimumpairwisesimilarityormaximumpairwisedissimilarity,or(ii)forpointsinEuclideanspacecomputeacentroid(themeanofallthepoints—seeSection8.2)andthencomputethesumoraverageofthedistancesofthepointstothecentroid.(b)HowmightyoudefinethedistancebetweentwosetsofpointsinEu-clideanspace?Oneapproachistocomputethedistancebetweenthecentroidsofthetwosetsofpoints.(c)Howmightyoudefinetheproximitybetweentwosetsofdataobjects?(Makenoassumptionaboutthedataobjects,exceptthataproximitymeasureisdefinedbetweenanypairofobjects.)Oneapproachistocomputetheaveragepairwiseproximityofobjects课后答案网:www.hackshp.cninonegroupofobjectswiththoseobjectsintheothergroup.Otherapproachesaretotaketheminimumormaximumproximity.Notethatthecohesionofaclusterisrelatedtothenotionoftheproximityofagroupofobjectsamongthemselvesandthattheseparationofclustersisrelatedtoconceptoftheproximityoftwogroupsofobjects.(SeeSection8.4.)Furthermore,theproximityoftwoclustersisanimportantconceptinagglomerativehierarchicalclustering.(SeeSection8.2.)25.YouaregivenasetofpointsSinEuclideanspace,aswellasthedistanceofeachpointinStoapointx.(Itdoesnotmatterifx∈S.)(a)Ifthegoalistofindallpointswithinaspecifieddistanceεofpointy,y=x,explainhowyoucouldusethetriangleinequalityandtheal-readycalculateddistancestoxtopotentiallyreducethenumberofdis-tancecalculationsnecessary?Hint:Thetriangleinequality,d(x,z)≤d(x,y)+d(y,x),canberewrittenasd(x,y)≥d(x,z)−d(y,z).Unfortunately,thereisatypoandalackofclarityinthehint.Thehintshouldbephrasedasfollows:若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn17Hint:IfzisanarbitrarypointofS,thenthetriangleinequality,d(x,y)≤d(x,z)+d(y,z),canberewrittenasd(y,z)≥d(x,y)−d(x,z).Anotherapplicationofthetriangleinequalitystartingwithd(x,z)≤d(x,y)+d(y,z),showsthatd(y,z)≥d(x,z)−d(x,y).Ifthelowerboundofd(y,z)obtainedfromeitheroftheseinequalitiesisgreaterthan,thend(y,z)doesnotneedtobecalculated.Also,iftheupperboundofd(y,z)obtainedfromtheinequalityd(y,z)≤d(y,x)+d(x,z)islessthanorequalto,thend(x,z)doesnotneedtobecalculated.(b)Ingeneral,howwouldthedistancebetweenxandyaffectthenumberofdistancecalculations?Ifx=ythennocalculationsarenecessary.Asxbecomesfartheraway,typicallymoredistancecalculationsareneeded.(c)SupposethatyoucanfindasmallsubsetofpointsS,fromtheoriginaldataset,suchthateverypointinthedatasetiswithinaspecifieddistanceεofatleastoneofthepointsinS,andthatyoualsohavethepairwisedistancematrixforS.Describeatechniquethatusesthisinformationtocompute,withaminimumofdistancecalculations,thesetofallpointswithinadistanceofβofaspecifiedpointfromthedataset.Letxandybethetwopointsandletx∗andy∗bethepointsinSthatareclosesttothetwopoints,respectively.Ifd(x∗,y∗)+2≤β,thenwecansafelyconcluded(x,y)≤β.Likewise,ifd(x∗,y∗)−2≥β,thenwecansafelyconcluded(x,y)≥β.Theseformulasarederivedbyconsideringthecaseswherexandyareasfarfromx∗andy∗aspossibleandasfarorclosetoeachotheraspossible.课后答案网:www.hackshp.cn26.Showthat1minustheJaccardsimilarityisadistancemeasurebetweentwodataobjects,xandy,thatsatisfiesthemetricaxiomsgivenonpage70.Specifically,d(x,y)=1−J(x,y).1(a).BecauseJ(x,y)≤1,d(x,y)≥0.1(b).BecauseJ(x,x)=1,d(x,x)=02.BecauseJ(x,y)=J(y,x),d(x,y)=d(y,x)3.(ProofduetoJeffreyUllman)minhash(x)istheindexoffirstnonzeroentryofxprob(minhash(x)=k)istheprobabilitythaminhash(x)=kwhenxisran-domlypermuted.Notethatprob(minhash(x)=minhash(y))=J(x,y)(minhashlemma)Therefore,d(x,y)=1−prob(minhash(x)=minhash(y))=prob(minhash(x)=minhash(y))Wehavetoshowthat,prob(minhash(x)=minhash(z))≤prob(minhash(x)=minhash(y))+prob(minhash(y)=minhash(z)若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn18Chapter2DataHowever,notethatwheneverminhash(x)=minhash(z),thenatleastoneofminhash(x)=minhash(y)andminhash(y)=minhash(z)mustbetrue.27.Showthatthedistancemeasuredefinedastheanglebetweentwodatavec-tors,xandy,satisfiesthemetricaxiomsgivenonpage70.Specifically,d(x,y)=arccos(cos(x,y)).Notethatanglesareintherange0to180◦.1(a).Because0≤cos(x,y)≤1,d(x,y)≥0.1(b).Becausecos(x,x)=1,d(x,x)=arccos(1)=02.Becausecos(x,y)=cos(y,x),d(x,y)=d(y,x)3.Ifthethreevectorslieinaplanethenitisobviousthattheanglebetweenxandzmustbelessthanorequaltothesumoftheanglesbetweenxandyandyandz.Ifyistheprojectionofyintotheplanedefinedbyxandz,thennotethattheanglesbetween课后答案网:www.hackshp.cnxandyandyandzaregreaterthanthosebetweenxandyandyandz.28.Explainwhycomputingtheproximitybetweentwoattributesisoftensimplerthancomputingthesimilaritybetweentwoobjects.Ingeneral,anobjectcanbearecordwhosefields(attributes)areofdifferenttypes.Tocomputetheoverallsimilarityoftwoobjectsinthiscase,weneedtodecidehowtocomputethesimilarityforeachattributeandthencombinethesesimilarities.ThiscanbedonestraightforwardlybyusingEquations2.15or2.16,butisstillsomewhatadhoc,atleastcomparedtoproximitymeasuressuchastheEuclideandistanceorcorrelation,whicharemathematicallywell-founded.Incontrast,thevaluesofanattributeareallofthesametype,andthus,ifanotherattributeisofthesametype,thenthecomputationofsimilarityisconceptuallyandcomputationallystraightforward.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn3ExploringData1.ObtainoneofthedatasetsavailableattheUCIMachineLearningRepositoryandapplyasmanyofthedifferentvisualizationtechniquesdescribedinthechapteraspossible.ThebibliographicnotesandbookWebsiteprovidepointerstovisualizationsoftware.MATLABandRhaveexcellentfacilitiesforvisualization.Mostofthefig-uresinthischapterwerecreatedusingMATLAB.Risfreelyavailablefromhttp://www.r-project.org/.2.Identifyatleasttwoadvantagesandtwodisadvantagesofusingcolortovisuallyrepresentinformation.Advantages:Colormakesitmucheasiertovisuallydistinguishvisualel-ementsfromoneanother.Forexample,threeclustersoftwo-dimensionalpointsaremorereadilydistinguishedifthemarkersrepresentingthepoints课后答案网:www.hackshp.cnhavedifferentcolors,ratherthanonlydifferentshapes.Also,figureswithcoloraremoreinterestingtolookat.Disadvantages:Somepeoplearecolorblindandmaynotbeabletoproperlyinterpretacolorfigure.Grayscalefigurescanshowmoredetailinsomecases.Colorcanbehardtouseproperly.Forexample,apoorcolorschemecanbegarishorcanfocusattentiononunimportantelements.3.Whatarethearrangementissuesthatarisewithrespecttothree-dimensionalplots?Itwouldhavebeenbettertostatethismoregenerallyas“Whataretheissues...,”sinceselection,aswellasarrangementplaysakeyissueindisplayingathree-dimensionalplot.Thekeyissueforthreedimensionalplotsishowtodisplayinformationsothataslittleinformationisobscuredaspossible.Iftheplotisofatwo-dimensionalsurface,thenthechoiceofaviewpointiscritical.However,iftheplotisinelectronicform,thenitissometimespossibletointeractivelychange若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn20Chapter3ExploringDatatheviewpointtogetacompleteviewofthesurface.Forthreedimensionalsolids,thesituationisevenmorechallenging.Typically,portionsoftheinformationmustbeomittedinordertoprovidethenecessaryinformation.Forexample,asliceorcross-sectionofathreedimensionalobjectisoftenshown.Insomecases,transparencycanbeused.Again,theabilitytochangethearrangementofthevisualelementsinteractivelycanbehelpful.4.Discusstheadvantagesanddisadvantagesofusingsamplingtoreducethenumberofdataobjectsthatneedtobedisplayed.Wouldsimplerandomsampling(withoutreplacement)beagoodapproachtosampling?Whyorwhynot?Simplerandomsamplingisnotthebestapproachsinceitwilleliminatemostofthepointsinsparseregions.Itisbettertoundersampletheregionswheredataobjectsaretoodensewhilekeepingmostorallofthedataobjectsfromsparseregions.5.Describehowyouwouldcreatevisualizationstodisplayinformationthatdescribesthefollowingtypesofsystems.Besuretoaddressthefollowingissues:•Representation.Howwillyoumapobjects,attributes,andrelation-shipstovisualelements?•Arrangement.Arethereanyspecialconsiderationsthatneedtobetakenintoaccountwithrespecttohowvisualelementsaredisplayed?Specificexamplesmightbethechoiceofviewpoint,theuseoftrans-parency,ortheseparationofcertaingroupsofobjects.•课后答案网:www.hackshp.cnSelection.Howwillyouhandlealargenumberofattributesanddataobjects?Thefollowingsolutionsareintendedforillustration.(a)Computernetworks.Besuretoincludeboththestaticaspectsofthenetwork,suchasconnectivity,andthedynamicaspects,suchastraffic.Theconnectivityofthenetworkwouldbestberepresentedasagraph,withthenodesbeingrouters,gateways,orothercommunicationsde-vicesandthelinksrepresentingtheconnections.Thebandwidthoftheconnectioncouldberepresentedbythewidthofthelinks.Colorcouldbeusedtoshowthepercentusageofthelinksandnodes.(b)Thedistributionofspecificplantandanimalspeciesaroundtheworldforaspecificmomentintime.Thesimplestapproachistodisplayeachspeciesonaseparatemapoftheworldandtoshadetheregionsoftheworldwherethespeciesoccurs.Ifseveralspeciesaretobeshownatonce,theniconsforeachspeciescanbeplacedonamapoftheworld.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn21(c)Theuseofcomputerresources,suchasprocessortime,mainmemory,anddisk,forasetofbenchmarkdatabaseprograms.Theresourceusageofeachprogramcouldbedisplayedasabarplotofthethreequantities.Sincethethreequantitieswouldhavedifferentscales,aproperscalingoftheresourceswouldbenecessaryforthistoworkwell.Forexample,resourceusagecouldbedisplayedasapercentageofthetotal.Alternatively,wecouldusethreebarplots,onefortypeofresourceusage.Oneachoftheseplotstherewouldbeabarwhoseheightrepresentstheusageofthecorrespondingprogram.Thisapproachwouldnotrequireanyscaling.Yetanotheroptionwouldbetodisplayalineplotofeachprogram’sresourceusage.Foreachprogram,alinewouldbeconstructedby(1)consideringprocessortime,mainmemory,anddiskasdifferentxlocations,(2)lettingthepercentageresourceusageofaparticularprogramforthethreequantitiesbetheyvaluesassociatedwiththexvalues,andthen(3)drawingalinetoconnectthesethreepoints.Notethatanorderingofthethreequantitiesneedstobespecified,butisarbitrary.Forthisapproach,theresourceusageofallprogramscouldbedisplayedonthesameplot.(d)Thechangeinoccupationofworkersinaparticularcountryoverthelastthirtyyears.Assumethatyouhaveyearlyinformationabouteachpersonthatalsoincludesgenderandlevelofeducation.Foreachgender,theoccupationbreakdowncouldbedisplayedasanarrayofpiecharts,whereeachrowofpiechartsindicatesaparticu-larlevelofeducationandeachcolumnindicatesaparticularyear.Forconvenience,thetimegapbetweeneachcolumncouldbe5ortenyears.课后答案网:www.hackshp.cnAlternatively,wecouldordertheoccupationsandthen,foreachgen-der,computethecumulativepercentemploymentforeachoccupation.Ifthisquantityisplottedforeachgender,thentheareabetweentwosuccessivelinesshowsthepercentageofemploymentforthisoccupa-tion.Ifacolorisassociatedwitheachoccupation,thentheareabetweeneachsetoflinescanalsobecoloredwiththecolorassociatedwitheachoccupation.Asimilarwaytoshowthesameinformationwouldbetouseasequenceofstackedbargraphs.6.Describeoneadvantageandonedisadvantageofastemandleafplotwithrespecttoastandardhistogram.Astemandleafplotshowsyoutheactualdistributionofvalues.Ontheotherhand,astemandleafplotbecomesratherunwieldyforalargenumberofvalues.7.Howmightyouaddresstheproblemthatahistogramdependsonthenumberandlocationofthebins?若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn22Chapter3ExploringDataThebestapproachistoestimatewhattheactualdistributionfunctionofthedatalookslikeusingkerneldensityestimation.Thisbranchofdataanalysisisrelativelywell-developedandismoreappropriateifthewidelyavailable,butsimplisticapproachofahistogramisnotsufficient.8.Describehowaboxplotcangiveinformationaboutwhetherthevalueofanattributeissymmetricallydistributed.Whatcanyousayaboutthesymme-tryofthedistributionsoftheattributesshowninFigure3.11?(a)Ifthelinerepresentingthemedianofthedataisinthemiddleofthebox,thenthedataissymmetricallydistributed,atleastintermsofthe75%ofthedatabetweenthefirstandthirdquartiles.Fortheremain-ingdata,thelengthofthewhiskersandoutliersisalsoanindication,although,sincethesefeaturesdonotinvolveasmanypoints,theymaybemisleading.(b)Sepalwidthandlengthseemtoberelativelysymmetricallydistributed,petallengthseemstoberatherskewed,andpetalwidthissomewhatskewed.9.Comparesepallength,sepalwidth,petallength,andpetalwidth,usingFigure3.12.ForSetosa,sepallength>sepalwidth>petallength>petalwidth.ForVersicolourandVirginiica,sepallength>sepalwidthandpetallength>petalwidth,butalthoughsepallength>petallength,petallength>sepalwidth.课后答案网:www.hackshp.cn10.Commentontheuseofaboxplottoexploreadatasetwithfourattributes:age,weight,height,andincome.Agreatdealofinformationcanbeobtainedbylookingat(1)theboxplotsforeachattribute,and(2)theboxplotsforaparticularattributeacrossvariouscategoriesofasecondattribute.Forexample,ifwecomparetheboxplotsofagefordifferentcategoriesofages,wewouldseethatweightincreaseswithage.11.GiveapossibleexplanationastowhymostofthevaluesofpetallengthandwidthfallinthebucketsalongthediagonalinFigure3.9.WewouldexpectsuchadistributionifthethreespeciesofIriscanbeorderedaccordingtotheirsize,andifpetallengthandwidtharebothcorrelatedtothesizeoftheplantandeachother.12.UseFigures3.14and3.15toidentifyacharacteristicsharedbythepetalwidthandpetallengthattributes.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn23ThereisarelativelyflatareainthecurvesoftheEmpiricalCDF’sandthepercentileplotsforbothpetallengthandpetalwidth.Thisindicatesasetofflowersforwhichtheseattributeshavearelativelyuniformvalue.13.Simplelineplots,suchasthatdisplayedinFigure2.12onpage56,whichshowstwotimeseries,canbeusedtoeffectivelydisplayhigh-dimensionaldata.Forexample,inFigure56itiseasytotellthatthefrequenciesofthetwotimeseriesaredifferent.Whatcharacteristicoftimeseriesallowstheeffectivevisualizationofhigh-dimensionaldata?Thefactthattheattributevaluesareordered.14.Describethetypesofsituationsthatproducesparseordensedatacubes.Illustratewithexamplesotherthanthoseusedinthebook.Anysetofdataforwhichallcombinationsofvaluesareunlikelytooccurwouldproducesparsedatacubes.Thiswouldincludesetsofcontinuousattributeswherethesetofobjectsdescribedbytheattributesdoesn’toccupytheentiredataspace,butonlyafractionofit,aswellasdiscreteattributes,wheremanycombinationsofvaluesdon’toccur.Adensedatacubewouldtendtoarise,wheneitheralmostallcombinationsof课后答案网:www.hackshp.cnthecategoriesoftheunderlyingattributesoccur,orthelevelofaggregationishighenoughsothatallcombinationsarelikelytohavevalues.Forexample,consideradatasetthatcontainsthetypeoftrafficaccident,aswellasitslocationanddate.Theoriginaldatacubewouldbeverysparse,butifitisaggregatedtohavecategoriesconsistingsingleormultiplecaraccident,thestateoftheaccident,andthemonthinwhichitoccurred,thenwewouldobtainadensedatacube.15.Howmightyouextendthenotionofmultidimensionaldataanalysissothatthetargetvariableisaqualitativevariable?Inotherwords,whatsortsofsummarystatisticsordatavisualizationswouldbeofinterest?Asummarystatisticsthatwouldbeofinterestwouldbethefrequencieswithwhichvaluesorcombinationsofvalues,targetandotherwise,occur.Fromthiswecouldderiveconditionalrelationshipsamongvariousvalues.Inturn,theserelationshipscouldbedisplayedusingagraphsimilartothatusedtodisplayBayesiannetworks.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn24Chapter3ExploringData16.ConstructadatacubefromTable3.1.Isthisadenseorsparsedatacube?Ifitissparse,identifythecellsthatareempty.ThedatacubeisshowninTable3.2.Itisadensecube;onlytwocellsareempty.Table3.1.FacttableforExercise16.Table3.2.DatacubeforExercise16.ProductIDLocationIDNumberSoldLocationID1110123Total136110061621525220272课后答案网:www.hackshp.cn222ProductIDTotal152264317.Discussthedifferencesbetweendimensionalityreductionbasedonaggrega-tionanddimensionalityreductionbasedontechniquessuchasPCAandSVD.ThedimensionalityofPCAorSVDcanbeviewedasaprojectionofthedataontoareducedsetofdimensions.Inaggregation,groupsofdimensionsarecombined.Insomecases,aswhendaysareaggregatedintomonthsorthesalesofaproductareaggregatedbystorelocation,theaggregationcanbeviewedasachangeofscale.Incontrast,thedimensionalityreductionprovidedbyPCAandSVDdonothavesuchaninterpretation.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn4Classification:BasicConcepts,DecisionTrees,andModelEvaluation1.DrawthefulldecisiontreefortheparityfunctionoffourBooleanattributes,A,B,C,andD.Isitpossibletosimplifythetree?ABCDClassTTTTTTTTFFTTFTF课后答案网:www.hackshp.cnTTFFTTFTTFTFTFTTFFTTTFFFFFTTTFAFTTFTTFFTFTTFTFFFBBFFTTTFFTFFFFFTFTFTFFFFFTCCCCTFTFTFTFDDDDDDDDTFTFTFTFTFTFTFTFTFFTFTTFFTTFTFFTFigure4.1.DecisiontreeforparityfunctionoffourBooleanattributes.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn26Chapter4ClassificationTheprecedingtreecannotbesimplified.2.ConsiderthetrainingexamplesshowninTable4.1forabinaryclassificationproblem.Table4.1.DatasetforExercise2.CustomerIDGenderCarTypeShirtSizeClass1MFamilySmallC02MSportsMediumC03MSportsMediumC04MSportsLargeC05MSportsExtraLargeC06MSportsExtraLargeC07FSportsSmallC08FSportsSmallC09FSportsMediumC010FLuxuryLargeC011MFamilyLargeC112MFamilyExtraLargeC113MFamilyMediumC114MLuxuryExtraLargeC115FLuxurySmallC116FLuxurySmallC117FLuxuryMediumC118FLuxuryMediumC1课后答案网:www.hackshp.cn19FLuxuryMediumC120FLuxuryLargeC1(a)ComputetheGiniindexfortheoverallcollectionoftrainingexamples.Answer:Gini=1−2×0.52=0.5.(b)ComputetheGiniindexfortheCustomerIDattribute.Answer:TheginiforeachCustomerIDvalueis0.Therefore,theoverallginiforCustomerIDis0.(c)ComputetheGiniindexfortheGenderattribute.Answer:TheginiforMaleis1−2×0.52=0.5.TheginiforFemaleisalso0.5.Therefore,theoverallginiforGenderis0.5×0.5+0.5×0.5=0.5.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn27Table4.2.DatasetforExercise3.Instancea1a2a3TargetClass1TT1.0+2TT6.0+3TF5.0−4FF4.0+5FT7.0−6FT3.0−7FF8.0−8TF7.0+9FT5.0−(d)ComputetheGiniindexfortheCarTypeattributeusingmultiwaysplit.Answer:TheginiforFamilycaris0.375,Sportscaris0,andLuxurycaris0.2188.Theoverallginiis0.1625.(e)ComputetheGiniindexfortheShirtSizeattributeusingmultiwaysplit.Answer:TheginiforSmallshirtsizeis0.48,Mediumshirtsizeis0.4898,Largeshirtsizeis0.5,andExtraLargeshirtsizeis0.5.TheoverallginiforShirtSizeattributeis0.4914.(f)Whichattributeisbetter,课后答案网:www.hackshp.cnGender,CarType,orShirtSize?Answer:CarTypebecauseithasthelowestginiamongthethreeattributes.(g)ExplainwhyCustomerIDshouldnotbeusedastheattributetestconditioneventhoughithasthelowestGini.Answer:TheattributehasnopredictivepowersincenewcustomersareassignedtonewCustomerIDs.3.ConsiderthetrainingexamplesshowninTable4.2forabinaryclassificationproblem.(a)Whatistheentropyofthiscollectionoftrainingexampleswithrespecttothepositiveclass?Answer:Therearefourpositiveexamplesandfivenegativeexamples.Thus,P(+)=4/9andP(−)=5/9.Theentropyofthetrainingexamplesis−4/9log2(4/9)−5/9log2(5/9)=0.9911.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn28Chapter4Classification(b)Whataretheinformationgainsofa1anda2relativetothesetrainingexamples?Answer:Forattributea1,thecorrespondingcountsandprobabilitiesare:a1+-T31F14Theentropyfora1is4−(3/4)log2(3/4)−(1/4)log2(1/4)95+−(1/5)log2(1/5)−(4/5)log2(4/5)=0.7616.9Therefore,theinformationgainfora1is0.9911−0.7616=0.2294.Forattributea2,thecorrespondingcountsandprobabilitiesare:a2+-T23F22Theentropyfora2is5−(2/5)log2(2/5)−(3/5)log2(3/5)9课后答案网:www.hackshp.cn4+−(2/4)log2(2/4)−(2/4)log2(2/4)=0.9839.9Therefore,theinformationgainfora2is0.9911−0.9839=0.0072.(c)Fora3,whichisacontinuousattribute,computetheinformationgainforeverypossiblesplit.Answer:a3ClasslabelSplitpointEntropyInfoGain1.0+2.00.84840.14273.0-3.50.98850.00264.0+4.50.91830.07285.0-5.0-5.50.98390.00726.0+6.50.97280.01837.0+7.0-7.50.88890.1022Thebestsplitfora3occursatsplitpointequalsto2.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn29(d)Whatisthebestsplit(amonga1,a2,anda3)accordingtotheinfor-mationgain?Answer:Accordingtoinformationgain,a1producesthebestsplit.(e)Whatisthebestsplit(betweena1anda2)accordingtotheclassificationerrorrate?Answer:Forattributea1:errorrate=2/9.Forattributea2:errorrate=4/9.Therefore,accordingtoerrorrate,a1producesthebestsplit.(f)Whatisthebestsplit(betweena1anda2)accordingtotheGiniindex?Answer:Forattributea1,theginiindexis4225221−(3/4)−(1/4)+1−(1/5)−(4/5)=0.3444.99Forattributea2,theginiindexis5224221−(2/5)−(3/5)+1−(2/4)−(2/4)=0.4889.99Sincetheginiindexfora1issmaller,itproducesthebettersplit.4.Showthattheentropyofanodeneverincreasesaftersplittingitintosmaller课后答案网:www.hackshp.cnsuccessornodes.Answer:LetY={y1,y2,···,yc}denotethecclassesandX={x1,x2,···,xk}denotethekattributevaluesofanattributeX.BeforeanodeissplitonX,theentropyis:cckE(Y)=−P(yj)log2P(yj)=P(xi,yj)log2P(yj),(4.1)j=1j=1i=1kwherewehaveusedthefactthatP(yj)=i=1P(xi,yj)fromthelawoftotalprobability.AftersplittingonX,theentropyforeachchildnodeX=xiis:cE(Y|xi)=−P(yj|xi)log2P(yj|xi)(4.2)j=1若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn30Chapter4ClassificationwhereP(yj|xi)isthefractionofexampleswithX=xithatbelongtoclassyj.TheentropyaftersplittingonXisgivenbytheweightedentropyofthechildrennodes:kE(Y|X)=P(xi)E(Y|xi)i=1kc=−P(xi)P(yj|xi)log2P(yj|xi)i=1j=1kc=−P(xi,yj)log2P(yj|xi),(4.3)i=1j=1wherewehaveusedaknownfactfromprobabilitytheorythatP(xi,yj)=P(yj|xi)×P(xi).NotethatE(Y|X)isalsoknownastheconditionalentropyofYgivenX.Toanswerthisquestion,weneedtoshowthatE(Y|X)≤E(Y).Letuscom-putethedifferencebetweentheentropiesaftersplittingandbeforesplitting,i.e.,E(Y|X)−E(Y),usingEquations4.1and4.3:E(Y|X)−E(Y)kckc课后答案网:www.hackshp.cn=−P(xi,yj)log2P(yj|xi)+P(xi,yj)log2P(yj)i=1j=1i=1j=1kcP(y)j=P(xi,yj)log2P(yj|xi)i=1j=1kcP(x)P(y)ij=P(xi,yj)log2(4.4)P(xi,yj)i=1j=1ToprovethatEquation4.4isnon-positive,weusethefollowingpropertyofalogarithmicfunction:ddaklog(zk)≤logakzk,(4.5)k=1k=1dsubjecttotheconditionthatk=1ak=1.Thispropertyisaspecialcaseofamoregeneraltheoreminvolvingconvexfunctions(whichincludethelogarithmicfunction)knownasJensen’sinequality.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn31ByapplyingJensen’sinequality,Equation4.4canbeboundedasfollows:kcP(x)P(y)ijE(Y|X)−E(Y)≤log2P(xi,yj)P(xi,yj)i=1j=1kc=log2P(xi)P(yj)i=1j=1=log2(1)=0BecauseE(Y|X)−E(Y)≤0,itfollowsthatentropyneverincreasesaftersplittingonanattribute.5.Considerthefollowingdatasetforabinaryclassproblem.ABClassLabelTF+TT+TT+TF−TT+FF−FF−FF−TT−TF−课后答案网:www.hackshp.cn(a)CalculatetheinformationgainwhensplittingonAandB.Whichattributewouldthedecisiontreeinductionalgorithmchoose?Answer:ThecontingencytablesaftersplittingonattributesAandBare:A=TA=FB=TB=F+40+31−33−15Theoverallentropybeforesplittingis:Eorig=−0.4log0.4−0.6log0.6=0.9710TheinformationgainaftersplittingonAis:4433EA=T=−log−log=0.985277773300EA=F=−log−log=03333∆=Eorig−7/10EA=T−3/10EA=F=0.2813若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn32Chapter4ClassificationTheinformationgainaftersplittingonBis:3311EB=T=−log−log=0.811344441155EB=F=−log−log=0.65006666∆=Eorig−4/10EB=T−6/10EB=F=0.2565Therefore,attributeAwillbechosentosplitthenode.(b)CalculatethegainintheGiniindexwhensplittingonAandB.Whichattributewouldthedecisiontreeinductionalgorithmchoose?Answer:Theoverallginibeforesplittingis:G=1−0.42−0.62=0.48origThegaininginiaftersplittingonAis:2243GA=T=1−−=0.4898772230GA=F=1=−=033∆=Gorig−7/10GA=T−3/10GA=F=0.1371ThegaininginiaftersplittingonBis:22课后答案网:www.hackshp.cn13GB=T=1−−=0.3750442215GB=F=1=−=0.277866∆=Gorig−4/10GB=T−6/10GB=F=0.1633Therefore,attributeBwillbechosentosplitthenode.(c)Figure4.13showsthatentropyandtheGiniindexarebothmonotonouslyincreasingontherange[0,0.5]andtheyarebothmonotonouslydecreas-ingontherange[0.5,1].IsitpossiblethatinformationgainandthegainintheGiniindexfavordifferentattributes?Explain.Answer:Yes,eventhoughthesemeasureshavesimilarrangeandmonotonousbehavior,theirrespectivegains,∆,whicharescaleddifferencesofthemeasures,donotnecessarilybehaveinthesameway,asillustratedbytheresultsinparts(a)and(b).6.Considerthefollowingsetoftrainingexamples.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn33XYZNo.ofClassC1ExamplesNo.ofClassC2Examples000540001015010105011450100105101250110520111015(a)Computeatwo-leveldecisiontreeusingthegreedyapproachdescribedinthischapter.Usetheclassificationerrorrateasthecriterionforsplitting.Whatistheoverallerrorrateoftheinducedtree?Answer:SplittingAttributeatLevel1.Todeterminethetestconditionattherootnode,weneedtocom-putetheerrorratesforattributesX,Y,andZ.ForattributeX,thecorrespondingcountsare:XC1C20606014040Therefore,theerrorrateusingattributeXis(60+40)/200=0.5.ForattributeY,thecorrespondingcountsare:YC1C2课后答案网:www.hackshp.cn0406016040Therefore,theerrorrateusingattributeYis(40+40)/200=0.4.ForattributeZ,thecorrespondingcountsare:ZC1C20307017030Therefore,theerrorrateusingattributeYis(30+30)/200=0.3.SinceZgivesthelowesterrorrate,itischosenasthesplittingattributeatlevel1.SplittingAttributeatLevel2.AftersplittingonattributeZ,thesubsequenttestconditionmayin-volveeitherattributeXorY.ThisdependsonthetrainingexamplesdistributedtotheZ=0andZ=1childnodes.ForZ=0,thecorrespondingcountsforattributesXandYarethesame,asshowninthetablebelow.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn34Chapter4ClassificationXC1C2YC1C201545015451152511525Theerrorrateinbothcases(XandY)are(15+15)/100=0.3.ForZ=1,thecorrespondingcountsforattributesXandYareshowninthetablesbelow.XC1C2YC1C204515025151251514515Althoughthecountsaresomewhatdifferent,theirerrorratesremainthesame,(15+15)/100=0.3.Thecorrespondingtwo-leveldecisiontreeisshownbelow.Z01XXororYY0011C2C2C1C1课后答案网:www.hackshp.cnTheoverallerrorrateoftheinducedtreeis(15+15+15+15)/200=0.3.(b)Repeatpart(a)usingXasthefirstsplittingattributeandthenchoosethebestremainingattributeforsplittingateachofthetwosuccessornodes.Whatistheerrorrateoftheinducedtree?Answer:AfterchoosingattributeXtobethefirstsplittingattribute,thesub-sequenttestconditionmayinvolveeitherattributeYorattributeZ.ForX=0,thecorrespondingcountsforattributesYandZareshowninthetablebelow.YC1C2ZC1C2055501545155514515TheerrorrateusingattributesYandZare10/120and30/120,re-spectively.SinceattributeYleadstoasmallererrorrate,itprovidesabettersplit.ForX=1,thecorrespondingcountsforattributesYandZareshowninthetablesbelow.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn35YC1C2ZC1C2035501525153512515TheerrorrateusingattributesYandZare10/80and30/80,respec-tively.SinceattributeYleadstoasmallererrorrate,itprovidesabettersplit.Thecorrespondingtwo-leveldecisiontreeisshownbelow.X01YY0011C2C1C1C2Theoverallerrorrateoftheinducedtreeis(10+10)/200=0.1.(c)Comparetheresultsofparts(a)and(b).Commentonthesuitabilityofthegreedyheuristicusedforsplittingattributeselection.Answer:Fromtheprecedingresults,theerrorrateforpart(a)issignificantlylargerthanthatforpart(b).Thisexamplesshowsthatagreedyheuris-ticdoesnotalwaysproduceanoptimalsolution.课后答案网:www.hackshp.cn7.ThefollowingtablesummarizesadatasetwiththreeattributesA,B,Candtwoclasslabels+,−.Buildatwo-leveldecisiontree.NumberofABCInstances+−TTT50FTT020TFT200FFT05TTF00FTF250TFF00FFF025(a)Accordingtotheclassificationerrorrate,whichattributewouldbechosenasthefirstsplittingattribute?Foreachattribute,showthecontingencytableandthegainsinclassificationerrorrate.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn36Chapter4ClassificationAnswer:Theerrorrateforthedatawithoutpartitioningonanyattributeis505050Eorig=1−max(,)=.100100100AftersplittingonattributeA,thegaininerrorrateis:2500EA=T=1−max(,)==0A=TA=F252525255025+2525EA=F=1−max(,)=−050757575257525∆A=Eorig−EA=T−EA=F=100100100AftersplittingonattributeB,thegaininerrorrateis:20EB=T=B=TB=F5020+3020EB=F=−203050505010∆B=Eorig−EB=T−EB=F=100100100AftersplittingonattributeC,thegaininerrorrateis:25EC=T=课后答案网:www.hackshp.cnC=TC=F5025+2525EC=F=−25255050500∆C=Eorig−EC=T−EC=F==0100100100ThealgorithmchoosesattributeAbecauseithasthehighestgain.(b)Repeatforthetwochildrenoftherootnode.Answer:BecausetheA=Tchildnodeispure,nofurthersplittingisneeded.FortheA=Fchildnode,thedistributionoftraininginstancesis:ClasslabelBC+−TT020FT05TF250FF025TheclassificationerroroftheA=Fchildnodeis:若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn3725Eorig=75AftersplittingonattributeB,thegaininerrorrateis:20B=TB=FEB=T=45+250EB=F=0−203045205∆B=Eorig−EB=T−EB=F=757575AftersplittingonattributeC,thegaininerrorrateis:0EC=T=C=TC=F2525+025EC=F=−2525502550∆C=Eorig−EC=T−EC=F=07575ThesplitwillbemadeonattributeB.(c)Howmanyinstancesaremisclassifiedbytheresultingdecisiontree?Answer:20instancesaremisclassified.(Theerrorrateis20.)100(d)Repeatparts(a),(b),and(c)usingCasthesplittingattribute.Answer:Forthe课后答案网:www.hackshp.cnC=Tchildnode,theerrorratebeforesplittingis:E=25.orig50AftersplittingonattributeA,thegaininerrorrateis:A=TA=FEA=T=0+250EA=F=0−02525∆A=50AftersplittingonattributeB,thegaininerrorrateis:5EB=T=B=TB=F255+520EB=F=−2052515∆B=50Therefore,Aischosenasthesplittingattribute.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn38Chapter4ClassificationTraining:InstanceABCClass1000+2001+A3010+4011–015100+6100+BC7110–8101+9110–010110110–+_+_Validation:InstanceABCClass11000+12011+13110+14101–15100+Figure4.2.DecisiontreeanddatasetsforExercise8.FortheC=Fchild,theerrorratebeforesplittingis:E=25.orig50AftersplittingonattributeA,theerrorrateis:EA=T=0A=TA=F课后答案网:www.hackshp.cn25+025EA=F=−02550∆A=0AftersplittingonattributeB,theerrorrateis:B=TB=FEB=T=0+250EB=F=0−02525∆B=50Therefore,Bisusedasthesplittingattribute.Theoverallerrorrateoftheinducedtreeis0.(e)Usetheresultsinparts(c)and(d)toconcludeaboutthegreedynatureofthedecisiontreeinductionalgorithm.Thegreedyheuristicdoesnotnecessarilyleadtothebesttree.8.ConsiderthedecisiontreeshowninFigure4.2.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn39(a)Computethegeneralizationerrorrateofthetreeusingtheoptimisticapproach.Answer:Accordingtotheoptimisticapproach,thegeneralizationerrorrateis3/10=0.3.(b)Computethegeneralizationerrorrateofthetreeusingthepessimisticapproach.(Forsimplicity,usethestrategyofaddingafactorof0.5toeachleafnode.)Answer:Accordingtothepessimisticapproach,thegeneralizationerrorrateis(3+4×0.5)/10=0.5.(c)Computethegeneralizationerrorrateofthetreeusingthevalidationsetshownabove.Thisapproachisknownasreducederrorpruning.Answer:Accordingtothereducederrorpruningapproach,thegeneralizationerrorrateis4/5=0.8.9.ConsiderthedecisiontreesshowninFigure4.3.Assumetheyaregeneratedfromadatasetthatcontains16binaryattributesand3classes,C1,C2,andC3.Computethetotaldescriptionlengthofeachdecisiontreeaccordingtotheminimumdescriptionlengthprinciple.课后答案网:www.hackshp.cnC1CCC123CC23CC12(a)Decisiontreewith7errors(b)Decisiontreewith4errorsFigure4.3.DecisiontreesforExercise9.•Thetotaldescriptionlengthofatreeisgivenby:Cost(tree,data)=Cost(tree)+Cost(data|tree).若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn40Chapter4Classification•EachinternalnodeofthetreeisencodedbytheIDofthesplittingattribute.Iftherearemattributes,thecostofencodingeachattributeislog2mbits.•EachleafisencodedusingtheIDoftheclassitisassociatedwith.Iftherearekclasses,thecostofencodingaclassislog2kbits.•Cost(tree)isthecostofencodingallthenodesinthetree.Tosimplifythecomputation,youcanassumethatthetotalcostofthetreeisobtainedbyaddingupthecostsofencodingeachinternalnodeandeachleafnode.•Cost(data|tree)isencodedusingtheclassificationerrorsthetreecom-mitsonthetrainingset.Eacherrorisencodedbylog2nbits,wherenisthetotalnumberoftraininginstances.Whichdecisiontreeisbetter,accordingtotheMDLprinciple?Answer:Becausethereare16attributes,thecostforeachinternalnodeinthedecisiontreeis:log2(m)=log2(16)=4Furthermore,becausethereare3classes,thecostforeachleafnodeis:log2(k)=log2(3)=2课后答案网:www.hackshp.cnThecostforeachmisclassificationerrorislog2(n).Theoverallcostforthedecisiontree(a)is2×4+3×2+7×log2n=14+7log2nandtheoverallcostforthedecisiontree(b)is4×4+5×2+4×5=26+4log2n.AccordingtotheMDLprinciple,tree(a)isbetterthan(b)ifn<16andisworsethan(b)ifn>16.10.Whilethe.632bootstrapapproachisusefulforobtainingareliableestimateofmodelaccuracy,ithasaknownlimitation.Consideratwo-classproblem,wherethereareequalnumberofpositiveandnegativeexamplesinthedata.Supposetheclasslabelsfortheexamplesaregeneratedrandomly.Theclas-sifierusedisanunpruneddecisiontree(i.e.,aperfectmemorizer).Determinetheaccuracyoftheclassifierusingeachofthefollowingmethods.(a)Theholdoutmethod,wheretwo-thirdsofthedataareusedfortrainingandtheremainingone-thirdareusedfortesting.Answer:Assumingthatthetrainingandtestsamplesareequallyrepresentative,thetesterrorratewillbecloseto50%.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn41(b)Ten-foldcross-validation.Answer:Assumingthatthetrainingandtestsamplesforeachfoldareequallyrepresentative,thetesterrorratewillbecloseto50%.(c)The.632bootstrapmethod.Answer:Thetrainingerrorforaperfectmemorizeris100%whiletheerrorrateforeachbootstrapsampleiscloseto50%.Substitutingthisinformationintotheformulafor.632bootstrapmethod,theerrorestimateis:b10.632×0.5+0.368×1=0.684.bi=1(d)Fromtheresultsinparts(a),(b),and(c),whichmethodprovidesamorereliableevaluationoftheclassifier’saccuracy?Answer:Theten-foldcross-validationandholdoutmethodprovidesabetter课后答案网:www.hackshp.cnerrorestimatethanthe.632boostrapmethod.11.ConsiderthefollowingapproachfortestingwhetheraclassifierAbeatsan-otherclassifierB.LetNbethesizeofagivendataset,pAbetheaccuracyofclassifierA,pBbetheaccuracyofclassifierB,andp=(pA+pB)/2betheaverageaccuracyforbothclassifiers.TotestwhetherclassifierAissignificantlybetterthanB,thefollowingZ-statisticisused:pA−pBZ=.2p(1−p)NClassifierAisassumedtobebetterthanclassifierBifZ>1.96.Table4.3comparestheaccuraciesofthreedifferentclassifiers,decisiontreeclassifiers,na¨ıveBayesclassifiers,andsupportvectormachines,onvariousdatasets.(ThelattertwoclassifiersaredescribedinChapter5.)若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn42Chapter4ClassificationTable4.3.Comparingtheaccuracyofvariousclassificationmethods.DataSetSizeDecisionna¨ıveSupportvector(N)Tree(%)Bayes(%)machine(%)Anneal89892.0979.6287.19Australia69085.5176.8184.78Auto20581.9558.0570.73Breast69995.1495.9996.42Cleve30376.2483.5084.49Credit69085.8077.5485.07Diabetes76872.4075.9176.82German100070.9074.7074.40Glass21467.2948.5959.81Heart27080.0084.0783.70Hepatitis15581.9483.2387.10Horse36885.3378.8082.61Ionosphere35189.1782.3488.89Iris15094.6795.3396.00Labor5778.9594.7492.98Led7320073.3473.1673.56Lymphography14877.0383.1186.49Pima76874.3576.0476.95Sonar20878.8569.7176.92Tic-tac-toe95883.7270.0498.33Vehicle84671.0445.0474.94Wine17894.3896.6398.88Zoo10193.0793.0796.04课后答案网:www.hackshp.cnAnswer:Asummaryoftherelativeperformanceoftheclassifiersisgivenbelow:win-loss-drawDecisiontreeNa¨ıveBayesSupportvectormachineDecisiontree0-0-239-3-112-7-14Na¨ıveBayes3-9-110-0-230-8-15Supportvectormachine7-2-148-0-150-0-2312.LetXbeabinomialrandomvariablewithmeanNpandvarianceNp(1−p).ShowthattheratioX/Nalsohasabinomialdistributionwithmeanpandvariancep(1−p)/N.Answer:Letr=X/N.SinceXhasabinomialdistribution,ralsohasthesamedistribution.Themeanandvarianceforrcanbecomputedasfollows:Mean,E[r]=E[X/N]=E[X]/N=(Np)/N=p;若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn课后答案网:www.hackshp.cn43Variance,E[(r−E[r])2]=E[(X/N−E[X/N])2]=E[(X−E[X])2]/N2=Np(1−p)/N2=p(1−p)/N若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn课后答案网:www.hackshp.cn若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn5Classification:AlternativeTechniques1.Considerabinaryclassificationproblemwiththefollowingsetofattributesandattributevalues:•AirConditioner={Working,Broken}•Engine={Good,Bad}•Mileage={High,Medium,Low}•Rust={Yes,No}Supposearule-basedclassifierproducesthefollowingruleset:课后答案网:www.hackshp.cnMileage=High−→Value=LowMileage=Low−→Value=HighAirConditioner=Working,Engine=Good−→Value=HighAirConditioner=Working,Engine=Bad−→Value=LowAirConditioner=Broken−→Value=Low(a)Aretherulesmutuallyexclustive?Answer:No(b)Istherulesetexhaustive?Answer:Yes(c)Isorderingneededforthissetofrules?Answer:Yesbecauseatestinstancemaytriggermorethanonerule.(d)Doyouneedadefaultclassfortheruleset?Answer:Nobecauseeveryinstanceisguaranteedtotriggeratleastonerule.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn46Chapter5Classification:AlternativeTechniques2.TheRIPPERalgorithm(byCohen[1])isanextensionofanearlieralgorithmcalledIREP(byF¨urnkranzandWidmer[3]).Bothalgorithmsapplythereduced-errorpruningmethodtodeterminewhetheraruleneedstobepruned.Thereducederrorpruningmethodusesavalidationsettoestimatethegeneralizationerrorofaclassifier.Considerthefollowingpairofrules:R1:A−→CR2:A∧B−→CR2isobtainedbyaddinganewconjunct,B,totheleft-handsideofR1.Forthisquestion,youwillbeaskedtodeterminewhetherR2ispreferredoverR1fromtheperspectivesofrule-growingandrule-pruning.Todeterminewhetheraruleshouldbepruned,IREPcomputesthefollowingmeasure:p+(N−n)vIREP=,P+NwherePisthetotalnumberofpositiveexamplesinthevalidationset,Nisthetotalnumberofnegativeexamplesinthevalidationset,pisthenumberofpositiveexamplesinthevalidationsetcoveredbytherule,andnisthenumberofnegativeexamplesinthevalidationsetcoveredbytherule.vIREPisactuallysimilartoclassificationaccuracyforthevalidationset.IREPfavorsrulesthathavehighervaluesofvIREP.Ontheotherhand,RIPPERappliesthefollowingmeasuretodeterminewhetheraruleshouldbepruned:p−nvRIPPER=.课后答案网:www.hackshp.cnp+n(a)SupposeR1iscoveredby350positiveexamplesand150negativeex-amples,whileR2iscoveredby300positiveexamplesand50negativeexamples.ComputetheFOIL’sinformationgainfortheruleR2withrespecttoR1.Answer:Forthisproblem,p0=350,n0=150,p1=300,andn1=50.There-fore,theFOIL’sinformationgainforR2withrespecttoR1is:300350Gain=300×log2−log2=87.65350500(b)Consideravalidationsetthatcontains500positiveexamplesand500negativeexamples.ForR1,supposethenumberofpositiveexamplescoveredbytheruleis200,andthenumberofnegativeexamplescoveredbytheruleis50.ForR2,supposethenumberofpositiveexamplescoveredbytheruleis100andthenumberofnegativeexamplesis5.ComputevIREPforbothrules.WhichruledoesIREPprefer?若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn47Answer:Forthisproblem,P=500,andN=500.ForruleR1,p=200andn=50.Therefore,p+(N−n)200+(500−50)VIREP(R1)===0.65P+N1000ForruleR2,p=100andn=5.p+(N−n)100+(500−5)VIREP(R2)===0.595P+N1000Thus,IREPprefersruleR1.(c)ComputevRIPPERforthepreviousproblem.WhichruledoesRIPPERprefer?Answer:p−n150VRIPPER(R1)===0.6p+n250p−n95VRIPPER(R2)===0.9p+n105Thus,RIPPERpreferstheruleR2.3.C4.5rulesisanimplementationofanindirectmethodforgeneratingrulesfromadecisiontree.RIPPERisanimplementationofadirectmethodfor课后答案网:www.hackshp.cngeneratingrulesdirectlyfromdata.(a)Discussthestrengthsandweaknessesofbothmethods.Answer:TheC4.5rulesalgorithmgeneratesclassificationrulesfromaglobalperspective.Thisisbecausetherulesarederivedfromdecisiontrees,whichareinducedwiththeobjectiveofpartitioningthefeaturespaceintohomogeneousregions,withoutfocusingonanyclasses.Incontrast,RIPPERgeneratesrulesone-class-at-a-time.Thus,itismorebiasedtowardstheclassesthataregeneratedfirst.(b)Consideradatasetthathasalargedifferenceintheclasssize(i.e.,someclassesaremuchbiggerthanothers).Whichmethod(betweenC4.5rulesandRIPPER)isbetterintermsoffindinghighaccuracyrulesforthesmallclasses?Answer:Theclass-orderingschemeusedbyC4.5ruleshasaneasierinterpretationthantheschemeusedbyRIPPER.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn48Chapter5Classification:AlternativeTechniques4.Consideratrainingsetthatcontains100positiveexamplesand400negativeexamples.Foreachofthefollowingcandidaterules,R1:A−→+(covers4positiveand1negativeexamples),R2:B−→+(covers30positiveand10negativeexamples),R3:C−→+(covers100positiveand90negativeexamples),determinewhichisthebestandworstcandidateruleaccordingto:(a)Ruleaccuracy.Answer:Theaccuraciesoftherulesare80%(forR1),75%(forR2),and52.6%(forR3),respectively.ThereforeR1isthebestcandidateandR3istheworstcandidateaccordingtoruleaccuracy.(b)FOIL’sinformationgain.Answer:Assumetheinitialruleis∅−→+.Thisrulecoversp0=100positiveexamplesandn0=400negativeexamples.TheruleR1coversp1=4positiveexamplesandn1=1negativeexample.Therefore,theFOIL’sinformationgainforthisruleis41004×log2−log2=8.5500TheruleR2coversp1=30positiveexamplesandn1=10negativeexample.Therefore,theFOIL’sinformationgainforthisruleis课后答案网:www.hackshp.cn3010030×log2−log2=57.2.40500TheruleR3coversp1=100positiveexamplesandn1=90negativeexample.Therefore,theFOIL’sinformationgainforthisruleis100100100×log2−log2=139.6.190500Therefore,R3isthebestcandidateandR1istheworstcandidateac-cordingtoFOIL’sinformationgain.(c)Thelikelihoodratiostatistic.Answer:ForR1,theexpectedfrequencyforthepositiveclassis5×100/500=1andtheexpectedfrequencyforthenegativeclassis5×400/500=4.Therefore,thelikelihoodratioforR1is2×4×log2(4/1)+1×log2(1/4)=12.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn49ForR2,theexpectedfrequencyforthepositiveclassis40×100/500=8andtheexpectedfrequencyforthenegativeclassis40×400/500=32.Therefore,thelikelihoodratioforR2is2×30×log2(30/8)+10×log2(10/32)=80.85ForR3,theexpectedfrequencyforthepositiveclassis190×100/500=38andtheexpectedfrequencyforthenegativeclassis190×400/500=152.Therefore,thelikelihoodratioforR3is2×100×log2(100/38)+90×log2(90/152)=143.09Therefore,R3isthebestcandidateandR1istheworstcandidateac-cordingtothelikelihoodratiostatistic.(d)TheLaplacemeasure.Answer:TheLaplacemeasureoftherulesare71.43%(forR1),73.81%(forR2),and52.6%(forR3),respectively.ThereforeR2isthebestcandidateandR3istheworstcandidateaccordingtotheLaplacemeasure.(e)Them-estimatemeasure(withk=2andp+=0.2).Answer:Them-estimatemeasureoftherulesare62.86%(forR1),73.38%(forR2),and52.3%(forR3),respectively.ThereforeR2isthebestcandi-dateand课后答案网:www.hackshp.cnR3istheworstcandidateaccordingtothem-estimatemea-sure.5.Figure5.1illustratesthecoverageoftheclassificationrulesR1,R2,andR3.Determinewhichisthebestandworstruleaccordingto:(a)Thelikelihoodratiostatistic.Answer:Thereare29positiveexamplesand21negativeexamplesinthedataset.R1covers12positiveexamplesand3negativeexamples.Theexpectedfrequencyforthepositiveclassis15×29/50=8.7andtheexpectedfrequencyforthenegativeclassis15×21/50=6.3.Therefore,thelikelihoodratioforR1is2×12×log2(12/8.7)+3×log2(3/6.3)=4.71.R2covers7positiveexamplesand3negativeexamples.Theexpectedfrequencyforthepositiveclassis10×29/50=5.8andtheexpected若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn50Chapter5Classification:AlternativeTechniquesR3R2R1+++++++++++++class=+++++++++++++++++----------class=------------Figure5.1.Eliminationoftrainingrecordsbythesequentialcoveringalgorithm.R1,R2,andR3representregionscoveredbythreedifferentrules.frequencyforthenegativeclassis10×21/50=4.2.Therefore,thelikelihoodratioforR2is2×7×log2(7/5.8)+3×log2(3/4.2)=0.89.R3covers8positiveexamplesand4negativeexamples.Theexpectedfrequencyforthepositiveclassis12×29/50=6.96andtheexpectedfrequencyforthenegativeclassis12×21/50=5.04.Therefore,the课后答案网:www.hackshp.cnlikelihoodratioforR3is2×8×log2(8/6.96)+4×log2(4/5.04)=0.5472.R1isthebestruleandR3istheworstruleaccordingtothelikelihoodratiostatistic.(b)TheLaplacemeasure.Answer:TheLaplacemeasurefortherulesare76.47%(forR1),66.67%(forR2),and64.29%(forR3),respectively.ThereforeR1isthebestruleandR3istheworstruleaccordingtotheLaplacemeasure.(c)Them-estimatemeasure(withk=2andp+=0.58).Answer:Them-estimatemeasurefortherulesare77.41%(forR1),68.0%(forR2),and65.43%(forR3),respectively.ThereforeR1isthebestruleandR3istheworstruleaccordingtothem-estimatemeasure.(d)TheruleaccuracyafterR1hasbeendiscovered,wherenoneoftheexamplescoveredbyR1arediscarded).若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn51Answer:IftheexamplesforR1arenotdiscarded,thenR2willbechosenbecauseithasahigheraccuracy(70%)thanR3(66.7%).(e)TheruleaccuracyafterR1hasbeendiscovered,whereonlythepositiveexamplescoveredbyR1arediscarded).Answer:IfthepositiveexamplescoveredbyR1arediscarded,thenewaccuraciesforR2andR3are70%and60%,respectively.ThereforeR2ispreferredoverR3.(f)TheruleaccuracyafterR1hasbeendiscovered,wherebothpositiveandnegativeexamplescoveredbyR1arediscarded.Answer:IfthepositiveandnegativeexamplescoveredbyR1arediscarded,thenewaccuraciesforR2andR3are70%and75%,respectively.Inthiscase,R3ispreferredoverR2.6.(a)Supposethefractionofundergraduatestudentswhosmokeis15%andthefractionofgraduatestudentswhosmokeis23%.Ifone-fifthofthecollegestudentsaregraduatestudentsandtherestareundergraduates,whatistheprobabilitythatastudentwhosmokesisagraduatestudent?Answer:GivenP(S|UG)=0.15,P(S|G)=0.23,P(G)=0.2,P(UG)=0.8.WewanttocomputeP(G|S).AccordingtoBayesianTheorem,课后答案网:www.hackshp.cn0.23×0.2P(G|S)==0.277.(5.1)0.15×0.8+0.23×0.2(b)Giventheinformationinpart(a),isarandomlychosencollegestudentmorelikelytobeagraduateorundergraduatestudent?Answer:Anundergraduatestudent,becauseP(UG)>P(G).(c)Repeatpart(b)assumingthatthestudentisasmoker.Answer:AnundergraduatestudentbecauseP(UG|S)>P(G|S).(d)Suppose30%ofthegraduatestudentsliveinadormbutonly10%oftheundergraduatestudentsliveinadorm.Ifastudentsmokesandlivesinthedorm,isheorshemorelikelytobeagraduateorundergraduatestudent?Youcanassumeindependencebetweenstudentswholiveinadormandthosewhosmoke.Answer:First,weneedtoestimatealltheprobabilities.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn52Chapter5Classification:AlternativeTechniquesP(D|UG)=0.1,P(D|G)=0.3.P(D)=P(UG).P(D|UG)+P(G).P(D|G)=0.8∗0.1+0.2∗0.3=0.14.P(S)=P(S|UG)P(UG)+P(S|G)P(G)=0.15∗0.8+0.23∗0.2=0.166.P(DS|G)=P(D|G)×P(S|G)=0.3×0.23=0.069(usingconditionalindependentassumption)P(DS|UG)=P(D|UG)×P(S|UG)=0.1×0.15=0.015.WeneedtocomputeP(G|DS)andP(UG|DS).0.069×0.20.0138P(G|DS)==P(DS)P(DS)0.015×0.80.012P(UG|DS)==P(DS)P(DS)SinceP(G|DS)>P(UG|DS),he/sheismorelikelytobeagraduatestudent.7.ConsiderthedatasetshowninTable5.1Table5.1.DatasetforExercise7.RecordABCClass1000+2001−3011−4011−5001+课后答案网:www.hackshp.cn6101+7101−8101−9111+10101+(a)EstimatetheconditionalprobabilitiesforP(A|+),P(B|+),P(C|+),P(A|−),P(B|−),andP(C|−).Answer:P(A=1|−)=2/5=0.4,P(B=1|−)=2/5=0.4,P(C=1|−)=1,P(A=0|−)=3/5=0.6,P(B=0|−)=3/5=0.6,P(C=0|−)=0;P(A=1|+)=3/5=0.6,P(B=1|+)=1/5=0.2,P(C=1|+)=2/5=0.4,P(A=0|+)=2/5=0.4,P(B=0|+)=4/5=0.8,P(C=0|+)=3/5=0.6.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn53(b)Usetheestimateofconditionalprobabilitiesgiveninthepreviousques-tiontopredicttheclasslabelforatestsample(A=0,B=1,C=0)usingthena¨ıveBayesapproach.Answer:LetP(A=0,B=1,C=0)=K.P(+|A=0,B=1,C=0)P(A=0,B=1,C=0|+)×P(+)=P(A=0,B=1,C=0)P(A=0|+)P(B=1|+)P(C=0|+)×P(+)=K=0.4×0.2×0.6×0.5/K=0.024/K.P(−|A=0,B=1,C=0)P(A=0,B=1,C=0|−)×P(−)=P(A=0,B=1,C=0)P(A=0|−)×P(B=1|−)×P(C=0|−)×P(−)=课后答案网:www.hackshp.cnK=0/KTheclasslabelshouldbe’+’.(c)Estimatetheconditionalprobabilitiesusingthem-estimateapproach,withp=1/2andm=4.Answer:P(A=0|+)=(2+2)/(5+4)=4/9,P(A=0|−)=(3+2)/(5+4)=5/9,P(B=1|+)=(1+2)/(5+4)=3/9,P(B=1|−)=(2+2)/(5+4)=4/9,P(C=0|+)=(3+2)/(5+4)=5/9,P(C=0|−)=(0+2)/(5+4)=2/9.(d)Repeatpart(b)usingtheconditionalprobabilitiesgiveninpart(c).Answer:LetP(A=0,B=1,C=0)=K若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn54Chapter5Classification:AlternativeTechniquesP(+|A=0,B=1,C=0)P(A=0,B=1,C=0|+)×P(+)=P(A=0,B=1,C=0)P(A=0|+)P(B=1|+)P(C=0|+)×P(+)=K(4/9)×(3/9)×(5/9)×0.5=K=0.0412/KP(−|A=0,B=1,C=0)P(A=0,B=1,C=0|−)×P(−)=P(A=0,B=1,C=0)P(A=0|−)×P(B=1|−)×P(C=0|−)×P(−)=K(5/9)×(4/9)×(2/9)×0.5=K=0.0274/KTheclasslabelshouldbe’+’.(e)Comparethetwomethodsforestimatingprobabilities.Whichmethodisbetterandwhy?课后答案网:www.hackshp.cnAnswer:Whenoneoftheconditionalprobabilityiszero,theestimateforcondi-tionalprobabilitiesusingthem-estimateprobabilityapproachisbetter,sincewedon’twanttheentireexpressionbecomeszero.8.ConsiderthedatasetshowninTable5.2.(a)EstimatetheconditionalprobabilitiesforP(A=1|+),P(B=1|+),P(C=1|+),P(A=1|−),P(B=1|−),andP(C=1|−)usingthesameapproachasinthepreviousproblem.Answer:P(A=1|+)=0.6,P(B=1|+)=0.4,P(C=1|+)=0.8,P(A=1|−)=0.4,P(B=1|−)=0.4,andP(C=1|−)=0.2(b)Usetheconditionalprobabilitiesinpart(a)topredicttheclasslabelforatestsample(A=1,B=1,C=1)usingthena¨ıveBayesapproach.Answer:LetR:(A=1,B=1,C=1)bethetestrecord.Todetermineitsclass,weneedtocomputeP(+|R)andP(−|R).UsingBayestheorem,若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn55Table5.2.DatasetforExercise8.InstanceABCClass1001−2101+3010−4100−5101+6001+7110−8000−9010+10111+P(+|R)=P(R|+)P(+)/P(R)andP(−|R)=P(R|−)P(−)/P(R).SinceP(+)=P(−)=0.5andP(R)isconstant,RcanbeclassifiedbycomparingP(+|R)andP(−|R).Forthisquestion,P(R|+)=P(A=1|+)×P(B=1|+)×P(C=1|+)=0.192P(R|−)=P(A=1|−)×P(B=1|−)×P(C=1|−)=0.032SinceP(R|+)islarger,therecordisassignedto(+)class.(c)CompareP(A=1),P(B=1),andP(A=1,B=1).Statetherelationshipsbetween课后答案网:www.hackshp.cnAandB.Answer:P(A=1)=0.5,P(B=1)=0.4andP(A=1,B=1)=P(A)×P(B)=0.2.Therefore,AandBareindependent.(d)Repeattheanalysisinpart(c)usingP(A=1),P(B=0),andP(A=1,B=0).Answer:P(A=1)=0.5,P(B=0)=0.6,andP(A=1,B=0)=P(A=1)×P(B=0)=0.3.AandBarestillindependent.(e)CompareP(A=1,B=1|Class=+)againstP(A=1|Class=+)andP(B=1|Class=+).Arethevariablesconditionallyindependentgiventheclass?Answer:CompareP(A=1,B=1|+)=0.2againstP(A=1|+)=0.6andP(B=1|Class=+)=0.4.SincetheproductbetweenP(A=1|+)andP(A=1|−)arenotthesameasP(A=1,B=1|+),AandBarenotconditionallyindependentgiventheclass.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn56Chapter5Classification:AlternativeTechniquesAttributesDistinguishingAttributesNoiseAttributesA1ClassAA2RecordsB1ClassBB2Figure5.2.DatasetforExercise9.9.(a)Explainhowna¨ıveBayesperformsonthedatasetshowninFigure5.2.Answer:NBwillnotdowellonthisdatasetbecausetheconditionalprobabilitiesforeachdistinguishingattributegiventheclassarethesameforbothclassAandclassB.(b)Ifeachclassisfurtherdividedsuchthattherearefourclasses(课后答案网:www.hackshp.cnA1,A2,B1,andB2),willna¨ıveBayesperformbetter?Answer:TheperformanceofNBwillimproveonthesubclassesbecausetheproductofconditionalprobabilitiesamongthedistinguishingattributeswillbedifferentforeachsubclass.(c)Howwilladecisiontreeperformonthisdataset(forthetwo-classproblem)?Whatiftherearefourclasses?Answer:Forthetwo-classproblem,decisiontreewillnotperformwellbecausetheentropywillnotimproveaftersplittingthedatausingthedistin-guishingattributes.Iftherearefourclasses,thendecisiontreewillimproveconsiderably.10.RepeattheanalysisshowninExample5.3forfindingthelocationofadecisionboundaryusingthefollowinginformation:(a)ThepriorprobabilitiesareP(Crocodile)=2×P(Alligator).Answer:xˆ=13.0379.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn57(b)ThepriorprobabilitiesareP(Alligator)=2×P(Crocodile).Answer:xˆ=13.9621.(c)Thepriorprobabilitiesarethesame,buttheirstandarddeviationsaredifferent;i.e.,σ(Crocodile)=4andσ(Alligator)=2.Answer:xˆ=22.1668.11.Figure5.3illustratestheBayesianbeliefnetworkforthedatasetshowninTable5.3.(Assumethatalltheattributesarebinary).MileageAirEngineConditionerCarValueFigure5.3.Bayesianbeliefnetwork.课后答案网:www.hackshp.cnTable5.3.DatasetforExercise11.MileageEngineAirConditionerNumberofRecordsNumberofRecordswithCarValue=HiwithCarValue=LoHiGoodWorking34HiGoodBroken12HiBadWorking15HiBadBroken04LoGoodWorking90LoGoodBroken51LoBadWorking12LoBadBroken02(a)Drawtheprobabilitytableforeachnodeinthenetwork.P(Mileage=Hi)=0.5P(AirCond=Working)=0.625P(Engine=Good|Mileage=Hi)=0.5P(Engine=Good|Mileage=Lo)=0.75若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn58Chapter5Classification:AlternativeTechniquesP(B=bad)=0.1P(F=empty)=0.2BatteryFuelGaugeP(G=empty|B=good,F=notempty)=0.1P(G=empty|B=good,F=empty)=0.8P(G=empty|B=bad,F=notempty)=0.2P(G=empty|B=bad,F=empty)=0.9StartP(S=no|B=good,F=notempty)=0.1P(S=no|B=good,F=empty)=0.8P(S=no|B=bad,F=notempty)=0.9P(S=no|B=bad,F=empty)=1.0Figure5.4.BayesianbeliefnetworkforExercise12.P(Value=High|Engine=Good,AirCond=Working)=0.750P(Value=High|Engine=Good,AirCond=Broken)=0.667P(Value=High课后答案网:www.hackshp.cn|Engine=Bad,AirCond=Working)=0.222P(Value=High|Engine=Bad,AirCond=Broken)=0(b)UsetheBayesiannetworktocomputeP(Engine=Bad,AirConditioner=Broken).P(Engine=Bad,AirCond=Broken)=P(Engine=Bad,AirCond=Broken,Mileage=α,Value=β)αβ=P(Value=β|Engine=Bad,AirCond=Broken)αβ×P(Engine=Bad|Mileage=α)P(Mileage=α)P(AirCond=Broken)=0.1453.12.GiventheBayesiannetworkshowninFigure5.4,computethefollowingprob-abilities:(a)P(B=good,F=empty,G=empty,S=yes).若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn59Answer:P(B=good,F=empty,G=empty,S=yes)=P(B=good)×P(F=empty)×P(G=empty|B=good,F=empty)×P(S=yes|B=good,F=empty)=0.9×0.2×0.8×0.2=0.0288.(b)P(B=bad,F=empty,G=notempty,S=no).Answer:P(B=bad,F=empty,G=notempty,S=no)=P(B=bad)×P(F=empty)×P(G=notempty|B=bad,F=empty)×P(S=no|B=bad,F=empty)=0.1×0.2×0.1×1.0=0.002.(c)Giventhatthebatteryisbad,computetheprobabilitythatthecarwillstart.Answer:P(S=yes|B=bad)=P(S=yes|B=bad,F=α)P(B=bad)P(F=α)α=0.1×0.1×0.8=0.00813.Considertheone-dimensionaldatasetshowninTable5.4.课后答案网:www.hackshp.cnTable5.4.DatasetforExercise13.x0.53.04.54.64.95.25.35.57.09.5y−−+++−−+−−(a)Classifythedatapointx=5.0accordingtoits1-,3-,5-,and9-nearestneighbors(usingmajorityvote).Answer:1-nearestneighbor:+,3-nearestneighbor:−,5-nearestneighbor:+,9-nearestneighbor:−.(b)Repeatthepreviousanalysisusingthedistance-weightedvotingap-proachdescribedinSection5.2.1.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn60Chapter5Classification:AlternativeTechniquesAnswer:1-nearestneighbor:+,3-nearestneighbor:+,5-nearestneighbor:+,9-nearestneighbor:+.14.Thenearest-neighboralgorithmdescribedinSection5.2canbeextendedtohandlenominalattributes.AvariantofthealgorithmcalledPEBLS(ParallelExamplar-BasedLearningSystem)byCostandSalzberg[2]measuresthedistancebetweentwovaluesofanominalattributeusingthemodifiedvaluedifferencemetric(MVDM).Givenapairofnominalattributevalues,V1andV2,thedistancebetweenthemisdefinedasfollows:kni1ni2d(V1,V2)=−,(5.2)n1n2i=1wherenijisthenumberofexamplesfromclassiwithattributevalueVjandnjisthenumberofexampleswithattributevalueVj.ConsiderthetrainingsetfortheloanclassificationproblemshowninFigure5.9.UsetheMVDMmeasuretocomputethedistancebetweeneverypairofattributevaluesfortheHomeOwnerandMaritalStatusattributes.Answer:ThetrainingsetshowninFigure5.9canbesummarizedfortheHomeOwnerandMaritalStatus课后答案网:www.hackshp.cnattributesasfollows.MaritalStatusClassSingleMarriedDivorcedYes201No241HomeOwnerClassYesNoYes03No34d(Single,Married)=1d(Single,Divorced)=0d(Married,Divorced)=1d(Refund=Yes,Refund=No)=6/7若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn6115.ForeachoftheBooleanfunctionsgivenbelow,statewhethertheproblemislinearlyseparable.(a)AANDBANDCAnswer:Yes(b)NOTAANDBAnswer:Yes(c)(AORB)AND(AORC)Answer:Yes(d)(AXORB)AND(AORB)Answer:No16.(a)DemonstratehowtheperceptronmodelcanbeusedtorepresenttheANDandORfunctionsbetweenapairofBooleanvariables.Answer:Letx1andx2beapairofBooleanvariablesandybetheoutput.ForANDfunction,apossibleperceptronmodelis:y=sgnx1+x2−1.5.ForORfunction,apossibleperceptronmodelis:y=sgnx1+x2−0.5.课后答案网:www.hackshp.cn(b)Commentonthedisadvantageofusinglinearfunctionsasactivationfunctionsformultilayerneuralnetworks.Answer:Multilayerneuralnetworksisusefulformodelingnonlinearrelation-shipsbetweentheinputandoutputattributes.However,iflinearfunc-tionsareusedasactivationfunctions(insteadofsigmoidorhyperbolictangentfunction),theoutputisstillalinearcombinationofitsinputattributes.Suchanetworkisjustasexpressiveasaperceptron.17.Youareaskedtoevaluatetheperformanceoftwoclassificationmodels,M1andM2.Thetestsetyouhavechosencontains26binaryattributes,labeledasAthroughZ.Table5.5showstheposteriorprobabilitiesobtainedbyapplyingthemodelstothetestset.(Onlytheposteriorprobabilitiesforthepositiveclassareshown).Asthisisatwo-classproblem,P(−)=1−P(+)andP(−|A,...,Z)=1−P(+|A,...,Z).Assumethatwearemostlyinterestedindetectinginstancesfromthepositiveclass.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn62Chapter5Classification:AlternativeTechniquesTable5.5.PosteriorprobabilitiesforExercise17.InstanceTrueClassP(+|A,...,Z,M1)P(+|A,...,Z,M2)1+0.730.612+0.690.033−0.440.684−0.550.315+0.670.456+0.470.097−0.080.388−0.150.059+0.450.0110−0.350.04(a)PlottheROCcurveforbothM1andM2.(Youshouldplotthemonthesamegraph.)Whichmodeldoyouthinkisbetter?Explainyourreasons.Answer:TheROCcurveforM1andM2areshownintheFigure5.5.10.90.80.7M1课后答案网:www.hackshp.cnM20.6TPR0.50.40.30.20.1000.10.20.30.40.50.60.70.80.91FPRFigure5.5.ROCcurve.M1isbetter,sinceitsareaundertheROCcurveislargerthantheareaunderROCcurveforM2.(b)FormodelM1,supposeyouchoosethecutoffthresholdtobet=0.5.Inotherwords,anytestinstanceswhoseposteriorprobabilityisgreaterthantwillbeclassifiedasapositiveexample.Computetheprecision,recall,andF-measureforthemodelatthisthresholdvalue.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn63Whent=0.5,theconfusionmatrixforM1isshownbelow.+-Actual+32-14Precision=3/4=75%.Recall=3/5=60%.F-measure=(2×.75×.6)/(.75+.6)=0.667.(c)Repeattheanalysisforpart(c)usingthesamecutoffthresholdonmodelM2.ComparetheF-measureresultsforbothmodels.Whichmodelisbetter?AretheresultsconsistentwithwhatyouexpectfromtheROCcurve?Answer:Whent=0.5,theconfusionmatrixforM2isshownbelow.+-Actual+14-14Precision=1/2=50%.Recall=1/5=20%.F-measure=(2×.5×.2)/(.5+.2)=0.2857.BasedonF-measure,M1isstillbetterthanM2.Thisresultisconsis-tentwiththeROCplot.(d)Repeatpart(c)formodel课后答案网:www.hackshp.cnM1usingthethresholdt=0.1.Whichthresholddoyouprefer,t=0.5ort=0.1?AretheresultsconsistentwithwhatyouexpectfromtheROCcurve?Answer:Whent=0.1,theconfusionmatrixforM1isshownbelow.+-Actual+50-41Precision=5/9=55.6%.Recall=5/5=100%.F-measure=(2×.556×1)/(.556+1)=0.715.AccordingtoF-measure,t=0.1isbetterthant=0.5.Whent=0.1,FPR=0.8andTPR=1.Ontheotherhand,whent=0.5,FPR=0.2andTRP=0.6.Since(0.2,0.6)isclosertothepoint(0,1),wefavort=0.5.ThisresultisinconsistentwiththeresultsusingF-measure.WecanalsoshowthisbycomputingtheareaundertheROCcurve若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn64Chapter5Classification:AlternativeTechniquesFort=0.5,area=0.6×(1−0.2)=0.6×0.8=0.48.Fort=0.1,area=1×(1−0.8)=1×0.2=0.2.Sincetheareafort=0.5islargerthantheareafort=0.1,weprefert=0.5.18.Followingisadatasetthatcontainstwoattributes,XandY,andtwoclasslabels,“+”and“−”.Eachattributecantakethreedifferentvalues:0,1,or2.NumberofXYInstances+−000100100020010001101001110021101000201001200220100Theconceptforthe“+”classisY=1andtheconceptforthe“−”classisX=0∨X=2.(a)Buildadecisiontreeonthedataset.Doesthetreecapturethe“+”课后答案网:www.hackshp.cnand“−”concepts?Answer:Thereare30positiveand600negativeexamplesinthedata.Therefore,attherootnode,theerrorrateisEorig=1−max(30/630,600/630)=30/630.IfwesplitonX,thegaininerrorrateis:X=0X=1X=2EX=0=10/310+101010EX=1=0−3000300EX=2=10/310310101031010∆X=Eorig−−0−=10/630.630310630630310IfwesplitonY,thegaininerrorrateis:若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn65Y=0Y=1Y=2EY=0=0+0300EY=1=30/230−200200200EY=2=023030∆Y=Eorig−=0.630230Therefore,Xischosentobethefirstsplittingattribute.SincetheX=1childnodeispure,itdoesnotrequirefurthersplitting.WemayuseattributeYtosplittheimpurenodes,X=0andX=2,asfollows:•TheY=0andY=2nodescontain100−instances.•TheY=1nodecontains100−and10+instances.InallthreecasesforY,thechildnodesarelabeledas−.Theresultingconceptis+,X=1;class=−,otherwise.(b)Whataretheaccuracy,precision,recall,andF1-measureofthedecisiontree?(Notethatprecision,recall,andF1-measurearedefinedwithrespecttothe“+”class.)Answer:Theconfusionmatrixonthetrainingdata:课后答案网:www.hackshp.cn610accuracy:=0.9683630Predicted10precision:=1.0+−10+102010Actualrecall:=0.3333−0600302∗0.3333∗1.0F−measure:=0.51.0+0.3333(c)Buildanewdecisiontreewiththefollowingcostfunction:⎧⎨0,ifi=j;C(i,j)=1,ifi=+,j=−;⎩Numberof−instances,ifi=−,j=+.Numberof+instances(Hint:onlytheleavesoftheolddecisiontreeneedtobechanged.)Doesthedecisiontreecapturethe“+”concept?Answer:Thecostmatrixcanbesummarizedasfollows:若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn66Chapter5Classification:AlternativeTechniquesPredicted+−+0600/30=20Actual−10Thedecisiontreeinpart(a)has7leafnodes,X=1,X=0∧Y=0,X=0∧Y=1,X=0∧Y=2,X=2∧Y=0,X=2∧Y=1,andX=2∧Y=2.OnlyX=0∧Y=1andX=2∧Y=1areimpurenodes.Thecostofmisclassifyingtheseimpurenodesaspositiveclassis:10∗0+1∗100=100whilethecostofmisclassifyingthemasnegativeclassis:10∗20+0∗100=200.Thesenodesarethereforelabeledas+.Theresultingconceptis+,X=1∨(X=0∧Y=1)∨(X=2∧Y=2);class=−,otherwise.(d)Whataretheaccuracy,precision,recall,andF1-measureofthenewdecisiontree?Answer:Theconfusionmatrixofthenewtree430课后答案网:www.hackshp.cnaccuracy:=0.6825630Predicted30precision:=0.1304+−230+30030Actualrecall:=1.0−200400302∗0.1304∗1.0F−measure:=0.23071.0+0.130419.(a)Considerthecostmatrixforatwo-classproblem.LetC(+,+)=C(−,−)=p,C(+,−)=C(−,+)=q,andq>p.Showthatmin-imizingthecostfunctionisequivalenttomaximizingtheclassifier’saccuracy.Answer:ConfusionMatrix+−CostMatrix+-+ab+pq−cd−qp若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn67ThetotalcostisF=p(a+d)+q(b+c).Sinceacc=a+d,whereN=a+b+c+d,wemaywriteNF=Nacc(p−q)+q.Becausep−qisnegative,minimizingthetotalcostisequivalenttomaximizingaccuracy.(b)Showthatacostmatrixisscale-invariant.Forexample,ifthecostmatrixisrescaledfromC(i,j)−→βC(i,j),whereβisthescalingfactor,thedecisionthreshold(Equation5.82)willremainunchanged.Answer:Thecostmatrixis:CostMatrix+−+c(+,+)c(+,−)−c(−,+)c(−,−)Anodetisclassifiedaspositiveif:c(+,−)p(+|t)+c(−,−)p(−|t)>c(−,+)p(−|t)+c(+,+)p(+|t)=⇒c(+,−)p(+|t)+c(−,−)[1−p(+|t)]>c(−,+)[1−p(+|t)]+c(+,+)p(+|t)c(−,+)−c(−,−)=⇒p(+|t)>[c(−,+)−c(−,−)]+[c(+,−)−c(+,+)]Thetransformedcostmatrixis:课后答案网:www.hackshp.cnCostMatrix+−+βc(+,+)βc(+,−)−βc(−,+)βc(−,−)Therefore,thedecisionruleis:βc(−,+)−βc(−,−)p(+|t)>[βc(−,+)−βc(−,−)]+[βc(+,−)−βc(+,+)]c(−,+)−c(−,−)=[c(−,+)−c(−,−)]+[c(+,−)−c(+,+)]whichisthesameastheoriginaldecisionrule.(c)Showthatacostmatrixistranslation-invariant.Inotherwords,addingaconstantfactortoallentriesinthecostmatrixwillnotaffectthedecisionthreshold(Equation5.82).Answer:Thetransformedcostmatrixis:若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn68Chapter5Classification:AlternativeTechniquesCostMatrix+−+c(+,+)+βc(+,−)+β−c(−,+)+βc(−,−)+βTherefore,thedecisionruleis:β+c(−,+)−β−c(−,−)p(+|t)>[β+c(−,+)−β−c(−,−)]+[β+c(+,−)−β−c(+,+)]c(−,+)−c(−,−)=[c(−,+)−c(−,−)]+[c(+,−)−c(+,+)]whichisthesameastheoriginaldecisionrule.20.Considerthetaskofbuildingaclassifierfromrandomdata,wheretheat-tributevaluesaregeneratedrandomlyirrespectiveoftheclasslabels.Assumethedatasetcontainsrecordsfromtwoclasses,“+”and“−.”Halfofthedatasetisusedfortrainingwhiletheremaininghalfisusedfortesting.(a)Supposethereareanequalnumberofpositiveandnegativerecordsinthedataandthedecisiontreeclassifierpredictseverytestrecordtobepositive.Whatistheexpectederrorrateoftheclassifieronthetestdata?Answer:50%.(b)Repeatthepreviousanalysisassumingthattheclassifierpredictseachtestrecordtobepositiveclasswithprobability0.8andnegativeclasswithprobability0.2.Answer:50%.(c)Supposetwo-thirdsofthedatabelongtothepositiveclassandthe课后答案网:www.hackshp.cnremainingone-thirdbelongtothenegativeclass.Whatistheexpectederrorofaclassifierthatpredictseverytestrecordtobepositive?Answer:33%.(d)Repeatthepreviousanalysisassumingthattheclassifierpredictseachtestrecordtobepositiveclasswithprobability2/3andnegativeclasswithprobability1/3.Answer:44.4%.21.DerivethedualLagrangianforthelinearSVMwithnonseparabledatawheretheobjectivefunctionisw2N2f(w)=+Cξi.2i=1Answer:N12LD=λi−λiλjyiyjxi·xj−Cξi.2i=1i,ji若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn69NoticethatthedualLagrangiandependsontheslackvariablesξi’s.22.ConsidertheXORproblemwheretherearefourtrainingpoints:(1,1,−),(1,0,+),(0,1,+),(0,0,−).Transformthedataintothefollowingfeaturespace:√√√Φ=(1,2x,2x,2xx,x2,x2).121212Findthemaximummarginlineardecisionboundaryinthetransformedspace.Answer:Thedecisionboundaryisf(x1,x2)=x1x2.23.GiventhedatasetsshowninFigures5.6,explainhowthedecisiontree,na¨ıveBayes,andk-nearestneighborclassifierswouldperformonthesedatasets.Answer:(a)BothdecisiontreeandNBwilldowellonthisdatasetbecausethedistinguishingattributeshavebetterdiscriminatingpowerthannoiseattributesintermsofentropygainandconditionalprobability.k-NNwillnotdoaswellduetorelativelylargenumberofnoiseattributes.(b)NBwillnotworkatallwiththisdatasetduetoattributedependency.OtherschemeswilldobetterthanNB.(c)NBwilldoverywellinthisdataset,becauseeachdiscriminatingat-课后答案网:www.hackshp.cntributehashigherconditionalprobabilityinoneclassovertheotherandtheoverallclassificationisdonebymultiplyingtheseindividualconditionalprobabilities.Decisiontreewillnotdoaswell,duetotherelativelylargenumberofdistinguishingattributes.Itwillhaveanoverfittingproblem.k-NNwilldoreasonablywell.(d)k-NNwilldowellonthisdataset.Decisiontreeswillalsowork,butwillresultinafairlylargedecisiontree.Thefirstfewsplitswillbequiterandom,becauseitmaynotfindagoodinitialsplitatthebeginning.NBwillnotperformquiteaswellduetotheattributedependency.(e)k-NNwilldowellonthisdataset.Decisiontreeswillalsowork,butwillresultinalargedecisiontree.Ifdecisiontreeusesanobliquesplitinsteadofjustverticalandhorizontalsplits,thentheresultingdecisiontreewillbemorecompactandhighlyaccurate.NBwillnotperformquiteaswellduetoattributedependency.(f)kNNworksthebest.NBdoesnotworkwellforthisdatasetduetoattributedependency.Decisiontreewillhavealargetreeinordertocapturethecirculardecisionboundaries.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn70Chapter5Classification:AlternativeTechniquesAttributesAttributesDistinguishingAttributesNoiseAttributesDistinguishingAttributesNoiseAttributesClassAClassARecordsRecordsClassBClassB(a)Syntheticdataset1.(b)Syntheticdataset2.AttributesDistinguishingDistinguishingAttributeset1Attributeset2NoiseAttributesClassAClassBClassAClassBClassA60%filled40%filledClassBClassAClassBClassAClassBwith1with1ClassARecordsAttributeYClassAClassBClassAClassBClassA40%filled60%filledClassBClassAClassBClassAClassBwith1with1ClassB课后答案网:www.hackshp.cnAttributeX(c)Syntheticdataset3.(d)Syntheticdataset4ClassAClassBClassAClassBAttributeYAttributeYClassBAttributeXAttributeX(e)Syntheticdataset5.(f)Syntheticdataset6.Figure5.6.DatasetforExercise23.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn6AssociationAnalysis:BasicConceptsandAlgorithms1.Foreachofthefollowingquestions,provideanexampleofanassociationrulefromthemarketbasketdomainthatsatisfiesthefollowingconditions.Also,describewhethersuchrulesaresubjectivelyinteresting.(a)Arulethathashighsupportandhighconfidence.Answer:Milk−→Bread.Suchobviousruletendstobeuninteresting.课后答案网:www.hackshp.cn(b)Arulethathasreasonablyhighsupportbutlowconfidence.Answer:Milk−→Tuna.Whilethesaleoftunaandmilkmaybehigherthanthesupportthreshold,notalltransactionsthatcontainmilkalsocontaintuna.Suchlow-confidenceruletendstobeuninteresting.(c)Arulethathaslowsupportandlowconfidence.Answer:Cookingoil−→Laundrydetergent.Suchlowconfidenceruletendstobeuninteresting.(d)Arulethathaslowsupportandhighconfidence.Answer:Vodka−→Caviar.Suchruletendstobeinteresting.2.ConsiderthedatasetshowninTable6.1.(a)Computethesupportforitemsets{e},{b,d},and{b,d,e}bytreatingeachtransactionIDasamarketbasket.Answer:若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn72Chapter6AssociationAnalysisTable6.1.Exampleofmarketbaskettransactions.CustomerIDTransactionIDItemsBought10001{a,d,e}10024{a,b,c,e}20012{a,b,d,e}20031{a,c,d,e}30015{b,c,e}30022{b,d,e}40029{c,d}40040{a,b,c}50033{a,d,e}50038{a,b,e}8s({e})==0.8102s({b,d})==0.2102s({b,d,e})==0.210(6.1)(b)Usetheresultsinpart(a)tocomputetheconfidencefortheassociationrules{b,d}−→{e}and{e}−→{b,d}.Isconfidenceasymmetricmeasure?课后答案网:www.hackshp.cnAnswer:0.2c(bd−→e)==100%0.20.2c(e−→bd)==25%0.8No,confidenceisnotasymmetricmeasure.(c)Repeatpart(a)bytreatingeachcustomerIDasamarketbasket.Eachitemshouldbetreatedasabinaryvariable(1ifanitemappearsinatleastonetransactionboughtbythecustomer,and0otherwise.)Answer:4s({e})==0.855s({b,d})==154s({b,d,e})==0.85若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn73(d)Usetheresultsinpart(c)tocomputetheconfidencefortheassociationrules{b,d}−→{e}and{e}−→{b,d}.Answer:0.8c(bd−→e)==80%10.8c(e−→bd)==100%0.8(e)Supposes1andc1arethesupportandconfidencevaluesofanassocia-tionrulerwhentreatingeachtransactionIDasamarketbasket.Also,lets2andc2bethesupportandconfidencevaluesofrwhentreatingeachcustomerIDasamarketbasket.Discusswhetherthereareanyrelationshipsbetweens1ands2orc1andc2.Answer:Therearenoapparentrelationshipsbetweens1,s2,c1,andc2.3.(a)Whatistheconfidencefortherules∅−→AandA−→∅?Answer:c(∅−→A)=s(∅−→A).c(A−→∅)=100%.(b)Letc1,c2,andc3betheconfidencevaluesoftherules{p}−→{q},{p}−→{q,r},and{p,r}−→{q},respectively.Ifweassumethatc1,c2,andc3havedifferentvalues,whatarethepossiblerelationshipsthatmayexistamong课后答案网:www.hackshp.cnc1,c2,andc3?Whichrulehasthelowestconfidence?Answer:s(p∪q)c1=s(p)s(p∪q∪r)c2=s(p)s(p∪q∪r)c3=s(p∪r)Considerings(p)≥s(p∪q)≥s(p∪q∪r)Thus:c1≥c2&c3≥c2.Thereforec2hasthelowestconfidence.(c)Repeattheanalysisinpart(b)assumingthattheruleshaveidenticalsupport.Whichrulehasthehighestconfidence?Answer:Considerings(p∪q)=s(p∪q∪r)buts(p)≥s(p∪r)Thus:c3≥(c1=c2)Eitherallruleshavethesameconfidenceorc3hasthehighestconfi-dence.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn74Chapter6AssociationAnalysis(d)Transitivity:SupposetheconfidenceoftherulesA−→BandB−→Carelargerthansomethreshold,minconf.IsitpossiblethatA−→Chasaconfidencelessthanminconf?Answer:Yes,ItdependsonthesupportofitemsA,B,andC.Forexample:s(A,B)=60%s(A)=90%s(A,C)=20%s(B)=70%s(B,C)=50%s(C)=60%Letminconf=50%Therefore:c(A→B)=66%>minconfc(B→C)=71%>minconfButc(A→C)=22%0)?Answer:Becausethelongesttransactioncontains4items,themaxi-mumsizeoffrequentitemsetis4.(c)Writeanexpressionforthemaximumnumberofsize-3itemsetsthatcanbederivedfromthisdataset.课后答案网:www.hackshp.cn6Answer:=20.3(d)Findanitemset(ofsize2orlarger)thathasthelargestsupport.Answer:{Bread,Butter}.(e)Findapairofitems,aandb,suchthattherules{a}−→{b}and{b}−→{a}havethesameconfidence.Answer:(Beer,Cookies)or(Bread,Butter).7.Considerthefollowingsetoffrequent3-itemsets:{1,2,3},{1,2,4},{1,2,5},{1,3,4},{1,3,5},{2,3,4},{2,3,5},{3,4,5}.Assumethatthereareonlyfiveitemsinthedataset.(a)Listallcandidate4-itemsetsobtainedbyacandidategenerationproce-dureusingtheFk−1×F1mergingstrategy.Answer:{1,2,3,4},{1,2,3,5},{1,2,3,6}.{1,2,4,5},{1,2,4,6},{1,2,5,6}.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn79{1,3,4,5},{1,3,4,6},{2,3,4,5}.{2,3,4,6},{2,3,5,6}.(b)Listallcandidate4-itemsetsobtainedbythecandidategenerationpro-cedureinApriori.Answer:{1,2,3,4},{1,2,3,5},{1,2,4,5},{2,3,4,5},{2,3,4,6}.(c)Listallcandidate4-itemsetsthatsurvivethecandidatepruningstepoftheApriorialgorithm.Answer:{1,2,3,4}8.TheApriorialgorithmusesagenerate-and-countstrategyforderivingfre-quentitemsets.Candidateitemsetsofsizek+1arecreatedbyjoiningapairoffrequentitemsetsofsizek(thisisknownasthecandidategenerationstep).Acandidateisdiscardedifanyoneofitssubsetsisfoundtobeinfrequentduringthecandidatepruningstep.SupposetheApriorialgorithmisappliedtothedatasetshowninTable6.3withminsup=30%,i.e.,anyitemsetoccurringinlessthan3transactionsisconsideredtobeinfrequent.Table6.3.Exampleofmarketbaskettransactions.TransactionIDItemsBought课后答案网:www.hackshp.cn1{a,b,d,e}2{b,c,d}3{a,b,d,e}4{a,c,d,e}5{b,c,d,e}6{b,d,e}7{c,d}8{a,b,c}9{a,d,e}10{b,d}(a)DrawanitemsetlatticerepresentingthedatasetgiveninTable6.3.Labeleachnodeinthelatticewiththefollowingletter(s):•N:IftheitemsetisnotconsideredtobeacandidateitemsetbytheApriorialgorithm.Therearetworeasonsforanitemsetnottobeconsideredasacandidateitemset:(1)itisnotgeneratedatallduringthecandidategenerationstep,or(2)itisgeneratedduring若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn80Chapter6AssociationAnalysisthecandidategenerationstepbutissubsequentlyremovedduringthecandidatepruningstepbecauseoneofitssubsetsisfoundtobeinfrequent.•F:IfthecandidateitemsetisfoundtobefrequentbytheApriorialgorithm.•I:Ifthecandidateitemsetisfoundtobeinfrequentaftersupportcounting.Answer:Thelatticestructureisshownbelow.nullFFFFFFABCDEFFIFFFFFFIABACADAEBCBDBECDCEDENIINNFINFNABCABDABEACDACEADEBCDBCEBDECDENNNNN课后答案网:www.hackshp.cnABCDABCEABDEACDEBCDENABCDEFigure6.1.Solution.(b)Whatisthepercentageoffrequentitemsets(withrespecttoallitemsetsinthelattice)?Answer:Percentageoffrequentitemsets=16/32=50.0%(includingthenullset).(c)WhatisthepruningratiooftheApriorialgorithmonthisdataset?(Pruningratioisdefinedasthepercentageofitemsetsnotconsideredtobeacandidatebecause(1)theyarenotgeneratedduringcandidategenerationor(2)theyareprunedduringthecandidatepruningstep.)Answer:若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn811,4,73,6,92,5,81,4,73,6,91,4,73,6,92,5,81,4,73,6,92,5,82,5,8L1L5L6L7L8L9L11L12{145}{168}{246}{258}{568}{346}{356}{367}{178}{278}{289}{379}{689}{678}1,4,72,5,83,6,9L2L3L4{127}{125}{459}{457}{158}{456}{458}{789}Figure6.2.Anexampleofahashtreestructure.PruningratioistheratioofNtothetotalnumberofitemsets.SincethecountofN=11,thereforepruningratiois11/32=34.4%.(d)Whatisthefalsealarmrate(i.e,percentageofcandidateitemsetsthatarefoundtobeinfrequentafterperformingsupportcounting)?Answer:FalsealarmrateistheratioofItothetotalnumberofitemsets.SincethecountofI=5,thereforethefalsealarmrateis5/32=15.6%.9.TheApriori课后答案网:www.hackshp.cnalgorithmusesahashtreedatastructuretoefficientlycountthesupportofcandidateitemsets.Considerthehashtreeforcandidate3-itemsetsshowninFigure6.2.(a)Givenatransactionthatcontainsitems{1,3,4,5,8},whichofthehashtreeleafnodeswillbevisitedwhenfindingthecandidatesofthetrans-action?Answer:TheleafnodesvisitedareL1,L3,L5,L9,andL11.(b)Usethevisitedleafnodesinpart(b)todeterminethecandidateitem-setsthatarecontainedinthetransaction{1,3,4,5,8}.Answer:Thecandidatescontainedinthetransactionare{1,4,5},{1,5,8},and{4,5,8}.10.Considerthefollowingsetofcandidate3-itemsets:{1,2,3},{1,2,6},{1,3,4},{2,3,4},{2,4,5},{3,4,6},{4,5,6}若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn82Chapter6AssociationAnalysis(a)Constructahashtreefortheabovecandidate3-itemsets.Assumethetreeusesahashfunctionwhereallodd-numbereditemsarehashedtotheleftchildofanode,whiletheeven-numbereditemsarehashedtotherightchild.Acandidatek-itemsetisinsertedintothetreebyhashingoneachsuccessiveiteminthecandidateandthenfollowingtheappropriatebranchofthetreeaccordingtothehashvalue.Oncealeafnodeisreached,thecandidateisinsertedbasedononeofthefollowingconditions:Condition1:Ifthedepthoftheleafnodeisequaltok(therootisassumedtobeatdepth0),thenthecandidateisinsertedregardlessofthenumberofitemsetsalreadystoredatthenode.Condition2:Ifthedepthoftheleafnodeislessthank,thenthecandidatecanbeinsertedaslongasthenumberofitemsetsstoredatthenodeislessthanmaxsize.Assumemaxsize=2forthisquestion.Condition3:Ifthedepthoftheleafnodeislessthankandthenumberofitemsetsstoredatthenodeisequaltomaxsize,thentheleafnodeisconvertedintoaninternalnode.Newleafnodesarecreatedaschildrenoftheoldleafnode.Candidateitemsetspreviouslystoredintheoldleafnodearedistributedtothechildrenbasedontheirhashvalues.Thenewcandidateisalsohashedto课后答案网:www.hackshp.cnitsappropriateleafnode.Answer:134L1234245456L5123126L4L2346L3Figure6.3.HashtreeforExercise10.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn83nullabcdeabacadaebcbdbecdcedeabcabdabeacdaceadebcdbcebdecdeabcdabceabdeacdebcdeabcdeFigure6.4.Anitemsetlattice(b)Howmanyleafnodesarethereinthecandidatehashtree?Howmanyinternalnodesarethere?Answer:课后答案网:www.hackshp.cnThereare5leafnodesand4internalnodes.(c)Consideratransactionthatcontainsthefollowingitems:{1,2,3,5,6}.Usingthehashtreeconstructedinpart(a),whichleafnodeswillbecheckedagainstthetransaction?Whatarethecandidate3-itemsetscontainedinthetransaction?Answer:TheleafnodesL1,L2,L3,andL4willbecheckedagainstthetransaction.Thecandidateitemsetscontainedinthetransactioninclude{1,2,3}and{1,2,6}.11.GiventhelatticestructureshowninFigure6.4andthetransactionsgiveninTable6.3,labeleachnodewiththefollowingletter(s):•Mifthenodeisamaximalfrequentitemset,•Cifitisaclosedfrequentitemset,•Nifitisfrequentbutneithermaximalnorclosed,and•Iifitisinfrequent.Assumethatthesupportthresholdisequalto30%.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn84Chapter6AssociationAnalysisAnswer:Thelatticestructureisshownbelow.nullCCCCCFABCDEMCCIFFMCFMIABACADAECBCBDBECCDCEDEIIIIIMIIMIABCABDABEACDACECADEBCDBCECBDECDEIIIIIABCDABCEABDEACDEBCDEIABCDEFigure6.5.SolutionforExercise11.课后答案网:www.hackshp.cn12.Theoriginalassociationruleminingformulationusesthesupportandconfi-dencemeasurestopruneuninterestingrules.(a)Drawacontingencytableforeachofthefollowingrulesusingthetrans-actionsshowninTable6.4.Rules:{b}−→{c},{a}−→{d},{b}−→{d},{e}−→{c},{c}−→{a}.Answer:ccddddb34a41b61b21a50b30ccaae24c23e31c32(b)Usethecontingencytablesinpart(a)tocomputeandranktherulesindecreasingorderaccordingtothefollowingmeasures.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn85Table6.4.Exampleofmarketbaskettransactions.TransactionIDItemsBought1{a,b,d,e}2{b,c,d}3{a,b,d,e}4{a,c,d,e}5{b,c,d,e}6{b,d,e}7{c,d}8{a,b,c}9{a,d,e}10{b,d}i.Support.Answer:RulesSupportRankb−→c0.33a−→d0.42b−→d0.61e−→c0.24c−→a0.24ii.Confidence.Answer:课后答案网:www.hackshp.cnRulesConfidenceRankb−→c3/73a−→d4/52b−→d6/71e−→c2/65c−→a2/54P(X,Y)iii.Interest(X−→Y)=P(Y).P(X)Answer:RulesInterestRankb−→c0.2143a−→d0.722b−→d0.7711e−→c0.1675c−→a0.24√P(X,Y)iv.IS(X−→Y)=.P(X)P(Y)Answer:若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn86Chapter6AssociationAnalysisRulesISRankb−→c0.5073a−→d0.5962b−→d0.7561e−→c0.3655c−→a0.44v.Klosgen(X−→Y)=P(X,Y)×(P(Y|X)−P(Y)),whereP(Y|X)=P(X,Y).P(X)Answer:RulesKlosgenRankb−→c-0.0392a−→d-0.0634b−→d-0.0331e−→c-0.0755c−→a-0.0453P(X,Y)P(X,Y)vi.Oddsratio(X−→Y)=.P(X,Y)P(X,Y)Answer:RulesOddsRatioRankb−→c0.3752a−→d04b−→d04e−→c0.1673c−→a0.4441课后答案网:www.hackshp.cn13.GiventherankingsyouhadobtainedinExercise12,computethecorrela-tionbetweentherankingsofconfidenceandtheotherfivemeasures.Whichmeasureismosthighlycorrelatedwithconfidence?Whichmeasureisleastcorrelatedwithconfidence?Answer:Correlation(Confidence,Support)=0.97.Correlation(Confidence,Interest)=1.Correlation(Confidence,IS)=1.Correlation(Confidence,Klosgen)=0.7.Correlation(Confidence,OddsRatio)=-0.606.InterestandISarethemosthighlycorrelatedwithconfidence,whileoddsratioistheleastcorrelated.14.AnswerthefollowingquestionsusingthedatasetsshowninFigure6.6.Notethateachdatasetcontains1000itemsand10,000transactions.Darkcellsindicatethepresenceofitemsandwhitecellsindicatetheabsenceof若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn87items.WewillapplytheApriorialgorithmtoextractfrequentitemsetswithminsup=10%(i.e.,itemsetsmustbecontainedinatleast1000transac-tions)?(a)Whichdataset(s)willproducethemostnumberoffrequentitemsets?Answer:Dataset(e)becauseithastogeneratethelongestfrequentitemsetalongwithitssubsets.(b)Whichdataset(s)willproducethefewestnumberoffrequentitemsets?Answer:Dataset(d)whichdoesnotproduceanyfrequentitemsetsat10%supportthreshold.(c)Whichdataset(s)willproducethelongestfrequentitemset?Answer:Dataset(e).(d)Whichdataset(s)willproducefrequentitemsetswithhighestmaximumsupport?Answer:Dataset(b).(e)Whichdataset(s)willproducefrequentitemsetscontainingitemswithwide-varyingsupportlevels(i.e.,itemswithmixedsupport,rangingfromlessthan20%tomorethan70%).Answer:Dataset(e).15.(a)Provethattheφcoefficientisequalto1ifandonlyiff11=f1+=f+1.Answer:Insteadofprovingf11=f1+=f+1,wewillshowthatP(A,B)=P(A)=P(B),whereP(A,B)=f11/N,P(A)=f1+/N,andP(B)=f+1/N课后答案网:www.hackshp.cn.Whentheφ-coefficientequalsto1:P(A,B)−P(A)P(B)φ==1P(A)P(B)1−P(A)1−P(B)Theprecedingequationcanbesimplifiedasfollows:2P(A,B)−P(A)P(B)=P(A)P(B)1−P(A)1−P(B)P(A,B)2−2P(A,B)P(A)P(B)=P(A)P(B)1−P(A)−P(B)P(A,B)2=P(A)P(B)1−P(A)−P(B)+2P(A,B)WemayrewritetheequationintermsofP(B)asfollows:P(A)P(B)2−P(A)1−P(A)+2P(A,B)P(B)+P(A,B)2=0ThesolutiontothequadraticequationinP(B)is:P(A)β−P(A)2β2−4P(A)P(A,B)2P(B)=,2P(A)若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn88Chapter6AssociationAnalysisItemsItems200020004000400060006000TransactionsTransactions80008000200400600800200400600800(a)(b)ItemsItems200020004000400060006000TransactionsTransactions80008000课后答案网:www.hackshp.cn200400600800200400600800(c)(d)ItemsItems20002000400010%are1s400090%are0s6000(uniformlydistributed)6000TransactionsTransactions80008000200400600800200400600800(e)(f)Figure6.6.FiguresforExercise14.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn89whereβ=1−P(A)+2P(A,B).Notethatthesecondsolution,inwhichthesecondtermonthelefthandsideispositive,isnotafeasiblesolutionbecauseitcorrespondstoφ=−1.Furthermore,thesolutionforP(B)mustsatisfythefollowingconstraint:P(B)≥P(A,B).Itcanbeshownthat:P(B)−P(A,B)1−P(A)(1−P(A))2+4P(A,B)(1−P(A))(1−P(A,B)/P(A))=−22≤0Becauseoftheconstraint,P(B)=P(A,B),whichcanbeachievedbysettingP(A,B)=P(A).(b)ShowthatifAandBareindependent,thenP(A,B)×P(A,B)=P(A,B)×P(A,B).Answer:WhenAandBareindependent,P(A,B)=P(A)×P(B)orequiva-lently:P(A,B)−P(A)P(B)=0P(A,B)−[P(A,B)+P(A,B)][P(A,B)+P(A,B)]=0P(A,B)[1−P(A,B)−P(A,B)−P(A,B)]−P(A,B)P(A,B)=0P(A,B)P(A,B)−P(A,B)P(A,B)=0.(c)ShowthatYule’sQandYcoefficients课后答案网:www.hackshp.cnf11f00−f10f01Q=f11f00+f10f01√√f11f00−f10f01Y=√√f11f00+f10f01arenormalizedversionsoftheoddsratio.Answer:Oddsratiocanbewrittenas:f11f00α=.f10f01WecanexpressQandYintermsofαasfollows:α−1Q=α+1√α−1Y=√α+1若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn90Chapter6AssociationAnalysisInbothcases,QandYincreasemonotonicallywithα.Furthermore,whenα=0,Q=Y=−1torepresentperfectnegativecorrelation.Whenα=1,whichistheconditionforattributeindependence,Q=Y=1.Finally,whenα=∞,Q=Y=+1.ThissuggeststhatQandYarenormalizedversionsofα.(d)WriteasimplifiedexpressionforthevalueofeachmeasureshowninTables6.11and6.12whenthevariablesarestatisticallyindependent.Answer:MeasureValueunderindependenceφ-coefficient0Oddsratio1Kappaκ0Interest1Cosine,ISP(A,B)Piatetsky-Shapiro’s0Collectivestrength1Jaccard0···1Conviction1Certaintyfactor0Addedvalue0P(B|A)−P(B)16.Considertheinterestingnessmeasure,M=,foranassociation1−P(B)ruleA−→B.(a)Whatistherangeofthismeasure?Whendoesthemeasureattainits课后答案网:www.hackshp.cnmaximumandminimumvalues?Answer:Therangeofthemeasureisfrom0to1.Themeasureattainsitsmax-imumvaluewhenP(B|A)=1anditsminimumvaluewhenP(B|A)=P(B).(b)HowdoesMbehavewhenP(A,B)isincreasedwhileP(A)andP(B)remainunchanged?Answer:Themeasurecanberewrittenasfollows:P(A,B)−P(A)P(B).P(A)(1−P(B))ItincreaseswhenP(A,B)isincreased.(c)HowdoesMbehavewhenP(A)isincreasedwhileP(A,B)andP(B)remainunchanged?Answer:ThemeasuredecreaseswithincreasingP(A).若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn91(d)HowdoesMbehavewhenP(B)isincreasedwhileP(A,B)andP(A)remainunchanged?Answer:ThemeasuredecreaseswithincreasingP(B).(e)Isthemeasuresymmetricundervariablepermutation?Answer:No.(f)WhatisthevalueofthemeasurewhenAandBarestatisticallyinde-pendent?Answer:0.(g)Isthemeasurenull-invariant?Answer:No.(h)Doesthemeasureremaininvariantunderroworcolumnscalingoper-ations?Answer:No.(i)Howdoesthemeasurebehaveundertheinversionoperation?Answer:Asymmetric.17.Supposewehavemarketbasketdataconsistingof100transactionsand20items.Ifthesupportforitemais25%,thesupportforitembis90%andthesupportforitemset{a,b}is20%.Letthesupportandconfidencethresholdsbe10%and60%,respectively.(a)Computetheconfidenceoftheassociationrule{a}→{b}.Istheruleinterestingaccordingtotheconfidencemeasure?Answer:课后答案网:www.hackshp.cnConfidenceis0.2/0.25=80%.Theruleisinterestingbecauseitexceedstheconfidencethreshold.(b)Computetheinterestmeasurefortheassociationpattern{a,b}.De-scribethenatureoftherelationshipbetweenitemaanditembintermsoftheinterestmeasure.Answer:Theinterestmeasureis0.2/(0.25×0.9)=0.889.Theitemsarenega-tivelycorrelatedaccordingtointerestmeasure.(c)Whatconclusionscanyoudrawfromtheresultsofparts(a)and(b)?Answer:Highconfidencerulesmaynotbeinteresting.(d)Provethatiftheconfidenceoftherule{a}−→{b}islessthanthesupportof{b},then:i.c({a}−→{b})>c({a}−→{b}),ii.c({a}−→{b})>s({b}),若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn92Chapter6AssociationAnalysiswherec(·)denotetheruleconfidenceands(·)denotethesupportofanitemset.Answer:LetP({a,b})c({a}−→{b})=P({a,b}).Furthermore,P({a,b})P({b})−P({a,b})c({a}−→{b})==P({a})1−P({a})i.Therefore,wemaywriteP({b})−P({a,b})P({a,b})c({a}−→{b})−c({a}−→{b})=−1−P({a})P({a})P({a})P({b})−P({a,b})=P({a})(1−P({a}))whichispositivebecauseP({a})P({b})>P({a,b}).ii.WecanalsoshowthatP({b})−P({a,b})c({a}−→{b})−s({b})=−P({b})1−P({a})P({a})P({b})−P({a,b})课后答案网:www.hackshp.cn=1−P({a})isalwayspositivebecauseP({a})P({b})>P({a,b}).18.Table6.5showsa2×2×2contingencytableforthebinaryvariablesAandBatdifferentvaluesofthecontrolvariableC.(a)ComputetheφcoefficientforAandBwhenC=0,C=1,andC=0√P(A,B)−P(A)P(B)or1.Notethatφ({A,B})=.P(A)P(B)(1−P(A))(1−P(B))Answer:i.WhenC=0,φ(A,B)=−1/3.ii.WhenC=1,φ(A,B)=1.iii.WhenC=0orC=1,φ=0.(b)Whatconclusionscanyoudrawfromtheaboveresult?Answer:Theresultshowsthatsomeinterestingrelationshipsmaydisappeariftheconfoundingfactorsarenottakenintoaccount.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn93Table6.5.AContingencyTable.A101015C=0B01530150C=1B0015Table6.6.ContingencytablesforExercise19.BBBBA91A891A189A19(a)TableI.(b)TableII.19.ConsiderthecontingencytablesshowninTable6.6.(a)FortableI,computesupport,theinterestmeasure,andtheφcorrela-tioncoefficientfortheassociationpattern课后答案网:www.hackshp.cn{A,B}.Also,computetheconfidenceofrulesA→BandB→A.Answer:s(A)=0.1,s(B)=0.9,s(A,B)=0.09.I(A,B)=9,φ(A,B)=0.89.c(A−→B)=0.9,c(B−→A)=0.9.(b)FortableII,computesupport,theinterestmeasure,andtheφcorrela-tioncoefficientfortheassociationpattern{A,B}.Also,computetheconfidenceofrulesA→BandB→A.Answer:s(A)=0.9,s(B)=0.9,s(A,B)=0.89.I(A,B)=1.09,φ(A,B)=0.89.c(A−→B)=0.98,c(B−→A)=0.98.(c)Whatconclusionscanyoudrawfromtheresultsof(a)and(b)?Answer:Interest,support,andconfidencearenon-invariantwhiletheφ-coefficientisinvariantundertheinversionoperation.Thisisbecauseφ-coefficient若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn94Chapter6AssociationAnalysistakesintoaccounttheabsenceaswellasthepresenceofaniteminatransaction.20.Considertherelationshipbetweencustomerswhobuyhigh-definitiontelevi-sionsandexercisemachinesasshowninTables6.19and6.20.(a)Computetheoddsratiosforbothtables.Answer:ForTable6.19,oddsratio=1.4938.ForTable6.20,theoddsratiosare0.8333and0.98.(b)Computetheφ-coefficientforbothtables.课后答案网:www.hackshp.cnAnswer:Fortable6.19,φ=0.098.ForTable6.20,theφ-coefficientsare-0.0233and-0.0047.(c)Computetheinterestfactorforbothtables.Answer:ForTable6.19,I=1.0784.ForTable6.20,theinterestfactorsare0.88and0.9971.Foreachofthemeasuresgivenabove,describehowthedirectionofassocia-tionchangeswhendataispooledtogetherinsteadofbeingstratified.Answer:Thedirectionofassociationchangessign(fromnegativetopositivecorre-lated)whenthedataispooledtogether.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn7AssociationAnalysis:AdvancedConcepts1.ConsiderthetrafficaccidentdatasetshowninTable7.1.Table7.1.Trafficaccidentdataset.WeatherDriver’sTrafficSeatBeltCrashConditionConditionViolationSeverityGoodAlcohol-impairedExceedspeedlimitNoMajorBadSoberNoneYesMinorGoodSoberDisobeystopsignYesMinorGoodSoberExceedspeedlimitYesMajorBadSoberDisobeytrafficsignalNoMajorGood课后答案网:www.hackshp.cnAlcohol-impairedDisobeystopsignYesMinorBadAlcohol-impairedNoneYesMajorGoodSoberDisobeytrafficsignalYesMajorGoodAlcohol-impairedNoneNoMajorBadSoberDisobeytrafficsignalNoMajorGoodAlcohol-impairedExceedspeedlimitYesMajorBadSoberDisobeystopsignYesMinor(a)Showabinarizedversionofthedataset.Answer:SeeTable7.2.(b)Whatisthemaximumwidthofeachtransactioninthebinarizeddata?Answer:5(c)Assumingthatsupportthresholdis30%,howmanycandidateandfre-quentitemsetswillbegenerated?若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn96Chapter7AssociationAnalysis:AdvancedConceptsTable7.2.Trafficaccidentdataset.GoodBadAlcoholSoberExceedNoneDisobeyDisobeyBeltBeltMajorMinorspeedstoptraffic=No=Yes101010001010010101000101100100100101100110000110010100011010101000100101011001000110100100010110101001001010010100011010101010000110010100100101Answer:5Thenumberofcandidateitemsetsfromsize1tosize3is10+28+3=41.Thenumberoffrequentitemsetsfromsize1tosize3is8+10+0=18.(d)Createadatasetthatcontainsonlythefollowingasymmetricbinaryat-tributes:(Weather=Bad,Driver’scondition=Alcohol-impaired,Trafficviolation=Yes,SeatBelt=No,CrashSeverity=Major).ForTrafficviolation,onlyNonehasavalueof0.Therestoftheattributevaluesareassignedto1.Assumingthatsupportthresholdis30%,howmanycandidateandfrequentitemsetswillbegenerated?Answer:ThebinarizeddataisshowninTable7.3.课后答案网:www.hackshp.cnTable7.3.Trafficaccidentdataset.BadAlcoholTrafficBeltMajorImpairedviolation=No011111000000100001011011101100110010010101011101110110110100Thenumberofcandidateitemsetsfromsize1tosize3is5+10+0=15.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn97Thenumberoffrequentitemsetsfromsize1tosize3is5+3+0=8.(e)Comparethenumberofcandidateandfrequentitemsetsgeneratedinparts(c)and(d).Answer:Thesecondmethodproduceslessnumberofcandidateandfrequentitemsets.2.(a)ConsiderthedatasetshowninTable7.4.Supposeweapplythefol-lowingdiscretizationstrategiestothecontinuousattributesofthedataset.D1:Partitiontherangeofeachcontinuousattributeinto3equal-sizedbins.D2:Partitiontherangeofeachcontinuousattributeinto3bins;whereeachbincontainsanequalnumberoftransactionsForeachstrategy,answerthefollowingquestions:i.Constructabinarizedversionofthedataset.ii.Deriveallthefrequentitemsetshavingsupport≥30%.Table7.4.DatasetforExercise2.TIDTemperaturePressureAlarm1Alarm2Alarm3195110500128510401103课后答案网:www.hackshp.cn103109011149710841005801038011610010801107831025101886103010091011100111Answer:Table7.5showsthediscretizeddatausingD1,wherethediscretizedintervalsare:•X1:Temperaturebetween80and87,•X2:Temperaturebetween88and95,•X3:Temperaturebetween96and103,•Y1:Pressurebetween1025and1051,•Y2:Pressurebetween1052and1078,•Y3:Pressurebetween1079and1105.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn98Chapter7AssociationAnalysis:AdvancedConceptsTable7.5.DiscretizeddatausingD1.TIDX1X2X3Y1Y2Y3Alarm1Alarm2Alarm3101000100121001001103001001111400100110051001000116001001110710010010181001001009001001111Table7.6.DiscretizeddatausingD2.TIDX1X2X3Y1Y2Y3Alarm1Alarm2Alarm3101000100121000101103001001111401001010051001000116001010110710010010180101001009001001111课后答案网:www.hackshp.cnTable7.6showsthediscretizeddatausingD1,wherethediscretizedintervalsare:•X1:Temperaturebetween80and85,•X2:Temperaturebetween86and97,•X3:Temperaturebetween100and103,•Y1:Pressurebetween1025and1038,•Y2:Pressurebetween1039and1084,•Y3:Pressurebetween1085and1105.ForD1,thereare7frequent1-itemset,12frequent2-itemset,and5frequent3-itemset.ForD2,thereare9frequent1-itemset,7frequent2-itemset,and1frequent3-itemset.(b)Thecontinuousattributecanalsobediscretizedusingaclusteringap-proach.i.PlotagraphoftemperatureversuspressureforthedatapointsshowninTable7.4.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn99Answer:ThegraphofTemperatureandPressureisshownbelow.PressurevsTemperature1110C211001090108010701060Pressure1050C11040103010207580859095100105TemperatureFigure7.1.TemperatureversusPressure.ii.Howmanynaturalclustersdoyouobservefromthegraph?Assignalabel(C1,C2,etc.)toeachclusterinthegraph.Answer:Therearetwonaturalclustersinthedata.iii.Whattypeofclusteringalgorithmdoyouthinkcanbeusedtoidentifytheclusters?Stateyourreasonsclearly.Answer:K-meansalgorithm.iv.ReplacethetemperatureandpressureattributesinTable7.4withasymmetricbinaryattributesC1,C2,etc.Constructatransactionmatrixusingthenewattributes(alongwithattributesAlarm1,课后答案网:www.hackshp.cnAlarm2,andAlarm3).Answer:Table7.7.Exampleofnumericdataset.TIDC1C2Alarm1Alarm2Alarm3101001210110301111401100510011601110710101810100901111若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn100Chapter7AssociationAnalysis:AdvancedConceptsv.Deriveallthefrequentitemsetshavingsupport≥30%fromthebinarizeddata.Answer:Thereare5frequent1-itemset,7frequent2-itemset,and1frequent3-itemset.3.ConsiderthedatasetshowninTable7.8.Thefirstattributeiscontinuous,whiletheremainingtwoattributesareasymmetricbinary.Aruleisconsid-eredtobestrongifitssupportexceeds15%anditsconfidenceexceeds60%.ThedatagiveninTable7.8supportsthefollowingtwostrongrules:(i){(1≤A≤2),B=1}→{C=1}(ii){(5≤A≤8),B=1}→{C=1}Table7.8.DatasetforExercise3.ABC111211310410511601700811课后答案网:www.hackshp.cn900100011001201(a)Computethesupportandconfidenceforbothrules.Answer:s({(1≤A≤2),B=1}→{C=1})=1/6c({(1≤A≤2),B=1}→{C=1})=1s({(5≤A≤8),B=1}→{C=1})=1/6c({(5≤A≤8),B=1}→{C=1})=1(b)TofindtherulesusingthetraditionalApriorialgorithm,weneedtodiscretizethecontinuousattributeA.Supposeweapplytheequalwidthbinningapproachtodiscretizethedata,withbin-width=2,3,4.Foreachbin-width,statewhethertheabovetworulesarediscoveredbytheApriorialgorithm.(Notethattherulesmaynotbeinthesameexactformasbeforebecauseitmaycontainwiderornarrowerintervals若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn101forA.)Foreachrulethatcorrespondstooneoftheabovetworules,computeitssupportandconfidence.Answer:Whenbin−width=2:Table7.9.ASyntheticDatasetA1A2A3A4A5A6BC100000111000001101000010010000100010001100100001000100000001001100001000000010000000010000000101WhereA1=1≤A≤2;A2=3≤A≤4;课后答案网:www.hackshp.cnA3=5≤A≤6;A4=7≤A≤8;A5=9≤A≤10;A6=11≤A≤12;Forthefirstrule,thereisonecorrespondingrule:{A1=1,B=1}→{C=1}s(A1=1,B=1}→{C=1})=1/6c(A1=1,B=1}→{C=1})=1Sincethesupportandconfidencearegreaterthanthethresholds,therulecanbediscovered.Forthesecondrule,therearetwocorrespondingrules:{A3=1,B=1}→{C=1}{A4=1,B=1}→{C=1}Forbothrules,thesupportis1/12andtheconfidenceis1.Sincethesupportislessthanthethreshold(15%),theserulescannotbegenerated.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn102Chapter7AssociationAnalysis:AdvancedConceptsWhenbin−width=3:Table7.10.ASyntheticDatasetA1A2A3A4BC100011100011100010010010010011010001001000001011001000000100000100000101WhereA1=1≤A≤3;A2=4≤A≤6;A3=7≤A≤9;A4=10≤A≤12;课后答案网:www.hackshp.cnForthefirstrule,thereisonecorrespondingrule:{A1=1,B=1}→{C=1}s(A1=1,B=1}→{C=1})=1/6c(A1=1,B=1}→{C=1})=2/3Sincethesupportandconfidencearegreaterthanthethresholds,therulecanbediscovered.Thediscoveredruleisingeneralformthantheoriginalrule.Forthesecondrule,therearetwocorrespondingrules:{A2=1,B=1}→{C=1}{A3=1,B=1}→{C=1}Forbothrules,thesupportis1/12andtheconfidenceis1.Sincethesupportislessthanthethreshold(15%),theserulescannotbegenerated.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn103Whenbin−width=4:Table7.11.ASyntheticDatasetA1A2A3BC100111001110010100100101101001010000101100100001000010000101WhereA1=1≤A≤4;A2=5≤A≤8;A3=9≤A≤12;Forthefirstrule,thereisonecorrespomdingrule:课后答案网:www.hackshp.cn{A1=1,B=1}→{C=1}s(A1=1,B=1}→{C=1})=1/6c(A1=1,B=1}→{C=1})=1/2Sincetheconfidenceislessthanthethreshold(60%),thentherulecannotbegenerated.Forthesecondrule,thereisonecorrespondingrule:{A2=1,B=1}→{C=1}s(A2=1,B=1}→{C=1})=1/6c(A2=1,B=1}→{C=1})=1Sincethesupportandthresholdaregreaterthanthresholds,thetherulecanbediscovered.(c)Commentontheeffectivenessofusingtheequalwidthapproachforclassifyingtheabovedataset.Isthereabin-widththatallowsyouto若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn104Chapter7AssociationAnalysis:AdvancedConceptsfindbothrulessatisfactorily?Ifnot,whatalternativeapproachcanyoutaketoensurethatyouwillfindbothrules?Answer:Noneofthediscretizationmethodscaneffectivelyfindbothrules.Oneapproachtoensurethatyoucanfindbothrulesistostartwithbinwidthequalsto2andconsiderallpossiblemergingsoftheadjacentintervals.Forexample,thediscreteintervalsare:1<=A<=2,3<=A<=4,5<=A<=6,·,11<=A<=121<=A<=4,5<=A<=8,9<=A<=124.ConsiderthedatasetshowninTable7.12.Table7.12.DatasetforExercise4.AgeNumberofHoursOnlineperWeek(B)(A)0–55–1010–2020–3030–4010–152353215–25251010325–35101553235–5046532(a)Foreachcombinationofrulesgivenbelow,specifytherulethathasthehighestconfidence.课后答案网:www.hackshp.cni.15minsup,whichofthefollowingitemsetsareguaranteedtobefrequent?(i)s({p,qˆ}),(ii)s({p,qˆ}),and(iii)s({p,ˆqˆ}).Answer:Allthreeitemsetsareguaranteedtobefrequent.(c)Considertheassociationrule{p}−→{q}.Supposetheconfidenceoftheruleexceedsminconf.Whichofthefollowingrulesareguaranteedtohaveconfidencehigherthanminconf?(i){p}−→{qˆ},(ii){pˆ}−→{q},and(iii){pˆ}−→{qˆ}.Answer:Only{p}−→{qˆ}isguaranteedtohaveconfidencehigherthanminconf.9.(a)Listallthe4-subsequencescontainedinthefollowingdatasequence:<{1,3}{2}{2,3}{4}>,assumingnotimingconstraints.Answer:课后答案网:www.hackshp.cn<{1,3}{2}{2}><{1,3}{2}{3}><{1,3}{2}{4}><{1,3}{2,3}><{1,3}{3}{4}><{1}{2}{2,3}><{1}{2}{2}{4}><{1}{2}{3}{4}><{1}{2,3}{4}><{3}{2}{2,3}><{3}{2}{2}{4}><{3}{2}{3}{4}><{3}{2,3}{4}><{2}{2,3}{4}>(b)Listallthe3-elementsubsequencescontainedinthedatasequenceforpart(a)assumingthatnotimingconstraintsareimposed.Answer:<{1,3}{2}{2,3}><{1,3}{2}{4}><{1,3}{3}{4}><{1,3}{2}{2}><{1,3}{2}{3}><{1,3}{2,3}{4}><{1}{2}{2,3}><{1}{2}{4}><{1}{3}{4}><{1}{2}{2}><{1}{2}{3}><{1}{2,3}{4}><{3}{2}{2,3}><{3}{2}{4}><{3}{3}{4}><{3}{2}{2}><{3}{2}{3}><{3}{2,3}{4}>若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn111(c)Listallthe4-subsequencescontainedinthedatasequenceforpart(a)(assumingthetimingconstraintsareflexible).Answer:Thiswillincludeallthesubsequencesinpart(a)aswellasthefollowing:<{1,2,3,4}><{1,2,3}{2}><{1,2,3}{3}><{1,2,3}{4}><{1,3}{2,4}><{1,3}{3,4}><{1}{2}{2,4}><{1}{2}{3,4}><{3}{2}{2,4}><{3}{2}{3,4}><{1,2}{2,3}><{1,2}{2,4}><{1,2}{3,4}><{1,2}{2}{4}><{1,2}{3}{4}><{2,3}{2,3}><{2,3}{2,4}><{2,3}{3,4}><{2,3}{2}{4}><{2,3}{3}{4}><{1}{2,3,4}><{1}{2}{2,4}><{1}{2}{3,4}><{3}{2,3,4}><{3}{2}{2,4}><{3}{2}{3,4}><{2}{2,3,4}>(d)Listallthe3-elementsubsequencescontainedinthedatasequenceforpart(a)(assumingthetimingconstraintsareflexible).Answer:Thiswillincludeallthesubsequencesinpart(b)aswellasthefollowing:<{1,2,3}{2}{4}><{1,2,3}{3}{4}><{1,2,3}{2,3}{4}><{1,2}{2}{4}><{1,2}{3}{4}><{1,2}{2,3}{4}>课后答案网:www.hackshp.cn<{2,3}{2}{4}><{2,3}{3}{4}><{2,3}{2,3}{4}><{1}{2}{2,4}><{1}{2}{3,4}><{1}{2}{2,3,4}><{3}{2}{2,4}><{3}{2}{3,4}><{3}{2}{2,3,4}><{1,3}{2}{2,4}><{1,3}{2}{3,4}><{1,3}{2}{2,3,4}>10.Findallthefrequentsubsequenceswithsupport≥50%giventhesequencedatabaseshowninTable7.15.Assumethattherearenotimingconstraintsimposedonthesequences.Answer:<{A}>,<{B}>,<{C}>,<{D}>,<{E}><{A}{C}>,<{A}{D}>,<{A}{E}>,<{B}{C}>,<{B}{D}>,<{B}{E}>,<{C}{D}>,<{C}{E}>,<{D,E}>11.(a)Foreachofthesequencesw=givenbe-low,determinewhethertheyaresubsequencesofthesequence<{1,2,3}{2,4}{2,4,5}{3,5}{6}>若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn112Chapter7AssociationAnalysis:AdvancedConceptsTable7.15.Exampleofeventsequencesgeneratedbyvarioussensors.SensorTimestampEventsS11A,B2C3D,E4CS21A,B2C,D3ES31B2A3B4D,ES41C2D,E3C4ES51B2A3B,C4A,Dsubjectedtothefollowingtimingconstraints:mingap=0(intervalbetweenlasteventineiandfirstevent课后答案网:www.hackshp.cninei+1is>0)maxgap=3(intervalbetweenfirsteventineiandlasteventinei+1is≤3)maxspan=5(intervalbetweenfirsteventine1andlasteventinelastis≤5)ws=1(timebetweenfirstandlasteventsineiis≤1)•w=<{1}{2}{3}>Answer:Yes.•w=<{1,2,3,4}{5,6}>Answer:No.•w=<{2,4}{2,4}{6}>Answer:Yes.•w=<{1}{2,4}{6}>Answer:Yes.•w=<{1,2}{3,4}{5,6}>Answer:No.(b)Determinewhethereachofthesubsequenceswgiveninthepreviousquestionarecontiguoussubsequencesofthefollowingsequencess.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn113•s=<{1,2,3,4,5,6}{1,2,3,4,5,6}{1,2,3,4,5,6}>–w=<{1}{2}{3}>Answer:Yes.–w=<{1,2,3,4}{5,6}>Answer:Yes.–w=<{2,4}{2,4}{6}>Answer:Yes.–w=<{1}{2,4}{6}>Answer:Yes.–w=<{1,2}{3,4}{5,6}>Answer:Yes.•s=<{1,2,3,4}{1,2,3,4,5,6}{3,4,5,6}>–w=<{1}{2}{3}>Answer:Yes.–w=<{1,2,3,4}{5,6}>Answer:Yes.–w=<{2,4}{2,4}{6}>Answer:Yes.–w=<{1}{2,4}{6}>Answer:Yes.–w=<{1,2}{3,4}{5,6}>Answer:Yes.•s=<{1,2}{1,2,3,4}{3,4,5,6}{5,6}>–w=<{1}{2}{3}>课后答案网:www.hackshp.cnAnswer:Yes.–w=<{1,2,3,4}{5,6}>Answer:Yes.–w=<{2,4}{2,4}{6}>Answer:No.–w=<{1}{2,4}{6}>Answer:Yes.–w=<{1,2}{3,4}{5,6}>Answer:Yes.•s=<{1,2,3}{2,3,4,5}{4,5,6}>–w=<{1}{2}{3}>Answer:No.–w=<{1,2,3,4}{5,6}>Answer:No.–w=<{2,4}{2,4}{6}>Answer:No.–w=<{1}{2,4}{6}>Answer:Yes.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn114Chapter7AssociationAnalysis:AdvancedConcepts–w=<{1,2}{3,4}{5,6}>Answer:Yes.12.Foreachofthesequencew=e1,...,elastbelow,determinewhethertheyaresubsequencesofthefollowingdatasequence:{A,B}{C,D}{A,B}{C,D}{A,B}{C,D}subjectedtothefollowingtimingconstraints:mingap=0(intervalbetweenlasteventineiandfirsteventinei+1is>0)maxgap=2(intervalbetweenfirsteventineiandlasteventinei+1is≤2)maxspan=6(intervalbetweenfirsteventine1andlasteventinelastis≤6)ws=1(timebetweenfirstandlasteventsineiis≤1)(a)w={A}{B}{C}{D}Answer:Yes.(b)w={A}{B,C,D}{A}Answer:No.(c)w={A}{B,C,D}{A}Answer:No.(d)w={B,C}{A,D}{B,C}课后答案网:www.hackshp.cnAnswer:No.(e)w={A,B,C,D}{A,B,C,D}Answer:No.13.Considerthefollowingfrequent3-sequences:<{1,2,3}>,<{1,2}{3}>,<{1}{2,3}>,<{1,2}{4}>,<{1,3}{4}>,<{1,2,4}>,<{2,3}{3}>,<{2,3}{4}>,<{2}{3}{3}>,and<{2}{3}{4}>.(a)Listallthecandidate4-sequencesproducedbythecandidategenerationstepoftheGSPalgorithm.Answer:<{1,2,3}{3}>,<{1,2,3}{4}>,<{1,2}{3}{3}>,<{1,2}{3}{4}>,<{1}{2,3}{3}>,<{1}{2,3}{4}>.(b)Listallthecandidate4-sequencesprunedduringthecandidatepruningstepoftheGSPalgorithm(assumingnotimingconstraints).Answer:若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn115Whenthereisnotimingconstraints,allsubsequencesofacandidatemustbefrequent.Therefore,theprunedcandidatesare:<{1,2,3}{3}>,<{1,2}{3}{3}>,<{1,2}{3}{4}>,<{1}{2,3}{3}>,<{1}{2,3}{4}>.(c)Listallthecandidate4-sequencesprunedduringthecandidatepruningstepoftheGSPalgorithm(assumingmaxgap=1).Answer:Withtimingconstraint,onlycontiguoussubsequencesofacandidatemustbefrequent.Therefore,theprunedcandidatesare:<{1,2,3}{3}>,<{1,2}{3}{3}>,<{1,2}{3}{4}>,<{1}{2,3}{3}>,<{1}{2,3}{4}>.14.ConsiderthedatasequenceshowninTable7.16foragivenobject.Countthenumberofoccurrencesforthesequence{p}{q}{r}accordingtothefollowingcountingmethods:Assumethatws=0,mingap=0,maxgap=3,maxspan=5).Table7.16.ExampleofeventsequencedataforExercise14.TimestampEvents1p,q2r3s4p,q课后答案网:www.hackshp.cn5r,s6p7q,r8q,s9p10q,r,s(a)COBJ(oneoccurrenceperobject).Answer:1.(b)CWIN(oneoccurrenceperslidingwindow).Answer:2.(c)CMINWIN(numberofminimalwindowsofoccurrence).Answer:2.(d)CDISTO(distinctoccurrenceswithpossibilityofevent-timestampover-lap).Answer:3.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn116Chapter7AssociationAnalysis:AdvancedConcepts(e)CDIST(distinctoccurrenceswithnoeventtimestampoverlapallowed).Answer:2.15.Describethetypesofmodificationsnecessarytoadaptthefrequentsubgraphminingalgorithmtohandle:(a)Directedgraphs(b)Unlabeledgraphs(c)Acyclicgraphs(d)DisconnectedgraphsForeachtypeofgraphgivenabove,describewhichstepofthealgorithmwillbeaffected(candidategeneration,candidatepruning,andsupportcounting),andanyfurtheroptimizationthatcanhelpimprovetheefficiencyofthealgorithm.Answer:(a)Adjacencymatrixmaynotbesymmetric,whichaffectscandidategen-erationusingvertexgrowingapproach.(b)Anunlabeledgraphisequivalenttoalabeledgraphwhereallthever-ticeshaveidenticallabels.(c)Noeffectonalgorithm.Ifthegraphisarootedlabeledtree,moreefficienttechniquescanbedevelopedtoencodethetree(see:M.J.Zaki,EfficientlyMiningFrequentTreesinaForest,InProc.oftheEighthACMSIGKDDInt’lConf.onKnowledgeDiscoveryandDataMining,课后答案网:www.hackshp.cn2002).16.DrawallcandidatesubgraphsobtainedfromjoiningthepairofgraphsshowninFigure7.2.Assumetheedge-growingmethodisusedtoexpandthesub-graphs.Answer:SeeFigure7.3.17.DrawallthecandidatesubgraphsobtainedbyjoiningthepairofgraphsshowninFigure7.4.Assumetheedge-growingmethodisusedtoexpandthesubgraphs.Answer:SeeFigure7.5.18.(a)Ifsupportisdefinedintermsofinducedsubgraphrelationship,showthattheconfidenceoftheruleg1−→g2canbegreaterthan1ifg1andg2areallowedtohaveoverlappingvertexsets.Answer:Weillustratethiswithanexample.Considerthefivegraphs,G1,G2,···,G5,showninFigure7.6.Thegraphg1shownonthetop-righthand若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn117bbaaaaaaaa(a)baabaabaacaa(b)Figure7.2.GraphsforExercise16.bbbbbaaaaaaaaaaaab课后答案网:www.hackshp.cn(a)baabaac(b)Figure7.3.SolutionforExercise16.diagramisasubgraphofG1,G3,G4,andG5.Therefore,s(g1)=4/5=80%.Similarly,wecanshowthats(g2)=60%becauseg2isasubgraphofG1,G2,andG3,whiles(g3)=40%becauseg3isasubgraphofG1andG3.Considertheassociationrule,g2−→g1.Usingthestandarddefinitionofconfidenceastheratiobetweenthesupportofg2∪g1≡g3tothe若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn118Chapter7AssociationAnalysis:AdvancedConceptsbabbabbbbb(a)babbaabaacaa(b)Figure7.4.GraphsforExercise17.babbabbbbbbabbabbbbbbbbabbbabbbbb课后答案网:www.hackshp.cnbaabaabaacaabaabaaCFigure7.5.SolutionforExercise17.supportofg2,weobtainaconfidencevaluegreaterthan1becauses(g3)>s(g2).(b)Whatisthetimecomplexityneededtodeterminethecanonicallabelofagraphthatcontains|V|vertices?Answer:若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn119Subgraphge1a1a1da1e111111Subgraphsupport=80%b1dceInducedsubgraphsupport=80%1cbG1G2Subgraphg211ad1bdae111111aee1dSubgraphsupport=60%G3G4cInducedsubgraphsupport=20%bd1Subgraphg311eG5a1ad1c11eGraphDataSetSubgraphsupport=40%Inducedsubgraphsupport=40%Figure7.6.Computingthesupportofasubgraphfromasetofgraphs.Ana¨ıveapproachrequires|V|!computationstoexamineallpossiblepermutationsofthecanonicallabel.课后答案网:www.hackshp.cn(c)Thecoreofasubgraphcanhavemultipleautomorphisms.Thiswillincreasethenumberofcandidatesubgraphsobtainedaftermergingtwofrequentsubgraphsthatsharethesamecore.Determinethemaximumnumberofcandidatesubgraphsobtainedduetoautomorphismofacoreofsizek.Answer:k.(d)Twofrequentsubgraphsofsizekmaysharemultiplecores.Determinethemaximumnumberofcoresthatcanbesharedbythetwofrequentsubgraphs.Answer:k−1.19.(a)Consideragraphminingalgorithmthatusestheedge-growingmethodtojointhetwoundirectedandunweightedsubgraphsshowninFigure19a.i.Drawallthedistinctcoresobtainedwhenmergingthetwosub-graphs.Answer:SeeFigure7.7.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn120Chapter7AssociationAnalysis:AdvancedConceptsAAAABBAAAAAAAAAABBBAAAAAFigure7.7.SolutiontoExercise19.ii.Howmanycandidatesaregeneratedusingthefollowingcore?AAB课后答案网:www.hackshp.cnAAAnswer:Nocandidatek+1-subgraphcanbegeneratedfromthecore.20.Theoriginalassociationruleminingframeworkconsidersonlypresenceofitemstogetherinthesametransaction.Therearesituationsinwhichitemsetsthatareinfrequentmayalsobeinformative.Forinstance,theitemsetTV,DVD,¬VCRsuggeststhatmanycustomerswhobuyTVsandDVDsdonotbuyVCRs.Inthisproblem,youareaskedtoextendtheassociationruleframeworktonegativeitemsets(i.e.,itemsetsthatcontainbothpresenceandabsenceofitems).Wewillusethenegationsymbol(¬)torefertoabsenceofitems.(a)Ana¨ıvewayforderivingnegativeitemsetsistoextendeachtransactiontoincludeabsenceofitemsasshowninTable7.17.i.Supposethetransactiondatabasecontains1000distinctitems.Whatisthetotalnumberofpositiveitemsetsthatcanbegener-若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn121Table7.17.Exampleofnumericdataset.TIDTV¬TVDVD¬DVDVCR¬VCR...1100101...2100101...atedfromtheseitems?(Note:Apositiveitemsetdoesnotcontainanynegateditems).Answer:21000−1.ii.Whatisthemaximumnumberoffrequentitemsetsthatcanbegeneratedfromthesetransactions?(Assumethatafrequentitem-setmaycontainpositive,negative,orbothtypesofitems)Answer:22000−1.iii.Explainwhysuchana¨ıvemethodofextendingeachtransactionwithnegativeitemsisnotpracticalforderivingnegativeitemsets.Answer:Thenumberofcandidateitemsetsistoolarge,manyofthethemarealsoredundantanduseless(e.g.,anitemsetthatcontainsbothitemsxandx).(b)ConsiderthedatabaseshowninTable7.14.Whatarethesupportandconfidencevaluesforthefollowingnegativeassociationrulesinvolvingregularanddietsoda?i.¬Regular−→Diet.Answer:s=42.9%,c=100%.ii.Regular−→¬Diet.Answer:s=42.9%,c=100%.iii.¬Diet−→Regular.课后答案网:www.hackshp.cnAnswer:s=42.9%,c=100%.iv.Diet−→¬Regular.Answer:s=42.9%,c=100%.21.Supposewewouldliketoextractpositiveandnegativeitemsetsfromadatasetthatcontainsditems.(a)Consideranapproachwhereweintroduceanewvariabletorepresenteachnegativeitem.Withthisapproach,thenumberofitemsgrowsfromdto2d.Whatisthetotalsizeoftheitemsetlattice,assumingthatanitemsetmaycontainbothpositiveandnegativeitemsofthesamevariable?Answer:22d.(b)Assumethatanitemsetmustcontainpositiveornegativeitemsofdif-ferentvariables.Forexample,theitemset{a,a,b,c}isinvalidbecauseitcontainsbothpositiveandnegativeitemsforvariablea.Whatisthetotalsizeoftheitemsetlattice?ddkkddkdAnswer:=2=3−1.k=1ki=0ik=1k若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn122Chapter7AssociationAnalysis:AdvancedConcepts22.Foreachtypeofpatterndefinedbelow,determinewhetherthesupportmea-sureismonotone,anti-monotone,ornon-monotone(i.e.,neithermonotonenoranti-monotone)withrespecttoincreasingitemsetsize.(a)Itemsetsthatcontainbothpositiveandnegativeitemssuchas{a,b,c,d}.Isthesupportmeasuremonotone,anti-monotone,ornon-monotonewhenappliedtosuchpatterns?Answer:Anti-monotone.(b)Booleanlogicalpatternssuchas{(a∨b∨c),d,e},whichmaycontainbothdisjunctionsandconjunctionsofitems.Isthesupportmeasuremonotone,anti-monotone,ornon-monotonewhenappliedtosuchpat-terns?Answer:Non-monotone.23.ManyassociationanalysisalgorithmsrelyonanApriori-likeapproachforfindingfrequentpatterns.Theoverallstructureofthealgorithmisgivenbelow.Algorithm7.1Apriori-likealgorithm.1:k=1.σ({i})2:Fk={i|i∈I∧≥minsup}.{Findfrequent1-patterns.}N3:repeat4:k=k+1.5:Ck=genCandidate(Fk−1).{CandidateGeneration}6:Ck=pruneCandidate(课后答案网:www.hackshp.cnCk,Fk−1).{CandidatePruning}7:Ck=count(Ck,D).{SupportCounting}σ(c)8:Fk={c|c∈Ck∧≥minsup}.{Extractfrequentpatterns}N9:untilFk=∅10:Answer=Fk.Supposeweareinterestedinfindingbooleanlogicalrulessuchas{a∨b}−→{c,d},whichmaycontainbothdisjunctionsandconjunctionsofitems.Thecorre-spondingitemsetcanbewrittenas{(a∨b),c,d}.(a)DoestheAprioriprinciplestillholdforsuchitemsets?(b)Howshouldthecandidategenerationstepbemodifiedtofindsuchpatterns?(c)Howshouldthecandidatepruningstepbemodifiedtofindsuchpat-terns?若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn课后答案网:www.hackshp.cn123(d)Howshouldthesupportcountingstepbemodifiedtofindsuchpat-terns?Answer:RefertoR.Srikant,Q.Vu,R.Agrawal:MiningAssociationRuleswithItemConstraints.InProcoftheThirdInt’lConfonKnowledgeDiscoveryandDataMining,1997.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn课后答案网:www.hackshp.cn若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn8ClusterAnalysis:BasicConceptsandAlgorithms1.Consideradatasetconsistingof220datavectors,whereeachvectorhas32componentsandeachcomponentisa4-bytevalue.Supposethatvec-torquantizationisusedforcompressionandthat216prototypevectorsareused.Howmanybytesofstoragedoesthatdatasettakebeforeandaftercompressionandwhatisthecompressionratio?Beforecompression,thedatasetrequires4课后答案网:www.hackshp.cn×32×220=134,217,728bytes.Aftercompression,thedatasetrequires4×32×216=8,388,608bytesfortheprototypevectorsand2×220=2,097,152bytesforvectors,sinceidentifyingtheprototypevectorassociatedwitheachdatavectorrequiresonlytwobytes.Thus,aftercompression,10,485,760bytesareneededtorepresentthedata.Thecompressionratiois12.8.2.Findallwell-separatedclustersinthesetofpointsshowninFigure8.1.ThesolutionsarealsoindicatedinFigure8.1.Figure8.1.PointsforExercise2.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn126Chapter8ClusterAnalysis:BasicConceptsandAlgorithms3.Manypartitionalclusteringalgorithmsthatautomaticallydeterminethenumberofclustersclaimthatthisisanadvantage.Listtwosituationsinwhichthisisnotthecase.(a)Whenthereishierarchicalstructureinthedata.Mostalgorithmsthatautomaticallydeterminethenumberofclustersarepartitional,andthus,ignorethepossibilityofsubclusters.(b)Whenclusteringforutility.Ifacertainreductionindatasizeisneeded,thenitisnecessarytospecifyhowmanyclusters(clustercentroids)areproduced.4.GivenKequallysizedclusters,theprobabilitythatarandomlychoseninitialcentroidwillcomefromanygivenclusteris1/K,buttheprobabilitythateachclusterwillhaveexactlyoneinitialcentroidismuchlower.(ItshouldbeclearthathavingoneinitialcentroidineachclusterisagoodstartingsituationforK-means.)Ingeneral,ifthereareKclustersandeachclusterhasnpoints,thentheprobability,p,ofselectinginasampleofsizeKoneinitialcentroidfromeachclusterisgivenbyEquation8.1.(Thisassumessamplingwithreplacement.)Fromthisformulawecancalculate,forexample,thatthechanceofhavingoneinitialcentroidfromeachoffourclustersis4!/44=0.0938.numberofwaystoselectonecentroidfromeachclusterK!nKK!p===(8.1)numberofwaystoselectKcentroids(Kn)KKK(a)Plottheprobabilityofobtainingonepointfromeachclusterinasample课后答案网:www.hackshp.cnofsizeKforvaluesofKbetween2and100.ThesolutionisshowninFigure4.Notethattheprobabilityisessen-tially0bythetimeK=10.(b)ForKclusters,K=10,100,and1000,findtheprobabilitythatasampleofsize2Kcontainsatleastonepointfromeachcluster.Youcanuseeithermathematicalmethodsorstatisticalsimulationtodeterminetheanswer.Weusedsimulationtocomputetheanswer.Respectively,theproba-bilitiesare0.21,<10−6,and<10−6.Proceedinganalytically,theprobabilitythatapointdoesn’tcomefromaparticularclusteris,1−1,andthus,theprobabilitythatall2KKpointsdon’tcomefromaparticularclusteris(1−1)2K.Hence,theKprobabilitythatatleastoneofthe200pointscomesfromaparticularclusteris1−(1−1)2K.Ifweassumeindependence(whichistooKoptimistic,butbecomesapproximatelytrueforlargervaluesofK),thenanupperboundfortheprobabilitythatallclustersarerepresentedinthefinalsampleisgivenby(1−(1−1)2K)K.ThevaluesgivenbythisKboundare0.27,5.7e-07,and8.2e-64,respectively.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn1270.50.450.40.350.30.25Probability0.20.150.10.0500102030405060708090100KFigure8.2.Probabilityofatleastonepointfromeachcluster.Exercise4.5.IdentifytheclustersinFigure8.3usingthecenter-,contiguity-,anddensity-baseddefinitions.Alsoindicatethenumberofclustersforeachcaseandgiveabriefindicationofyourreasoning.Notethatdarknessorthenumberofdotsindicatesdensity.Ifithelps,assumecenter-basedmeansK-means,contiguity-basedmeanssinglelink,anddensity-basedmeansDBSCAN.课后答案网:www.hackshp.cn(a)(b)(c)(d)Figure8.3.ClustersforExercise5.(a)center-based2clusters.Therectangularregionwillbesplitinhalf.Notethatthenoiseisincludedinthetwoclusters.contiguity-based1clusterbecausethetwocircularregionswillbejoinedbynoise.density-based2clusters,oneforeachcircularregion.Noisewillbeeliminated.(b)center-based1clusterthatincludesbothrings.contiguity-based2clusters,oneforeachrings.density-based2clusters,oneforeachring.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn128Chapter8ClusterAnalysis:BasicConceptsandAlgorithms(c)center-based3clusters,oneforeachtriangularregion.Oneclusterisalsoanacceptableanswer.contiguity-based1cluster.Thethreetriangularregionswillbejoinedtogetherbecausetheytouch.density-based3clusters,oneforeachtriangularregion.Eventhoughthethreetrianglestouch,thedensityintheregionwheretheytouchislowerthanthroughouttheinteriorofthetriangles.(d)center-based2clusters.Thetwogroupsoflineswillbesplitintwo.contiguity-based5clusters.Eachsetoflinesthatintertwinesbe-comesacluster.density-based2clusters.Thetwogroupsoflinesdefinetworegionsofhighdensityseparatedbyaregionoflowdensity.6.Forthefollowingsetsoftwo-dimensionalpoints,(1)provideasketchofhowtheywouldbesplitintoclustersbyK-meansforthegivennumberofclustersand(2)indicateapproximatelywheretheresultingcentroidswouldbe.As-sumethatweareusingthesquarederrorobjectivefunction.Ifyouthinkthatthereismorethanonepossiblesolution,thenpleaseindicatewhethereachsolutionisaglobalorlocalminimum.NotethatthelabelofeachdiagraminFigure8.4matchesthecorrespondingpartofthisquestion,e.g.,Figure8.4(a)goeswithpart(a).课后答案网:www.hackshp.cnGlobalminimumLocalminimumGlobalminimumLocalminimum(a)(b)(c)(d)(e)Figure8.4.DiagramsforExercise6.(a)K=2.Assumingthatthepointsareuniformlydistributedinthecircle,howmanypossiblewaysarethere(intheory)topartitionthepointsintotwoclusters?Whatcanyousayaboutthepositionsofthetwocentroids?(Again,youdon’tneedtoprovideexactcentroidlocations,justaqualitativedescription.)Intheory,thereareaninfinitenumberofwaystosplitthecircleintotwoclusters-justtakeanylinethatbisectsthecircle.Thislinecan若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn129makeanyangle0◦≤θ≤180◦withthexaxis.Thecentroidswilllieontheperpendicularbisectorofthelinethatsplitsthecircleintotwoclustersandwillbesymmetricallypositioned.Allthesesolutionswillhavethesame,globallyminimal,error.(b)K=3.Thedistancebetweentheedgesofthecirclesisslightlygreaterthantheradiiofthecircles.Ifyoustartwithinitialcentroidsthatarerealpoints,youwillnecessar-ilygetthissolutionbecauseoftherestrictionthatthecirclesaremorethanoneradiusapart.Ofcourse,thebisectorcouldhaveanyangle,asabove,anditcouldbetheothercirclethatissplit.Allthesesolutionshavethesamegloballyminimalerror.(c)K=3.Thedistancebetweentheedgesofthecirclesismuchlessthantheradiiofthecircles.Thethreeboxesshowthethreeclustersthatwillresultintherealisticcasethattheinitialcentroidsareactualdatapoints.(d)K=2.Inbothcase,therectanglesshowtheclusters.Inthefirstcase,thetwoclustersareonlyalocalminimumwhileinthesecondcasetheclustersrepresentagloballyminimalsolution.(e)K=3.Hint:Usethesymmetryofthesituationandrememberthatwearelookingforaroughsketchofwhattheresultwouldbe.Forthesolutionshowninthetopfigure,thetwotopclustersareen-closedintwoboxes,whilethethirdclusterisenclosedbytheregions课后答案网:www.hackshp.cndefinedbyatriangleandarectangle.(Thetwosmallerclustersinthedrawingaresupposedtobesymmetrical.)Ibelievethatthesecondsolution—suggestedbyastudent—isalsopossible,althoughitisalo-calminimumandmightrarelybeseeninpracticeforthisconfigurationofpoints.Notethatwhilethetwopieshapedcutsoutofthelargercir-cleareshownasmeetingatapoint,thisisnotnecessarilythecase—itdependsontheexactpositionsandsizesofthecircles.Therecouldbeagapbetweenthetwopieshapedcutswhichisfilledbythethird(larger)cluster.(Imaginethesmallcirclesonoppositesides.)Ortheboundarybetweenthetwopieshapedcutscouldactuallybealinesegment.7.Supposethatforadataset•therearempointsandKclusters,•halfthepointsandclustersarein“moredense”regions,•halfthepointsandclustersarein“lessdense”regions,and•thetworegionsarewell-separatedfromeachother.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn130Chapter8ClusterAnalysis:BasicConceptsandAlgorithmsForthegivendataset,whichofthefollowingshouldoccurinordertomini-mizethesquarederrorwhenfindingKclusters:(a)Centroidsshouldbeequallydistributedbetweenmoredenseandlessdenseregions.(b)Morecentroidsshouldbeallocatedtothelessdenseregion.(c)Morecentroidsshouldbeallocatedtothedenserregion.Note:Donotgetdistractedbyspecialcasesorbringinfactorsotherthandensity.However,ifyoufeelthetrueanswerisdifferentfromanygivenabove,justifyyourresponse.Thecorrectansweris(c).Lessdenseregionsrequiremorecentroidsifthesquarederroristobeminimized.8.Considerthemeanofaclusterofobjectsfromabinarytransactiondataset.Whataretheminimumandmaximumvaluesofthecomponentsofthemean?Whatistheinterpretationofcomponentsoftheclustermean?Whichcomponentsmostaccuratelycharacterizetheobjectsinthecluster?(a)Thecomponentsofthemeanrangebetween0and1.(b)Foranyspecificcomponent,itsvalueisthefractionoftheobjectsintheclusterthathavea1forthatcomponent.Ifwehaveasymmetricbinarydata,suchasmarketbasketdata,thenthiscanbeviewedastheprobabilitythat,forexample,acustomeringrouprepresentedbythetheclusterbuysthatparticularitem.(c)Thisdependsonthetypeofdata.Forbinaryasymmetricdata,the课后答案网:www.hackshp.cncomponentswithhighervaluescharacterizethedata,since,formostclusters,thevastmajorityofcomponentswillhavevaluesofzero.Forregularbinarydata,suchastheresultsofatrue-falsetest,thesignifi-cantcomponentsarethosethatareunusuallyhighorlowwithrespecttotheentiresetofdata.9.Giveanexampleofadatasetconsistingofthreenaturalclusters,forwhich(almostalways)K-meanswouldlikelyfindthecorrectclusters,butbisectingK-meanswouldnot.Consideradatasetthatconsistsofthreecircularclusters,thatareidenticalintermsofthenumberanddistributionofpoints,andwhosecenterslieonalineandarelocatedsuchthatthecenterofthemiddleclusterisequallydistantfromtheothertwo.BisectingK-meanswouldalwayssplitthemiddleclusterduringitsfirstiteration,andthus,couldneverproducethecorrectsetofclusters.(Postprocessingcouldbeappliedtoaddressthis.)10.WouldthecosinemeasurebetheappropriatesimilaritymeasuretousewithK-meansclusteringfortimeseriesdata?Whyorwhynot?Ifnot,whatsimilaritymeasurewouldbemoreappropriate?若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn131Timeseriesdataisdensehigh-dimensionaldata,andthus,thecosinemeasurewouldnotbeappropriatesincethecosinemeasureisappropriateforsparsedata.Ifthemagnitudeofatimeseriesisimportant,thenEuclideandistancewouldbeappropriate.Ifonlytheshapesofthetimeseriesareimportant,thencorrelationwouldbeappropriate.Notethatifthecomparisonofthetimeseriesneedstotakeinaccountthatonetimeseriesmightleadorlaganotheroronlyberelatedtoanotherduringspecifictimeperiods,thenmoresophisticatedapproachestomodelingtimeseriessimilaritymustbeused.11.TotalSSEisthesumoftheSSEforeachseparateattribute.WhatdoesitmeaniftheSSEforonevariableislowforallclusters?Lowforjustonecluster?Highforallclusters?Highforjustonecluster?HowcouldyouusethepervariableSSEinformationtoimproveyourclustering?(a)IftheSSEofoneattributeislowforallclusters,thenthevariableisessentiallyaconstantandoflittleuseindividingthedataintogroups.(b)iftheSSEofoneattributeisrelativelylowforjustonecluster,thenthisattributehelpsdefinethecluster.(c)IftheSSEofanattributeisrelativelyhighforallclusters,thenitcouldwellmeanthattheattributeisnoise.(d)IftheSSEofanattributeisrelativelyhighforonecluster,thenitisatoddswiththeinformationprovidedbytheattributeswithlowSSEthatdefinethecluster.Itcouldmerelybethecasethattheclustersdefinedbythisattributearedifferentfromthosedefinedbytheotherattributes,butinanycase,itmeansthatthisattributedoesnothelpdefinethecluster.课后答案网:www.hackshp.cn(e)Theideaistoeliminateattributesthathavepoordistinguishingpowerbetweenclusters,i.e.,loworhighSSEforallclusters,sincetheyareuselessforclustering.NotethatattributeswithhighSSEforallclustersareparticularlytroublesomeiftheyhavearelativelyhighSSEwithrespecttootherattributes(perhapsbecauseoftheirscale)sincetheyintroducealotofnoiseintothecomputationoftheoverallSSE.12.Theleaderalgorithm(Hartigan[4])representseachclusterusingapoint,knownasaleader,andassignseachpointtotheclustercorrespondingtotheclosestleader,unlessthisdistanceisaboveauser-specifiedthreshold.Inthatcase,thepointbecomestheleaderofanewcluster.NotethatthealgorithmdescribedhereisnotquitetheleaderalgorithmdescribedinHartigan,whichassignsapointtothefirstleaderthatiswithinthethresholddistance.Theanswersapplytothealgorithmasstatedintheproblem.(a)WhataretheadvantagesanddisadvantagesoftheleaderalgorithmascomparedtoK-means?若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn132Chapter8ClusterAnalysis:BasicConceptsandAlgorithmsTheleaderalgorithmrequiresonlyasinglescanofthedataandisthusmorecomputationallyefficientsinceeachobjectiscomparedtothefinalsetofcentroidsatmostonce.Althoughtheleaderalgorithmisorderdependent,forafixedorderingoftheobjects,italwaysproducesthesamesetofclusters.However,unlikeK-means,itisnotpossibletosetthenumberofresultingclustersfortheleaderalgorithm,exceptindirectly.Also,theK-meansalgorithmalmostalwaysproducesbetterqualityclustersasmeasuredbySSE.(b)Suggestwaysinwhichtheleaderalgorithmmightbeimproved.Useasampletodeterminethedistributionofdistancesbetweenthepoints.Theknowledgegainedfromthisprocesscanbeusedtomoreintelligentlysetthevalueofthethreshold.Theleaderalgorithmcouldbemodifiedtoclusterforseveralthresholdsduringasinglepass.13.TheVoronoidiagramforasetofKpointsintheplaneisapartitionofallthepointsoftheplaneintoKregions,suchthateverypoint(oftheplane)isassignedtotheclosestpointamongtheKspecifiedpoints.(SeeFigure8.5.)WhatistherelationshipbetweenVoronoidiagramsandK-meansclusters?WhatdoVoronoidiagramstellusaboutthepossibleshapesofK-meansclusters?(a)IfwehaveKK-meansclusters,thentheplaneisdividedintoKVoronoiregionsthatrepresentthepointsclosesttoeachcentroid.(b)Theboundariesbetweenclustersarepiecewiselinear.Itispossibleto课后答案网:www.hackshp.cnseethisbydrawingalineconnectingtwocentroidsandthendrawingaperpendiculartothelinehalfwaybetweenthecentroids.Thisper-pendicularlinesplitstheplaneintotworegions,eachcontainingpointsthatareclosesttothecentroidtheregioncontains.Figure8.5.VoronoidiagramforExercise13.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn13314.Youaregivenadatasetwith100recordsandareaskedtoclusterthedata.YouuseK-meanstoclusterthedata,butforallvaluesofK,1≤K≤100,theK-meansalgorithmreturnsonlyonenon-emptycluster.YouthenapplyanincrementalversionofK-means,butobtainexactlythesameresult.Howisthispossible?HowwouldsinglelinkorDBSCANhandlesuchdata?(a)Thedataconsistscompletelyofduplicatesofoneobject.(b)Singlelink(andmanyoftheotheragglomerativehierarchicalschemes)wouldproduceahierarchicalclustering,butwhichpointsappearinwhichclusterwoulddependontheorderingofthepointsandtheexactalgorithm.However,ifthedendrogramwereplottedshowingtheprox-imityatwhicheachobjectismerged,thenitwouldbeobviousthatthedataconsistedofduplicates.DBSCANwouldfindthatallpointswerecorepointsconnectedtooneanotherandproduceasinglecluster.15.Traditionalagglomerativehierarchicalclusteringroutinesmergetwoclustersateachstep.Doesitseemlikelythatsuchanapproachaccuratelycapturesthe(nested)clusterstructureofasetofdatapoints?Ifnot,explainhowyoumightpostprocessthedatatoobtainamoreaccurateviewoftheclusterstructure.(a)Suchanapproachdoesnotaccuratelycapturethenestedclusterstruc-tureofthedata.Forexample,considerasetofthreeclusters,eachofwhichhastwo,three,andfoursubclusters,respectively.Anidealhi-erarchicalclusteringwouldhavethreebranchesfromtheroot—onetoeachofthethreemainclusters—andthentwo,three,andfourbranchesfromeachoftheseclusters,respectively.Atraditionalagglomerative课后答案网:www.hackshp.cnapproachcannotproducesuchastructure.(b)Thesimplesttypeofpostprocessingwouldattempttoflattenthehier-archicalclusteringbymovingclustersupthetree.16.UsethesimilaritymatrixinTable8.1toperformsingleandcompletelinkhierarchicalclustering.Showyourresultsbydrawingadendrogram.Thedendrogramshouldclearlyshowtheorderinwhichthepointsaremerged.ThesolutionsareshowninFigures8.6(a)and8.6(b).17.HierarchicalclusteringissometimesusedtogenerateKclusters,K>1bytakingtheclustersattheKthlevelofthedendrogram.(Rootisatlevel1.)Bylookingattheclustersproducedinthisway,wecanevaluatethebehaviorofhierarchicalclusteringondifferenttypesofdataandclusters,andalsocomparehierarchicalapproachestoK-means.Thefollowingisasetofone-dimensionalpoints:{6,12,18,24,30,42,48}.(a)Foreachofthefollowingsetsofinitialcentroids,createtwoclustersbyassigningeachpointtothenearestcentroid,andthencalculatethe若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn134Chapter8ClusterAnalysis:BasicConceptsandAlgorithmsTable8.1.SimilaritymatrixforExercise16.p1p2p3p4p5p11.000.100.410.550.35p20.101.000.640.470.98p30.410.641.000.440.85p40.550.470.441.000.76p50.350.980.850.761.000.550.10.60.20.650.30.70.40.750.5Similarity0.8Similarity0.60.850.70.90.80.950.9112534125314(a)Singlelink.(b)Completelink.Figure8.6.DendrogramsforExercise16.课后答案网:www.hackshp.cntotalsquarederrorforeachsetoftwoclusters.Showboththeclustersandthetotalsquarederrorforeachsetofcentroids.i.{18,45}Firstclusteris6,12,18,24,30.Error=360.Secondclusteris42,48.Error=18.TotalError=378ii.{15,40}Firstclusteris6,12,18,24.Error=180.Secondclusteris30,42,48.Error=168.TotalError=348.(b)Dobothsetsofcentroidsrepresentstablesolutions;i.e.,iftheK-meansalgorithmwasrunonthissetofpointsusingthegivencentroidsasthestartingcentroids,wouldtherebeanychangeintheclustersgenerated?若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn135Yes,bothcentroidsarestablesolutions.(c)Whatarethetwoclustersproducedbysinglelink?Thetwoclustersare{6,12,18,24,30}and{42,48}.(d)Whichtechnique,K-meansorsinglelink,seemstoproducethe“mostnatural”clusteringinthissituation?(ForK-means,taketheclusteringwiththelowestsquarederror.)MINproducesthemostnaturalclustering.(e)Whatdefinition(s)ofclusteringdoesthisnaturalclusteringcorrespondto?(Well-separated,center-based,contiguous,ordensity.)MINproducescontiguousclusters.However,densityisalsoanaccept-ableanswer.Evencenter-basedisacceptable,sinceonesetofcentersgivesthedesiredclusters.(f)Whatwell-knowncharacteristicoftheK-meansalgorithmexplainsthepreviousbehavior?K-meansisnotgoodatfindingclustersofdifferentsizes,atleastwhentheyarenotwellseparated.Thereasonforthisisthattheobjectiveofminimizingsquarederrorcausesitto“break”thelargercluster.Thus,inthisproblem,thelowerrorclusteringsolutionisthe“unnatural”one.18.SupposewefindKclustersusingWard’smethod,bisectingK-means,andordinaryK-means.Whichofthesesolutionsrepresentsalocalorglobalminimum?Explain.课后答案网:www.hackshp.cnAlthoughWard’smethodpicksapairofclusterstomergebasedonmini-mizingSSE,thereisnorefinementstepasinregularK-means.Likewise,bisectingK-meanshasnooverallrefinementstep.Thus,unlesssucharefine-mentstepisadded,neitherWard’smethodnorbisectingK-meansproducesalocalminimum.OrdinaryK-meansproducesalocalminimum,butliketheothertwoalgorithms,itisnotguaranteedtoproduceaglobalminimum.19.HierarchicalclusteringalgorithmsrequireO(m2log(m))time,andconse-quently,areimpracticaltousedirectlyonlargerdatasets.Onepossibletechniqueforreducingthetimerequiredistosamplethedataset.Forex-√ample,ifKclustersaredesiredandmpointsaresampledfromthempoints,thenahierarchicalclusteringalgorithmwillproduceahierarchicalclusteringinroughlyO(m)time.Kclusterscanbeextractedfromthishier-archicalclusteringbytakingtheclustersontheKthlevelofthedendrogram.Theremainingpointscanthenbeassignedtoaclusterinlineartime,byusingvariousstrategies.Togiveaspecificexample,thecentroidsoftheK√clusterscanbecomputed,andtheneachofthem−mremainingpointscanbeassignedtotheclusterassociatedwiththeclosestcentroid.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn136Chapter8ClusterAnalysis:BasicConceptsandAlgorithmsForeachofthefollowingtypesofdataorclusters,discussbrieflyif(1)sam-plingwillcauseproblemsforthisapproachand(2)whatthoseproblemsare.Assumethatthesamplingtechniquerandomlychoosespointsfromtheto-talsetofmpointsandthatanyunmentionedcharacteristicsofthedataorclustersareasoptimalaspossible.Inotherwords,focusonlyonproblemscausedbytheparticularcharacteristicmentioned.Finally,assumethatKisverymuchlessthanm.(a)Datawithverydifferentsizedclusters.Thiscanbeaproblem,particularlyifthenumberofpointsinaclusterissmall.Forexample,ifwehaveathousandpoints,withtwoclusters,oneofsize900andoneofsize100,andtakea5%sample,thenwewill,onaverage,endupwith45pointsfromthefirstclusterand5pointsfromthesecondcluster.Fivepointsismucheasiertomissorclusterimproperlythan50.Also,thesecondclusterwillsometimesberepresentedbyfewerthan5points,justbythenatureofrandomsamples.(b)High-dimensionaldata.Thiscanbeaproblembecausedatainhigh-dimensionalspaceistypi-callysparseandmorepointsmaybeneededtodefinethestructureofaclusterinhigh-dimensionalspace.(c)Datawithoutliers,i.e.,atypicalpoints.Bydefinition,outliersarenotveryfrequentandmostofthemwillbeomittedwhensampling.Thus,iffindingthecorrectclusteringdepends课后答案网:www.hackshp.cnonhavingtheoutlierspresent,theclusteringproducedbysamplingwilllikelybemisleading.Otherwise,itisbeneficial.(d)Datawithhighlyirregularregions.Thiscanbeaproblembecausethestructureofthebordercanbelostwhensamplingunlessalargenumberofpointsaresampled.(e)Datawithglobularclusters.Thisistypicallynotaproblemsincenotasmanypointsneedtobesampledtoretainthestructureofaglobularclusterasanirregularone.(f)Datawithwidelydifferentdensities.Inthiscasethedatawilltendtocomefromthedenserregion.Notethattheeffectofsamplingistoreducethedensityofallclustersbythesamplingfactor,e.g.,ifwetakea10%sample,thenthedensityoftheclustersisdecreasedbyafactorof10.Forclustersthataren’tverydensetobeginwith,thismaymeansthattheyarenowtreatedasnoiseoroutliers.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn137(g)Datawithasmallpercentageofnoisepoints.Samplingwillnotcauseaproblem.Actually,sincewewouldliketoexcludenoise,andsincetheamountofnoiseissmall,thismaybebeneficial.(h)Non-Euclideandata.Thishasnoparticularimpact.(i)Euclideandata.Thishasnoparticularimpact.(j)Datawithmanyandmixedattributetypes.Manyattributeswasdiscussedunderhigh-dimensionality.Mixedat-tributeshavenoparticularimpact.20.ConsiderthefollowingfourfacesshowninFigure8.7.Again,darknessornumberofdotsrepresentsdensity.Linesareusedonlytodistinguishregionsanddonotrepresentpoints.课后答案网:www.hackshp.cn(a)(b)(c)(d)Figure8.7.FigureforExercise20.(a)Foreachfigure,couldyouusesinglelinktofindthepatternsrepresentedbythenose,eyes,andmouth?Explain.Onlyfor(b)and(d).For(b),thepointsinthenose,eyes,andmoutharemuchclosertogetherthanthepointsbetweentheseareas.For(d)thereisonlyspacebetweentheseregions.(b)Foreachfigure,couldyouuseK-meanstofindthepatternsrepresentedbythenose,eyes,andmouth?Explain.Onlyfor(b)and(d).For(b),K-meanswouldfindthenose,eyes,andmouth,butthelowerdensitypointswouldalsobeincluded.For(d),K-若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn138Chapter8ClusterAnalysis:BasicConceptsandAlgorithmsmeanswouldfindthenose,eyes,andmouthstraightforwardlyaslongasthenumberofclusterswassetto4.(c)WhatlimitationdoesclusteringhaveindetectingallthepatternsformedbythepointsinFigure8.7(c)?Clusteringtechniquescanonlyfindpatternsofpoints,notofemptyspaces.21.ComputetheentropyandpurityfortheconfusionmatrixinTable8.2.Table8.2.ConfusionmatrixforExercise21.ClusterEntertainmentFinancialForeignMetroNationalSportsTotalEntropyPurity#11101146766930.200.98#227893338272533315621.840.53#3326465810516299491.700.49Total35455534194327373832041.440.6122.Youaregiventwosetsof100pointsthatfallwithintheunitsquare.Onesetofpointsisarrangedsothatthepointsareuniformlyspaced.Theothersetofpointsisgeneratedfromauniformdistributionovertheunitsquare.(a)Isthereadifferencebetweenthetwosetsofpoints?Yes.Therandompointswillhaveregionsoflesserorgreaterdensity,whiletheuniformlydistributedpointswill,ofcourse,haveuniformdensitythroughouttheunitsquare.课后答案网:www.hackshp.cn(b)Ifso,whichsetofpointswilltypicallyhaveasmallerSSEforK=10clusters?TherandomsetofpointswillhavealowerSSE.(c)WhatwillbethebehaviorofDBSCANontheuniformdataset?Therandomdataset?DBSCANwillmergeallpointsintheuniformdatasetintooneclusterorclassifythemallasnoise,dependingonthethreshold.Theremightbesomeboundaryissuesforpointsattheedgeoftheregion.However,DBSCANcanoftenfindclustersintherandomdata,sinceitdoeshavesomevariationindensity.23.UsingthedatainExercise24,computethesilhouettecoefficientforeachpoint,eachofthetwoclusters,andtheoverallclustering.Cluster1contains{P1,P2},Cluster2contains{P3,P4}.Thedissimilaritymatrixthatweobtainfromthesimilaritymatrixisthefollowing:若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn139Table8.3.TableofdistancesforExercise23P1P2P3P4P100.100.650.55.P20.1000.700.60P30.650.7000.30P40.550.600.300Letaindicatetheaveragedistanceofapointtootherpointsinitscluster.Letbindicatetheminimumoftheaveragedistanceofapointtopointsinanothercluster.PointP1:SC=1-a/b=1-0.1/((0.65+0.55)/2)=5/6=0.833PointP2:SC=1-a/b=1-0.1/((0.7+0.6)/2)=0.846PointP2:SC=1-a/b=1-0.3/((0.65+0.7)/2)=0.556PointP2:SC=1-a/b=1-0.3/((0.55+0.6)/2)=0.478Cluster1AverageSC=(0.833+0.846)/2=0.84Cluster2AverageSC=(0.556+0.478)/2=0.52OverallAverageSC=(0.840+0.517)/2=0.6824.GiventhesetofclusterlabelsandsimilaritymatrixshowninTables8.4and8.5,respectively,computethecorrelationbetweenthesimilaritymatrixandtheidealsimilaritymatrix,i.e.,thematrixwhoseijthentryis1iftwoobjectsbelongtothesamecluster,and0otherwise.课后答案网:www.hackshp.cnTable8.4.TableofclusterlabelsforExercise24.Table8.5.SimilaritymatrixforExercise24.PointClusterLabelPointP1P2P3P4P11P110.80.650.55P21P20.810.70.6P32P30.650.710.9P42P40.550.60.91Weneedtocomputethecorrelationbetweenthevectorx=<1,0,0,0,0,1>andthevectory=<0.8,0.65,0.55,0.7,0.6,0.3>,whichisthecorrelationbetweentheoff-diagonalelementsofthedistancematrixandtheidealsimi-laritymatrix.Weget:Standarddeviationofthevectorx:σx=0.5164Standarddeviationofthevectory:σy=0.1703Covarianceofxandy:cov(x,y)=−0.200若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn140Chapter8ClusterAnalysis:BasicConceptsandAlgorithmsTherefore,corr(x,y)=cov(x,y)/σxσy=−0.22725.ComputethehierarchicalF-measurefortheeightobjects{p1,p2,p3,p4,p5,p6,p7,p8}andhierarchicalclusteringshowninFigure8.8.ClassAcontainspointsp1,p2,andp3,whilep4,p5,p6,p7,andp8belongtoclassB.{p1,p2,p3,p4,p5,p6,p7,p8}{p1,p2,p4,p5,}{p3,p6,p7,p8}{p1,p2}{p4,p5}{p3,p6}{p7,p8}Figure8.8.HierarchicalclusteringforExercise25.LetR(i,j)=nij/niindicatetherecallofclassiwithrespecttoclusterj.LetP(i,j)=nij/njindicatetheprecisionofclassiwithrespecttoclusterj.F(i,j)=2R(i,j)×P(i,j)/(R(i,j)+P(i,j))istheF-measureforclassiandclusterj.Forcluster#1={p1,p2,p3,p4,p5,p6,p7,p8}:Class=A:R(A,课后答案网:www.hackshp.cn1)=3/3=1,P(A,1)=3/8=0.375F(A,1)=2×1×0.375/(1+0.375)=0.55Class=B:R(B,1)=5/5=1,P(A,1)=5/8=0.625,F(A,1)=0.77Forcluster#2={p1,p2,p4,p5}Class=A:R(A,2)=2/3,P(A,2)=2/4,F(A,2)=0.57Class=B:R(B,2)=2/5,P(B,2)=2/4,F(B,2)=0.44Forcluster#3={p3,p6,p7,p8}Class=A:R(A,3)=1/3,P(A,3)=1/4,F(A,3)=0.29Class=B:R(B,3)=3/5,P(B,3)=3/4,F(B,3)=0.67Forcluster#4={p1,p2}Class=A:若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn141R(A,4)=2/3,P(A,4)=2/2,F(A,4)=0.8Class=B:R(B,4)=0/5,P(B,4)=0/2,F(B,4)=0Forcluster#5={p4,p5}Class=A:R(A,5)=0,P(A,5)=0,F(A,5)=0Class=B:R(B,5)=2/5,P(B,5)=2/2,F(B,5)=0.57Forcluster#6={p3,p6}Class=A:R(A,6)=1/3,P(A,6)=1/2,F(A,6)=0.4Class=B:R(B,6)=1/5,P(B,6)=1/2,F(B,6)=0.29Forcluster#7=课后答案网:www.hackshp.cn{p7,p8}Class=A:R(A,7)=0,P(A,7)=1,F(A,7)=0Class=B:R(B,7)=2/5,P(B,7)=2/2,F(B,7)=0.57ClassA:F(A)=max{F(A,j)}=max{0.55,0.57,0.29,0.8,0,0.4,0}=0.8ClassB:F(B)=max{F(B,j)}=max{0.77,0.44,0.67,0,0.57,0.29,0.57}=0.772nOverallClustering:F=imaxF(i,j)=3/8∗F(A)+5/8∗F(B)=0.781ni26.ComputethecopheneticcorrelationcoefficientforthehierarchicalclusteringsinExercise16.(Youwillneedtoconvertthesimilaritiesintodissimilarities.)Thiscanbeeasilycomputedusingapackage,e.g.,MATLAB.Theanswersaresinglelink,0.8116,andcompletelink,0.7480.27.ProveEquation8.14.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn142Chapter8ClusterAnalysis:BasicConceptsandAlgorithms11(x−y)2=((x−c)−(y−c))2ii2|Ci|2|Ci|x∈Ciy∈Cix∈Ciy∈Ci1=(x−c)2−2(x−c)(y−c)iii2|Ci|x∈Ciy∈Cix∈Ciy∈Ci+(y−c)2ix∈Ciy∈Ci1=(x−c)2+(y−c)2ii2|Ci|x∈Ciy∈Cix∈Ciy∈Ci1=|C|(x−c)2ii|Ci|x∈Ci=SSEThecrosstermx∈Ciy∈Ci(x−ci)(y−ci)is0.28.ProveEquation8.15.1KK1KK|C|(c−c)2=|C|((m−c)−(m−c))2ijiiijK2Ki=1j=1i=1j=1课后答案网:www.hackshp.cn1KKKK=|C|(m−c)2−2|C|(m−c)(m−c)iiiij2Ki=1j=1i=1j=1KK+|C|(m−c)2iji=1j=11KKKK=|C|(m−c)2+|C|(m−c)2iiij2Ki=1j=1i=1j=11K=K|C|(m−c)2iiKi=1=SSBAgain,thecrosstermcancels.K29.Provethati=1x∈Ci(x−mi)(m−mi)=0.ThisfactwasusedintheproofthatTSS=SSE+SSBonpage557.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn143KK(x−c)(c−c)=(xc−cc−xc+c2)iiiiii=1x∈Cii=1x∈CiKKKK=xc−cc−xc+c2iiii=1x∈Cii=1x∈Cii=1x∈Cii=1x∈Ci=mcc−mcc−mc2+mc2iiiiiiii=030.Clustersofdocumentscanbesummarizedbyfindingthetopterms(words)forthedocumentsinthecluster,e.g.,bytakingthemostfrequentkterms,wherekisaconstant,say10,orbytakingalltermsthatoccurmorefre-quentlythanaspecifiedthreshold.SupposethatK-meansisusedtofindclustersofbothdocumentsandwordsforadocumentdataset.(a)HowmightasetoftermclustersdefinedbythetoptermsinadocumentclusterdifferfromthewordclustersfoundbyclusteringthetermswithK-means?First,thetopwordsclusterscould,andlikelywould,overlapsomewhat.Second,itislikelythatmanytermswouldnotappearinanyoftheclustersformedbythetopterms.Incontrast,aK-meansclusteringofthetermswouldcoverallthetermsandwouldnotbeoverlapping.(b)Howcouldtermclusteringbeusedtodefineclustersofdocuments?Anobviousapproachwouldbetotakethetopdocumentsforatermcluster;i.e.,thosedocumentsthatmostfrequentlycontainthetermsin课后答案网:www.hackshp.cnthecluster.31.Wecanrepresentadatasetasacollectionofobjectnodesandacollectionofattributenodes,wherethereisalinkbetweeneachobjectandeachat-tribute,andwheretheweightofthatlinkisthevalueoftheobjectforthatattribute.Forsparsedata,ifthevalueis0,thelinkisomitted.Bipartiteclusteringattemptstopartitionthisgraphintodisjointclusters,whereeachclusterconsistsofasetofobjectnodesandasetofattributenodes.Theobjectiveistomaximizetheweightoflinksbetweentheobjectandattributenodesofacluster,whileminimizingtheweightoflinksbetweenobjectandattributelinksindifferentclusters.Thistypeofclusteringisalsoknownasco-clusteringsincetheobjectsandattributesareclusteredatthesametime.(a)Howisbipartiteclustering(co-clustering)differentfromclusteringthesetsofobjectsandattributesseparately?Inregularclustering,onlyonesetofconstraints,relatedeithertoob-jectsorattributes,isapplied.Inco-clusteringbothsetsofconstraints若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn144Chapter8ClusterAnalysis:BasicConceptsandAlgorithmsareappliedsimultaneously.Thus,partitioningtheobjectsandat-tributesindependentlyofoneanothertypicallydoesnotproducethesameresults.(b)Arethereanycasesinwhichtheseapproachesyieldthesameclusters?Yes.Forexample,ifasetofattributesisassociatedonlywiththeobjectsinoneparticularcluster,i.e.,has0weightforobjectsinallotherclusters,andconversely,thesetofobjectsinaclusterhas0weightforallotherattributes,thentheclustersfoundbyco-clusteringwillmatchthosefoundbyclusteringtheobjectsandattributesseparately.Tousedocumentsasanexample,thiswouldcorrespondtoadocumentdatasetthatconsistsofgroupsofdocumentsthatonlycontaincertainwordsandgroupsofwordsthatonlyappearincertaindocuments.(c)Whatarethestrengthsandweaknessesofco-clusteringascomparedto课后答案网:www.hackshp.cnordinaryclustering?Co-clusteringautomaticallyprovidesadescriptionofaclusterofobjectsintermsofattributes,whichcanbemoreusefulthanadescriptionofclustersasapartitioningofobjects.However,theattributesthatdistinguishdifferentclustersofobjects,mayoverlapsignificantly,andinsuchcases,co-clusteringwillnotworkwell.32.InFigure8.9,matchthesimilaritymatrices,whicharesortedaccordingtoclusterlabels,withthesetsofpoints.Differencesinshadingandmarkershapedistinguishbetweenclusters,andeachsetofpointscontains100pointsandthreeclusters.Inthesetofpointslabeled2,therearethreeverytight,equal-sizedclusters.Answers:1-D,2-C,3-A,4-B若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn14512110.90.90.80.80.70.70.60.6y0.5y0.50.40.40.30.30.20.20.10.10000.20.40.60.8100.20.40.60.81xx34110.90.90.80.80.70.70.60.6y0.5y0.50.40.40.30.30.20.20.10.10000.20.40.60.8100.20.40.60.81xx课后答案网:www.hackshp.cnFigure8.9.PointsandsimilaritymatricesforExercise32.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn课后答案网:www.hackshp.cn若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn9ClusterAnalysis:AdditionalIssuesandAlgorithms1.Forsparsedata,discusswhyconsideringonlythepresenceofnon-zerovaluesmightgiveamoreaccurateviewoftheobjectsthanconsideringtheactualmagnitudesofvalues.Whenwouldsuchanapproachnotbedesirable?Considerdocumentdata.Intuitively,twodocumentsaresimilariftheycon-tainmanyofthesamewords.Althoughwecanalsoincludethefrequencywithwhichthosewordsoccurinthesimilaritycomputation,thiscansome-timesgivealessreliableassessmentofsimilarity.Inparticular,ifoneofthewordsinadocumentoccursratherfrequentlycomparedtootherwords,课后答案网:www.hackshp.cnthenthiswordcandominatethesimilaritycomparisonwhenmagnitudesaretakenintoaccount.Inthatcase,thedocumentwillonlybehighlysimilartootherdocumentsthatalsocontainthesamewordwithahighfrequency.Whilethismaybeappropriateinmanyorevenmostcases,itmayleadtothewrongconclusionifthewordcanappearindifferentcontexts,thatcanonlybedistinguishedbyotherwords.Forinstance,theword,‘game,’appearsfrequentlyindiscussionsofsportsandvideogames.2.DescribethechangeinthetimecomplexityofK-meansasthenumberofclusterstobefoundincreases.Asthenumberofclustersincreases,thecomplexityofK-meansapproachesO(m2).3.Considerasetofdocuments.Assumethatalldocumentshavebeennormal-izedtohaveunitlengthof1.Whatisthe“shape”ofaclusterthatconsistsofalldocumentswhosecosinesimilaritytoacentroidisgreaterthansomespecifiedconstant?Inotherwords,cos(d,c)≥δ,where0<δ≤1.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn148Chapter9ClusterAnalysis:AdditionalIssuesandAlgorithmsOncedocumentvectorshavebeennormalized,theylieonamn-dimensionalhypershpere.Theconstraintthatalldocumentshaveaminimumcosinesimilaritywithrespecttoacentroidsisaconstraintthatthedocumentvectorsliewithinacone,whoseintersectionwiththesphereisacircleonthesurfaceofthesphere.4.Discusstheadvantagesanddisadvantagesoftreatingclusteringasanopti-mizationproblem.Amongotherfactors,considerefficiency,non-determinism,andwhetheranoptimization-basedapproachcapturesalltypesofclusteringsthatareofinterest.Twokeyadvantagetotreatingclusteringasanoptimizationproblemarethat(1)itprovidesacleardefinitionofwhattheclusteringprocessisdo-ing,and(2)itallowstheuseofpowerfuloptimizationtechniquesthathavebeendevelopedinawidevarietyoffields.Unfortunately,mostoftheseop-timizationtechniqueshaveahightimecomplexity.Furthermore,itcanbeshownthatmanyoptimizationproblemsareNPhard,andtherefore,itisnecessarytouseheuristicoptimizationapproachesthatcanonlyguaranteealocallyoptimalsolution.Oftensuchtechniquesworkbestwhenusedwithrandominitialization,andthus,thesolutionfoundcanvaryfromoneruntoanother.Anotherproblemwithoptimizationapproachesisthattheobjectivefunctionstheyusetendtofavorlargeclustersattheexpenseofsmallerones.5.Whatisthetimeandspacecomplexityoffuzzyc-means?OfSOM?HowdothesecomplexitiescomparetothoseofK-means?ThetimecomplexityofK-meansO(I∗K∗m∗n),whereIisthenumberofiterationsrequiredforconvergence,Kisthenumberofclusters,misthenumberofpoints,and课后答案网:www.hackshp.cnnisthenumberofattributes.Thetimerequiredbyfuzzyc-meansisbasicallythesameasK-means,althoughtheconstantismuchhigher.ThetimecomplexityofSOMisalsobasicallythesameasK-meansbecauseitconsistsofmultiplepassesinwhichobjectsareassignedtocentroidsandthecentroidsareupdated.However,becausethesurroundingcentroidsarealsoupdatedandthenumberofpassescanbelarge,SOMwilltypicallybeslowerthanK-means.6.TraditionalK-meanshasanumberoflimitations,suchassensitivitytoout-liersanddifficultyinhandlingclustersofdifferentsizesanddensities,orwithnon-globularshapes.Commentontheabilityoffuzzyc-meanstohandlethesesituations.Fuzzyc-meanshasallthelimitationsoftraditionalK-means,exceptthatitdoesnotmakeahardassignmentofanobjecttoacluster.7.Forthefuzzyc-meansalgorithmdescribedinthisbook,thesumofthemem-bershipdegreeofanypointoverallclustersis1.Instead,wecouldonlyrequirethatthemembershipdegreeofapointinaclusterbebetween0and1.Whataretheadvantagesanddisadvantagesofsuchanapproach?若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn149Themainadvantageofthisapproachoccurswhenapointisanoutlieranddoesnotreallybelongverystronglytoanycluster,sinceinthatsituation,thepointcanhavelowmembershipinallclusters.However,thisapproachisoftenhardertoinitializeproperlyandcanperformpoorlywhentheclustersarenotarenotdistinct.Inthatcase,severalclustercentersmaymergetogether,oraclustercentermayvarysignificantlyfromoneiterationtoanother,insteadofchangingonlyslightly,asinordinaryK-meansorfuzzyc-means.8.Explainthedifferencebetweenlikelihoodandprobability.Probabilityis,accordingtoonecommonstatisticaldefinition,thefrequencywithwhichaneventoccursasthenumberofexperimentstendstoinfinity.Probabilityisdefinedbyaprobabilitydensityfunctionwhichisafunctionoftheattributevaluesofanobject.Typically,aprobabilitydensityfunctiondependsonsomeparameters.Consideringprobabilitydensityfunctiontobeafunctionoftheparametersyieldsthelikelihoodfunction.9.Equation9.12givesthelikelihoodforasetofpointsfromaGaussiandis-tributionasafunctionofthemeanµandthestandarddeviationσ.Showmathematicallythatthemaximumlikelihoodestimateofµandσarethesamplemeanandthesamplestandarddeviation,respectively.First,wesolveforµ.∂((µ,σ)|X)∂m(x−µ)2i=−−0.5mlog2π−mlogσ∂µ∂µ2σ2i=1m课后答案网:www.hackshp.cn2(xi−µ)=−2σ2i=11mSettingthisequalto0andsolving,wegetµ=mi=1xi.Likewise,wecansolveforσ.∂((µ,σ)|X)∂m(x−µ)2i=−−0.5mlog2π−mlogσ∂σ∂σ2σ2i=1m2(x−µ)2mi=−2σ3σi=121m2Settingthisequalto0andsolving,wegetσ=mi=1(xi−µ).10.Wetakeasampleofadultsandmeasuretheirheights.Ifwerecordthegenderofeachperson,wecancalculatetheaverageheightandthevarianceoftheheight,separately,formenandwomen.Suppose,however,thatthisinforma-tionwasnotrecorded.Woulditbepossibletostillobtainthisinformation?Explain.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn150Chapter9ClusterAnalysis:AdditionalIssuesandAlgorithmsTheheightofmenandwomenwillhaveseparateGaussiandistributionswithdifferentmeansandperhapsdifferentvariances.Byusingamixturemodelapproach,wecanobtainestimatesofthemeanandvarianceofthetwoheightdistributions.Givenalargeenoughsamplesize,theestimatedparametersshouldbeclosetothosethatcouldbecomputedifweknewthegenderofeachperson.11.ComparethemembershipweightsandprobabilitiesofFigures9.1and9.4,whichcome,respectively,fromapplyingfuzzyandEMclusteringtothesamesetofdatapoints.Whatdifferencesdoyoudetect,andhowmightyouexplainthesedifferences?Thefuzzyclusteringapproachonlyassignsveryhighweightstothosepointsinthecenteroftheclusters.Thosepointsthatareclosetotwoorthreeclustershaverelativelylowweights.Thepointsthatareonthefaredgesoftheclusters,awayfromotherclustersalsohavelowerweightsthanthecenterpoints,butnotaslowaspointsthatareneartwoorthreeclusters.TheEMclusteringapproachassignshighweightsbothtopointsinthecenteroftheclustersandthoseonthefaredges.Thepointsthatareneartwoorthreeclustershavelowerweights,butnotasmuchsoaswiththefuzzyclusteringprocedure.Themaindifferencebetweentheapproachesisthatasapointonthefaredgeofaclustergetsfurtherawayfromthecenterofthecluster,theweightwithwhichisbelongstoaclusterbecomesmoreequalamongclustersforthefuzzyclusteringapproach,butfortheEMapproachthepointtendsto课后答案网:www.hackshp.cnbelongmorestronglytotheclustertowhichitisclosest.12.Figure9.1showsaclusteringofatwo-dimensionalpointdatasetwithtwoclusters:Theleftmostcluster,whosepointsaremarkedbyasterisks,issome-whatdiffuse,whiletherightmostcluster,whosepointsaremarkedbycircles,iscompact.Totherightofthecompactcluster,thereisasinglepoint(markedbyanarrow)thatbelongstothediffusecluster,whosecenterisfartherawaythanthatofthecompactcluster.ExplainwhythisispossiblewithEMclustering,butnotK-meansclustering.InEMclustering,wecomputetheprobabilitythatapointbelongstoacluster.Inturn,thisprobabilitydependsonboththedistancefromtheclustercenterandthespread(variance)ofthecluster.Hence,apointthatisclosertothecentroidofoneclusterthananothercanstillhaveahigherprobabilitywithrespecttothemoredistantclusterifthatclusterhasahigherspreadthantheclosercluster.K-meansonlytakesintoaccountthedistancetotheclosestclusterwhenassigningpointstoclusters.ThisisequivalenttoanEMapproachwhereallclustersareassumedtohavethesamevariance.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn1516420y–2–4–6–8–10–8–6–4–2024xFigure9.1.DatasetforExercise12.EMclusteringofatwo-dimensionalpointsetwithtwoclustersofdifferingdensity.13.ShowthattheMSTclusteringtechniqueofSection9.4.2producesthesameclustersassinglelink.Toavoidcomplicationsandspecialcases,assumethatallthepairwisesimilaritiesaredistinct.课后答案网:www.hackshp.cnInsinglelink,westartwithwithclustersofindividualpointsandthensucces-sivelyjointwoclustersthathavethepairofpointsthatareclosesttogether.Conceptually,wecanviewthemergingoftheclustersasputtinganedgebetweenthetwoclosestpointsofthetwoclusters.Notethatifbothclustersarecurrentlyconnected,thentheresultingclusterwillalsobeconnected.However,sincetheclustersareformedfromdisjointsetsofpoints,andedgesareonlyplacedbetweenpointsindifferentclusters,nocyclecanbeformed.Fromtheseobservationsandbynotingthatwestartwithclusters(graphs)ofsizeonethatarevacuouslyconnected,wecandeducebyinductionthatatanystageinsinglelinkclusteringprocess,eachclusterconsistsofacon-nectedsetofpointswithoutanycycles.Thus,whenthelasttwoclustersaremergedtoformaclustercontainingallthepoints,wealsohaveaconnectedgraphofallthepointsthatisaspanningtreeofthegraph.Furthermore,sinceeachpointinthegraphisconnectedtoitsnearestpoint,thespanningtreemustbeaminimumspanningtree.AllthatremainstoestablishtheequivalenceofMSTandsinglelinkistonotethatMSTessentiallyreversestheprocessbywhichsinglelinkbuilttheminimumspanningtree;i.e.,by若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn152Chapter9ClusterAnalysis:AdditionalIssuesandAlgorithmsbreakingedgesbeginningwiththelongestandproceedinguntilthesmallest.Thus,itgeneratesthesameclustersassinglelink,butinreverseorder.14.Onewaytosparsifyaproximitymatrixisthefollowing:Foreachobject(rowinthematrix),setallentriesto0exceptforthosecorrespondingtotheobjectsk-nearestneighbors.However,thesparsifiedproximitymatrixistypicallynotsymmetric.(a)Ifobjectaisamongthek-nearestneighborsofobjectb,whyisbnotguaranteedtobeamongthek-nearestneighborsofa?Consideradensesetofk+1objectsandanotherobject,anoutlier,thatisfartherfromanyoftheobjectsthantheyarefromeachother.Noneoftheobjectsinthedensesetwillhavetheoutlierontheirk-nearestneighborlist,buttheoutlierwillhavekoftheobjectsfromthedensesetonitsk-nearestneighborlist.(b)Suggestatleasttwoapproachesthatcouldbeusedtomakethesparsi-fiedproximitymatrixsymmetric.Oneapproachistosettheijthentryto0ifthejithentryis0,orviceversa.Anotherapproachistosettheijthentryto1ifthejithentryis1,orviceversa.15.Giveanexampleofasetofclustersinwhichmergingbasedontheclosenessofclustersleadstoamorenaturalsetofclustersthanmergingbasedonthestrengthofconnection(interconnectedness)ofclusters.AnexampleofthisisgivenintheChameleonpaperthatcanbefoundat课后答案网:www.hackshp.cnhttp://www.cs.umn.edu/karypis/publications/Papers/PDF/chameleon.pdf.Theexampleconsistsoftwonarrowrectanglesofpointsthataresidebyside.Thetoprectangleissplitintotwoclusters,onemuchsmallerthantheother.Eventhoughthetworectanglesonthetopareclose,theyarenotstronglyconnectedsincethestronglinksbetweenthemareacrossasmallarea.Ontheotherhand,thelargestrectangleonthetopandtherectangleonthebottomarestronglyconnected.Eachindividualconnectionisnotasstrong,becausethesetworectanglesarenotasclose,buttherearemoreofthembecausetheareaofconnectionislarge.Thus,anapproachbasedonconnectivitywillmergethelargestrectangleontopwiththebottomrectangle.16.Table9.1liststhetwonearestneighborsoffourpoints.CalculatetheSNNsimilaritybetweeneachpairofpointsusingthedefinitionofSNNsimilaritydefinedinAlgorithm9.10.ThefollowingistheSNNsimilaritymatrix.17.ForthedefinitionofSNNsimilarityprovidedbyAlgorithm9.10,thecal-culationofSNNdistancedoesnottakeintoaccountthepositionofshared若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn153Table9.1.Twonearestneighborsoffourpoints.PointFirstNeighborSecondNeighbor143234342431Table9.2.Twonearestneighborsoffourpoints.Point123412001202103012041002neighborsinthetwonearestneighborlists.Inotherwords,itmightbede-sirabletogivehighersimilaritytotwopointsthatsharethesamenearestneighborsinthesameorroughlythesameorder.(a)DescribehowyoumightmodifythedefinitionofSNNsimilaritytogivehighersimilaritytopointswhosesharedneighborsareinroughlythesameorder.Thiscanbedonebyassigningweightstothepointsbasedontheirpositioninthenearestneighborlist.Forexample,wecanweightthei课后答案网:www.hackshp.cnthpointinthenearestneighborlistbyn−i+1.Foreachpoint,wethentakethesumorproductofitsrankonbothlists.Thesevaluesarethensummedtocomputethesimilaritybetweenthetwoobjects.ThisapproachwassuggestedbyJarvisandPatrick[5].(b)Discusstheadvantagesanddisadvantagesofsuchamodification.Suchanapproachismorecomplex.However,itisadvantageousifitisthecasethattwoobjectsaremoresimilarifthesharedneighborsareroughlyofthesamerank.Furthermore,itmayalsohelptocompensateforarbitrarinessinthechoiceofk.18.NameatleastonesituationinwhichyouwouldnotwanttouseclusteringbasedonSNNsimilarityordensity.Whenyouwishtoclusterbasedonabsolutedensityordistance.19.Grid-clusteringtechniquesaredifferentfromotherclusteringtechniquesinthattheypartitionspaceinsteadofsetsofpoints.(a)Howdoesthisaffectsuchtechniquesintermsofthedescriptionoftheresultingclustersandthetypesofclustersthatcanbefound?若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn154Chapter9ClusterAnalysis:AdditionalIssuesandAlgorithmsIngrid-basedclustering,theclustersaredescribedintermsofcollec-tionsofadjacentcells.Insomecases,asinCLIQUE,amorecompactdescriptionisgenerated.Inanycase,thedescriptionoftheclustersisgivenintermsofaregionofspace,notasetofobjects.(However,suchadescriptioncaneasilybegenerated.)Becauseitisnecessarytoworkintermsofrectangularregions,theshapesofnon-rectangularclusterscanonlybeapproximated.However,thegroupsofadjacentcellscanhaveholes.(b)Whatkindofclustercanbefoundwithgrid-basedclustersthatcannotbefoundbyothertypesofclusteringapproaches?(Hint:SeeExercise20inChapter8,page564.)Typically,grid-basedclusteringtechniquesonlypayattentiontodenseregions.However,suchtechniquescouldalsobeusedtoidentifysparseoremptyregionsandthusfindpatternsoftheabsenceofpoints.Note,however,thatthiswouldnotbeappropriateforasparsedataspace.20.InCLIQUE,thethresholdusedtofindclusterdensityremainsconstant,evenasthenumberofdimensionsincreases.Thisisapotentialproblemsincedensitydropsasdimensionalityincreases;i.e.,tofindclustersinhigherdimensionsthethresholdhastobesetatalevelthatmaywellresultinthemergingoflow-dimensionalclusters.Commentonwhetheryoufeelthisistrulyaproblemand,ifso,howyoumightmodifyCLIQUEtoaddressthisproblem.Thisisarealproblem.Asimilarproblemexistsinassociationanalysis.Inparticular,thesupportofassociationpatternswithalargenumberofitemsisoftenlow.TofindsuchpatternsusinganalgorithmsuchasAprioriisdiffi-课后答案网:www.hackshp.cncultbecausethelowsupportthresholdrequiredresultsinalargenumberofassociationpatternswithfewitemsthatareoflittleinterest.Inotherwords,associationpatternswithmanyitems(patternsinhigher-dimensionalspace)areinterestingatsupportlevels(densities)thatdonotmakeforinterestingpatternswhenthesizeoftheassociationpattern(numberofdimensions)islow.Oneapproachistodecreasethesupportthreshold(densitythreshold)asthenumberofitems(numberofdimensions)isincreased.21.GivenasetofpointsinEuclideanspace,whicharebeingclusteredusingtheK-meansalgorithmwithEuclideandistance,thetriangleinequalitycanbeusedintheassignmentsteptoavoidcalculatingallthedistancesofeachpointtoeachclustercentroid.Provideageneraldiscussionofhowthismightwork.CharlesElkanpresentedthefollowingtheoreminhiskeynotespeechattheWorkshoponClusteringHigh-DimensionalDataatSIAM2004.Lemma1:Letxbeapoint,andletbandcbecenters.Ifd(b,c)≥2d(x,b)thend(x,c)≥d(x,b).若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn155Proof:Weknowd(b,c)≤d(b,x)+d(x,c).Sod(b,c)−d(x,b)≤d(x,c).Nowd(b,c)−d(x,b)≥2d(x,b)−d(x,b)=d(x,b).Sod(x,b)≤d(x,c).Thistheoremcanbeusedtoeliminatealargenumberofunnecessarydistancecalculations.课后答案网:www.hackshp.cn22.InsteadofusingtheformuladerivedinCURE—seeEquation9.19—wecouldrunaMonteCarlosimulationtodirectlyestimatetheprobabilitythatasampleofsizeswouldcontainatleastacertainfractionofthepointsfromacluster.UsingaMonteCarlosimulationcomputetheprobabilitythatasampleofsizescontains50%oftheelementsofaclusterofsize100,wherethetotalnumberofpointsis1000,andwherescantakethevalues100,200,or500.Thisquestionshouldhavesaid“containsatleast50%.”Theresultsofoursimulationconsistingof100,000trialswas0,0,and0.54forthesamplesizes100,200,and500respectively.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn课后答案网:www.hackshp.cn若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn10AnomalyDetection1.CompareandcontrastthedifferenttechniquesforanomalydetectionthatwerepresentedinSection10.1.2.Inparticular,trytoidentifycircumstancesinwhichthedefinitionsofanomaliesusedinthedifferenttechniquesmightbeequivalentorsituationsinwhichonemightmakesense,butanotherwouldnot.Besuretoconsiderdifferenttypesofdata.First,notethattheproximity-anddensity-basedanomalydetectiontech-niquesarerelated.Specifically,highdensityintheneighborhoodofapointimpliesthatmanypointsareclosetoit,andvice-versa.Inpractice,densityisoftendefinedintermsofdistance,althoughitcanalsobedefinedusingagrid-basedapproach.Themodel-basedapproachcanbeusedwithvirtuallyanyunderlyingtech-niquethatfitsamodeltothedata.However,notethataparticularmodel,课后答案网:www.hackshp.cnstatisticalorotherwise,mustbeassumed.Consequently,model-basedap-proachesarerestrictedintermsofthedatatowhichtheycanbeapplied.Forexample,ifthemodelassumesaGaussiandistribution,thenitcannotbeappliedtodatawithanon-Gaussiandistribution.Ontheotherhand,theproximity-anddensity-basedapproachesdonotmakeanyparticularassumptionaboutthedata,althoughthedefinitionofananomalydoesvaryfromoneproximity-ordensity-basedtechniquetoanother.Proximity-basedapproachescanbeusedforvirtuallyanytypeofdata,althoughtheproximitymetricusedmustbechosenappropriately.Forexample,Euclideandistanceistypicallyusedfordense,low-dimensionaldata,whilethecosinesimilaritymeasureisusedforsparse,high-dimensionaldata.Sincedensityistypicallydefinedintermsofproximity,density-basedapproachescanalsobeusedforvirtuallyanytypeofdata.However,themeaningofdensityislessclearinanon-Euclideandataspace.Proximity-anddensity-basedanomalydetectiontechniquescanoftenpro-ducesimilarresults,althoughtherearesignificantdifferencesbetweentech-若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn158Chapter10AnomalyDetectionniquesthatdonotaccountforthevariationsindensitythroughoutadatasetorthatusedifferentproximitymeasuresforthesamedataset.Model-basedmethodswilloftendiffersignificantlyfromoneanotherandfromproximity-anddensity-basedapproaches.2.Considerthefollowingdefinitionofananomaly:Ananomalyisanobjectthatisunusuallyinfluentialinthecreationofadatamodel.(a)Comparethisdefinitiontothatofthestandardmodel-baseddefinitionofananomaly.Thestandardmodel-baseddefinitionlabelsobjectsthatdon’tfitthemodelverywellasanomalies.Althoughtheseobjectoftenareunusu-allyinfluentialinthemodel,itcanalsobethecasethatanunusuallyinfluentialobjectcanfitthemodelverywell.(b)Forwhatsizesofdatasets(small,medium,orlarge)isthisdefinitionappropriate?Thisdefinitionistypicallymoreappropriateforsmallerdatasets,atleastifwearetalkingaboutoneveryinfluentialobject.However,arelativelysmallgrouphighlyinfluentialobjectscanhaveasignificantimpactonamodel,butstillfititwell,evenformediumorlargedatasets.3.Inoneapproachtoanomalydetection,objectsarerepresentedaspointsinamultidimensionalspace,andthepointsaregroupedintosuccessiveshells,whereeachshellrepresentsalayeraroundagroupingofpoints,suchasaconvexhull.Anobjectisananomalyifitliesinoneoftheoutershells.课后答案网:www.hackshp.cn(a)TowhichofthedefinitionsofananomalyinSection10.1.2isthisdefi-nitionmostcloselyrelated?Thisdefinitionismostcloselyrelatedtothedistance-basedapproach.(b)Nametwoproblemswiththisdefinitionofananomaly.i.Fortheconvexhullapproach,thedistanceofthepointsinacon-vexhullfromthecenterofthepointscanvarysignificantlyifthedistributionofpointsinnotsymmetrical.ii.Thisapproachdoesnotlenditselftoassigningmeaningfulnumer-icalanomalyscores.4.Associationanalysiscanbeusedtofindanomaliesasfollows.Findstrongas-sociationpatterns,whichinvolvesomeminimumnumberofobjects.Anoma-liesarethoseobjectsthatdonotbelongtoanysuchpatterns.Tomakethismoreconcrete,wenotethatthehypercliqueassociationpatterndiscussedinSection6.8isparticularlysuitableforsuchanapproach.Specifically,givenauser-selectedh-confidencelevel,maximalhypercliquepatternsofobjectsare若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn159found.Allobjectsthatdonotappearinamaximalhypercliquepatternofatleastsizethreeareclassifiedasoutliers.(a)Doesthistechniquefallintoanyofthecategoriesdiscussedinthischapter?Ifso,whichone?Inahyperclique,allpairsofobjectshaveaguaranteedcosinesimilar-ityoftheh-confidenceorhigher.Thus,thisapproachcanbeviewedasaproximity-basedapproach.However,ratherthanaconditionontheproximityofobjectswithrespecttoaparticularobject,thereisarequirementonthepairwiseproximitiesofallobjectsinagroup.(b)Nameonepotentialstrengthandonepotentialweaknessofthisap-proach.Strengthsofthisapproacharethat(1)theobjectsthatdonotbelongtoanysize3hypercliquearenotstronglyconnectedtootherobjectsandarelikelyanomalousand(2)itiscomputationallyefficient.Potentialweaknessesare(1)thisapproachdoesnotassignanumericalanomalyscore,butsimplyclassifiesanobjectasnormalorananomaly,(2)itisnotpossibletodirectlycontrolthenumberofobjectsclassifiedasanomaliesbecausetheonlyparametersaretheh-confidenceandsupportthreshold,and(3)thedataneedstobediscretized.5.Discusstechniquesforcombiningmultipleanomalydetectiontechniquestoimprovetheidentificationofanomalousobjects.Considerbothsupervisedandunsupervisedcases.Inthesupervisedcase,wecoulduseensembleclassificationtechniques.In课后答案网:www.hackshp.cntheseapproaches,theclassificationofanobjectisdeterminedbycombiningtheclassificationsofanumberofclassifiers,e.g.,byvoting.Intheunsuper-visedapproach,avotingapproachcouldalsobeused.Notethatthisassumesthatwehaveabinaryassignmentofanobjectasananomaly.Ifwehaveanomalyscores,thenthescorescouldbecombinedinsomemanner,e.g.,anaverageorminimum,toyieldanoverallscore.6.Describethepotentialtimecomplexityofanomalydetectionapproachesbasedonthefollowingapproaches:model-basedusingclustering,proximity-based,anddensity.Noknowledgeofspecifictechniquesisrequired.Rather,focusonthebasiccomputationalrequirementsofeachapproach,suchasthetimerequiredtocomputethedensityofeachobject.IfK-meansclusteringisused,thenthecomplexityisdominatedbyfindingtheclusters.Thisrequirestimeproportionaltothenumberofobjects,i.e.,O(m).Thedistanceanddensitybasedapproaches,usuallyrequirecomputingallthepairwiseproximities,andthusthecomplexityisoftenO(m2).Insomecases,suchaslowdimensionaldata,specialtechniques,suchastheR∗若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn160Chapter10AnomalyDetectiontreeork-dtreescanbeusedtocomputethenearestneighborsofanobjectmoreefficiently,i.e.,O(mlogm),andthiscanreducetheoverallcomplexitywhenthetechniqueisbasedonlyonnearestneighbors.Also,agrid-basedapproachtocomputingdensitycanreducethecomplexityofdensity-basedanomalydetectiontoO(m),butsuchtechniquesarenotasaccurateandareonlyeffectiveforlowdimensions.7.TheGrubbs’test,whichisdescribedbyAlgorithm10.1,isamorestatisticallysophisticatedprocedurefordetectingoutliersthanthatofDefinition10.3.Itisiterativeandalsotakesintoaccountthefactthatthez-scoredoesnothaveanormaldistribution.Thisalgorithmcomputesthez-scoreofeachvaluebasedonthesamplemeanandstandarddeviationofthecurrentsetofvalues.Thevaluewiththelargestmagnitudez-scoreisdiscardedifitsz-scoreislargerthangc,thecriticalvalueofthetestforanoutlieratsignificancelevelα.Thisprocessisrepeateduntilnoobjectsareeliminated.Notethatthesamplemean,standarddeviation,andgcareupdatedateachiteration.Algorithm10.1Grubbs’approachforoutlierelimination.1:Inputthevaluesandα{misthenumberofvalues,αisaparameter,andtcisavaluechosensothatα=prob(x≥tc)foratdistributionwithm−2degreesoffreedom.}2:repeat3:Computethesamplemean(x)andstandarddeviation(sx).4:Computeavaluegcsothatprob(|z|≥gc)=α.m−1t2(Intermsoftandm,g=√c.)ccmm−2+t2c5:Computethez-scoreofeachvalue,i.e.,z=(x−x)/sx.6:Letg课后答案网:www.hackshp.cn=max|z|,i.e.,findthez-scoreoflargestmagnitudeandcallitg.7:ifg>gcthen8:Eliminatethevaluecorrespondingtog.9:m←m−110:endif11:untilNoobjectsareeliminated.m−1t2(a)Whatisthelimitofthevalue√cusedforGrubbs’testasmm−2+t2cmapproachesinfinity?Useasignificancelevelof0.05.Notethatthiscouldhavebeenphrasedbetter.Thevalueofthisex-pressionapproachestc,butstrictlyspeakingthisisnotalimitastcdependsonm.m−1t2cm−1lim√2=lim2×tc=1×tc=tc.m→∞mm−2+tcm→∞m(m−2+tc)若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn161Also,thevalueoftcwillcontinuetoincreasewithm,althoughslowly.Form=1020,t=93forasignificancevalueof0.05.c(b)Describe,inwords,themeaningofthepreviousresult.Thedistributionofgisbecomesatdistributionasmincreases.8.Manystatisticaltestsforoutliersweredevelopedinanenvironmentinwhichafewhundredobservationswasalargedataset.Weexplorethelimitationsofsuchapproaches.(a)Forasetof1,000,000values,howlikelyarewetohaveoutliersaccordingtothetestthatsaysavalueisanoutlierifitismorethanthreestandarddeviationsfromtheaverage?(Assumeanormaldistribution.)Thisquestionshouldhaveaskedhowmanyoutlierswewouldhavesincetheobjectofthisquestionistoshowthatevenasmallprobabilityofanoutlieryieldsalargenumberofoutliersforalargedataset.Theprobabilityisunaffectedbythenumberofobjects.Theprobabilityiseither0.00135forasinglesideddeviationof3stan-darddeviationsor0.0027foradouble-sideddeviation.Thus,thenum-berofanomalousobjectswillbeeither1,350or2,700.(b)Doestheapproachthatstatesanoutlierisanobjectofunusuallylowprobabilityneedtobeadjustedwhendealingwithlargedatasets?Ifso,how?Therearethousandsofoutliers(underthespecifieddefinition)inamillionobjects.Wemaychoosetoaccepttheseobjectsasoutliersor课后答案网:www.hackshp.cnprefertoincreasethethresholdsothatfeweroutliersresult.9.TheprobabilitydensityofapointxwithrespecttoamultivariatenormaldistributionhavingameanµandcovariancematrixΣisgivenbytheequa-tion−1T1−(x−µ)Σ(x−µ)prob(x)=√e2.(10.1)(2π)m|Σ|1/2UsingthesamplemeanxandcovariancematrixSasestimatesofthemeanµandcovariancematrixΣ,respectively,showthatthelog(prob(x))isequaltotheMahalanobisdistancebetweenadatapointxandthesamplemeanxplusaconstantthatdoesnotdependonx.√1logprob(x)=−log((2π)m|Σ|1/2)−(x−µ)Σ−1(x−µ)T.2IfweusethesamplemeanandcovarianceasestimatesofµandΣ,respec-tively,then√1logprob(x)=−log((2π)m|S|1/2)−(x−x)S−1(x−x)T2若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn162Chapter10AnomalyDetectionTheconstantandtheconstantfactordonotaffecttheorderingofthisquan-tity,onlytheirmagnitude.Thus,ifwewanttobaseadistanceonthisquantity,wecankeeponlythevariablepart,whichistheMahalanobisdis-tance.10.Comparethefollowingtwomeasuresoftheextenttowhichanobjectbelongstoacluster:(1)distanceofanobjectfromthecentroidofitsclosestclusterand(2)thesilhouettecoefficientdescribedinSection8.5.2.Thefirstmeasureissomewhatlimitedsinceitdisregardsthatfactthattheobjectmayalsobeclosetoanothercluster.Thesilhouettecoefficienttakesintoaccountboththedistanceofanobjecttoitsclusteranditsdistancetootherclusters.Thus,itcanbemoreinformativeabouthowstronglyanobjectbelongstotheclustertowhichitwasassigned.11.Considerthe(relativedistance)K-meansschemeforoutlierdetectionde-scribedinSection10.5andtheaccompanyingfigure,Figure10.10.(a)ThepointsatthebottomofthecompactclustershowninFigure10.10haveasomewhathigheroutlierscorethanthosepointsatthetopofthecompactcluster.Why?ThemeanofthepointsispulledsomewhatupwardfromthecenterofthecompactclusterbypointD.(b)Supposethatwechoosethenumberofclusterstobemuchlarger,e.g.,10.Wouldtheproposedtechniquestillbeeffectiveinfindingthemostextremeoutlieratthetopofthefigure?Whyorwhynot?No.Thispointwouldbecomeaclusterbyitself.课后答案网:www.hackshp.cn(c)Theuseofrelativedistanceadjustsfordifferencesindensity.Giveanexampleofwheresuchanapproachmightleadtothewrongconclusion.Ifabsolutedistancesareimportant.Forexample,considerheartratemonitorsforpatients.Iftheheartrategoesaboveorbelowaspecifiedrangeofvalues,thenthishasanphysicalmeaning.Itwouldbeincorrectnottoidentifyanypatientoutsidethatrangeasabnormal,eventhoughtheremaybeagroupofpatientsthatarerelativelysimilartoeachotherandallhaveabnormalheartrates.12.Iftheprobabilitythatanormalobjectisclassifiedasananomalyis0.01andtheprobabilitythatananomalousobjectisclassifiedasanomalousis0.99,thenwhatisthefalsealarmrateanddetectionrateif99%oftheobjectsarenormal?(Usethedefinitionsgivenbelow.)numberofanomaliesdetecteddetectionrate=totalnumberofanomaliesnumberoffalseanomaliesfalsealarmrate=numberofobjectsclassifiedasanomalies若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn163Thedetectionrateissimply99%.Thefalsealarmrate=0.99m×0.01/(0.99m×0.01+0.01m×0.99)=0.50=50%.13.Whenacomprehensivetrainingsetisavailable,asupervisedanomalydetec-tiontechniquecantypicallyoutperformanunsupervisedanomalytechniquewhenperformanceisevaluatedusingmeasuressuchasthedetectionandfalsealarmrate.However,insomecases,suchasfrauddetection,newtypesofanomaliesarealwaysdeveloping.Performancecanbeevaluatedaccordingtothedetectionandfalsealarmrates,becauseitisusuallypossibletode-termine,uponinvestigation,whetheranobject(transaction)isanomalous.Discusstherelativemeritsofsupervisedandunsupervisedanomalydetectionundersuchconditions.Whennewanomaliesaretobedetected,anunsupervisedanomalydetectionschememustbeused.However,supervisedanomalydetectiontechniquesarestillimportantfordetectingknowntypesofanomalies.Thus,bothsuper-visedandunsupervisedanomalydetectionmethodsshouldbeused.Agoodexampleofsuchasituationisnetworkintrusiondetection.Profilesorsig-naturescanbecreatedforwell-knowntypesofintrusions,butcannotdetectnewtypesofintrusions.14.Consideragroupofdocumentsthathasbeenselectedfromamuchlargersetofdiversedocumentssothattheselecteddocumentsareasdissimilarfromoneanotheraspossible.Ifweconsiderdocumentsthatarenothighlyrelated(connected,similar)tooneanotherasbeinganomalous,thenallof课后答案网:www.hackshp.cnthedocumentsthatwehaveselectedmightbeclassifiedasanomalies.Isitpossibleforadatasettoconsistonlyofanomalousobjectsoristhisanabuseoftheterminology?Theconnotationofanomalousisthatofrarityandmanyofthedefinitionsofananomalyincorporatethisnotiontosomeextent.However,therearesituationsinwhichananomalytypicallydoesnotoccurveryoften,e.g.,anetworkfailure,buthasaveryconcretedefinition.Thismakesitpossibletodistinguishananomalyinanabsolutesenseandforasituationtoarisewherethemajorityofobjectsareanomalous.Also,inprovidingmathematicaloralgorithmicdefinitionsofananomaly,itcanhappenthatthesedefinitionsproducesituationsinwhichmanyorallobjectsofadatasetcanbeclassifiedasanomalies.Anotherviewpointmightsaythatifitisimpossibletodefineameaningfulnormalsituation,thenallobjectsareanomalous.(“Unique”isthetermtypicallyusedinthiscontext.)Insummary,thiscanberegardedasaphilosophicalorsemanticquestion.Agoodargument(althoughlikelynotanuncontestedone)canbemadethatitispossibletohaveacollectionofobjectsthataremostlyorallanomalies.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cn164Chapter10AnomalyDetection15.Considerasetofpoints,wheremostpointsareinregionsoflowdensity,butafewpointsareinregionsofhighdensity.Ifwedefineananomalyasapointinaregionoflowdensity,thenmostpointswillbeclassifiedasanomalies.Isthisanappropriateuseofthedensity-baseddefinitionofananomalyorshouldthedefinitionbemodifiedinsomeway?Ifthedensityhasanabsolutemeaning,suchasassignedbythedomain,thenitmaybeperfectlyreasonabletoconsidermostofthepointsasanomalous.(Seetheanswertothepreviousexercise.)However,inmanycircumstances,theappropriateapproachwouldbetouseananomalydetectiontechniquethattakestherelativedensityintoaccount.16.Considerasetofpointsthatareuniformlydistributedontheinterval[0,1].Isthestatisticalnotionofanoutlierasaninfrequentlyobservedvaluemean-ingfulforthisdata?Notreally.Thetraditionalstatisticalnotionofanoutlierreliesontheideathatanobjectwitharelativelylowprobabilityissuspect.Withauniformdistribution,nosuchdistinctioncanbemade.17.Ananalystappliesananomalydetectionalgorithmtoadatasetandfindsasetofanomalies.Beingcurious,theanalystthenappliestheanomalydetectionalgorithmtothesetofanomalies.(a)Discussthebehaviorofeachoftheanomalydetectiontechniquesde-课后答案网:www.hackshp.cnscribedinthischapter.(Ifpossible,trythisforrealdatasetsandalgorithms.)(b)Whatdoyouthinkthebehaviorofananomalydetectionalgorithmshouldbewhenappliedtoasetofanomalousobjects?Insomecases,suchasthestatistically-basedanomalydetectiontechniques,itwouldnotbevalidtoapplythetechniqueasecondtime,sincetheassump-tionswouldnolongerhold.Thiscouldalsobetrueforothermodel-basedapproaches.Thebehavioroftheproximity-anddensity-basedapproacheswoulddependontheparticulartechniques.Interestingly,theapproachesthatuseabsolutethresholdsofdistanceordensitywouldlikelyclassifythesetofanomaliesasanomalies,atleastiftheoriginalparameterswerekept.Therelativeapproacheswouldlikelyclassifymostoftheanomaliesasnormalandsomeasanomalies.Whetheranobjectisanomalousdependsonthetheentiregroupofobjects,andthus,itisprobablyunreasonabletoexpectthatananomalydetectiontechniquewillidentifyasetofanomaliesassuchintheabsenceoftheoriginaldataset.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn 课后答案网:www.hackshp.cnBIBLIOGRAPHY165Bibliography课后答案网:www.hackshp.cn[1]W.W.Cohen.Fasteffectiveruleinduction.InProc.ofthe12thIntl.Conf.onMachineLearning,pages115–123,TahoeCity,CA,July1995.[2]S.CostandS.Salzberg.Aweightednearestneighboralgorithmforlearningwithsymbolicfeatures.MachineLearning,10:57–78,1993.[3]J.F¨urnkranzandG.Widmer.Incrementalreducederrorpruning.InProc.ofthe11thIntl.Conf.onMachineLearning,pages70–77,NewBrunswick,NJ,July1994.[4]J.Hartigan.ClusteringAlgorithms.Wiley,NewYork,1975.[5]R.A.JarvisandE.A.Patrick.Clusteringusingasimilaritymeasurebasedonsharednearestneighbors.IEEETransactionsonComputers,C-22(11):1025–1034,1973.若侵犯了您的版权利益,敬请来信告知!www.hackshp.cn'

您可能关注的文档