这里 --我可以看到存在类clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN
,但是当我试图调用它时,我得到了错误:
java -cp elki.jar de.lmu.ifi.dbs.elki.application.KDDCLIApplication -algorithm clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN -algorithm.distancefunction EuclideanDistanceFunction -dbc.in infile.txt -dbscan.epsilon 1.0 -dbscan.minpts 1 -verbose -out OUTFOLDER
Class 'clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN' not found for given value. Must be a subclass / implementation of de.lmu.ifi.dbs.elki.algorithm.Algorithm
这个类确实没有出现在可用类的列表中,该列表中输出了错误消息:
-> clustering.CanopyPreClustering
-> clustering.DBSCAN
-> clustering.affinitypropagation.AffinityPropagationClusteringAlgorithm
-> clustering.em.EM
-> clustering.gdbscan.GeneralizedDBSCAN
-> clustering.gdbscan.LSDBC
-> clustering.GriDBSCAN
-> clustering.hierarchical.extraction.HDBSCANHierarchyExtraction
-> clustering.hierarchical.extraction.SimplifiedHierarchyExtraction
-> clustering.hierarchical.extraction.ExtractFlatClusteringFromHierarchy
-> clustering.hierarchical.SLINK
-> clustering.hierarchical.AnderbergHierarchicalClustering
-> clustering.hierarchical.AGNES
-> clustering.hierarchical.CLINK
-> clustering.hierarchical.SLINKHDBSCANLinearMemory
-> clustering.hierarchical.HDBSCANLinearMemory
-> clustering.kmeans.KMeansSort
-> clustering.kmeans.KMeansCompare
-> clustering.kmeans.KMeansHamerly
-> clustering.kmeans.KMeansElkan
-> clustering.kmeans.KMeansLloyd
-> clustering.kmeans.parallel.ParallelLloydKMeans
-> clustering.kmeans.KMeansMacQueen
-> clustering.kmeans.KMediansLloyd
-> clustering.kmeans.KMedoidsPAM
-> clustering.kmeans.KMedoidsEM
-> clustering.kmeans.CLARA
-> clustering.kmeans.BestOfMultipleKMeans
-> clustering.kmeans.KMeansBisecting
-> clustering.kmeans.KMeansBatchedLloyd
-> clustering.kmeans.KMeansHybridLloydMacQueen
-> clustering.kmeans.SingleAssignmentKMeans
-> clustering.kmeans.XMeans
-> clustering.NaiveMeanShiftClustering
-> clustering.optics.DeLiClu
-> clustering.optics.OPTICSXi
-> clustering.optics.OPTICSHeap
-> clustering.optics.OPTICSList
-> clustering.optics.FastOPTICS
-> clustering.SNNClustering
-> clustering.biclustering.ChengAndChurch
-> clustering.correlation.CASH
-> clustering.correlation.COPAC
-> clustering.correlation.ERiC
-> clustering.correlation.FourC
-> clustering.correlation.HiCO
-> clustering.correlation.LMCLUS
-> clustering.correlation.ORCLUS
-> clustering.onedimensional.KNNKernelDensityMinimaClustering
-> clustering.subspace.CLIQUE
-> clustering.subspace.DiSH
-> clustering.subspace.DOC
-> clustering.subspace.HiSC
-> clustering.subspace.P3C
-> clustering.subspace.PreDeCon
-> clustering.subspace.PROCLUS
-> clustering.subspace.SUBCLU
-> clustering.meta.ExternalClustering
-> clustering.trivial.ByLabelClustering
-> clustering.trivial.ByLabelHierarchicalClustering
-> clustering.trivial.ByModelClustering
-> clustering.trivial.TrivialAllInOne
-> clustering.trivial.TrivialAllNoise
-> clustering.trivial.ByLabelOrAllInOneClustering
-> clustering.uncertain.FDBSCAN
-> clustering.uncertain.CKMeans
-> clustering.uncertain.UKMeans
-> clustering.uncertain.RepresentativeUncertainClustering
-> clustering.uncertain.CenterOfMassMetaClustering
我认为这个方法可能是内部的,并且是由clustering.gdbscan.GeneralizedDBSCAN
调用的,但是它对我来说是一个单一的核心。也许我需要添加一些命令行参数来启用多处理?
编辑:感谢@erich,现在我可以看到时间估计了。我在那里使用了M树索引,如文档所示
java -Xmx32000M -cp elki-bundle-0.7.2-SNAPSHOT.jar de.lmu.ifi.dbs.elki.application.KDDCLIApplication -algorithm clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN -db.index tree.metrical.mtreevariants.mtree.MTreeFactory -treeindex.pagesize 4096 -mtree.distancefunction EuclideanDistanceFunction -algorithm.distancefunction EuclideanDistanceFunction -dbc.in dump_txt.txt -dbscan.epsilon 1.0 -dbscan.minpts 1 -verbose -out RES
我收到了关于被忽略的参数的警告:following parameters were not processed: [-treeindex.pagesize, 4096]
相当压抑的时间估计,而且还在继续增长:
de.lmu.ifi.dbs.elki.datasource.FileBasedDatabaseConnection.load: 553728 ms
Relation does not have a dimensionality -- simulating M-tree as external index!
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.directory.capacity: 200
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.directory.minfill: 0
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.leaf.capacity: 333
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.leaf.minfill: 0
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.construction: 806160 ms
Index statistics before running algorithms:
de.lmu.ifi.dbs.elki.persistent.MemoryPageFile.reads: 22344677
de.lmu.ifi.dbs.elki.persistent.MemoryPageFile.writes: 3831053
de.lmu.ifi.dbs.elki.persistent.MemoryPageFile.numpages: 17472
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.height: 2
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.AbstractMTree$Statistics.distancecalcs: 1773733054
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.AbstractMTree$Statistics.knnqueries: 0
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.AbstractMTree$Statistics.rangequeries: 0
DBSCAN clustering: 708 [ 0%] 33738 min remaining
我的数据是3.5米,300维word2vec矢量(浮点).我能以某种方式优化它以便在一个合理的时间内运行吗?
我使用-dbscan.minpts 1
,因为我刚刚找到了向量之间的距离,它对应于相似性。
EDIT2: r树索引更快:DBSCAN clustering: 4423 [ 0%] 17248 min remaining
发布于 2018-01-24 10:07:40
并行DBSCAN版本不在0.7.1版本中,但您需要自己编译它。
它目前不包括进度日志记录,而且它是一个相当幼稚的并行化。如果大部分时间都花在邻居搜索上,它可以正常工作,因为集群标记是同步的。(但如果加载了所有内核,则同步应该是很好的)。
我刚刚推动了一个将进度日志添加到并行GDBSCAN的更改。
确保添加索引。对于大多数数据集,索引会产生相当大的加速。有了索引,这个实现的并行性就会很差,你会看到越来越多的线程在等待同步。
https://stackoverflow.com/questions/48383925
复制相似问题