首頁猿問 Java中的稀疏矩陣/數組

Java中的稀疏矩陣/數組

Java 算法

慕妹3146593 2019-07-22 19:12:41

Java中的稀疏矩陣/數組我正在做一個用Java編寫的項目，它要求我構建一個非常大的二維稀疏數組。非常稀少，如果這有區別的話。無論如何：這個應用程序最關鍵的方面是時間方面的效率(假設內存負載，雖然不是無限的，允許我使用標準的二維數組-關鍵的范圍是兩個維度的數十億)。在數組中的kajillion單元中，將有幾十萬個包含一個對象的單元格。我需要能夠非?？斓匦薷膯卧駜热?。不管怎么說，有人知道一個特別好的圖書館嗎？它必須是伯克利，LGPL或類似的許可(沒有GPL，因為產品不能完全開源)?；蛘撸绻幸环N非常簡單的方法來制作一個自制的稀疏數組對象，那也很好。我在考慮MTJ，但沒有聽到任何關于它的質量的意見。

查看完整描述

3 回答

隔江千里

TA貢獻1906條經驗獲得超10個贊

使用hashmap構建的Sparsed數組對于頻繁讀取的數據來說效率很低。最有效的實現使用Trie，它允許訪問單個向量，其中段是分布的。

Trie可以通過只執行只讀的兩個數組索引來計算表中是否存在一個元素，以獲得存儲元素的有效位置，或者知道它是否不在基礎存儲中。

它還可以為稀疏數組的默認值提供備份存儲中的默認位置，因此不需要對返回的索引進行任何測試，因為Trie保證所有可能的源索引至少映射到后備存儲中的默認位置(在備份存儲中經常存儲零、空字符串或空對象)。

有一些實現支持可快速更新的嘗試，通過一個普通的“緊密()”操作來優化多個操作結束時備份存儲的大小。嘗試要比散列映射快得多，因為它們不需要任何復雜的散列函數，并且不需要處理讀的沖突(對于Hashmap，無論是讀還是寫，這都需要一個循環來跳到下一個候選位置，并對它們進行測試以比較有效的源索引.)

此外，JavaHashmap只能對象進行索引，并且為每個散列源索引創建一個Integer對象(每次讀取都需要創建這個對象，而不僅僅是寫)，因為它強調垃圾收集器的內存操作成本很高。

我真的希望JRE包含一個IntegerTrieMap<Object>作為慢速HashMap<Integer，Object>或LongTrieMap<Object>的默認實現，作為更慢的HashMap<Long，Object>的默認實現.但情況仍然并非如此。

你可能想知道什么是Trie？

它只是一個小的整數數組(在比矩陣的整個坐標范圍更小的范圍內)，它允許將坐標映射到向量中的整數位置。

例如，假設您想要一個只包含幾個非零值的1024*1024矩陣。與其將矩陣存儲在包含1024*1024個元素(超過100萬)的數組中，您可能只想將其拆分為大小為16*16的子區間，而只需要64*64這樣的子范圍。

在這種情況下，Trie索引將只包含64*64整數(4096)，并且至少有16*16個數據元素(包含默認的零，或稀疏矩陣中最常見的子范圍)。

用于存儲值的向量將只包含一個副本，用于相互相等的子區域(其中大多數都是零的，它們將由相同的子范圍表示)。

所以，而不是使用類似的語法matrix[i][j]，您將使用如下語法：

trie.values[trie.subrangePositions[(i & ~15) + (j >> 4)] +
            ((i & 15) << 4) + (j & 15)]

使用Trie對象的訪問方法可以更方便地處理。

下面是一個內置在注釋類中的示例(我希望它編譯OK，因為它已經簡化了；如果有錯誤需要更正，請通知我)：

/**
 * Implement a sparse matrix. Currently limited to a static size
 * (<code>SIZE_I</code>, <code>SIZE_I</code>).
 */public class DoubleTrie {

    /* Matrix logical options */        
    public static final int SIZE_I = 1024;
    public static final int SIZE_J = 1024;
    public static final double DEFAULT_VALUE = 0.0;

    /* Internal splitting options */
    private static final int SUBRANGEBITS_I = 4;
    private static final int SUBRANGEBITS_J = 4;

    /* Internal derived splitting constants */
    private static final int SUBRANGE_I =
        1 << SUBRANGEBITS_I;
    private static final int SUBRANGE_J =
        1 << SUBRANGEBITS_J;
    private static final int SUBRANGEMASK_I =
        SUBRANGE_I - 1;
    private static final int SUBRANGEMASK_J =
        SUBRANGE_J - 1;
    private static final int SUBRANGE_POSITIONS =
        SUBRANGE_I * SUBRANGE_J;

    /* Internal derived default values for constructors */
    private static final int SUBRANGES_I =
        (SIZE_I + SUBRANGE_I - 1) / SUBRANGE_I;
    private static final int SUBRANGES_J =
        (SIZE_J + SUBRANGE_J - 1) / SUBRANGE_J;
    private static final int SUBRANGES =
        SUBRANGES_I * SUBRANGES_J;
    private static final int DEFAULT_POSITIONS[] =
        new int[SUBRANGES](0);
    private static final double DEFAULT_VALUES[] =
        new double[SUBRANGE_POSITIONS](DEFAULT_VALUE);

    /* Internal fast computations of the splitting subrange and offset. */
    private static final int subrangeOf(
            final int i, final int j) {
        return (i >> SUBRANGEBITS_I) * SUBRANGE_J +
               (j >> SUBRANGEBITS_J);
    }
    private static final int positionOffsetOf(
            final int i, final int j) {
        return (i & SUBRANGEMASK_I) * MAX_J +
               (j & SUBRANGEMASK_J);
    }

    /**
     * Utility missing in java.lang.System for arrays of comparable
     * component types, including all native types like double here.
     */
    public static final int arraycompare(
            final double[] values1, final int position1,
            final double[] values2, final int position2,
            final int length) {
        if (position1 >= 0 && position2 >= 0 && length >= 0) {
            while (length-- > 0) {
                double value1, value2;
                if ((value1 = values1[position1 + length]) !=
                    (value2 = values2[position2 + length])) {
                    /* Note: NaN values are different from everything including
                     * all Nan values; they are are also neigher lower than nor
                     * greater than everything including NaN. Note that the two
                     * infinite values, as well as denormal values, are exactly
                     * ordered and comparable with <, <=, ==, >=, >=, !=. Note
                     * that in comments below, infinite is considered "defined".
                     */
                    if (value1 < value2)
                        return -1;        /* defined < defined. */
                    if (value1 > value2)
                        return 1;         /* defined > defined. */
                    if (value1 == value2)
                        return 0;         /* defined == defined. */
                    /* One or both are NaN. */
                    if (value1 == value1) /* Is not a NaN? */
                        return -1;        /* defined < NaN. */
                    if (value2 == value2) /* Is not a NaN? */
                        return 1;         /* NaN > defined. */
                    /* Otherwise, both are NaN: check their precise bits in
                     * range 0x7FF0000000000001L..0x7FFFFFFFFFFFFFFFL
                     * including the canonical 0x7FF8000000000000L, or in
                     * range 0xFFF0000000000001L..0xFFFFFFFFFFFFFFFFL.
                     * Needed for sort stability only (NaNs are otherwise
                     * unordered).
                     */
                    long raw1, raw2;
                    if ((raw1 = Double.doubleToRawLongBits(value1)) !=
                        (raw2 = Double.doubleToRawLongBits(value2)))
                        return raw1 < raw2 ? -1 : 1;
                    /* Otherwise the NaN are strictly equal, continue. */
                }
            }
            return 0;
        }
        throw new ArrayIndexOutOfBoundsException(
                "The positions and length can't be negative");
    }

    /**
     * Utility shortcut for comparing ranges in the same array.
     */
    public static final int arraycompare(
            final double[] values,
            final int position1, final int position2,
            final int length) {
        return arraycompare(values, position1, values, position2, length);
    }

    /**
     * Utility missing in java.lang.System for arrays of equalizable
     * component types, including all native types like double here.
     */ 
    public static final boolean arrayequals(
            final double[] values1, final int position1,
            final double[] values2, final int position2,
            final int length) {
        return arraycompare(values1, position1, values2, position2, length) ==
            0;
    }

    /**
     * Utility shortcut for identifying ranges in the same array.
     */
    public static final boolean arrayequals(
            final double[] values,
            final int position1, final int position2,
            final int length) {
        return arrayequals(values, position1, values, position2, length);
    }

    /**
     * Utility shortcut for copying ranges in the same array.
     */
    public static final void arraycopy(
            final double[] values,
            final int srcPosition, final int dstPosition,
            final int length) {
        arraycopy(values, srcPosition, values, dstPosition, length);
    }

    /**
     * Utility shortcut for resizing an array, preserving values at start.
     */
    public static final double[] arraysetlength(
            double[] values,
            final int newLength) {
        final int oldLength =
            values.length < newLength ? values.length : newLength;
        System.arraycopy(values, 0, values = new double[newLength], 0,
            oldLength);
        return values;
    }

    /* Internal instance members. */
    private double values[];
    private int subrangePositions[];
    private bool isSharedValues;
    private bool isSharedSubrangePositions;

    /* Internal method. */
    private final reset(
            final double[] values,
            final int[] subrangePositions) {
        this.isSharedValues =
            (this.values = values) == DEFAULT_VALUES;
        this.isSharedsubrangePositions =
            (this.subrangePositions = subrangePositions) ==
                DEFAULT_POSITIONS;
    }

    /**
     * Reset the matrix to fill it with the same initial value.
     *
     * @param initialValue  The value to set in all cell positions.
     */
    public reset(final double initialValue = DEFAULT_VALUE) {
        reset(
            (initialValue == DEFAULT_VALUE) ? DEFAULT_VALUES :
                new double[SUBRANGE_POSITIONS](initialValue),
            DEFAULT_POSITIONS);
    }

    /**
     * Default constructor, using single default value.
     *
     * @param initialValue  Alternate default value to initialize all
     *                      positions in the matrix.
     */
    public DoubleTrie(final double initialValue = DEFAULT_VALUE) {
        this.reset(initialValue);
    }

    /**
     * This is a useful preinitialized instance containing the
     * DEFAULT_VALUE in all cells.
     */
    public static DoubleTrie DEFAULT_INSTANCE = new DoubleTrie();

    /**
     * Copy constructor. Note that the source trie may be immutable
     * or not; but this constructor will create a new mutable trie
     * even if the new trie initially shares some storage with its
     * source when that source also uses shared storage.
     */
    public DoubleTrie(final DoubleTrie source) {
        this.values = (this.isSharedValues =
            source.isSharedValues) ?
            source.values :
            source.values.clone();
        this.subrangePositions = (this.isSharedSubrangePositions =
            source.isSharedSubrangePositions) ?
            source.subrangePositions :
            source.subrangePositions.clone());
    }

    /**
     * Fast indexed getter.
     *
     * @param i  Row of position to set in the matrix.
     * @param j  Column of position to set in the matrix.
     * @return   The value stored in matrix at that position.
     */
    public double getAt(final int i, final int j) {
        return values[subrangePositions[subrangeOf(i, j)] +
                      positionOffsetOf(i, j)];
    }

    /**
     * Fast indexed setter.
     *
     * @param i      Row of position to set in the sparsed matrix.
     * @param j      Column of position to set in the sparsed matrix.
     * @param value  The value to set at this position.
     * @return       The passed value.
     * Note: this does not compact the sparsed matric after setting.
     * @see compact(void)
     */
    public double setAt(final int i, final int i, final double value) {
       final int subrange       = subrangeOf(i, j);
       final int positionOffset = positionOffsetOf(i, j);
       // Fast check to see if the assignment will change something.
       int subrangePosition, valuePosition;
       if (Double.compare(
               values[valuePosition =
                   (subrangePosition = subrangePositions[subrange]) +
                   positionOffset],
               value) != 0) {
               /* So we'll need to perform an effective assignment in values.
                * Check if the current subrange to assign is shared of not.
                * Note that we also include the DEFAULT_VALUES which may be
                * shared by several other (not tested) trie instances,
                * including those instanciated by the copy contructor. */
               if (isSharedValues) {
                   values = values.clone();
                   isSharedValues = false;
               }
               /* Scan all other subranges to check if the position in values
                * to assign is shared by another subrange. */
               for (int otherSubrange = subrangePositions.length;
                       --otherSubrange >= 0; ) {
                   if (otherSubrange != subrange)
                       continue; /* Ignore the target subrange. */
                   /* Note: the following test of range is safe with future
                    * interleaving of common subranges (TODO in compact()),
                    * even though, for now, subranges are sharing positions
                    * only between their common start and end position, so we
                    * could as well only perform the simpler test <code>
                    * (otherSubrangePosition == subrangePosition)</code>,
                    * instead of testing the two bounds of the positions
                    * interval of the other subrange. */
                   int otherSubrangePosition;
                   if ((otherSubrangePosition =
                           subrangePositions[otherSubrange]) >=
                           valuePosition &&
                           otherSubrangePosition + SUBRANGE_POSITIONS <
                           valuePosition) {
                       /* The target position is shared by some other
                        * subrange, we need to make it unique by cloning the
                        * subrange to a larger values vector, copying all the
                        * current subrange values at end of the new vector,
                        * before assigning the new value. This will require
                        * changing the position of the current subrange, but
                        * before doing that, we first need to check if the
                        * subrangePositions array itself is also shared
                        * between instances (including the DEFAULT_POSITIONS
                        * that should be preserved, and possible arrays
                        * shared by an external factory contructor whose
                        * source trie was declared immutable in a derived
                        * class). */
                       if (isSharedSubrangePositions) {
                           subrangePositions = subrangePositions.clone();
                           isSharedSubrangePositions = false;
                       }
                       /* TODO: no attempt is made to allocate less than a
                        * fully independant subrange, using possible
                        * interleaving: this would require scanning all
                        * other existing values to find a match for the
                        * modified subrange of values; but this could
                        * potentially leave positions (in the current subrange
                        * of values) unreferenced by any subrange, after the
                        * change of position for the current subrange. This
                        * scanning could be prohibitively long for each
                        * assignement, and for now it's assumed that compact()
                        * will be used later, after those assignements. */
                       values = setlengh(
                           values,
                           (subrangePositions[subrange] =
                            subrangePositions = values.length) +
                           SUBRANGE_POSITIONS);
                       valuePosition = subrangePositions + positionOffset;
                       break;
                   }
               }
               /* Now perform the effective assignment of the value. */
               values[valuePosition] = value;
           }
       }
       return value;
    }

    /**
     * Compact the storage of common subranges.
     * TODO: This is a simple implementation without interleaving, which
     * would offer a better data compression. However, interleaving with its
     * O(N2) complexity where N is the total length of values, should
     * be attempted only after this basic compression whose complexity is
     * O(n2) with n being SUBRANGE_POSITIIONS times smaller than N.
     */
    public void compact() {
        final int oldValuesLength = values.length;
        int newValuesLength = 0;
        for (int oldPosition = 0;
                 oldPosition < oldValuesLength;
                 oldPosition += SUBRANGE_POSITIONS) {
            int oldPosition = positions[subrange];
            bool commonSubrange = false;
            /* Scan values for possible common subranges. */
            for (int newPosition = newValuesLength;
                    (newPosition -= SUBRANGE_POSITIONS) >= 0; )
                if (arrayequals(values, newPosition, oldPosition,
                        SUBRANGE_POSITIONS)) {
                    commonSubrange = true;
                    /* Update the subrangePositions|] with all matching
                     * positions from oldPosition to newPosition. There may
                     * be several index to change, if the trie has already
                     * been compacted() before, and later reassigned. */
                    for (subrange = subrangePositions.length;
                         --subrange >= 0; )
                        if (subrangePositions[subrange] == oldPosition)
                            subrangePositions[subrange] = newPosition;
                    break;
                }
            if (!commonSubrange) {
                /* Move down the non-common values, if some previous
                 * subranges have been compressed when they were common.
                 */
                if (!commonSubrange && oldPosition != newValuesLength) {
                    arraycopy(values, oldPosition, newValuesLength,
                        SUBRANGE_POSITIONS);
                    /* Advance compressed values to preserve these new ones. */
                    newValuesLength += SUBRANGE_POSITIONS;
                }
            }
        }
        /* Check the number of compressed values. */
        if (newValuesLength < oldValuesLength) {
            values = values.arraysetlength(newValuesLength);
            isSharedValues = false;
        }
    }}

注意：此代碼不完整，因為它處理單個矩陣大小，而它的密碼器僅限于檢測公共子范圍，而不交叉它們。

此外，根據矩陣大小，代碼不確定用于將矩陣拆分為子范圍(x或y坐標)的最佳寬度或高度。它只使用相同的靜態子范圍大小為16(這兩個坐標)，但它可以方便地任何其他小功率為2(但一個非冪2將減慢int indexOf(int, int)和int offsetOf(int, int)(內部方法)，獨立于兩個坐標，并達到矩陣的最大寬度或高度。compact()方法應該能夠確定最佳的擬合大小。

如果這些拆分子范圍的大小可能有所不同，則需要為這些子范圍大小添加實例成員，而不是靜態的。SUBRANGE_POSITIONS，并使靜態方法int subrangeOf(int i, int j)和int positionOffsetOf(int i, int j)變成非靜態的；以及初始化數組。DEFAULT_POSITIONS和DEFAULT_VALUES需要以不同的方式刪除或重新定義。

如果您想支持交錯，基本上您將從將現有值除以大約相同大小的兩個值開始(兩者都是最小子范圍大小的倍數，第一個子集可能比第二個子集多一個子區間)，然后在所有連續的位置掃描較大的子集以找到匹配的交織；然后嘗試匹配這些值。然后，通過將子集分成兩半(也是最小子范圍大小的倍數)遞歸循環，然后再掃描以匹配這些子集(這將使子集的數量乘以2：您必須想知道子范圍索引的加倍大小是否值得與現有值的大小相比，以查看它是否提供了有效的壓縮(如果沒有，您就停止了：您已經從交錯壓縮過程中直接找到了最佳的子范圍大小)。在這種情況下，在壓縮過程中，子范圍大小將是可變的。

但是，這段代碼顯示了如何分配非零值并重新分配data數組，用于額外的(非零)子范圍，以及如何優化(使用compact()在使用setAt(int i, int j, double value)方法)當數據中存在可能統一的重復子范圍時存儲此數據，并在subrangePositions陣列。

總之，TRIE的所有原則都是在這里實現的：

使用單個向量而不是雙索引數組(每個數組分別分配)來表示矩陣總是更快(并且在內存中更緊湊，意味著更好的局部性)。改進可見于double getAt(int, int)方法！
您節省了大量的空間，但是在賦值時，重新分配新的子范圍可能需要時間。因此，子范圍不應該太小，否則在設置矩陣時會發生太頻繁的重新分配。
通過檢測公共子區間，可以將初始大矩陣自動轉換為更緊湊的矩陣。然后，一個典型的實現將包含一個方法，例如compact()上面。但是，如果get()訪問非?？?，set()非?？欤绻性S多公共子范圍需要壓縮(例如，用它自己減去一個大的非稀疏隨機填充矩陣，或者將它乘以零，那么它可能會非常慢：在這種情況下，通過實例化一個新的矩陣和刪除舊的矩陣來重置Trie會更簡單、更快)。
公共子區域在數據中使用公共存儲，因此這種共享數據必須是只讀的。如果必須更改單個值而不更改矩陣的其余部分，則必須首先確保在subrangePositions索引。否則，您需要在values向量，然后將這個新子區域的位置存儲到subrangePositions索引。

請注意，泛型Colt庫雖然非常好，但在處理稀疏矩陣時卻不太好，因為它使用散列(或行壓縮)技術，目前還沒有實現對嘗試的支持，盡管它是一個很好的優化，這兩者都節省了空間。和節省時間，特別是對于最頻繁的getAt()操作。

甚至這里描述的用于嘗試的setAt()操作也節省了大量的時間(在這里實現的方法，即設置后無需自動壓縮，仍然可以根據需求和估計的時間來實現，因為壓縮仍然會以時間為代價節省大量存儲空間)：節省時間與子區間中單元格的數量成正比，而節省空間與每個子范圍的單元格數成反比。如果要使用子范圍大小，那么每個子范圍的單元格數是2D矩陣中單元格總數的平方根(當使用3D矩陣時，它將是一個立方根)。

Colt稀疏矩陣實現中使用的散列技術由于可能的沖突而增加了大量的存儲開銷和較慢的訪問時間。嘗試可以避免所有碰撞，然后可以保證在最壞的情況下將線性O(N)時間節省到O(1)時間，其中(N)是可能的碰撞次數(在稀疏矩陣的情況下，可能取決于矩陣中非缺省值單元的數目，即矩陣的總大小乘以與散列填充因子成正比的因子，對于非稀疏的，即全矩陣)。

Colt中使用的RC(行壓縮)技術離嘗試更近，但這是另一個代價，這里使用的壓縮技術對于最頻繁的只讀GET()操作具有非常慢的訪問時間，而對于setAt()操作則是非常慢的壓縮。此外，所使用的壓縮不是正交的，與保持正交性的嘗試表示不同。對于相關的查看操作，例如跨步、換位(視為基于整數循環模塊操作的跨步操作)、子測距(以及一般的子選擇，包括排序視圖)，嘗試也將保持這種正交性。

我只是希望Colt將來會被更新，以便使用TRY(即TrieSparseMatrix，而不是HashSparseMatrix和RCSparseMatrix)實現另一個實現。這些想法都在本文中。

trive實現(基于int->int映射)也是基于類似于Colt的HashedSparseMatrix的散列技術，即它們具有相同的不便。嘗試的速度要快得多，占用一定的額外空間(但是這個空間可以被優化，甚至可以比trove和Colt更好，在一個延遲的時間內，使用對結果的矩陣/trie的最后的緊湊()離子操作)。

注意：此Trie實現綁定到特定的本機類型(此處為Double)。這是自愿的，因為使用裝箱類型的一般實現有很大的空間開銷(而且訪問時間要慢得多)。在這里，它只使用本機的雙維數組，而不是泛型向量。但當然也可以為嘗試導出一個通用實現.不幸的是，Java仍然不允許使用本機類型的所有優點編寫真正的泛型類，除非編寫多個實現(對于一個泛型對象類型或每個本機類型)，并通過類型工廠提供所有這些操作。該語言應該能夠自動實例化本機實現并自動構建工廠(就目前而言，即使在Java 7中也不是這樣，在這種情況下.net仍然保持其優勢，適用于與本機類型一樣快速的真正泛型類型)。

反對回復 2019-07-22

3 回答
0 關注
716 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

Java中的稀疏矩陣/數組

Java中的稀疏矩陣/數組

3 回答

添加回答